# unison_unpaired_crosslingual_image_captioning__3d23ee86.pdf

UNISON: Unpaired Cross-Lingual Image Captioning

Jiahui Gao1, Yi Zhou2, Philip L.H. Yu3*, Shafiq Joty4, Jiuxiang Gu5

1The University of Hong Kong, Hong Kong 2Johns Hopkins University, USA 3The Education University of Hong Kong, Hong Kong 4Nanyang Technological University, Singapore 5Adobe Research, USA sumiler@hku.hk, yzhou188@jhu.edu, plhyu@eduhk.hk, srjoty@ntu.edu.sg, jigu@adobe.com

Image captioning has emerged as an interesting research field in recent years due to its broad application scenarios. The traditional paradigm of image captioning relies on paired imagecaption datasets to train the model in a supervised manner. However, creating such paired datasets for every target language is prohibitively expensive, which hinders the extensibility of captioning technology and deprives a large part of the world population of its benefit. In this work, we present a novel unpaired cross-lingual method to generate image captions without relying on any caption corpus in the source or the target language. Specifically, our method consists of two phases: (i) a cross-lingual auto-encoding process, which utilizing a sentence parallel (bitext) corpus to learn the mapping from the source to the target language in the scene graph encoding space and decode sentences in the target language, and (ii) a cross-modal unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to language modality. We verify the effectiveness of our proposed method on the Chinese image caption generation task. The comparisons against several existing methods demonstrate the effectiveness of our approach.

1 Introduction

Image captioning has attracted a lot of attention in recent years due to its emerging applications, including image indexing, virtual assistants, etc. Despite the impressive results achieved by the existing captioning techniques, most of them focus on English because of the availability of image-caption paired datasets, which can not generalize for languages where such paired dataset is not available. In reality, there are more than 7,100 different languages spoken by billions of people worldwide (source: Ethnologue(2019)). Building visual-language technologies only for English would deprive a significantly large population of non-English speakers of AI benefits and also leads to ethical concerns, such as unequal access to resources. Therefore, similar to other NLP tasks (e.g. parsing, question answering) (Hu et al. 2020; Conneau et al. 2018; Gu et al. 2020), visual-language tasks should also be extended to multiple languages. However, creating paired captioning datasets for

*Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

each target language is infeasible, since the labeling process is very time consuming and requires excessive human labor. To alleviate the aforementioned problem, there have been several attempts in relaxing the requirement of imagecaption paired data in the target language (Gu et al. 2018; Song et al. 2019), which rely on paired image-caption data in a pivot language to generate captions in the target language via sentence-level translation. However, even for English, the existing captioning datasets (e.g., MS-COCO (Lin et al. 2014)) are not sufficiently large and comprise only limited object categories, making it challenging to generalize the trained captioners to scenarios in the wild (Tran et al. 2016). In addition, sentence-level translation relies purely on the text description and can not observe the entire image, which may ignore important contextual semantics and lead to inaccurate translation. Thus, such pivot-based methods fail to fully address the problem. Recently, a few works explore image captioning task in an unpaired setting (Feng et al. 2019; Gu et al. 2019; Laina, Rupprecht, and Navab 2019). Nevertheless, these methods still rely on manually labeled caption corpus. For example, Gu et al. (2019) train their model based on shuffled imagecaption pairs of MS-COCO; Feng et al. (2019) use an image descriptions corpus from Shutterstock; Lania et al. (2019) create training dataset by sampling the images and captions from different image-caption datasets. Despite they belong to unpaired methods in spirit, one could still argue that they depend heavily on the collected caption corpus to get a reasonable cross-modal mapping between vision and language distributions a resource that is not always practical to assume. It therefore remains questionable how these methods would perform when there is no caption data at all. To the best of our knowledge, there is yet no work that investigates image captioning without relying on any caption corpus. Despite the giant gap between images and texts, they are essentially different mediums to describe the same entities and how they are related in the objective world. Such internal logic is the most essential information carried by the medium, which can be leveraged as the bridge to connect data in different modalities. Scene graph(Wang et al. 2018), a structural representation that contains 1) the objects, 2) their respective attributes and 3) how they are related as described by the medium (image or text), which has been developed into a mature technique for visual un-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

derstanding tasks in recent years(Johnson, Gupta, and Fei Fei 2018). Previous researches on scene graph generation have demonstrated its effectiveness in aiding cross-modal alignment (Yang et al. 2019; Gu et al. 2019). However, existing scene graph generators are only available in English, which poses challenges for extending its application on other languages. One naive approach is to perform crosslingual alignment to other target languages by conducting a superficial word-to-word translation on the scene graphs nodes, which neglects the contextual information of the sentence or the image. Since a word on a single node can carry drastically different meanings in various contexts, such an approach often leads to sub-optimal cross-lingual mapping. To address this issue, we propose a novel Cross-lingual Hierarchical Graph Mapping (HGM) to effectively conduct the alignment between languages in the scene graph encoding space, which benefits from contextual information by gathering semantics across different levels of the scene graph. Notably, the scene graph translation process can be enhanced by the large-scale parallel corpus (bi-text), which is easily accessible for many languages (Espl a et al. 2019). In this paper, we propose UNpa Ired cros S-lingual image capti ONing (UNISON), a novel approach to generate image captions in the target language without relying on any caption corpus. Our UNISON framework consists of two phases: (i) a cross-lingual auto-encoding process and (ii) a cross-modal unsupervised feature mapping (Fig. 1). Using the parallel corpus, the cross-lingual auto-encoding process aims to train the HGM to map a scene graph derived from the source language (English) sentence to the space of the target language (Chinese), and learns to generate a sentence in the target language based on the mapped scene graph. Then, a cross-modal feature mapping (CMM) function is learned in an unsupervised manner, which aligns the image scene graph features from image modality to language modality. The features in language modality is subsequently mapped by HGM, and then fed to the decoder in phase (i) to generate image captions in the target language. Our experiments show 1) the effectiveness of the proposed HGM when conducting cross-lingual alignment( 5.2) in the scene graph encoding space and 2) the superior performance of our UNISON framework as a whole( 5.1).

2 Related Work Paired Image Captioning. Previous studies on supervised image captioning mostly follow the popular encoderdecoder framework (Vinyals et al. 2015; Rennie et al. 2017; Anderson et al. 2018), which mostly focus on generating captions in English since the neural image captioning models require large-scale data of annotated image-caption pairs to achieve good performance. To relax the requirement of human effort in caption annotation, Lan, Li, and Dong (2017) propose a fluency-guided learning framework to generate Chinese captions based on pseudo captions, which are translated from English captions. Yang et al. (2019) adopt the scene graph as the structured representation to connect image-text domains and generate captions. Zhong et al. (2020) propose a method to select the important subgraphs of scene graphs to generate comprehensive caption-

ing. Nguyen et al. (2021) further close the semantic gap between image and text scene graphs by HOI labels. Unpaired Image Captioning. The main challenge in unpaired image captioning is to learn the captioner without any image-caption pairs. Gu et al. (2018) first propose an approach based on pivot language. They obviate the requirement of paired image-caption data in the target language but still rely on paired image-caption data in the pivot language. Feng et al. (2019) use a concept-to-sentence model to generate pseudo-image-caption pairs, and align image features and text features in an adversarial manner. Song et al. (2019) introduce a self-supervised reward to train the pivot-based captioning model on pseudo image-caption pairs. Gu et al. (2019) propose a scene graph-based method for unpaired image captioning on disordered images and captions. Summary. While several attempts have been made towards unpaired image captioning, they require caption corpus to learn a reasonable cross-modal mapping between vision and language distributions, e.g. the corpus in (Feng et al. 2019) is collected from Shutterstock image descriptions, Gu et al. (2019) use the MSCOCO corpus after shuffling the image-caption pairs. Thus, arguably these approaches are not entirely unpaired as they rely on the labelled corpus, limiting their applicability to different languages. Meanwhile, our method generates captions in target language without relying on any caption corpus.

3 Methods 3.1 Preliminary and Our Setting In the conventional paired paradigm, image captioning aims to learn a captioner which can generate an image caption ˆS for a given image I, such that ˆS is similar to the ground-truth (GT) caption. Given the image-caption pairs {Ii, Si}NI i=1 , the popular encoder-decoder framework is formulated as:

I S : I v ˆS (1)

where v denotes the encoded image feature. The training objective for Eq. 1 is to maximize the probability of words in the GT caption given the previous GT words and the image. Compared with paired setting, which relies on paired image-caption data and can not generalize beyond the language used to label the caption, our unpaired setting does not depend on any image-caption pairs and can be extended to other target languages. Specifically, we assume that we have an image dataset {Ii}NI i=1 and a source-target parallel corpus dataset {(Sx i , Sy i )}NS i=1. Our goal is to generate caption ˆSy in the target language y (Chinese) for an image I with the help of unpaired images and parallel corpus.

3.2 Overall Framework As shown in Fig. 1, there are two phases in our framework: (i) a cross-lingual auto-encoding process and (ii) a crossmodal unsupervised feature mapping, which can be formulated as the following equations, respectively:

Sx Sy : Sx Gx Gy zy ˆSy (2)

I Sy : I Gx,I Gy,I zy,I zy ˆSy (3)

He can see through the window a vista of green field

(b) Unsupervised Cross-modal

Feature Mapping

Image Parser

Sentence Parser

He can see through the window a vista of green field

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (* Through the window he can see the green field scenery)

KL Divergence

? ? ? ? ? ? ? ? ? ? ? ? (* There is a big vase in the center of the window)

Cross-modal Feature

Mapping? CMM? Shared Shared

Cross-lingual Scene Graph Mapping (a) Cross-Lingual Auto-Encoding

Cross-lingual Scene Graph Mapping

Sentence (English) Sentence (English)

Sentence (Chinese)

Caption (Chinese)

Training Data (Unpaired)

flower sill

of pictur e

sit in tulip

plant in blue/clear

view/vista see

Training Data (Paired)

Figure 1: Overview of our UNISON framework. It has two phases: cross-lingual auto-encoding process and unsupervised cross-modal feature mapping. The cross-lingual scene graph mapping in the first phase (Top) is designed to map the scene graph from the source language (e.g. English) to the target language (e.g. Chinese) without relying on a scene graph parser in the target language. The unsupervised cross-modal feature mapping in the second phase (Bottom) is designed to align the visual modality to textual modality. We mark the object, relationship, and attribute nodes yellow, blue, and grey in the scene graph. The English sentences (marked in gray) in parentheses are translated by google translator for better understanding.

where l {x, y} is the source or target language; Sl is the sentence in laugnage l; Gl and Gl,I are the scene graphs for the language l in sentence modality and image modality (I), respectively; zl and zl,I are the encoded scene graph features for Gl and Gl,I, respectively. The cross-lingual auto-encoding process (shown in top of Fig. 1) aims to generate a sentence in the target language given a scene graph in the source language: we first extract the sentence scene graph Gx from each (English) sentence Sx using a sentence scene graph parser, and map it to Gy via our proposed HGM (detail in later section). Then we feed Gy to the encoder to produce the scene graph features zy, which the decoder then takes as inputs to generate ˆSy. Note that the mapping from Gx to Gy is done at the embedding level, i.e. no symbolic Gy is constructed. This phase addresses the misalignment among different language domains. The cross-modal unsupervised feature mapping (shown in the bottom part of Fig. 1) closes the gap between image modality and language modality: we first extract the image scene graph Gx,I from image I, which is in source language x (English). After that, we map the Gx,I to Gy,I with the HGM (shared with the first phase). As shown in Eq. 3, a cross-modal mapping function (zy,I zy) is learned, which maps the encoded image scene graph features from image modality to language modality. Once mapped to zy, we can use the sentence decoder to generate ˆSy. We further elaborate each phase in detail below.

3.3 Cross-Lingual Auto-Encoding Process

Scene Graph. A scene graph G = (V, E) contains three kinds of nodes: object, relationship and attribute nodes. Let object oi denote the i-th object. The triplet oi, ri,j, oj in G is composed of two objects: oi (as subject role) and oj (as object role), along with their relation ri,j. As each object may have a set of attributes, we denote ak i as the k-th

attribute of object oi. To generate an image scene graph GI, we build the image scene graph generator based on Faster-RCNN (Ren et al. 2015) and MOTIFS (Zellers et al. 2018). To generate sentence scene graph Gx, we first convert each sentence into a dependency tree with a syntactic parser (Anderson et al. 2016), and then apply a rule-based method (Schuster et al. 2015) to build the graph. The Gy is mapped from Gx through our HGM module.

Cross-Lingual Hierarchical Graph Mapping (HGM). Our hierarchical graph mapping contains three levels: (i) word-level mapping, (ii) sub-graph mapping, and (iii) full-graph mapping. The semantic information from all three levels are fused in an self-adaptive manner via a selfgated mechanism, which effectively takes into account the structures and relations from the context. The proposed HGM is illustrated in Fig. 2. Let el oi, el ri,j, el oj Gl denote the triplet for relation rl i,j in language l, where el oi, el oj and el ri,j are the embeddings representing subject ol i, object ol i, and relationship rl i,j. Formally, our hierarchical graph mapping from language x to language y can be expressed as:

ey oi, ey ri,j, ey oj = f HGM(ex oi, Gx), f Word(ex ri,j),

f HGM(ex oj, Gx) (4)

f HGM(ex oi, Gx) =αwf Word(ex oi) + αsf Sub(ex oi, Gx)

+ αff Full(ex oi, Gx) (5)

αw, αs, αf =softmax f MLP(f Word(ex oi)) (6)

where ey oi, ey oj and ey ri,j are the mapped embeddings in target language y; αw, αs, and αf are the level-wise importance weights calculated by Eq. 6; f MLP( ) represents a multi-layer perception (MLP) composed of three fullyconnected (FC) layers with Re LU activations.

Word-level Mapping. The word-level mapping relies on a retrieval function f Word(.): after obtaining an embedding in language x, f Word(.) retrieves the most similar embedding in language y from a cross-lingual word embedding space as illustrated in Fig. 2(a), where cosine similarity is used to measure the distance. In practice, we adopt the pre-trained common space trained on Wikipedia following (Joulin et al. 2018). The retrieved embedding is then passed to an FC layer to obtain a high-dimension embedding in language y. Graph-level Mapping. Since the relation and structure of surrounding nodes encode crucial context information, we introduce the node mapping with graphlevel information (as illustrated by Fig. 2(b) and 2(c)): namely sub-graph mapping (f Sub) and full-graph mapping (f Full), which first construct the contextualized embedding in graph-level and then conduct the cross-lingual mapping on the produced embedding. More specifically, for sub-graph mapping, the contextualized embedding is computed by: PN o k=1 sconv(ex oi, ex ok)/N o, where N o is the total number of nodes directly connected to node oi, and sconv( ) is the spatial convolution operation (Yang et al. 2019). For full-graph mapping, the contextualized embedding is calculated by an attention module: PNo k=1 αkex ok, where αk is calculated by softmax over all the object embeddings ex o1:No . Both f Sub and f Full use a linear mapping to project the resulted contextualized (English) embedding to the target (Chinese) embedding space. We consider graph-level mapping only for the object nodes since relationships only exist between objects. For relationship and attribute nodes, only word-level mapping is performed. Self-gated Adaptive Fusion. To leverage the complementary advantages of information in different levels, we propose a self-gate mechanism to adaptively adjust the importance weights when fusing the embeddings. Specifically, the importance scores are calculated based on the word-level embeddings by passing it through a three-class MLP and a softmax function (Eq. 6). Compared with directly concatenating the embeddings from different levels, which assigns them with equal importance, our fusing mechanism adaptively concentrates on important information and suppress the noises when the context becomes sophisticated.

Scene Graph Encoder. We encode the Gx and Gy(mapped by the HGM) with two scene graph encoders Gx Enc( ) and Gy Enc( ), which are implemented by spatial graph convolutions. The output of each scene graph encoder can be formulated as:

f l o1:Nlo , f l r1:Nlr , f l a1:Nla = Gl Enc(Gl), l {x, y} (7)

where f l o1:Nlo , f l r1:Nlr , and f l a1:Nla denote the set of encoded object embeddings, relationship embeddings, and attribute embeddings, respectively. Each object embedding f l oi is calculated by considering relationship triplets el sub(oi), el rsub(oi),i, el oi and el oj, el rj,obj(oi), el obj(oi) ; sub(oi) represents the subjects where oi acts as an object, and obj(oi) represents the objects where oi plays the subject role. f l ri is calculated based on relationship triplet

(a) Word-level Mapping (b) Sub-Graph Mapping (c) Full-Graph Mapping

Figure 2: Illustration of HGM. Sub-graph mapping only considers those directly connected nodes, while full-graph mapping considers all the nodes in the scene graph.

el oi, el ri,j, el oj . f l ai is the attribute embedding calculated by object oi and its associated attributes.

Sentence Decoder. As shown in Fig. 1, we have two decoders: Gx Dec( ) and Gy Dec( ). Each decoder is composed of three attention modules and an LSTM-based decoder. It takes the encoded scene graph features as input and generates the captions. The decoding process is defined as:

ol t, hl t = Gl Dec f Triplet [zl o, zl r, zl a] , hl t 1, ˆsl t 1 (8)

ˆsl t softmax(W ool t) (9)

where l {x, y}, ˆsl t is the t-th decoded word drawn from the dictionary according to the softmax probability, W o is a learnable weight matrix, ol t is the cell output of the decoder, hl t is the hidden state. f Triplet( ) is a non-linear mapping function that takes the concatenated features as input and outputs the triplet level feature. zl oi is calculated by the

attention module defined as: PN l o i αl oif l oi, where αl oi is the attention weight calculated by the softmax operation over f l o1:Nlo . zl ri and zl ai are calculated in a similar way.

Joint-training Mechanism. Inspired by the fact that common structures exist in the encoded scene graph space that are language-agnostic, which may be leveraged to benefit the encoding process, we propose joint training mechanism to enhance the features in target language with the help of features in the source language. In practice, we train a separate scene graph encoder for each language in parallel, then align the encoded scene graph features by enforcing them to be semantically close. Specifically, we train the scene graph encoders (Gx Enc and Gy Enc), sentence decoders (Gx Dec and Gy Dec), and the crosslingual HGM module (Gx Gy), supervised by a parallel corpus. The two graph encoders encode Gx and Gy into feature representations and predict sentences ( ˆSx and ˆSy) with the decoders. We minimize the following loss:

t log PθGx Sx (sx t |sx 0:t 1, Gx)

t log PθGy Sy (sy t |sy 0:t 1, Gy) (10)

where the sx t and sy t are the ground truth words, Gx and Gy are the sentence scene graphs in different languages with Gy being derived from Gx using our HGM, θGx Sx and θGy Sy are the parameters of two encoder-decoder models.

To close the semantic gap between the encoded scene graph features {zx o, zx a, zx r} and {zy o, zy a, zy r}, we introduce a Kullback Leibler (KL) divergence loss:

LKL = exp KL(p(zx o)||p(zy o) + exp KL(p(zx a)||p(zy a)

+ exp KL(p(zx r)||p(zy r) (11)

where p( ) is composed of a linear layer that maps the input features to a low-dimension dc, followed by a softmax to get a probability distribution. The overall objective of our joint training mechanism is as follows: LPhase 1 = LXE + LKL.

3.4 Unsupervised Cross-Modal Feature Mapping To adapt the learned model from sentence modality to image modality, we drew inspiration from (Gu et al. 2019) and adopt Cycle GAN (Zhu et al. 2017) to align the features. For each type p {o, r, a} of triplet embedding in Eq. 8, we have two mapping functions: gp I y( ) and gp y I( ), where gp I y( ) maps the features from image modality to the sentence modality, and gp y I( ) maps from sentence modality to the image modality. Note that we freeze the cross-lingual mapping module trained in the first phase. The training objective for cross-modal feature mapping is defined as:

Lp Cycle GAN = LI y GAN + Ly I GAN + λLI y cyc (12)

where LI y cyc is a cycle consistency loss, LI y GAN and Ly I GAN are the adversarial losses for the mapping functions with respect to the discriminators. Specifically, the objective of the mapping function gp I y( ) is to fool the discriminator Dp y through adversarial learning. We formulate the objective function for crossmodal mapping as:

LI y GAN = ES[log Dp y(zy p)] + EI[log(1 Dp y(gp I y(z I p))] (13)

where zy p and z I p are the encoded embeddings for sentence scene graph Gy and image scene graph Gy,I, respectively. The adversarial loss for sentence to image mapping Ly I GAN is similarly defined. The cycle consistency loss LI y cyc is designed to regularize the training and make the mapping functions cycle-consistent:

LI y cyc =EI[ gp S I gp I S(z I p) z I p 1]

+ Ey[ gp I y gp y I(zy p) zy p 1] (14)

The overall training objective for phase 2 becomes: LPhase 2 = Lo Cycle GAN + La Cycle GAN + Lr Cycle GAN.

3.5 Inference of the UNISON Framework During inference, given an image I, we first extract the image scene graph Gx,I with a pre-trained image scene graph generator and then map the Gx,I in x (English) to Gy,I in y (Chinese) with our HGM module. After that, we encode Gy,I with Gy Enc( ) and map the encoded features to the language domain through gp I y( ). The mapped features are then fed to the LSTM-based sentence decoder Gy Dec( ) to generate the image caption ˆSy in target language y.

4 Experiments 4.1 Datasets and Setting

Datasets. For cross-lingual auto-encoding, we collect a paired English-Chinese corpus from existing MT datasets, including WMT19 (Barrault et al. 2019), AIC MT (Wu et al. 2017), UM (Tian et al. 2014), and Trans-zh (Brightmart 2019)1. We filter the sentences in MT datasets according to an existing caption-style dictionary containing 7,096 words in Li et al. (2019). For the first phase, we use 151,613 sentence pairs for training, 5,000 sentence pairs for validation, and 5,000 pairs for testing. For the second phase, following Li et al. (2019), we use 18,341 training images from MSCOCO and randomly select 18,341 Chinese sentences from the training split of the MT corpus. During evaluation, we use the validation and testing splits in COCO-CN.

Corpus 0 Obj/G 1 Obj/G 2 Obj/G 3 Obj/G Raw 17.7% 42.6% 24.4% 15.4% Back-Trans. 12.3% 13.3% 15.1% 59.3%

Table 1: Statistics of the English sentence scene graphs, where n Obj/G denotes the number of object in a scene graph, means greater than or equal to 3.

Preprocessing. We extract the image scene graph with MOTIFS (Zellers et al. 2018) pretrained on VG (Krishna et al. 2017). We tokenize and lowercase the English sentences, then replace the tokens appeared less than five times with UNK, resulting in a vocabulary size of 13,194. We segment the Chinese sentences with Jieba2, resulting in a vocabulary size of 11,731. The English sentence scene graphs are extracted with the parser proposed by (Anderson et al. 2016). We augment the English sentences with the pre-trained back-translators (Ng et al. 2019), resulting in 808,065 English sentences in total, which helps enrich the English sentence scene graphs. Specifically, the statistics in Table 1 shows that the percentage of scene graphs containing more than 3 objects is increased from 15.4% to 59.3%.

4.2 Implementation Details

During cross-lingual auto-encoding phase, we set the dimension of scene graph embeddings to 1,000 and dc to 100. LSTM with 2 layers is adopted to construct the decoder, whose hidden size is 1000. We start by initializing the graph mapping from a pre-trained common space (Joulin et al. 2018) to stabilize training. The cross-lingual encoderdecoder is firstly trained with the LXE for 80 epochs, then with joint loss LPhase 1 for 20 epochs. During unsupervised cross-modal mapping phase, we learn the cross-modal feature mapping on the unpaired MSCOCO images and translation corpus. Specifically, we inherit and freeze the parameters of the Chinese scene graph encoder, HGM, and Chinese sentence decoder from crosslingual auto-encoding process. The cross-modal mapping

1https://doi.org/10.5281/zenodo.3402023 2https://github.com/fxsjy/jieba

Method B@1 B@2 B@3 B@4 METEOR ROUGH CIDEr Setting w/o caption corpus

Graph-Aligh(En)(Gu et al. 2019) +Google Trans. 39.2 16.7 6.5 2.3 13.2 26.5 9.3 UNISON 44.9 19.9 8.6 3.3 16.5 29.6 12.7

Setting w/ caption corpus

FC-2k (En)(Rennie et al. 2017)+Google Trans. 58.9 38.0 23.5 14.3 23.5 40.2 47.3 FC-2k (Cn, Pseudo COCO)(Rennie et al. 2017) 60.4 40.7 26.8 17.3 24.0 43.6 52.7

UNISON 63.4 43.2 29.5 17.9 24.5 45.1 53.5

Table 2: Performance comparisons on the test split of COCO-CN. Un. is short for Unpaired. B@n is short for BLEU-n. En and Cn in the parentheses represent English and Chinese, respectively. Google Trans stands for google translator.

functions and discriminators are learned with LPhase 2. We optimize the model with Adam, batch size of 50, and learning rate of 5 10 5. The discriminators are implemented with a linear layer of dimension 1,000 and a Leaky Re LU activation. We set λ to 10. During inference, we use beam search with a beam size of 5. We use the popular BLEU (Papineni et al. 2002), CIDEr (Vedantam, Lawrence Zitnick, and Parikh 2015), METEOR (Denkowski and Lavie 2014) and ROUGE (Lin 2004) for evaluation.

4.3 Model Statement

To gain insights into the effectiveness of our HGM, we construct ablative models by progressively introducing crosslingual graph-mappings in different levels: GMBASE is our baseline model, which adopts Google s MT system (Wu et al. 2016) to symbolically map the scene graph from English to Chinese in a node-to-node manner. GMWORD maps the English scene graph to Chinese through word-level mapping in the scene graph encoding space. GMWORD+SUB. considers both word-level and subgraph-level mappings by directly concatenating them. HGMBASE considers mappings across all levels, which are directly concatenated and passed through an FC layer. HGM is similar to HGM-base, except that it adopts a selfgated fusion to adaptively fuse the three features, as illustrated by Eq. 5 and Eq. 6.

5 Results and Analysis 5.1 Overall Results

We demonstrate the superior performance of the proposed UNISON framework on Chinese image captions generation task. We first compare UNISON with the SOTA unpaired method Graph-Align(Gu et al. 2019) under the setting without using any caption corpus. More specifically, we run the Graph-Align3 and translate the generated English captions to Chinese by google translator for comparison. From the result in Table 2, we can find that our method significantly surpasses Graph-Align with translation, demonstrating that translation in graph level is superior to translation in sentence level. This is reasonable since the graph level alignment is able to consider structural and relational information of the whole image, while sentence level translation

3Code is acquired from the first author of (Gu et al. 2019).

suffers from information loss as it can only observe the predicted sentences, and can be severely affected if the translation tools perform poorly. We do not compare with the other unpaired method (Song et al. 2019) here, as the dataset and codes are not publicly available. To further verify the effectiveness of our framework, we compare UNISON with the supervised pipeline methods: (i) FC-2k(En)+Trans. We train the FC-2k model on imagecaption pairs(En) of MS-COCO and translate the generated captions(En) to caption(Cn) using Google translator; (ii) FC-2k(Pseudo). We train the FC-2k model on pseudo Chinese image-caption pairs of MS-COCO, where the captions(Cn) are translated by Google translator from captions(En). For such comparisons, we fine-tune our crosslingual mapping on the unpaired captions. The results show that our method significantly and consistently outperforms the FC-2k(En)+Trans. and FC-2k(Pseudo) models in all metrics, despite our unpaired setting is much weaker.

Method SG-m S-gate B@1 B@4 M R GEN 25.0 5.2 14.4 27.3 GMBASE 26.6 7.3 15.1 28.1 GMWORD 28.1 8.0 15.4 28.2 GMWORD+SUB. 29.2 9.9 16.3 30.2 HGMBASE 29.6 9.9 16.5 30.4 HGM 30.4 11.1 17.0 31.9

Table 3: Performance comparison between variants of HGM on Chinese sentence generation task. Test split of MT corpus is used for evaluation. SG-m is cross-lingual scene graphs mapping. S-gate is self-gate fusion mechanism.

5.2 Effectiveness of Cross-Lingual Alignment Analyzing the superior performance of HGM. We conduct experiments on MT task to demonstrate our HGM s effectiveness in cross-lingual alignment, which is shown in Table 3. The advantage of HGM lies in four aspects: (1) The cross-lingual graph translation is effective. Our HGM and its variants achieve considerably higher performance compared with GEN, which directly generates Chinese sentences based on English scene graphs. (2) The cross-lingual alignment in the encoding space is superior than direct symbolic translation. GMWORD achieve considerably higher performance compared with GMBASE, which proves that the scene graph

encoding contains richer information and is more suitable for cross-lingual alignment. (3) Performing node mapping considering features across different graph levels boosts the performance. When we consider full-graph and sub-graph level features, the cross-lingual alignment starts achieving significant performance improvement, which verifies the importance of structural and relational information in the context. E.g., HGM outperforms GMWORD in B@1, B@4, METEOR, and ROUGE metrics by 8.2%, 38.8%, 10.4%, 13.1%, respectively. (4) The adaptive self-gate fusion mechanism is beneficial. We can observe that HGMBASE is surpassed by HGM by a large margin. As shown in Table 5, the role of self-gate fusion becomes more essential when HGM is applied to image scene graphs.

Joint training benefits the enconding process. We train our models using the joint loss LPhase 1, where LKL enforces the distributions of latent scene graph embeddings between different languages to be close. Table 4 shows that the models trained with joint loss consistently outperforms their counterparts with only LXE for all metrics, which indicates that the encoding process of the target language can benefit from the source language.

Method LKL B@1 B@2 B@4 M R GMWORD 28.1 16.2 8.0 15.4 28.2 w/o joint -0.4 -0.4 -0.3 -0.1 -0.2 GMWORD+SUB. 29.2 17.9 9.9 16.3 30.2 w/o joint -0.3 -0.3 -0.4 -0.2 -0.1 HGM 30.4 19.1 11.1 17.0 31.9 w/o joint -0.5 -0.3 -0.3 -0.2 -0.4

Table 4: Effectiveness of joint training. Results are report on Chinese sentence generation task (test set).

5.3 Effectiveness of Cross-Modal Mapping Table 5 shows the performance of Chinese image captioners with and without CMM. We can see that adversarial training can consistently improve the model s performance. Specifically, CMM can boost the performance of our HGM by 3.8%(B@1), 0.7%(B@4), 1.2%(ROUGE), 3.0%(CIDER), respectively. Notably, GMWORD+SUB. and HGMBASE perform even worse than GMBASE, which is because the generated image scene graphs are noisy with repeated relation triples (as explained in 5.5), leading to degradation on contextualized cross-lingual graph mapping (sub-graph and fullgraph), whereas self-gated fusion can tackle this problem by decreasing the importance of noisy graph-level mapping.

5.4 Human Evaluation Table 6 shows human evaluation results. The caption quality is measured by fluency and relevancy metrics. The fluency measures whether the generated caption is fluent. The relevancy measures whether the caption correctly describes relevant information of the image. Metrics are graded by: 1-Very poor, 2-Poor, 3-Adequate, 4-Good, 5-Excellent. We invite 10 Chinese native speakers from diverse professional backgrounds to participate in the evaluation. Table 6 reports

Method B@1 B@4 M R C GMWORD 40.1 2.2 15.6 28.4 9.5 GMWORD+CMM 43.1 3.0 16.5 29.4 12.6 GMWORD+SUB. 37.3 2.5 14.3 27.0 7.9 GMWORD+SUB.+CMM 40.6 2.8 15.2 28.3 10.8 HGMBASE 38.0 2.4 14.4 27.3 8.0 HGMBASE+CMM 39.8 2.6 14.8 27.7 10.2 HGM 41.1 2.6 15.7 28.4 9.7 HGM+CMM 44.9 3.3 16.5 29.6 12.7

Table 5: Effectiveness of CMM. Results are reported on test set of COCO-CN. C is short for CIDEr.

* A girl who just threw the frisbee ? ? ? ? ? ? ? ? ? ? ? ?

* A young young woman on the grass ? ? ? ? ? ? ? ? ? ? ? ? ? ?

* A young girl standing on the lawn of a field ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

* A young girl standing on a golden meadow ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

per son/gir l wear

wheel on bike with str eet

people with white gr ey/empty

GT (Chinese) * A man riding a bicycle through the city streets ? ? ? ? ? ? ? ? ? ? ? ? ? ?

* A tall man pedestrian walks by the road on the street ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

* On the street sits a tricycle tricycle ? ? ? ? ? ? ? ? ? ? ? ?

* A man riding a bicycle on the sidewalk ? ? ? ? ? ? ? ? ? ? ? ? ? ?

GT (Chinese)

on m an on r id

building on

air young/white /play/stand

br own/lar ge

br own/dr y

r ound/black

lar ge/tall blue/white/cloudy

distant/gr een

Figure 3: Qualitative results of different unsupervised crosslingual caption generation models.

the mean scores, which illustrate that our method can generate relevant and human-like captions.

Metric GMWORD GMWORD+SUB. HGM HGM GT Rel. 2.78 2.96 3.22 3.96 4.86 Flu. 2.49 2.76 3.05 4.06 4.91

Table 6: Human evaluation on COCO-CN test split. HGM represents fine-tuned HGM. Models are trained with CMM. Rel. and Flu. is short for relevancy and fluency, respectively.

5.5 Qualitative Results We provide some Chinese captioning examples for MSCOCO images in Fig. 3. We can see that our method can generate reasonable image descriptions without using any paired image-caption data. Also, we observe that the image scene graphs are quite noisy, which potentially explains the performance degradation when introducing graph-mappings without self-fusion mechanism (see Table 5).

6 Conclusion In this paper, we propose a novel framework to learn a crosslingual image captioning model without any image-caption pairs. Extensive experiments demonstrate the effectiveness of our proposed methods. We hope our work can provide inspiration for unpaired image captioning in the future.

Acknowledgments We would like to thank Lingpeng Kong, Renjie Pi and the anonymous reviewers for insightful suggestions that have significantly improved the paper. This work was supported by TCL Corporate Research (Hong Kong). The research of Philip L.H. Yu was supported by a start-up research grant from the Education University of Hong Kong (#R4162).

References Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. Barrault, L.; Bojar, O.; Costa-Juss a, M. R.; Federmann, C.; Fishel, M.; Graham, Y.; Haddow, B.; Huck, M.; Koehn, P.; Malmasi, S.; et al. 2019. Findings of the 2019 conference on machine translation (wmt19). In WMT. Brightmart. 2019. NLP Chinese Corpus: Large Scale Chinese Corpus for NLP. Zenodo. Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S.; Schwenk, H.; and Stoyanov, V. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. In EMNLP. Denkowski, M.; and Lavie, A. 2014. Meteor universal: Language specific translation evaluation for any target language. In ACL. Eberhard, D. M.; Simons, G. F.; and Fennig, C. D., eds. 2019. Ethnologue: Languages of the World. SIL International, 22 edition. Espl a, M.; Forcada, M.; Ram ırez-S anchez, G.; and Hoang, H. 2019. Para Crawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, 118 119. Dublin, Ireland: European Association for Machine Translation. Feng, Y.; Ma, L.; Liu, W.; and Luo, J. 2019. Unsupervised image captioning. In CVPR. Gu, J.; Joty, S.; Cai, J.; and Wang, G. 2018. Unpaired image captioning by language pivoting. In ECCV. Gu, J.; Joty, S.; Cai, J.; Zhao, H.; Yang, X.; and Wang, G. 2019. Unpaired image captioning via scene graph alignments. In ICCV. Gu, J.; Kuen, J.; Joty, S.; Cai, J.; Morariu, V.; Zhao, H.; and Sun, T. 2020. Self-supervised relationship probing. Neur IPS. Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; and Johnson, M. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. Co RR, abs/2003.11080. Johnson, J.; Gupta, A.; and Fei-Fei, L. 2018. Image generation from scene graphs. In CVPR. Joulin, A.; Bojanowski, P.; Mikolov, T.; J egou, H.; and Grave, E. 2018. Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion. In EMNLP.

Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV. Laina, I.; Rupprecht, C.; and Navab, N. 2019. Towards Unsupervised Image Captioning with Shared Multimodal Embeddings. In ICCV. Lan, W.; Li, X.; and Dong, J. 2017. Fluency-guided crosslingual image captioning. In ACMMM. Li, X.; Xu, C.; Wang, X.; Lan, W.; Jia, Z.; Yang, G.; and Xu, J. 2019. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval. TMM. Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In ACL. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV. Ng, N.; Yee, K.; Baevski, A.; Ott, M.; Auli, M.; and Edunov, S. 2019. Facebook FAIR s WMT19 News Translation Task Submission. ar Xiv preprint ar Xiv:1907.06616. Nguyen, K.; Tripathi, S.; Du, B.; Guha, T.; and Nguyen, T. Q. 2021. In Defense of Scene Graphs for Image Captioning. In ICCV. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. In Neur IPS. Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In CVPR. Schuster, S.; Krishna, R.; Chang, A.; Fei-Fei, L.; and Manning, C. D. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In ACL. Song, Y.; Chen, S.; Zhao, Y.; and Jin, Q. 2019. Unpaired Cross-lingual Image Caption Generation with Self Supervised Rewards. In ACMMM. Tian, L.; Wong, D. F.; Chao, L. S.; Quaresma, P.; Oliveira, F.; and Yi, L. 2014. UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC. Tran, K.; He, X.; Zhang, L.; Sun, J.; Carapcea, C.; Thrasher, C.; Buehler, C.; and Sienkiewicz, C. 2016. Rich image captioning in the wild. In CVPRW. Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR. Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In CVPR. Wang, Y.-S.; Liu, C.; Zeng, X.; and Yuille, A. 2018. Scene graph parsing as dependency parsing. ar Xiv preprint ar Xiv:1803.09189.

Wu, J.; Zheng, H.; Zhao, B.; Li, Y.; Yan, B.; Liang, R.; Wang, W.; Zhou, S.; Lin, G.; Fu, Y.; et al. 2017. AI Challenger: A Large-scale Dataset for Going Deeper in Image Understanding. ar Xiv preprint ar Xiv:1711.06475. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144. Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Autoencoding scene graphs for image captioning. In CVPR. Zellers, R.; Yatskar, M.; Thomson, S.; and Choi, Y. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. Zhong, Y.; Wang, L.; Chen, J.; Yu, D.; and Li, Y. 2020. Comprehensive Image Captioning via Scene Graph Decomposition. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J., eds., ECCV. Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.