# improving_sequencetosequence_learning_via_optimal_transport__beb6709d.pdf

Published as a conference paper at ICLR 2019

IMPROVING SEQUENCE-TO-SEQUENCE LEARNING VIA OPTIMAL TRANSPORT

Liqun Chen1, Yizhe Zhang2, Ruiyi Zhang1, Chenyang Tao1, Zhe Gan3, Haichao Zhang4, Bai Li1, Dinghan Shen1, Changyou Chen5, Lawrence Carin1

1Duke University, 2Microsoft Research, 3Microsoft Dynamics 365 AI Research 4Baidu Research, 5SUNY at Buffalo {liqun.chen}@duke.edu

Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE). However, standard MLE training considers a word-level objective, predicting the next word given the previous ground-truth partial sentence. This procedure focuses on modeling local syntactic patterns, and may fail to capture long-range semantic structure. We present a novel solution to alleviate these issues. Our approach imposes global sequence-level guidance via new supervision based on optimal transport, enabling the overall characterization and preservation of semantic features. We further show that this method can be understood as a Wasserstein gradient ﬂow trying to match our model to the ground truth sequence distribution. Extensive experiments are conducted to validate the utility of the proposed approach, showing consistent improvements over a wide variety of NLP tasks, including machine translation, abstractive text summarization, and image captioning.

1 INTRODUCTION Sequence-to-sequence (Seq2Seq) models are widely used in various natural language processing tasks, such as machine translation (Bahdanau et al., 2015; Cho et al., 2014; Sutskever et al., 2014), text summarization (Chopra et al., 2016; Rush et al., 2015) and image captioning (Vinyals et al., 2015; Xu et al., 2015). Typically, Seq2Seq models are based on an encoder-decoder architecture, with an encoder mapping a source sequence into a latent vector, and a decoder translating the latent vector into a target sequence. The goal of a Seq2Seq model is to optimize this encoder-decoder network to generate sequences close to the target. Therefore, a proper measure of the distance between sequences is crucial for model training.

Maximum likelihood estimation (MLE) is often used as the training paradigm in existing Seq2Seq models (Goodfellow et al., 2016; Lamb et al., 2016). The MLE-based approach maximizes the likelihood of the next word conditioned on its previous ground-truth words. Such an approach adopts cross-entropy loss as the objective, essentially measuring the word difference at each position of the target sequence (assuming truth for the preceding words). That is, MLE only provides a word-level training loss (Ranzato et al., 2016). Consequently, MLE-based methods suffer from the so-called exposure bias problem (Bengio et al., 2015; Ranzato et al., 2016), i.e., the discrepancy between training and inference stages. During inference, each word is generated sequentially based on previously generated words. However, ground-truth words are used in each timestep during training (Huszr, 2015; Wiseman & Rush, 2016). Such discrepancy in training and testing leads to accumulated errors along the sequence-generation trajectory, and may therefore produce unstable results in practice. Further, commonly used metrics for evaluating the generated sentences at test time are sequence-level, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). This also indicates a mismatch of the training loss and test-time evaluation metrics.

Attempts have been made to alleviate the above issues, via a sequence-level training loss that enables comparisons between the entire generated and reference sequences. Such efforts roughly fall into two categories: (i) reinforcement-learning-based (RL) methods (Bahdanau et al., 2017; Ranzato et al., 2016) and (ii) adversarial-learning-based methods (Yu et al., 2017; Zhang et al., 2017). These methods overcome the exposure bias issue through criticizing model output during training; however, both schemes have their own vulnerabilities. RL methods often suffer from large variance

Published as a conference paper at ICLR 2019

six students

playing football

six freshmen are playing soccer

six students

playing football

six freshmen are playing soccer

six students

playing football

six freshmen are playing soccer

Figure 1: Different matching schemes. Left to right: hard matching, soft bipartite matching and OT matching. Dominant edges are shown in dark green for OT matching.

on policy-gradient estimation, and control variates and carefully designed baselines (such as a selfcritic) are needed to make RL training more robust (Liu et al., 2018; Rennie et al., 2017). Further, the rewards used by RL training are often criticized as a bad proxy for human evaluation, as they are usually highly biased towards certain particular aspects (Wang et al., 2018b). On the other hand, adversarial supervision relies on the delicate balance of a mini-max game, which can be easily undermined by mode-trapping and gradient-vanishing problems (Arjovsky et al., 2017; Zhang et al., 2017). Sophisticated tuning is often desired for successful adversarial training.

We present a novel Seq2Seq learning scheme that leverages optimal transport (OT) to construct sequence-level loss. Speciﬁcally, the OT objective aims to ﬁnd an optimal matching of similar words/phrases between two sequences, providing a way to promote their semantic similarity (Kusner et al., 2015). Compared with the above RL and adversarial schemes, our approach has: (i) semanticinvariance, allowing better preservation of sequence-level semantic information; and (ii) improved robustness, since neither the reinforce gradient nor the mini-max game is involved. The OT loss allows end-to-end supervised training and acts as an effective sequence-level regularization to the MLE loss.

Another novel strategy distinguishing our model from previous approaches is that during training we consider not only the OT distance between the generated sentence and ground-truth references, but also the OT distance between the generated sentence and its corresponding input. This enables our model to simultaneously match the generated output sentence with both the source sentence(s) and target reference sentence, thus enforcing the generator to leverage information contained in the input sentence(s) during generation.

The main contributions of this paper are summarized as follows. (i) A new sequence-level training algorithm based on optimal transport is proposed for Seq2Seq learning. In practice, the OT distance is introduced as a regularization term to the MLE training loss. (ii) Our model can be interpreted as approximate Wasserstein gradient ﬂows, learning to approximately match the sequence distribution induced by the generator and a target data distribution. (iii) In order to demonstrate the versatility of the proposed method, we conduct extensive empirical evaluations on three tasks: machine translation, text summarization, and image captioning.

2 SEMANTIC MATCHING WITH OPTIMAL TRANSPORT

We consider two components of a sentence: its syntactic and semantic parts. In a Seq2Seq model, it is often desirable to keep the semantic meaning while the syntactic part can be more ﬂexible. Conventional training schemes, such as MLE, are known to be well-suited for capturing the syntactic structure. As such, we focus on the semantic part. An intuitive way to assess semantic similarity is to directly match the key words between the synthesized and the reference sequences. Consider the respective sequences as sets A and B, with vocabularies as their elements. Then the matching can be evaluated by |A B|, where | | is the counting measure for sets. We call this hard matching, as it seeks to exactly match words from both sequences.

For language models, the above hard matching could be an over simpliﬁcation. This is because words have semantic meaning, and two different words can be close to each other in the semantic space. To account for such ambiguity, we can relax the hard matching to soft bipartite matching (SBM). More speciﬁcally, assuming all sequences have the same length n, we pair wik A and w jk B for k [1, K], such that K n, {ik}, {jk} are unique and LSBM = P

k c(wik, w jk) is minimized. Here c(w, w ) is a cost function measuring the semantic dissimilarity between the two words. For instance, the cosine distance c(x, y) = 1 x y x 2 y 2 between two word embedding vectors x and y is a popular choice (Pennington et al., 2014). This minimization can be solved exactly,

Published as a conference paper at ICLR 2019

soft argmax

Figure 2: Schematic computation graph of OT loss.

e.g., via the Hungarian algorithm (Kuhn, 1955). Unfortunately, its O(n3) complexity scales badly for common NLP tasks, and the objective is also non-differentiable wrt model parameters. As such, end-to-end supervised training is not feasible with the Hungarian matching scheme. To overcome this difﬁculty, we propose to further relax the matching criteria while keeping the favorable features of a semantic bipartite matching. OT arises as a natural candidate.

2.1 OPTIMAL TRANSPORT AND WASSERSTEIN DISTANCE We ﬁrst provide a brief review of optimal transport, which deﬁnes distances between probability measures on a domain X (the sequence space in our setting). The optimal transport distance for two probability measures µ and ν is deﬁned as (Peyr e et al., 2017):

Dc(µ, ν) = inf γ Π(µ,ν) E(x,y) γ [c(x, y)] , (1)

where Π(µ, ν) denotes the set of all joint distributions γ(x, y) with marginals µ(x) and ν(y); c(x, y) : X X R is the cost function for moving x to y, e.g., the Euclidean or cosine distance. Intuitively, the optimal transport distance is the minimum cost that γ induces in order to transport from µ to ν. When c(x, y) is a metric on X, Dc(µ, ν) induces a proper metric on the space of probability distributions supported on X, commonly known as the Wasserstein distance (Villani, 2008). One of the most popular choices is the 2 Wasserstein distance W 2 2 (µ, ν) where the squared Euclidean distance c(x, y) = x y 2 is used as cost.

OT distance on discrete domains We mainly focus on applying the OT distance on textual data. Therefore, we only consider OT between discrete distributions. Speciﬁcally, consider two discrete distributions µ, ν P(X), which can be written as µ = Pn i=1 uiδxi and ν = Pm j=1 vjδyj with δx the Dirac function centered on x. The weight vectors u = {ui}n i=1 n and v = {vi}m i=1 m respectively belong to the n and m-dimensional simplex, i.e., Pn i=1 ui = Pm j=1 vj = 1, as both µ and ν are probability distributions. Under such a setting, computing the OT distance as deﬁned in (1) is equivalent to solving the following network-ﬂow problem (Luise et al., 2018):

Lot(µ, ν) = min T Π(u,v)

j=1 Tij c(xi, yj) = min T Π(u,v) T, C , (2)

where Π(u, v) = {T Rn m + |T1m = u, T 1n = v}, 1n denotes an n-dimensional all-one vector, C is the cost matrix given by Cij = c(xi, yj) and T, C = Tr(T C) represents the Frobenius dot-product. We refer to the minimizer T of (2) as OT matching. Comparing the two objectives, one can readily recognize that soft bipartite matching represents a special constrained solution to (2), where T can only take values in Γ = {T| maxi{ Tei 0, e T i T 0} 1, Tij {0, 1}, T 0 = K} instead of Π(u, v); here 0 is the L0 norm and ei is the unit vector along i-th axis. As such, OT matching can be regarded as a relaxed version of soft bipartite matching. In Figure 1 we illustrate the three matching schemes discussed above.

The IPOT algorithm Unfortunately, the exact minimization over T is in general computational intractable (Arjovsky et al., 2017; Genevay et al., 2018; Salimans et al., 2018). To overcome such intractability, we consider an efﬁcient iterative approach to approximate the OT distance. We propose to use the recently introduced Inexact Proximal point method for Optimal Transport (IPOT) algorithm to compute the OT matrix T , thus also the OT distance (Xie et al., 2018). IPOT provides a solution to the original OT problem speciﬁed in (2). Speciﬁcally, IPOT iteratively solves the following optimization problem using the proximal point method (Boyd & Vandenberghe, 2004):

T(t+1) = arg min T Π(x,y)

n T, C + β B(T, T(t)) o , (3)

Published as a conference paper at ICLR 2019

where the proximity metric term B(T, T(t)) penalizes solutions that are too distant from the latest approximation, and 1

β is understood as the generalized stepsize. This renders a tractable iterative scheme towards the exact OT solution. In this work, we employ the generalized KL Bregman divergence B(T, T(t)) = P

i,j Tij log Tij

i,j Tij + P

i,j T(t) ij as the proximity metric. Algorithm

1 describes the implementation details for IPOT.

Algorithm 1 IPOT algorithm

1: Input: Feature vectors S = {zi}n 1 , S = {z j}m 1 and generalized stepsize 1/β, 2: σ = 1 m1m, T(1) = 1n1m

3: Cij = c(zi, z j), Aij = e Cij

β 4: for t = 1, 2, 3 . . . do 5: Q = A T(t) // is Hadamard product 6: for k = 1, . . . K do // K = 1 in practice 7: δ = 1 n Qσ , σ = 1 m Q δ 8: end for 9: T(t+1) = diag(δ)Qdiag(σ) 10: end for 11: Return T, C

Note that the Sinkhorn algorithm (Cuturi, 2013) can also be used to compute the OT matrix. Speciﬁcally, the Sinkhorn algorithm tries to solve the entropy regularized optimization problem: ˆLot(µ, ν) = min T Π(u,v) T, C 1

ϵ H(T) , where H(T) = P

i,j Tij(log(Tij) 1) is the entropy regularization term and ϵ > 0 is the regularization strength. However, in our experiments, we empirically found that the numerical stability and performance of the Sinkhorn algorithm is quite sensitive to the choice of the hyper-parameter ϵ, thus only IPOT is considered in our model training.

2.2 OPTIMAL TRANSPORT DISTANCE AS A SEQUENCE LEVEL LOSS Figure 2 illustrates how OT is computed to construct the sequence-level loss. Given two sentences, we can construct their word-level or phrase-level embedding matrices S and S , where S = {zi} is usually recognized as the reference sequence embedding and S = {z j} for the model output sequence embedding. The cost matrix C is then computed by Cij = c(zi, z j) and passed on to the IPOT algorithm to get the OT distance. Our full algorithm is summarized in Algorithm 2, and more detailed model speciﬁcations are given below.

Encoding model belief with a differentiable sequence generator We ﬁrst describe how to design a differentiable sequence generator so that the gradients can be backpropagated from the OT losses to update the model belief. The Long Short-Term Memory (LSTM) recurrent neural network (Hochreiter & Schmidhuber, 1997) is used as our sequence model. At each timestep t, the LSTM decoder outputs a logit vector vt for the vocabularies, based on its context. Directly sampling from the multinomial distribution ˆwt Softmax(vt) is a non-differentiable operation1, so we consider the following differentiable alternatives:

Soft-argmax: ˆw SA t = Softmax(vt/τ), where τ (0, 1) is the annealing parameter (Zhang et al., 2017). This approximates the deterministic sampling scheme ˆwmax t = arg max{vt}; Gumbel-softmax (GS): ˆw GS t = Softmax((vt + ξt)/τ), where ξt are iid Gumbel random variables for each of the vocabulary. It is also known as the Concrete distribution (Jang et al., 2016; Maddison et al., 2017). Unstable training and sub-optimal solutions have been observed for the GS-based scheme for the Seq2Seq tasks we considered (see Appendix G, Table 11), possibly due to the extra uncertainty introduced. As such, we will assume the use of soft-argmax to encode model belief in ˆwt unless otherwise speciﬁed. Note ˆwt is a normalized non-negative vector that sums up to one.

Sequence-level OT-matching loss To pass on the model belief to the OT loss, we use the mean word embedding predicted by the model, given by ˆzt = ET ˆwt, where E RV d is the word embedding matrix, V is the vocabulary size and d is the dimension for the embedding vector. We collect the predicted sequence embeddings into Sg = {ˆzt}L t=1, where L is the length of sequence. Similarly we denote the reference sequence embeddings as Sr = {zt}L t=1, using ground truth onehot input token sequence {wt}. Based on the sequence embeddings Sr and Sg, we can compute the sequence-level OT loss between ground-truth and model prediction using the IPOT algorithm described above for different Seq2Seq tasks:

Lseq IPOT(Sg, Sr) . (4)

1Here ˆwt is understood as an one-hot vector in order to be notationally consistent with its differentiable alternatives.

Published as a conference paper at ICLR 2019

Algorithm 2 Seq2Seq Learning via Optimal Transport.

1: Input: batch size m, paired input and output sequences (X, Y) 2: Load MLE pre-trained Seq2Seq model M( ; θ) and word embedding E 3: for iteration = 1, . . . Max Iter do 4: for i = 1, . . . , m do 5: Draw a pair of sequences xi, yi (X, Y), where xi = { wi,t}, yi = {wi,t} 6: Compute logit vectors from model: {vi,t} = M(xi; θ) 7: Encode model belief: ˆwi,t = Soft-argmax(vi,t) 8: Feature vector embedding: Sr,i = {ET wi,t}, Sg,i = {ET ˆwi,t} 9: end for 10: Update the M( ; θ) by optimizing: 1 m Pm i=1[LMLE(xi, yi; θ) + γLseq(Sr,i, Sg,i)] 11: end for

Soft-copying mechanism We additionally consider feature matching using the OT criteria between the source and target. Intuitively, it will encourage the global semantic meaning to be preserved from source to target. This is related to the copy network (Gu et al., 2016). However, in our framework, the copying mechanism can be understood as a soft optimal-transport-based copying, instead of the original hard retrieved-based copying used by Gu et al. (2016). This soft copying mechanism considers semantic similarity in the embedding space, and thus presumably delivers smoother transformation of information. In the case where the source and target sequences do not share vocabulary (e.g., machine translation), this objective can still be applied by sharing the word embedding space between source and target. Ideally, the embedding for the same concept in different languages will automatically be aligned by optimizing such loss, making available a cosinesimilarity-based cost matrix. This is also related to bilingual skip-gram (Luong et al., 2015b). We denote this loss as Lcopy IPOT(Sg, Ss), where Ss represents the source sequence embeddings.

Complementing MLE training with OT regularization OT training objectives discussed above can not train a proper language model on its own, as they do not explicitly consider word ordering, i.e., the syntactic strucuture of a language model. To overcome this issue, we propose to combine the OT loss with the de facto likelihood loss LMLE, which gives us the ﬁnal training objective: L = LMLE + γLseq, where γ > 0 is a hyper-parameter to be tuned. For tasks with both input and output sentences, such as machine translation and text summarization, Lcopy can be applied, in which case the ﬁnal objective can be written as L = LMLE + γ1Lcopy + γ2Lseq.

3 INTERPRETATION AS APPROXIMATE WASSERSTEIN GRADIENT FLOWS

To further justify the use of our approach (minimizing the loss {LMLE+γLot}, where Lot denotes the Wasserstein loss), we now explain how our model approximately learns to match the ground-truth sequence distribution. Our derivation is based on the theory of Wasserstein gradient ﬂows (WGF) (Villani, 2008). In WGF, the Wasserstein distance describes the local geometry of a trajectory in the space of probability measures converging to a target distribution (Ambrosio et al., 2005). In the following, we show that the proposed method learns to approximately match the data distribution, from the perspective of WGF. For simplicity we only discuss the continuous case, while a similar argument also holds for the discrete case (Li & Montufar, 2018).

We denote the induced distribution of the sequences generated from the decoder at the l-th iteration as µl. Assume the sequence data distribution is given by pd(x). Intuitively, the optimal generator in a Seq2Seq model learns a distribution µ (x) that matches pd(x). Based on Craig (2014), this can be achieved by composing a sequence of discretized WGFs given by:

µl = Jh(µl 1) = Jh(Jh( (µ0))) , (5)

with Jh( ) deﬁned as

Jh(µ) = arg min ν Ps

2h W 2 2 (µ, ν) + DKL(ν pd) = arg min ν Ps {LWGF(µ, ν)} , (6)

where λ = 1/(2h) is a regularization parameter (h is the generalized learning rate); W 2 2 (µ, ν) denotes the 2-Wasserstein distance between µ and ν; Ps is the space of distributions with ﬁnite 2nd-order moments; and DKL(ν pd) = Ex ν[log ν(x) log pd(x)] is the Kullback-Leibler (KL)

Published as a conference paper at ICLR 2019

divergence. It is not difﬁcult to see that discreteized WGF is essentially optimizing the KL divergence with a proximal descent scheme, using the 2-Wasserstein distance as the proximity metric.

We denote µ h = liml µl with generalized learning rate h. It is well known that limh 0 µ h = pd (Chen et al., 2018a), that is to say the induced model distribution µl asymptotically converges to the data distribution pd. In our case, instead of using LWGF(µ, ν) as the loss function, we deﬁne a surrogate loss using its upper bound LWGF(µ, ν) LWGF(pd, ν), where the inequality holds because (6) converges to pd. When our model distribution µ is parameterized by θ, µl can be solved with stochastic updates on θ based on the following equation with stepsize η:

θl θl 1 + η θLWGF(pd, µl 1) = θl 1 + η{ θDKL(µl 1 pd) + 1

2h θW 2 2 (pd, µl 1)} . (7)

Unfortunately, (7) is an infeasible update as we do not know pd. However, we argue that this update is still locally valid when current model approximation µl 1 is close to pd. To see this, recall that the KL-divergence is a natural Riemannian metric on the space of probability measures (Amari, 1985), therefore it is locally symmetric. So we can safely replace the DKL(µ pd) term with DKL(pd µ) when µ is close to pd. This recovers the loss function LMLE + γLseq derived in Section 2.2 as DKL(pd µ) = LMLE + H(pd), where H(pd) is the entropy of pd, independent of µ, and Lseq = W 2 2 (pd, µ). This justiﬁes the use of our proposed scheme in a model-reﬁnement stage, where model distribution µ is sufﬁciently close to pd. Empirically, we have observed that our scheme also improves training even when µ is distant from pd. While the above justiﬁcation is developed based on Euclidean transport, other non-Euclidean costs such as cosine distance usually yield better empirical performance as they are more adjusted to the geometry of sequence data.

4 RELATED WORK AND DISCUSSION

Optimal transport in NLP Although widely used in other ﬁelds such as computer vision (Rubner et al., 2000), OT has only been applied in NLP recently. Pioneered by the work of Kusner et al. (2015) on word mover s distance (WMD), existing literature primarily considers OT either on a macroscopic level like topic modeling (Huang et al., 2016), or a microscopic level such as word embedding (Xu et al., 2018). Euclidean distance, instead of other more general distance, is often used as the transportation cost, in order to approximate the OT distance with the Kantorovich Rubinstein duality (Gulrajani et al., 2017) or a more efﬁcient yet less accurate lower bound (Kusner et al., 2015). Our work employs OT for mesoscopic sequence-to-sequence models, presenting an efﬁcient IPOT-based implementation to enable end-to-end learning for general cost functions. The proposed OT not only reﬁnes the word embedding matrix but also improves the Seq2Seq model (see Appendix H for details).

RL for sequence generation A commonly employed strategy for sequence-level training is via reinforcement learning (RL). Typically, this type of method employs RL by considering the evaluation metrics as the reward to guide the generation (Bahdanau et al., 2017; Huang et al., 2018; Ranzato et al., 2016; Rennie et al., 2017; Zhang et al., 2018a). However, these approaches often introduce procedures that may yield large-variance gradients, resulting in unstable training. Moreover, it has been recognized that these automatic metrics may have poor correlation with human judgments in many scenarios (Wang et al., 2018b). As such, reinforcing the evaluation metrics can potentially boost the quantitative scores but not necessarily improve the generation quality, as such metrics usually encourage exact text snippets overlapping rather than semantic similarity. Some nonstandard metrics like SPICE (Anderson et al., 2016) also consider semantic similarity, however they also can not learn a good model on their own (Liu et al., 2017). Unlike RL methods, our method requires no human-deﬁned rewards, thus preventing the model from over-ﬁtting to one speciﬁc metric. As a concrete example, the two semantically similar sentences do you want to have lunch with us and would you like to join us for lunch would be considered as a bad match based on automatic metrics like BLEU, however, be rated as reasonable match in OT objective.

GAN for sequence generation Another type of method adopts the framework of generative adversarial networks (GANs) (Goodfellow et al., 2014), by providing sequence-level guidance based on a learned discriminator (or, critic). To construct such a loss, Fedus et al. (2018); Guo et al. (2018); Lin et al. (2017); Yu et al. (2017) combine the policy-gradient algorithm with the original GAN training procedure, while Chen et al. (2018b); Zhang et al. (2017) uses a so-called feature mover distance and maximum mean discrepancy (MMD) to match features of real and generated sentences, respectively.

Published as a conference paper at ICLR 2019

Table 1: BLEU scores on VI-EN and EN-VI. Systems NT2012 NT2013 VI-EN: GNMT 20.7 23.8 VI-EN: GNMT+Lseq 21.9 25.4 VI-EN: GNMT+Lseq+Lcopy 21.9 25.5 EN-VI: GNMT 23.8 26.1 EN-VI: GNMT+Lseq 24.4 26.5 EN-VI: GNMT+Lseq+Lcopy 24.5 26.9

Table 2: BLEU scores on DE-EN and EN-DE. Systems NT2013 NT2015 DE-EN: GNMT 29.0 29.9 DE-EN: GNMT+Lseq 29.1 29.9 DE-EN: GNMT+Lseq+Lcopy 29.2 30.1 EN-DE: GNMT 24.3 26.5 EN-DE: GNMT+Lseq 24.3 26.6 EN-DE: GNMT+Lseq+Lcopy 24.6 26.8

However, mode-collapse and gradient-vanishing problems make the training of these methods challenging. Unlike GAN methods, since no min-max games are involved, the training of our model is more robust. Moreover, compared with GAN, no additional critic is introduced in our model, which makes the model complexity comparable to MLE and less demanding to tune.

5 EXPERIMENTS

We consider a wide range of NLP tasks to experimentally validate the proposed model, and benchmark it with other strong baselines. All experiments are implemented with Tensorﬂow and run on a single NVIDIA TITAN X GPU. Code for our experiments are available from https: //github.com/Liqun Chen0606/Seq2Seq-OT.

5.1 NEURAL MACHINE TRANSLATION

We test our model on two datasets: (i) a small-scale English-Vietnamese parallel corpus of TEDtalks, which has 133K sentence pairs from the IWSLT Evaluation Campaign (Cettolo et al., 2015); and (ii) a large-scale English-German parallel corpus with 4.5M sentence pairs, from the WMT Evaluation Campaign (Vaswani et al., 2017). We used Google s Neural Machine Translation (GMNT) model (Wu et al., 2016) as our baseline, following the architecture and hyper-parameter settings from the GNMT repository2 to make a fair comparison. For the English-Vietnamese (i.e., VI-EN and EN-VI) tasks, a 2-layer LSTM with 512 units in each layer is adopted as the decoder, with a 1-layer bidirectional-LSTM adopted as the encoder; the word embedding dimension is set to 512. Attention proposed in Luong et al. (2015a) is used together with a dropout rate of 0.2. For the English-German (i.e., DE-EN and EN-DE) tasks, we train a 4-layer LSTM decoder with 1024 units in each layer. A 2-layer bidirectional-LSTM is used as the encoder, and we adopt the attention used in Wu et al. (2016). The word embedding dimension is set to 1024. Standard stochastic gradient descent is used for training with a decreasing learning rate, and we set β = 0.5 for the IPOT algorithm. More training details are provided in Appendix A. In terms of wall-clock time, our model only slightly increases training time. For the German-English task, it took roughly 5.5 days to train the GNMT model, and 6 days to train our proposed model from scratch, which only amounts to a roughly 10% increase.

We apply different combinations of Lcopy and Lseq to ﬁne-tune the pre-trained GNMT model (Luong et al., 2018) and the results are summarized in Table 1 and 2. . Additional results for training from scratch are provided in Appendix B. The proposed OT approach consistently improves upon MLE training in all experimental setups.

We additionally tested our model with a more expressive 8-layer LSTM model on the EN-DE task. The BLEU score of our method is 28.0 on NT2015. For reference, the GNMT model (same architecture) and a Transformer model (Vaswani et al., 2017) respectively report a score of 27.6 and 27.3. Our method outperforms both baselines, and it is also competitive to the state-of-the-art BLEU score 28.4 reported by Vaswani et al. (2017) using a highly sophisticated model design.

The German-to-English translation examples are provided in Table 3 for qualitative assessment. The main differences among the reference translation, our translation and the GNMT translation are highlighted in blue and red. Our OT-augmented translations are more faithful to the reference than its MLE-trained counterpart. The soft-copying mechanism introduced by OT successfully maintains the key semantic content from the reference. Presumably, the OT loss helps reﬁne the word embedding matrices, and promotes matching between words with similar semantic meanings. Vanilla

2https://github.com/tensorflow/nmt

Published as a conference paper at ICLR 2019

Table 3: Comparison of German-to-English translation examples. Matched key phrases are shown in the same color. First example: May is not the date when the new prime minister visited Japan, but actually is the time he won the election. Second example: GNMT s paraphrase choices are not as accurate as ours. Third example: nominating committee is controlled by the government, not a UN-controlled nomination committee in GNMT s result, and it also fails to capture the word retain .

Reference: India s new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his ﬁrst major foreign visit since winning May s election. GNMT: India s new prime minister , Narendra Modi , is meeting his Japanese counterpart , Shinzo Abe , in Tokyo at his ﬁrst important visit abroad in May to discuss economic and security relations . Ours: India s new Prime Minister Narendra Modi meets his Japanese counterpart , Shinzo Abe , in Tokyo at his ﬁrst major foreign visit since his election in May in order to discuss economic and security relations . Reference: The next day, turning up for work as usual, she was knocked down by a motorcyclist who had mounted the pavement in what passers-by described as a vicious rage. GNMT: The next day , when she went to work as usual , she was driven by a motorcyclist who , as passants described , went on foot in a kind of brutal anger . Ours: The next day , when she went to work as usual , she was crossed by a motorcyclist who , was described by passers-by , in a sort of brutal rage on the road . Reference: Chinese leaders presented the Sunday ruling as a democratic breakthrough because it gives Hong Kongers a direct vote, but the decision also makes clear that Chinese leaders would retain a ﬁrm hold on the process through a nominating committee tightly controlled by Beijing. GNMT: The Chinese leadership presented Sunday s decision as a democratic breakthrough , because Hong Kong s citizens have a direct right to vote , but the decision also makes it clear that the Chinese leadership is ﬁrmly in control of the process through a UN-controlled nomination committee . Ours: The Chinese leadership presented Sunday s decision as a democratic breakthrough because it gives the citizens of Hong Kong a direct right to vote , but the decision also makes it clear that the Chinese leadership keeps the process ﬁrmly in the hands of a government-controlled Nomination Committee .

Table 4: ROUGE scores on Gigaword.

Systems RG-1 RG-2 RG-L Seq2Seq 33.4 15.7 32.4 Seq2Seq+Lseq 35.8 17.5 33.7 Seq2Seq+Lseq+Lcopy 36.2 18.1 34.0

Table 5: ROUGE scores on DUC2004.

Systems RG-1 RG-2 RG-L Seq2Seq 28.0 9.4 24.8 Seq2Seq+Lseq 29.5 9.8 25.5 Seq2Seq+Lseq+Lcopy 30.1 10.1 26.0

GNMT translations, on the other hand, ignores or misinterprets some of the key terms. More examples are provided in Appendix E.

We also test the robustness of our method wrt the hyper-parameter γ. Results are summarized in Appendix C. Our OT-augmented model is robust to the choice of γ. The test BLEU scores are consistently higher than the baseline for γ (0, 1].

5.2 ABSTRACTIVE TEXT SUMMARIZATION

We consider two datasets for abstractive text summarization. The ﬁrst one is the Gigaword corpus (Graff et al., 2003), which has around 3.8M training samples, 190K validation samples, and 1951 test samples. The input pairs consist of the ﬁrst sentence and the headline of an article. We also evaluate our model on the DUC-2004 test set (Over et al., 2007), which consists of 500 news articles. Our implementation of the Seq2Seq model adopts a simple architecture, which consists of a bidirectional GRU encoder and a GRU decoder with attention mechanism (Bahdanau et al., 2015)3.

Results are summarized in Tables 4 and 5. Our OT-regularized model outperforms respective baselines. The state-of-the-art ROUGE result for the Gigawords dataset is 36.92 reported by Wang et al. (2018a). However, much more complex architectures are used to achieve that score. We use a relatively simple Seq2Seq model in our experiments to demonstrate the versatility of the proposed OT method. Applying it for (i) more complicated models and (ii) more recent datasets such as CNN/Daily Mail (See et al., 2017) will be interesting future work.

Summarization examples are provided in Appendix D. Similar to the machine translation task, our proposed method captures the key semantic information in both the source and reference sentences.

5.3 IMAGE CAPTIONING

We also consider an image captioning task using the COCO dataset (Lin et al., 2014), which contains 123,287 images in total and each image is annotated with at least 5 captions. Following Karpathy s

3https://github.com/thunlp/Tensor Flow-Summarization

Published as a conference paper at ICLR 2019

Table 6: Results for image captioning on the COCO dataset.

Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr Soft Attention (Xu et al., 2015) 70.7 49.2 34.4 24.3 23.9 - Hard Attention (Xu et al., 2015) 71.8 50.4 35.7 25.0 23.0 - Show & Tell (Vinyals et al., 2015) - - - 27.7 23.7 85.5 ATT-FCN (You et al., 2016) 70.9 53.7 40.2 30.4 24.3 - SCN-LSTM (Gan et al., 2017) 72.8 56.6 43.3 33.0 25.7 101.2 Adaptive Attention (Lu et al., 2017) 74.2 58.0 43.9 33.2 26.6 108.5 Top-Down Attention (Anderson et al., 2018) 77.2 36.2 27.0 113.5 No attention, Resnet-152 Show & Tell 70.3 53.7 39.9 29.5 23.6 87.1 Show & Tell+Lseq (Ours) 70.9 54.2 40.4 30.1 23.9 90.0 No attention, Tag Show & Tell 72.1 55.2 41.3 30.1 24.5 93.4 Show & Tell+Lseq (Ours) 72.3 55.4 41.5 31.0 24.6 94.7 Soft attention, Fast RCNN Show, Attend & Tell 74.0 58.0 44.0 33.1 25.2 99.1 Show, Attend & Tell+Lseq (Ours) 74.5 58.4 44.5 33.8 25.6 102.9

split (Karpathy & Fei-Fei, 2015), 113,287 images are used for training and 5,000 images are used for validation and testing. We follow the implementation of the Show, Attend (Xu et al., 2015)4, and use Resnet-152 (He et al., 2016), image tagging (Gan et al., 2017), and Fast RCNN (Anderson et al., 2018) as the image feature extractor (encoder), and a one-layer LSTM with 1024 units as the decoder. The word embedding dimension is set to 512. Note that in this task, the input are images instead of sequences, therefore Lcopy cannot be applied.

We report BLEU-k (k from 1 to 4) (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and METEOR (Banerjee & Lavie, 2005) scores and the results with different settings are shown in Table 6. Consistent across-the-board improvements are observed with the introduction of the OT loss, in contrast to the RL-based methods where drastic improvements can only be observed for the optimized evaluation metric (Rennie et al., 2017). Consequently, the OT loss is a more reliable method to improve the quality of generated captions when compared with RL methods that aim to optimize and therefore potentially overﬁt one speciﬁc metric. Examples of generated captions are provided in Appendix F.

6 CONCLUSION

This work is motivated by the major deﬁciency in training Seq2Seq models: that the MLE training loss does not operate at sequence-level. Inspired by soft bipartite matching, we propose the usage of optimal transport as a sequence-level loss to improve Seq2Seq learning. By applying this new method to machine translation, text summarization, and image captioning, we demonstrate that our proposed model can be used to help improve the performance compared to strong baselines. We believe the proposed method is a general framework, and will be useful to other sequence generation tasks as well, such as conversational response generation (Li et al., 2017; Zhang et al., 2018c), which is left as future work.

Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science & Business Media, 1985.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient ﬂows: in metric spaces and in the space of probability measures. 2005.

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382 398. Springer, 2016.

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.

4https://github.com/Deep RNN/image_captioning

Published as a conference paper at ICLR 2019

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In ICLR, 2017.

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop, 2005.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, 2015.

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Mauro Cettolo, Jan Niehues, Sebastian St uker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. The IWSLT 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation, 2015.

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A uniﬁed particleoptimization framework for scalable bayesian sampling. In UAI, 2018a.

Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. Adversarial text generation via feature-mover s distance. In Neur IPS, 2018b.

Kyunghyun Cho, Bart Van Merri enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.

Sumit Chopra, Michael Auli, and Alexander M Rush. Abstractive sentence summarization with attentive recurrent neural networks. In NAACL, 2016.

Katy Craig (ed.). The exponential formula for the Wasserstein metric. Ph D thesis, The State University of New Jersey, 2014.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.

William Fedus, Ian Goodfellow, and Andrew M Dai. Mask GAN: Better text generation via ﬁlling in the . In ICLR, 2018.

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.

Aude Genevay, Gabriel Peyr e, and Marco Cuturi. Learning generative models with sinkhorn divergences. In AISTATS, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.

David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword. Linguistic Data Consortium, Philadelphia, 2003.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mechanism in sequence-to-sequence learning. In ACL, 2016.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In NIPS, 2017.

Published as a conference paper at ICLR 2019

Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. In AAAI, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Computation, 1997.

Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised word mover s distance. In NIPS, 2016.

Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. Hierarchically structured reinforcement learning for topically coherent visual story generation. ar Xiv preprint ar Xiv:1805.08191, 2018.

Ferenc Huszr. How (not) to train your generative model: scheduled sampling, likelihood, adversary? In ar Xiv:1511.05101, 2015.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In ar Xiv:1611.01144, 2016.

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.

Harold W Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 1955.

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015.

Alex Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.

Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In EMNLP, 2017.

W. Li and G. Montufar. Natural gradient via optimal transport. ar Xiv preprint ar Xiv:1803.07033, 2018.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.

Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking for language generation. In NIPS, 2017.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.

Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via stein s identity. In ICLR, 2018.

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. In ICCV, 2017.

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, 2017.

Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. ar Xiv:1805.11897, 2018.

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. ar Xiv:1508.04025, 2015a.

Published as a conference paper at ICLR 2019

Thang Luong, Hieu Pham, and Christopher D Manning. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015b.

Thang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial, 2018. URL https://github.com/tensorflow/nmt.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ICLR, 2017.

Paul Over, Hoa Dang, and Donna Harman. DUC in context. Information Processing & Management, 2007.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport. Technical report, 2017.

Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In CVPR, 2017.

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover s distance as a metric for image retrieval. IJCV, 2000.

Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015.

Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In ICLR, 2018.

Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: summarization with pointergenerator networks. ACL, 2017.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.

C edric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, 2008.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.

Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization. IJCAI, 2018a.

Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL, 2018b.

Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search optimization. In EMNLP, 2016.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv:1609.08144, 2016.

Published as a conference paper at ICLR 2019

Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. A fast proximal point method for Wasserstein distance. In ar Xiv:1802.04307, 2018.

Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled Wasserstein learning for word embedding and topic modeling. In Neur IPS, 2018.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, 2016.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seq GAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.

Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin Wang, and Lawrence Carin. Sequence generation with guider network. ar Xiv preprint ar Xiv:1811.00696, 2018a.

Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient ﬂows. In ICML, 2018b.

Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In ICML, 2017.

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. Generating informative and diverse conversational responses via adversarial information maximization. In Neur IPS, 2018c.

A TRAINING DETAILS FOR NMT TASK

The following training details are basically the same as the intructions from the Tensorﬂow/nmt github repository:

EN-VI 2-layer LSTMs of 512 units with bidirectional encoder (i.e., 1 bidirectional layer for the encoder), embedding dim is 512. Luong Attention (Luong et al., 2015a) (scale=True) is used together with dropout keep-prob=0.8. All parameters are uniformly initialized. We use SGD with learning rate 1.0 as follows: train for 12K steps (around 12 epochs); after 8K steps, we start halving learning rate every 1K step. DE-EN The training hyperparameters are similar to the EN-VI experiments except for the following details. The data is split into subword units using BPE (32K operations). We train 4layer LSTMs of 1024 units with bidirectional encoder (i.e., 2 bidirectional layers for the encoder), embedding dimension is 1024. We train for 350K steps (around 10 epochs); after 170K steps, we start halving learning rate every 17K step.

B END-TO-END NEURAL MACHINE TRANSLATION

Table 7: BLEU scores on VI-EN and EN-VI. Systems NT2012 NT2013 VI-EN: GNMT 20.7 23.8 VI-EN: GNMT+OT (Ours) 21.9 25.5 EN-VI: GNMT 23.0 25.4 EN-VI: GNMT+OT (Ours) 24.1 26.5

Table 8: BLEU scores on DE-EN and EN-DE. Systems NT2013 NT2015 DE-EN: GNMT 28.5 29.0 DE-EN: GNMT+OT (Ours) 28.8 29.5 EN-DE: GNMT 23.7 25.3 EN-DE: GNMT+OT (Ours) 24.1 26.2

Table 7 and 8 show the quantitative comparison for training from random initialization.

Published as a conference paper at ICLR 2019

C BLEU SCORE FOR DIFFERENT HYPER-PARAMETERS

We tested different γ for the OT loss term and summarized the results in Figure 3. γ = 0.1 gave the best performance for the EN-VI experiment. The results are robust wrt the choice of γ 1.0.

10 15 20 25 30

1e 3 1e 2 1e 1 0.5 1 5 10 20

newstest2013 newstest2012

Figure 3: the performance of EN-VI translation by different γ.

D SUMMARIZATION EXAMPLES

Table 9: Examples on Text Summarization.

Examples Source: japan s nec corp. and UNK computer corp. of the united states said wednesday they had agreed to join forces in supercomputer sales . Reference: nec UNK in computer sales tie-up Baseline: nec UNK computer corp. Ours: nec UNK computer corp sales supercomputer. Source: ﬁve east timorese youths who scaled the french embassy s fence here thursday , left the embassy on their way to portugal friday . Reference: UNK latest east timorese asylum seekers leave for portugal Baseline: ﬁve east timorese youths leave embassy Ours: ﬁve east timorese seekers leave embassy for portugal Source: the us space shuttle atlantis separated from the orbiting russian mir space station early saturday , after three days of test runs for life in a future space facility , nasa announced . Reference: atlantis mir part ways after three-day space collaboration by emmanuel UNK Baseline: atlantis separate from mir Ours: atlantis separate from mir space by UNK Source: australia s news corp announced monday it was joining brazil s globo , mexico s grupo televisa and the us tele-communications inc. in a venture to broadcast ### channels via satellite to latin america . Reference: news corp globo televisa and tele-communications in satellite venture Baseline: australia s news corp joins brazil Ours: australia s news corp joins brazil in satellite venture

Examples for abstract summarization are provided in Table 9.

E NEURAL MACHINE TRANSLATION EXAMPLES

In Table 10, we show more examples for comparison. From these examples, sentences generated from our model are more faithful to the reference sentences.

F IMAGE CAPTION EXAMPLES

Table 4 shows the comparison of our model with other baselines.

G ENCODING MODEL BELIEF WITH SOFTMAX AND GUMBEL-SOFTMAX

Published as a conference paper at ICLR 2019

Res152: a group of people standing in the snow. Res152-OT: a group of people standing on a snowy slope. Fast RCNN: a group of people on skis in the snow. Fast RCNN-OT: a group of people standing on top of a snow covered slope.

Res152: a table with a laptop and a laptop. Res152-OT: a table with a laptop. Fast RCNN: a laptop computer sitting on top of a desk. Fast RCNN-OT: a laptop computer sitting on top of a wooden table.

Res152: a man wearing a suit and tie. Res152-OT: man in a suit and tie in a suit. Fast RCNN: a man in a suit and tie and tie in a suit. Fast RCNN-OT: a man in a suit and tie standing in front of a building.

Res152: a young boy holding a baseball bat. Res152-OT: a woman holding a baseball bat in his hand. Fast RCNN: a young girl wearing a hat and a hat. Fast RCNN-OT: a little girl holding a baseball bat in her hand.

Figure 4: Examples of image captioning on MS COCO.

Table 11: BLEU scores on VI-EN and EN-VI using different choices of model s belief.

Systems NT2012 NT2013 VI-EN: GNMT 20.7 23.8 VI-EN: GNMT+OT (GS) 20.9 24.5 VI-EN: GNMT+OT (softmax) 21.8 24.3 VI-EN: GNMT+OT (ours) 21.9 25.5 EN-VI: GNMT 23.0 25.4 EN-VI: GNMT+OT (GS) 23.3 25.7 EN-VI: GNMT+OT (softmax) 23.5 26.0 EN-VI: GNMT+OT (ours) 24.1 26.5

To ﬁnd out the best differentiable sequence generating mechanism, we also experimented with Softmax and Gumbel-softmax (a.k.a. concrete distribution). Detailed results are summarized in Table 11. We can see Softmax and Gumbelsoftmax based OT model provide less signiﬁcant gains in terms of BLEU score compared with the baseline MLE model. In some situation, the performance even degenerate. We hypothesized that this is because Softmax encodes more ambiguity and Gumbel-softmax has a larger variance due to the extra random variable involved. These in turn hurts the learning. More involved variance reduction scheme might offset such negative impacts for Gumbel-softmax, which is left as our future work.

H OT IMPROVES BOTH MODEL AND WORD EMBEDDING MATRIX

Table 12: Comparison experiment. Metric Baseline Case 1 Case 2 BLEU-2 71.87 73.60 75.12 BLEU-3 61.18 63.07 64.82 BLEU-4 56.59 58.48 60.27 BLEU-5 53.73 55.69 57.50

To identify the source of performance gains, we designed a toy sequence-to-sequence experiment to show that OT help to reﬁne the language model and word embedding matrix. We use the English corpus from WMT dataset (from our machine translation task) and trained an auto-encoder (Seq2Seq model) on this dataset. We evaluated the reconstruction quality with the BLEU score. In Case 1, we stop the OT gradient from ﬂowing back to the sequence model ( only affecting the word embedding matrix); while in Case 2, the gradient from OT can affect the entire model. Detailed results are shown in Table G. We can see that Case 1 is better than the baseline model, which means OT helps to reﬁne the word embedding matrix. Case 2 achieves the highest BLEU, which implies OT also helps to improve the language model.

Published as a conference paper at ICLR 2019

Table 10: More DE-EN translation examples.

Examples Reference: When former First Lady Eleanor Roosevelt chaired the International Commission on Human Rights, which drafted the Universal Declaration of Human Rights that would in 1948 be adopted by the United Nations as a global covenant, Roosevelt and the drafters included a guarantee that everyone has the right to form and to join trade unions for the protection of his interests. Ours: When former First Lady Eleanor Roosevelt held the presidency of the International Commission on Human Rights , drafted by the Universal Declaration of Human Rights , adopted in 1948 by the United Nations as a global agreement , Roosevelt and the other authors gave a guarantee that everyone has the right to form or join trade unions to protect their interests . GNMT: When former First Lady Eleanor Roosevelt presided over the International Human Rights Committee , which drew up the Universal Declaration of Human Rights , as adopted by the United Nations in 1948 as a global agreement , Roosevelt and the other authors added a guarantee that everyone has the right to form trade unions to protect their interests or to accede to them . Reference: India s new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his ﬁrst major foreign visit since winning May s election. Ours: India s new Prime Minister Narendra Modi meets his Japanese counterpart , Shinzo Abe , in Tokyo at his ﬁrst major foreign visit since his election in May in order to discuss economic and security relations . GNMT: India s new prime minister , Narendra Modi , is meeting his Japanese counterpart , Shinzo Abe , in Tokyo at his ﬁrst important visit abroad in May to discuss economic and security relations . Reference: The police used tear gas . Ours: The police used tear gas . GNMT: The police put in tear gas . Reference: There were three people killed. Ours: Three people were killed . GNMT: Three people had been killed . Reference: The next day, turning up for work as usual, she was knocked down by a motorcyclist who had mounted the pavement in what passers-by described as a vicious rage. Ours: The next day , when she went to work as usual , she was crossed by a motorcyclist who , as described by passers-by , was in a sort of brutal rage on the road . GNMT: The next day , when she went to work as usual , she was driven by a motorcyclist who , as passants described , went on foot in a kind of brutal anger . Reference: Double-check your gear. Ours: Check your equipment twice . GNMT: Control your equipment twice . Reference: Chinese leaders presented the Sunday ruling as a democratic breakthrough because it gives Hong Kongers a direct vote, but the decision also makes clear that Chinese leaders would retain a ﬁrm hold on the process through a nominating committee tightly controlled by Beijing. Ours: The Chinese leadership presented Sunday s decision as a democratic breakthrough because it gives the citizens of Hong Kong a direct right to vote , but the decision also makes it clear that the Chinese leadership keeps the process ﬁrmly in the hands of a government-controlled Nomination Committee . GNMT: The Chinese leadership presented Sunday s decision as a democratic breakthrough , because Hong Kong s citizens have a direct right to vote , but the decision also makes it clear that the Chinese leadership is ﬁrmly in control of the process through a UN-controlled nomination committee . Reference: Her mother arrived at Mount Sinai Hospital Thursday after an emergency call that she was in cardiac arrest at an Upper East Side clinic, Yorkville Endoscopy, sources said. Ours: According to sources , her mother was sent to Mount Sinai on Thursday after an emergency due to heart failure in a clinic at the Upper East Side , Yorkville Endoscopy . GNMT: According to sources , her mother was sent to Mount Sinai hospital on Thursday after an emergency due to heart closure in a clinic at the Upper East Side , Yorkville Endoscopy . Reference: Ukrainian soldiers had to withdraw from their positions in Ilovaysk after two columns of Russian armor and 1,000 troops last week moved into the Donetsk region to bolster the beleaguered separatists, Col. Andriy Lysenko, spokesman for the Ukrainian National Security and Defense Council, told reporters in Kiev on Saturday. Ours: Ukrainian soldiers had to withdraw from their positions in Ilowajsk after two Russian tanks and 1,000 soldiers entered the Donetsk region last week to support the belated separatists , said Colonel Andrij Lysenko , Speaker of the Ukrainian National Security and Defence Council Reporters on Saturday in Kiev . GNMT: Ukrainian soldiers had to withdraw from their positions in Ilowajsk after two Russian tanks and 1,000 soldiers invaded the Donetsk region last week to support the beloved separatists , said Colonel Andriy Lysenko , Ukrainian National Security and Defence Council spokesman on Saturday in Kiev . Reference: Mountain Rescue doctor, Professor Volker Lischke, who was there with his team to provide safety, and who was equipped with a four-wheel Bully and Quad, said: I know him from Frankfurt - he trains for a specialist sleigh trail - it s just that he pulls the sleigh himself. The man is, therefore, in a sense his own sleigh dog. Ours: Bergwacht doctor Professor Volker Lischke , who , with his team , endowed with Allrad-Bully and Quad for safety , said : The kenn I from Frankfurt , trained for a special sleigh trail , just that he trains the sleigh himself , so the man is , in a sense , his own sled dog . GNMT: Bergwacht physician Professor Volker Lischke , who with his team , equipped with Allrad-Bully and Quad , for safety , said : Den kenn ich aus Frankfurt , which trained for a special Schlittentrail , only to stop the sleigh itself , so the man is in some way his own hook dog .