# neural_machine_translation_with_gumbelgreedy_decoding__970e0602.pdf Neural Machine Translation with Gumbel-Greedy Decoding Jiatao Gu, Daniel Jiwoong Im, Victor O.K. Li The University of Hong Kong {jiataogu, wangyong, vli}@eee.hku.hk AIFounded Inc. daniel.im@aifounded.com Previous neural machine translation models used some heuristic search algorithms (e.g., beam search) in order to avoid solving the maximum a posteriori problem over translation sentences at test phase. In this paper, we propose the Gumbel Greedy Decoding which trains a generative network to predict translation under a trained model. We solve such a problem using the Gumbel-Softmax reparameterization, which makes our generative network differentiable and trainable through standard stochastic gradient methods. We empirically demonstrate that our proposed model is effective for generating sequences of discrete words. Introduction Neural machine translation (NMT) (Cho et al. 2014; Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014), as a new territory of machine translation research, has recently become a method of choice, and is empirically shown to be superior over traditional translation systems. The basic scenario of modeling neural machine translation is to model the conditional probability of the translation, in which we often train the model that either maximizes the loglikelihood for the ground-truth translation (teacher forcing) or translations with highest rewards (REINFORCE). Despite these advances, a key problem that still remains with such sequential modeling approaches: once the model is trained, the most probable output which maximizes the log-likelihood during training cannot be properly found at test time. This is because, it involves solving the maximum-a-posteriori (MAP) problem over all possible output sequences. To avoid this problem, heuristic search algorithms (e.g., greedy decoding, beam search) are used to approximate the optimal translation. In this paper, we address this issue by employing a discriminator-generator framework we train the discriminator and the generator at training time, but emit translations with the generator at test time. Instead of relying on a nonoptimal searching algorithm at test time, like greedy search, we propose to train the generator to predict the search directly. Such a way would typically suffer from non-differentiablity of generating discrete words. Here, we address this problem by turning the discrete output node into a differentiable Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. node using the Gumbel-Softmax reparameterization (Jang, Gu, and Poole 2016). Throughout the paper, we named this new process of generating sequence of words as the Gumbel Greedy-Decoding (GGD). We extensively evaluate the proposed GGD on a large parallel corpora with different variants of generators and discriminators. The empirical results demonstrate that GGD improves translation quality. Neural Machine Translation Neural Machine Translation (NMT) models commonly share the auto-regressive property as it is the natural way to model sequential data. More formally, we can define the distribution over the translation sentence Y = [y1, ..., y T ] given a source sentence X = [x1, ..., x T s] as a conditional language model: t=1 p(yt|y