# divergenceguided_simultaneous_speech_translation__d526e763.pdf Divergence-Guided Simultaneous Speech Translation Xinjie Chen1*, Kai Fan2, Wei Luo2, Linlin Zhang1*, Libo Zhao3* Xinggao Liu1 , Zhongqiang Huang2 1Zhejiang University 2Alibaba DAMO Academy 3South China University of Technology {xinjiechen, zhanglinlinlin, lxg}@zju.edu.cn, {z.huang, k.fan, w.luo}@alibaba-inc.com, wilbzhao@mail.scut.edu.cn To achieve high-quality translation with low latency, a Simultaneous Speech Translation (Simul ST) system relies on a policy module to decide whether to translate immediately or wait for additional streaming input, along with a translation model capable of effectively handling partial speech input. Prior research has tackled these components separately, either using wait-k policies based on fixed-length segments or detected word boundaries, or dynamic policies based on different strategies (e.g., meaningful units), while employing offline models for prefix-to-prefix translation. In this paper, we propose Divergence-Guided Simultaneous Speech Translation (Di G-SST), a tightly integrated approach focusing on both translation quality and latency for streaming input. Specifically, we introduce a simple yet effective prefixbased strategy for training translation models with partial speech input, and develop an adaptive policy that makes read- /write decisions for the translation model based on the expected divergence in translation distributions resulting from future input. Our experiments on multiple translation directions of the Mu ST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods. Introduction Simultaneous Speech Translation (Simul ST) aims to achieve real-time, high-quality translation from streaming speech input while maintaining low latency. Early efforts have conventionally employed a cascaded approach involving both a streaming Automatic Speech Recognition (ASR) model and a Simultaneous Text Machine Translation (Simul MT) model (Oda et al. 2014; Dalvi et al. 2018). While this approach has its merits, it nevertheless suffers from issues such as error propagation and latency accumulation (Le, Lecouteux, and Besacier 2017; Xue et al. 2020). In response to these challenges, recent progress in speech translation (ST) has been primarily focused on end-to-end approaches, leading to significant improvements in both offline and simultaneous ST tasks (Berard et al. 2016; Weiss et al. 2017; Berard et al. 2018; Bansal et al. 2019; Ren et al. 2020; Liu et al. 2021). Following the advancements *Work done when interning at Alibaba DAMO Academy Corresponding authors Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. in Simul MT, researchers have investigated both fixed and adaptive read/write policies for Simul ST. The absence of explicit linguistic boundaries in continuous speech signals presents a unique challenge. To adapt the wait-k policy (Ma et al. 2019) for speech input, various approaches have been proposed, including segmenting audio streams into fixedlength chunks (Ma, Pino, and Koehn 2020; Ma et al. 2021), or at the subword/word level (Dong et al. 2022; Zhang and Feng 2023). Adaptive policies have also been studied to leverage contextual information when making read/write decisions. Zhang et al. (2022) propose the detection of meaningful units in speech that can be independently translated without considering future inputs, while Papi, Negri, and Turchi (2023) experiment with the use of attention scores to develop adaptive policies for the inference process. In spite of their demonstrated improvements, there remains significant discrepancies with respect to the desirable attributes for the translation model and the policy module in Simul ST. First, most prior research has employed offline models trained on complete audio utterances to translate partial speech input, which raises questions about their effectiveness given the apparent gap between training and inference. Second, fixed policies, even when grounded in the detection of subword/word boundaries, disregard available context and cannot make informed read or write decisions. While existing adaptive policies can take into account the partial input and translation history to make dynamic decisions, they are based on heuristics and lack any direct measure of the potential impact of such decisions on the quality of translation, which is essential for achieving a delicate balance between translation quality and latency. Recently, Transducer-based approaches (Liu et al. 2021; Tang et al. 2023) have achieved success in addressing the aforementioned issues using synchronized audio inputs and translation outputs, without explicitly modeling read/write decisions. However, these approaches are computationally intensive and requires training a distinct model for each latency configuration. In this paper, we adhere to the conventional approach of separately modeling and improving translation and read/write decisions, and introduce an integrated approach called Divergence-Guided Simultaneous Speech Translation (Di G-SST). Specifically, we propose: Prefix-enhanced Translation: We include prefix-toprefix and prefix-to-full ST samples, in addition to the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) conventional offline training data, during translation model training. This strategy reduces the gap between conventional offline training and simultaneous inference, improving the model s effectiveness in low latency settings. Divergence-based Policy Module: We suggest using the divergence between the translation distributions of the next target word, computed based on the partial input versus the complete input using the model from prefix-enhanced training, as guidance for the read/write decisions of translation model. We develop a new modeling approach to estimate divergence scores using only the partial input for use during inference, achieved by adding a few lightweight layers on top of the translation model. Our experiments demonstrate that the proposed approach compares favorably against other methods across three dimensions of the Mu ST-C dataset in both offline and simultaneous translation scenarios. The code is available at https://github.com/cxjfluffy/Di G-SST. Background and Related Works Speech translation is categorized into offline and simultaneous scenarios based on inference modes. A standard speech translation training sample, denoted by D = (s, x, y), comprises a speech audio s = (s1, . . . , s T ), its transcription x = (x1, . . . , x I), and translation sequences y = (y1, . . . , y J). Subsequent discussions mainly focus on end-to-end techniques. Offline Speech Translation generates all target tokens based on the complete audio input. The offline ST model first encodes the audio input, represented as s, into a representation, h, which is then decoded to predict y. The decoding process of an offline ST model parameterized by θ is defined as: p(y | s; θ) = j p (yj | s, y