# federated_nearest_neighbor_machine_translation__e0bbc7c1.pdf Published as a conference paper at ICLR 2023 FEDERATED NEAREST NEIGHBOR MACHINE TRANSLATION Yichao Du , Zhirui Zhang , Bingzhe Wu , Lemao Liu , Tong Xu and Enhong Chen University of Science and Technology of China State Key Laboratory of Cognitive Intelligence Tencent AI Lab duyichao@mail.ustc.edu.cn {tongxu, cheneh}@ustc.edu.cn zrustc11@gmail.com {bingzhewu, redmondliu}@tencent.com To protect user privacy and meet legal regulations, federated learning (FL) is attracting significant attention. Training neural machine translation (NMT) models with traditional FL algorithms (e.g., Fed Avg) typically relies on multi-round model-based interactions. However, it is impractical and inefficient for translation tasks due to the vast communication overheads and heavy synchronization. In this paper, we propose a novel Federated Nearest Neighbor (Fed NN) machine translation framework that, instead of multi-round model-based interactions, leverages one-round memorization-based interaction to share knowledge across different clients and build low-overhead privacy-preserving systems. The whole approach equips the public NMT model trained on large-scale accessible data with a k-nearestneighbor (k NN) classifier and integrates the external datastore constructed by private text data from all clients to form the final FL model. A two-phase datastore encryption strategy is introduced to achieve privacy-preserving during this process. Extensive experiments show that Fed NN significantly reduces computational and communication costs compared with Fed Avg, while maintaining promising translation performance in different FL settings. 1 INTRODUCTION In recent years, neural machine translation (NMT) has significantly improved translation quality (Bahdanau et al., 2015; Vaswani et al., 2017; Hassan et al., 2018) and has been widely adopted in many commercial systems. The current mainstream system is first built on a large-scale corpus collected by the service provider and then directly applied to translation tasks for different users and enterprises. However, this application paradigm faces two critical challenges in practice. On the one hand, previous works have shown that NMT models perform poorly in specific scenarios, especially when they are trained on the corpora from very distinct domains (Koehn & Knowles, 2017; Chu & Wang, 2018). The fine-tuning method is a popular way to mitigate the effect of domain drift, but it brings additional model deployment overhead and particularly requires high-quality in-domain data provided by users or enterprises. On the other hand, some users and enterprises pose high data security requirements due to business concerns or regulations from the government (e.g., GDPR and CCPA), meaning that we cannot directly access private data from users for model training. Thus, a conventional centralized-training manner is infeasible in these scenarios. In response to this dilemma, a natural way is to leverage federated learning (FL) (Li et al., 2019) that enables different data owners to train a global model in a distributed manner while leaving raw private data isolated to preserve data privacy. Generally, a standard FL workflow, such as Fed Avg (Mc Mahan et al., 2017), contains multi-round model-based interactions between server and clients. At each round, the client first performs training on the local sensitive data and sends the model update to the server. The server aggregates these local updates to build an improved global model. This straightforward idea has been implemented by prior works (Roosta et al., 2021; Passban et al., 2022) that directly apply Fed Avg for machine translation tasks Published as a conference paper at ICLR 2023 and introduce some parameter pruning strategies during node communication. Despite this, multi-round model-based interactions are impractical and inefficient for NMT applications. Current models heavily rely on deep neural networks as the backbone and their parameters can reach tens of millions or even hundreds of millions, bringing vast computation and communication overhead. In real-world scenarios, different clients (i.e., users and enterprises) usually have limited computation and communication capabilities, making it difficult to meet frequent model training and node communication requirements in the standard FL workflow. Further, due to the capability differences between clients, heavy synchronization also hinders the efficacy of FL workflow. Fewer interactions may ease this problem but suffer from significant performance loss. Inspired by the recent remarkable performance of memorization-augmented techniques (e.g., the k-nearestneighbor, k NN) in natural language processing (Khandelwal et al., 2020; 2021; Zheng et al., 2021a;b) and computer vision (Papernot & Mcdaniel, 2018; Orhan, 2018), we take a new perspective to deal with above federated NMT training problem. In this paper, we propose a novel Federated Nearest Neighbor (Fed NN) machine translation framework, which equips the public NMT model trained on large-scale accessible data with a k NN classifier and integrates the external datastore constructed by private data from all clients to form the final FL model. In this way, we replace the multi-round model-based interactions in the conventional FL paradigm with the one-round encrypted memorization-based interaction to share knowledge among different clients and drastically reduce computation and communication overhead. Specifically, Fed NN follows a similar server-client architecture. The server holds large-scale accessible data to construct the public NMT model for all clients, while the client leverages their local private data to yield an external datastore that is collected to augment the public NMT model via k NN retrieval. Based on this architecture, the key is to merge and broadcast all datastores built from different clients, while avoiding privacy leakage. We design a two-phase datastore encryption strategy that adopts an adversarial mode between server and clients to achieve privacy-preserving during the memorization-based interaction process. On the one hand, the server builds (K, V)-encryption model for clients to increase the difficulty of reconstructing the private text from the datastores constructed by other clients. The K-encryption model is coupled with the public NMT model to ensure the correctness of k NN retrieval. On the other hand, all clients use a shared content-encryption model for a local datastore during the collecting process so that the server can not directly access the original datastore. During inference, the client leverages the corresponding content-decryption model to obtain the final integrated datastore. We set up several FL scenarios (i.e., Non-IID and IID settings) with multi-domain English-German (En-De) translation dataset, and demonstrate that Fed NN not only drastically decreases computation and communication costs compared with Fed Avg, but also achieves the state-of-the-art translation performance in the Non-IID setting. Additional experiments verify that Fed NN easily scales to large-scale clients with sparse data scenarios thanks to the memorization-based interaction across different clients. Our code is open-sourced on https://github.com/duyichao/Fed NN-MT. 2 FEDNMT: FEDERATED NEURAL MACHINE TRANSLATION Current commercial NMT systems are built on a large-scale corpus collected by the service provider and directly applied to different users and enterprises. However, this mode is difficult to flexibly satisfy the model customization and privacy protection requirements of users and enterprises. In this work, we focus on a more general application scenario, where users and enterprises participate in collaboratively training NMT models with the service provider, but the service provider cannot directly access the private data. Formally, this application scenario consists of |C| clients (i.e., user or enterprise) and a central server (i.e., service provider). The central server holds vast accessible translation data Ds = {(xi s, yi s)}|Ds| i=1 , where xi = (xi 1, xi 2, ..., xi |xi|) and yi = (yi 1, yi 2, ..., xi |yi|) (for brevity, we omit the subscript s here) are text sequences in the source and target languages, respectively. The central server can easily train a public NMT Published as a conference paper at ICLR 2023 model fθ based on this corpus, where θ denotes model parameters. For each client c, it contains private data Dc = {(xi c, yi c)}|Dc| i=1 , which is usually sparse in practice (i.e., |Dc| |Ds|) and only accessible to itself. This setting actually falls into the federated learning framework. The straightforward idea is to apply the vanilla FL method (i.e., Fed Avg) or its variants (Roosta et al., 2021; Passban et al., 2022). Generally, Fed Avg contains multi-round model-based interaction updates between server and clients. At each round r, each client c downloads a global model fθr from the server and optimizes it using Dc. Then the local updates θr c are uploaded to the server, while the server aggregates these updates to form a new model fθr+1 via a simple parameter averaging technique: θr+1 = PC m=1 nm n θr m, where nm denotes the number of data points in the m-th client s private data, and n is the total number of all training data. However, such FL workflow is inefficient for the above application scenario because the parameter of NMT models typically reaches tens of millions or even hundreds of millions, bringing vast computation and communication overhead. The system heterogeneity between server and clients, i.e., mismatch of bandwidth, computation resources, etc., also makes it difficult to satisfy frequent updates and communication requirements in the standard FL workflow. 3 FEDNN: FEDERATED NEAREST NEIGHBOR MACHINE TRANSLATION Inspired by the advanced memorization-augmented techniques, e.g., k NN-MT (Khandelwal et al., 2021) that has shown the promising capability of directly incorporating the pre-trained NMT model with external knowledge via k NN retrieval, we explore to leverage one-round memorization-based interaction rather than multi-round model-based interactions to achieve knowledge sharing across different clients. In this work, we design a novel Federated Nearest Neighbour (Fed NN) machine translation framework, which extends the promising capability of k NN-MT in the federated scenario and introduces two-phase datastore encryption strategy to avoid data privacy leakage. The whole approach complements the public NMT model built by the central server with a k NN classifier and safely collects the local datastore constructed by private text data from all clients to form the global FL model. The entire workflow of Fed NN is illustrated in Figure 1, consisting of initialization, one-round memorization-based interaction and model inference on clients. 3.1 INITIALIZATION Fed NN starts with the public NMT model and encryption models. The central server is responsible for optimizing the public NMT model fθ with Ds. Following k NN-MT (Khandelwal et al., 2021), the memorization (also called as datastore) is a set of key-value pairs. Given a sentence pair (xs, ys) Ds, we gain the context representation fθ(xs, ys,Medical>IT), since Fed Avg utilizes the size of different datasets as weights to aggregate client models. For the aggregation frequency, Fed Avg1 is much better than Fed Avg and more details can be found in Appendix C.2. We find that frequent aggregation significantly reduces the parameter conflicts between different models, but it brings high communication cost. (iii) FT-Ensemble is better than Fed Avg , indicating that the fusion of output probabilities leads to less knowledge conflict compared with model aggregation. (iv) Fed NN achieves an average 4.41/1.99 BLEU score improvement on the client test set compared to Fed Avg1 and C-NMT respectively, and maintains a competitive performance on the server test set. It demonstrates the effectiveness of Fed NN in capturing client-side knowledge by memorization and integrating it with P-NMT. (v) Although Fed NN slightly increases inference time, it not only improves translation quality, but also significantly reduces communication and computation overhead compared with other FL baselines, which is tolerable for clients. For the IID setting, we have some different findings: (i) Some FL methods that do not leverage server data in their training process (i.e., Fed Avg1, FT-Ensemble and Fed NN) outperform C-NMT on the client test set. The reason is that there is no statistical data heterogeneity among clients, resulting in fewer parameter conflicts and less conflict of probability outputs. (ii) The performance of Fed NN is slightly weaker than that in the Non-IID setting. It demonstrates that the benefit of the memorization-based interaction is more significant when the data distribution is more heterogeneous. Overall, Fed NN shows stable performance with less communication and computational overhead in Non-IID and IID settings, which verifies the practicality of memorization-based interaction mechanisms. More results and analysis are shown in Appendix B. 4.3 THE IMPACT OF CLIENT S NUMBER We further verify the effectiveness of Fed NN on a larger number of clients. We adopt the number of clients ranging from (3, 6, 12, 18) for quick experiments.1 The detailed results are shown in Figure 2. Comparisons with FL Methods. As the number of clients increases, we observe that: (i) Both Fed Avg1 and FT-Ensemble show varying degrees of performance degradation on the client test sets, especially for FT-Ensemble. We conjecture that the limited local data cannot support the training of local models and retain most of the knowledge of P-NMT. (ii) Fed NN outperforms FL baselines on both private and global test sets for the Non-IID setting, while for the IID setting it maintains a similar performance to Fed Avg1 on private test sets and keeps a higher global performance. These results show that Fed NN, benefiting from the memorization-based interaction, could quickly scale to large-scale client scenarios and avoid performance loss due to insufficient local private data. The more analysis of FL methods is described in Appendix E. Comparisons with Personalized Methods. We also compare Fed NN with the personalized methods, including FT (fine-tuning P-NMT with only local client-side data) and AK-MT (constructing a datastore with only local client-side data and decoding with assisted Meta-k network). AK-MT and FT perform similarly, as AK-MT is able to capture the personalized knowledge by local memorization. The performance of both AK-MT and FT tends to decrease in the Non-IID setting as the number of clients increase, while Fed NN hardly decreases. For the IID setting, although the performance of all methods degrades, Fed NN still achieves the best performance on all clients test sets. It is because that the global memorization capture more similar patterns than the local memorization to assist in inference. 1Due to the limited resources in our experiments, there are no more domains to ensure the Non-IID setting when the number of clients increases. Thus, we directly separate the Non-IID and IID data distributions with the ratio of (1, 1 6). Note that the Non-IID setting here is not the strictly one, but it is worth exploring. Published as a conference paper at ICLR 2023 FL Methods: FT-Ensemble Fed NN FT AK-MT Personalized Methods: Fed Avg1 (a) Non-IID Setting (b) IID Setting Figure 2: The translation performance of FL and personalized methods when the number of clients increases. 0 0.2 0.4 0.6 0.8 20 0 0.2 0.4 0.6 0.8 35 0 0.2 0.4 0.6 0.8 30 55 Client Medical 0 0.2 0.4 0.6 0.8 10 24 Server WMT14 Fed Avg1 FT-Ensemble FT AK-MT Fed NN Figure 3: The impact of data distribution heterogeneity for different FL and personalized methods. 4.4 THE IMPACT OF DATA HETEROGENEITY To further investigate the effect of data heterogeneity between three clients on FL performance, we adopt a mixed ratio α {0, 0.2, 0.4, 0.6, 0.8} to construct the data distribution that we want: we randomly take a proportion of α from each domain to construct the IID dataset, and then remain domain data is mixed with one-third of this IID dataset to form the final data distribution. As α 0, partitions tend to be more heterogeneous (Non-IID), and conversely, the data distribution is more uniform. As shown in Figure 3, the performance of personalized methods (FT and AK-MT) is degraded as data heterogeneity decrease, which is caused by the reduction of available domain-specific data in the client. FT-Ensemble also decreases across all client test sets and is worse than FT, while Fed Avg1 shows opposite performance trends between Law and IT, Medical. This is because when Fed Avg aggregates, the model weight of each client is proportional to the data size, and as α increases, the data size between clients tends from |DLaw| |DMedical| |DIT | to equally to 1 3(|DLaw| + |DMedical| + |DIT |). Our Fed NN maintains stable and remarkable performance across all client test sets and significantly outperforms other methods in the server test set. It indicates that the memorization-based interaction mechanism could capture and retain the knowledge of all clients, avoiding the knowledge conflict based on traditional model-based interaction. 4.5 QUANTITATIVE ANALYSIS OF PRIVACY We quantify the potential privacy-leaking risks of global memorization. Since all clients obtain the public NMT model, they could utilize their own datastore to train a reverse attack model to reconstruct the private data in Published as a conference paper at ICLR 2023 global memorization. In this experiment, we task one client (e.g., IT) as the attacker and others as the defenders (e.g., Medical and Law). The reconstruction BLEU (Papineni et al., 2002)/Precision(P)/Recall(R)/F1 scores are used to evaluate the degree of privacy leakage. The more experimental details are shown in Appendix D. As illustrated in Table 2, whether the input is an unencrypted key or a key encrypted by f KE, the threat model has very low scores in all defenders, especially for the recall score, meaning that it is difficult to recover and identify valuable information from global memorization. Furthermore, the f KE increases the difficulty of reconstructing the private text from the memorization constructed by other clients. We also provide some case studies in Appendix D.4, which better help qualitatively assess the safety of Fed NN. Table 2: The reconstruction BLEU/Precision(P)/Recall(R)/F1 score [%] of the attack model. Metric Datastore IT Medical IT Law Medical IT Medical Law Law IT Law Medical BLEU (K, V) 8.21 5.09 9.90 6.16 7.33 8.30 (Kf KE, V) 6.52 4.27 7.88 5.58 6.35 6.86 P/R/F1 (K, V) 14.55/2.90/4.84 35.78/7.43/12.3 12.18/7.53/9.31 23.18/11.65/15.51 11.15/7.04/8.63 9.75/4.85/6.48 (Kf KE, V) 14.73/2.30/3.98 41.26/5.28/9.36 11.88/7.43/9.14 12.18/7.53/9.31 11.81/6.35/8.26 9.62/4.04/5.69 5 RELATED WORK The FL algorithm for deep learning (Mc Mahan et al., 2017) is first proposed for language modeling and image classification tasks. Then theory and framework of FL are widely applied to many fields, including computer vision (Lim et al., 2020), data mining (Chai et al., 2021), and edge computing (Ye et al., 2020). Recently, researchers explore applications of FL in privacy-preserving NLP, such as next word prediction (Hard et al., 2018; Chen et al., 2019), aspect sentiment classification (Qin et al., 2021), relation extraction (Sui et al., 2021), and machine translation (Roosta et al., 2021; Passban et al., 2022). For machine translation, previous works directly apply Fed Avg for this task and introduce some parameter pruning strategies during node communication. However, multi-round model-based interactions are impractical and inefficient for NMT because of the huge computational and communication costs associated with large NNT models. Different from them, we design an efficient federated nearest neighbor machine translation framework that requires only one-round memorization interaction to obtain a high-quality global translation system. Memorization-augmented methods have attracted much attention from the community and achieved remarkable performance on many NLP tasks, including language modeling (Khandelwal et al., 2020; He et al., 2021), named entity recognition (Wang et al., 2022), few-shot learning with pre-trained language model (Bari et al., 2021; Nie et al., 2022), and machine translation (Khandelwal et al., 2021; Zheng et al., 2021a;b; Wang et al., 2021; Du et al., 2022). For the NMT system, Khandelwal et al. (2021) first propose k NN-MT, a simple and efficient non-parametric approach that plugs k NN classifier over a large datastore with traditional NMT models (Vaswani et al., 2017; Zhang et al., 2018a;b; Guo et al., 2020; Wei et al., 2020) to achieve significant improvement. Our work extends the promising capability of k NN-MT in the federated scenario and introduces two-phase datastore encryption strategy to avoid data privacy leakage. 6 CONCLUSION In this paper, we present a novel federated nearest neighbor machine translation framework to handle the federated NMT training problem. This FL framework equips the public NMT model trained on large-scale accessible data with a k NN classifier and safely collects all local datastores via a two-phase datastore encryption strategy to form the global FL model. Extensive experimental results demonstrate that our proposed approach significantly reduces computational and communication costs compared with Fed Avg, while achieving promising performance in different FL settings. In the future, we would like to explore this approach on other sequence-to-sequence tasks. Another interesting direction is to further investigate the effectiveness of our method on a larger number of clients, such as hundreds of clients with more domains. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS We thank the anonymous reviewers for helpful feedback on early versions of this work. We appreciate Wenxiang Jiao, Xing Wang, Longyue Wang and Zhaopeng Tu for the fruitful discussions. This work was done when the first author was an intern at Tencent AI Lab and supported by the grants from National Natural Science Foundation of China (No.62222213, 62072423), and the USTC Research Funds of the Double First-Class Initiative (No.YD2150002009). Zhirui Zhang, Tong Xu and Enhong Chen are the corresponding authors. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In EMNLP, 2015. M Saiful Bari, Batool Haider, and Saab Mansour. Nearest neighbour few-shot learning for cross-lingual classification. Ar Xiv, abs/2109.02221, 2021. Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. Findings of the 2014 workshop on statistical machine translation. In WMT@ACL, 2014. Kallista A. Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan Mc Mahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. Co RR, abs/1611.04482, 2016. URL http://arxiv.org/abs/1611.04482. Di Chai, Leye Wang, Kai Chen, and Qiang Yang. Secure federated matrix factorization. IEEE Intelligent Systems, 36:11 20, 2021. Mingqing Chen, Ananda Theertha Suresh, Rajiv Mathews, Adeline Wong, Cyril Allauzen, Franccoise Beaufays, and Michael Riley. Federated learning of n-gram language models. In Co NLL, 2019. Chenhui Chu and Rui Wang. A survey of domain adaptation for neural machine translation. In COLING, 2018. Yichao Du, Weizhi Wang, Zhirui Zhang, Boxing Chen, Tong Xu, Jun Xie, and Enhong Chen. Non-parametric domain adaptation for end-to-end speech translation. In EMNLP, 2022. Taher El Gamal. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans. Inf. Theory, 31:469 472, 1984. Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei, Boxing Chen, and Enhong Chen. Incorporating bert into parallel sequence decoding with adapters. In Neur IPS, 2020. Andrew Hard, Kanishka Rao, Rajiv Mathews, Franc oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo e Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. Ar Xiv, abs/1811.03604, 2018. Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William D. Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Achieving human parity on automatic chinese to english news translation. Ar Xiv, abs/1803.05567, 2018. Published as a conference paper at ICLR 2023 Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. Efficient nearest neighbor language models. In EMNLP, 2021. Herv e J egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:117 128, 2011. Jeff Johnson, Matthijs Douze, and Herv e J egou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535 547, 2021. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In ICLR, 2020. Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Nearest neighbor machine translation. In ICLR, 2021. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2015. Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In NMT@ACL, 2017. Tian Li, Anit Kumar Sahu, Ameet S. Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37:50 60, 2019. Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Tao Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22:2031 2063, 2020. H. B. Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag uera y Arcas. Communicationefficient learning of deep networks from decentralized data. In AISTATS, 2017. Feng Nie, Meixi Chen, Zhirui Zhang, and Xuan Cheng. Improving few-shot performance of language models via nearest neighbor calibration. Ar Xiv, abs/2212.02216, 2022. Goldreich Oded. Foundations of cryptography: Volume 2, basic applications. 2004. A. Emin Orhan. A simple cache model for image recognition. Ar Xiv, abs/1805.08709, 2018. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL, 2019. Nicolas Papernot and Patrick Mcdaniel. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. Ar Xiv, abs/1803.04765, 2018. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. Peyman Passban, Tanya Roosta, Rahul Gupta, Ankit R. Chadha, and Clement Chung. Training mixed-domain translation models via federated learning. In NAACL, 2022. Han Qin, Guimin Chen, Yuanhe Tian, and Yan Song. Improving federated learning for aspect-based sentiment analysis via topic memories. In EMNLP, 2021. Ronald L. Rivest, Adi Shamir, and Leonard M. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM, 21:120 126, 1978. Published as a conference paper at ICLR 2023 Tanya Roosta, Peyman Passban, and Ankit R. Chadha. Communication-efficient federated learning for neural machine translation. Ar Xiv, abs/2112.06135, 2021. Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. Bleurt: Learning robust metrics for text generation. In ACL, 2020. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Ar Xiv, abs/1508.07909, 2016. Dianbo Sui, Yubo Chen, Kang Liu, and Jun Zhao. Distantly supervised relation extraction in federated settings. In EMNLP, 2021. Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. Dongqi Wang, Hao-Ran Wei, Zhirui Zhang, Shujian Huang, Jun Xie, Weihua Luo, and Jiajun Chen. Nonparametric online learning from human feedback for neural machine translation. In AAAI Conference on Artificial Intelligence, 2021. Shuhe Wang, Xiaoya Li, Yuxian Meng, Tianwei Zhang, Rongbin Ouyang, Jiwei Li, and Guoyin Wang. knn-ner: Named entity recognition with nearest neighbor search. Ar Xiv, abs/2203.17103, 2022. Hao-Ran Wei, Zhirui Zhang, Boxing Chen, and Weihua Luo. Iterative domain-repaired back-translation. In EMNLP, 2020. Yunfan Ye, Shen Li, Fang Liu, Yonghao Tang, and Wanting Hu. Edgefed: Optimized federated learning based on edge computing. IEEE Access, 8:209191 209198, 2020. Junpeng Zhang, Mengqian Li, Shuiguang Zeng, Bin Xie, and Dongmei Zhao. A survey on security and privacy threats to federated learning. In 2021 International Conference on Networking and Network Applications, Na NA 2021, Lijiang City, China, October 29 - Nov. 1, 2021, pp. 319 326. IEEE, 2021. doi: 10.1109/Na NA53684.2021.00062. URL https://doi.org/10.1109/Na NA53684.2021. 00062. Zhirui Zhang, Shujie Liu, Mu Li, M. Zhou, and Enhong Chen. Joint training for neural machine translation models with monolingual data. In AAAI Conference on Artificial Intelligence, 2018a. Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. Regularizing neural machine translation by target-bidirectional agreement. In AAAI Conference on Artificial Intelligence, 2018b. Xin Zheng, Zhirui Zhang, Junliang Guo, Shujian Huang, Boxing Chen, Weihua Luo, and Jiajun Chen. Adaptive nearest neighbor machine translation. In ACL, 2021a. Xin Zheng, Zhirui Zhang, Shujian Huang, Boxing Chen, Jun Xie, Weihua Luo, and Jiajun Chen. Nonparametric unsupervised domain adaptation for neural machine translation. In EMNLP(findings), 2021b. Published as a conference paper at ICLR 2023 A IMPLEMENTATION DETAILS AND EVALUATION The statistics of the dataset and datastore used by the server/clients are listed in Table 3 and Table 4, respectively. We follow the recipe 2 to perform data pre-processing. The Moses toolkit 3 is used to tokenize all sentences and learn bpe-code in the publicly available corpus WMT14. Based on this, we split all the words of the above datasets into subword units (Sennrich et al., 2016). All experiments are implemented based on the FAIRSEQ toolkit (Ott et al., 2019). We train the public model on the WMT14 En-De dataset and use it as the initialization model for all methods. We adopt Transformer (Vaswani et al., 2017) as model structure of all baselines, in which it consists of 6 transformer encoder layers, and 6 transformer decoder layers. The input embedding size of the transformer layer is 512, the FFN layer dimension is 2048, and the number of self-attention heads is 8. During training, we deploy the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 5e-4 and 4K warm-up updates to optimize model parameters. Both label smoothing coefficient and dropout rate are set to 0.1. The batch size is set to 16K tokens. We train all models with 4 Tesla-V100 GPU and set patience to 5 to select the best checkpoint on the validation set. The FAISS (Johnson et al., 2021) is leveraged to construct the datastore and we use its Index IVFPQ strategy to implement Product Quantizer K-encryption and fast nearest neighbor search. We utilize the FAISS to learn 4096 cluster centroids on public datastore, and apply it to client s datastore. During inference, the beam size and length penalty are set to 5 and 1 for all methods and we search 64 clusters for each target token when using FAISS. In all experiments, we report the case-sensitive BLEU score (Papineni et al., 2002) using sacre BLEU4. We estimate the number of floating-point operations (FLOPs) used to train the model by multiplying the training time, the number of GPUs used, and an estimation of the sustained single-precision floating-point capacity of each GPU5. Table 3: The statistics of datasets for server and clients. Server Client WMT14 IT Medical Law Train 4,475,414 222,927 248,009 467,309 Dev 45,206 2,000 2,000 2,000 Test 3,003 2,000 2,000 2,000 Table 4: The statistics of datastores for server and clients. Server Client Global WMT14 IT Medical Law (K, V) size 117,427,034 3,085,523 5,858,648 16,868,065 25,812,236 Hard Disk Space (Datastore) 114 GB 3,938 MB 6,890 MB 17,717 MB 28,545 MB Hard Disk Space (Faiss Index) 8,988 MB 244 MB 451 MB 1,266 MB 1,978 MB 2https://github.com/facebookresearch/fairseq/blob/main/examples/translation/prepare-wmt14en2de.sh 3https://github.com/moses-smt/mosesdecoder 4https://github.com/mjpost/sacrebleu, with a configuration of 13a tokenizer, case-sensitiveness, and full punctuation 5The single-precision floating-point capacity for Tesla-V100 GPU is 14 TFLOPs. Published as a conference paper at ICLR 2023 B MORE RESULTS FOR THE NON-IID SETTING B.1 PERFORMANCE COMPARISONS WITH CONTROLLER We compare the performance (BLEU) and overhead of Fed NN, Fed Avg, and Controller in the Non-IID setting of the En2De translation task. For the Controller model, as shown in Roosta et al. (2021) s study, 6E-6D/C-C(0-3) model achieves the best trade-off of performance and efficiency among all FL methods. Thus, we follow their setup and adopt layers 0 and 3 (both for the encoder and decoder) as controllers to participate in the parameter interaction of Fed Avg training. The experimental results are shown in the Table 5. We find that the Controller has a significant performance improvement compared to P-NMT, but is still worse than Fed Avg1 and Fed NN. In addition, since the Controller falls into the multi-round model-based FL interaction paradigm, its communication overhead is still much higher than Fed NN. Table 5: The performance and overhead comparison with Controller. . Comm. and Comp. refer to communication and computational cost in GB and FLOPs respectively Methods Client Test Server Test Overall Performance Cost IT Law Medical WMT14 Client Global Comm. Comp. P-NMT 26.62 35.91 30.27 26.63 30.93 29.86 Controller 27.78 46.30 35.62 18.72 36.57 +5.63 32.11 +2.25 10.86 6.77 1017 Fed Avg1 28.26 53.00 45.90 13.45 42.39 +11.45 35.15 +5.30 388.12 3.23 1018 Fed NN 35.62 55.57 49.21 22.29 46.80 +15.87 40.67 +10.82 5.08 6.72 1015 B.2 EVALUATION WITH BLEURT We evaluate the two settings in Table 1 using the neural metric - BLEURT (Sellam et al., 2020). The detailed results are shown in Table 6. We can get similar conclusions when using the BLEU score as an evaluation metric, i.e., for the Non-IID setting, our Fed NN significantly outperforms all other FL methods; for the IID setting, our Fed NN also achieves comparable performance to the Fed Avg1 and FT-Ensemble. Table 6: BLEURT score [%] of different methods in Table 1. Methods Client Test Server Test Overall Performance IT Law Medical WMT14 Client Global C-NMT 70.46 78.49 74.97 71.62 74.64 - 73.89 - P-NMT 62.00 72.65 65.64 71.93 66.76 - 68.06 - Fed Avgs 1 63.02 73.71 67.06 72.19 67.93 +1.17 69.00 +0.94 Fed Avg1 64.09 78.05 72.74 53.61 71.63 +4.86 67.12 -0.93 Fed Avgs 62.87 74.11 67.50 72.20 68.16 +1.40 69.17 +1.11 Fed Avg 52.63 77.37 65.02 56.26 65.01 -1.76 62.82 -5.24 FT-Ensemble 64.08 72.47 70.01 59.17 68.85 +2.09 66.43 -1.62 Fed NN 68.86 78.12 72.74 67.38 73.24 +6.48 71.78 +3.72 Fed Avgs 1 66.38 76.31 71.65 72.46 71.45 +4.68 71.70 +3.65 Fed Avg1 69.93 78.67 75.68 55.78 74.76 +8.00 70.02 +1.96 Fed Avgs 64.92 74.32 68.68 72.15 69.31 +2.54 70.02 +1.96 Fed Avg 68.96 78.08 74.39 59.44 73.81 +7.05 70.22 +2.16 FT-Ensemble 70.28 78.80 75.04 58.78 74.71 +7.94 70.73 +2.67 Fed NN 67.69 77.74 72.31 68.16 72.58 +5.82 71.48 +3.42 Published as a conference paper at ICLR 2023 B.3 SIGNIFICANT TEST FOR TABLE 1 We use the bootstrap re-sampling method to test the significant difference between Fed NN and other methods. Table 7 shows the significance test results of English-German direction under the Non-IID setting. The - means that Fed NN is not significantly better than the method. We can find that Fed NN significant outperforms all FL methods, including Fed Avg and FT-Ensemble. Table 7: The significant test between Fed NN and other methods for the Non-IID setting in Table 1. Methods Client Test Server Test IT Law Medical WMT14 C-NMT - 0.01 0.05 - P-NMT 0.01 0.01 0.01 0.01 Fed Avg1 0.01 0.01 0.05 0.01 FT-Ensemble 0.01 0.01 0.01 0.05 B.4 PERFORMANCE COMPARISONS ON GERMAN-ENGLISH DIRECTION As illustrated in Table 8, we report the performance of different FL methods in the Non-IID setting of German-English Direction. We observe that the findings in the German-English direction remain consistent with the English-German direction (shown in Table 1), in which Fed NN outperforms other methods in terms of overall performance both client-side and globally. Table 8: Performance of different methods in the German-English direction for the Non-IID setting. Methods Client Test Server Test Overall Performance IT Law Medical WMT14 Client Global P-NMT 31.70 39.86 34.37 31.64 35.31 - 34.39 - Fed Avg1 32.22 58.32 48.56 16.83 46.37 +11.06 38.98 +4.59 FT-Ensemble 35.76 44.07 43.20 21.48 41.01 +5.70 36.13 +1.74 Fed NN 41.11 60.18 53.44 27.12 51.58 +16.27 45.46 +11.07 C ABLATION STUDY ON THE NON-IID SETTING C.1 THE IMPACT OF CLIENT DATA SIZE ON DIFFERENT FL METHODS We carry out an ablation study to verify the impact of client data size on different FL methods, including Fed Avg1, FT-Ensemble and Fed NN. For each domain, we adopt a ratio range of β {0.0, 0.2, 0.4, 0.6, 0.8} to randomly sample from its complete data to constitute the client data of different scales. The detailed results are shown in the Table 9. We can observe that the performance and cost of all methods increase as the size of the client data increases. Moreover, Fed NN significantly outperforms other FL methods in terms of performance, communication and computational overhead. C.2 THE IMPACT OF INTERACTION FREQUENCY ON FEDAVG We conduct experiments to analyze the impact of model interaction frequency on the Fed Avg performance. We set the frequency to k {1, 2, 5, 10, 20, }, i.e., the client interacts the model with the server after k Published as a conference paper at ICLR 2023 Table 9: The impact of client data size on the performance and cost of different methods. refers to the improvement of methods compared with P-NMT. Comm. and Comp. refer to communication and computational cost in GB and FLOPs respectively. β Methods Client Test Server Test Overall Performance Cost IT Law Medical WMT14 Client Global Comm. Comp. 0.2 P-NMT 26.62 35.91 30.27 26.63 30.93 29.86 Fed Avg1 20.27 47.42 33.92 15.73 33.87 +2.94 29.34 -0.52 111.59 9.27 1017 FT-Ensemble 29.95 40.25 37.41 20.81 35.87 +4.94 32.11 +2.25 4.85 1.40 1017 Fed NN 29.65 46.89 41.46 22.72 39.33 +8.40 35.18 +5.32 2.00 1.34 1015 Fed Avg1 20.99 48.98 35.58 16.58 35.18 +4.25 30.53 +0.67 116.44 9.68 1017 FT-Ensemble 29.60 39.50 38.45 21.43 35.85 +4.92 32.25 +2.39 4.85 2.81 1017 Fed NN 30.41 50.30 43.89 23.32 41.53 +10.60 36.98 +7.12 2.77 2.69 1015 Fed Avg1 22.36 52.38 42.04 12.83 38.93 +7.99 32.40 +2.55 245.00 2.04 1018 FT-Ensemble 28.32 40.55 40.19 18.61 36.35 +2.48 31.92 +2.06 4.85 4.21 1017 Fed NN 31.65 52.29 45.65 23.68 43.20 +12.26 38.32 +8.46 3.54 4.03 1015 Fed Avg1 24.11 52.71 43.72 12.76 40.18 +9.25 33.33 +3.47 354.74 2.92 1017 FT-Ensemble 29.01 38.16 39.30 16.87 35.49 +4.56 30.84 +0.98 4.85 5.62 1017 Fed NN 32.15 54.11 46.62 22.22 44.29 +13.36 38.78 +8.92 4.31 5.38 1015 rounds of local updates. We set the total computation overhead to be the same for a fair comparison of translation performance and communication overhead. The detailed results are shown in the Table 10. We find that the performance decreases significantly as k increases, especially when k = (i.e., a copy of the server model is trained locally until convergence), and the average performance drops to a level similar to that of P-NMT. The reason is that too many local updates suffer from catastrophic forgetting of knowledge from the previous aggregated models, resulting in strong knowledge conflicts in the new round of interactions. The optimal performance is presented at k = 1, which means that frequent interactions are essential to alleviate the knowledge conflicts for Fed Avg. Table 10: BLEU score [%] and communication cost of Fed Avg with different interaction frequency. . Comm. Cost refer to communication cost in GB . Methods Client Test Server Test Overall Performance Comm. IT Law Medical WMT14 Client Global Cost P-NMT 26.62 35.91 30.27 26.63 30.93 29.86 Fed Avg1 28.26 53.00 45.90 13.45 42.39 +11.45 35.15 +5.30 388.12 Fed Avg2 26.37 52.92 44.22 13.07 41.17 +10.24 34.15 +4.29 194.06 Fed Avg5 24.93 52.23 41.08 12.85 39.41 +8.48 32.77 +2.92 77.62 Fed Avg10 22.63 51.11 38.60 12.65 37.45 +6.51 31.25 +1.39 38.81 Fed Avg20 21.82 49.53 36.24 12.75 35.86 +4.93 30.09 +0.23 19.41 Fed Avg 17.30 47.06 30.61 13.33 31.57 +0.63 27.01 -2.85 4.85 C.3 THE IMPACT OF ENSEMBLE STRATEGY ON FT-ENSEMBLE The ensemble strategy of FT-Ensemble could be implemented in two ways: the first is to directly average the probability distribution of each client model (as used in the Table 1), i.e., FT-Ensemble; the second, similar Published as a conference paper at ICLR 2023 Table 11: BLEU score [%] of FT-Ensemble with different aggregation strategy. Methods Client Test Server Test Overall Performance IT Law Medical WMT14 Client Global Public Model 26.62 35.91 30.27 26.63 30.93 - 29.86 - FT-Ensemble 30.11 38.14 39.15 17.13 35.80 4.87 31.13 1.28 FT-Ensemble-Wei 24.09 48.58 34.11 16.06 35.59 4.66 30.71 0.85 Fed NN 35.62 55.57 49.21 22.29 46.80 15.87 40.67 10.82 Table 12: The impact of P-NMT s quality. refers to the improvement of the method compared with the model mentioned in Table 1. Methods IT Law Medical WMT14 Clients Avg. Global Avg. BLEU BLEU BLEU BLEU BLEU BLEU P-NMT 30.72 +4.10 38.69 +2.78 35.90 +5.63 29.77 +3.14 35.10 +4.17 33.77 +3.91 Fed Avg1 28.63 +0.37 58.32 +5.32 49.08 +3.18 16.05 +2.60 45.34 +2.95 38.02 +2.87 Fed NN 38.24 +2.62 55.76 +0.19 50.65 +1.44 22.65 +0.36 48.22 +1.42 41.83 +1.16 to Fed Avg, is to weight the probability distribution of each client model s output by assigning weights to it according to its data size, i.e., FT-Ensemble-Wei. The performance comparison of these two ways in the Non-IID setting are shown in the Table 11. We find that FT-Ensemble outperforms FT-Ensemble-Wei in both client-side and global overall performance. FT-Ensemble has a more balanced performance on the client side, while FT-Ensemble-Wei is similar to Fed Avg in that the performance is more biased towards the client Law s model, which has more local data. Our Fed NN outperforms both of these methods on all clients. Note that the two implementations of FT-Ensemble described above in the IID setting are equivalent since the data size is the same for all client. C.4 THE IMPACT OF PUBLIC MODEL S QUALITY Since Fed NN performs federated learning based on the P-NMT, we investigate the impact of the P-NMT s quality on performance. We introduce WMT20 En-De data to train the P-NMT, which contains 40 million parallel pairs, and conduct fast experiments in the Non-IID setting. From Table 12, we can observe that as the quality of the P-NMT improves, all methods show better performance. D THE DETAILS OF PRIVACY LEAKAGE ANALYSIS D.1 DATASET CONSTRUCTION Given a local parallel sentence pair (xc, yc) Dc of client attacker, the public NMT model generates the context representation k = fθ(xc, yc, xc <2tgt> yc, and <2tgt> are used to identify the generation of source and target languages, respectively. By traversing the entire Dc, we obtain the whole dataset R = {r1, r2, . . . , rn} used to train the threat model, where n = P|Dc| i |yi| + |Dc|. The detailed statistics of dataset used for threat model are shown in Table 13. Published as a conference paper at ICLR 2023 Table 13: The statistics of datasets for the threat model. IT Medical Law Train 3,085,523 5,858,648 16,868,065 Dev 34,737 55,577 51,423 Test 2,000 2,000 2,000 dogs <2tgt> hunde <2src> I like mag Ich Embd Embd Embd Linear Embd Embd Embd Embd like dogs hunde <2src> I Ich <2tgt> mag Reconstruction Transformer Decoder Prefix Token Text Predicted by Autoregressive Paradigm Figure 4: The threat model based on the autoregressive paradigm. D.2 THE ARCHITECTURE OF THREAT MODEL The goal of the threat model is to reconstruct the corresponding original text from the memorization (k, v) of client defender. As shown in Figure 4, we use a transformer decoder as the architecture of the threat model, which is similar to the left-to-right language model based on the auto-regressive paradigm. It consists of 6 transformer layers. The input embedding size is 512, the FFN layer dimension is 2048, and the number of self-attention heads is 8. We first transform the first input token k to the same dimension as the word embedding using a linear layer, and then auto-regressive perform left-to-right reconstruction modeling. D.3 EVALUATION OF PRIVACY LEAKAGE We quantify the privacy information leaked by global memorization using sentence-level and word-level metrics, i.e., reconstruction BLEU and privacy word hitting Precision/Recall/F1. Assuming that the text recovered by the threat model from memorization ki vi is hi = {hi,1, hi,2, ..h|hi|} and the ground-truth is gi = {gi,1, gi,2, ....g|gi|}, where i = {1, 2, ..., N} is test sample index. Then we calculate the reconstruction BLEU score using sacre BLEU. Before evaluating the word-level privacy leakage, we require to extract the privacy dictionary of the client defender. The privacy dictionary is obtained by computing the difference between the word distribution of the defender s private dataset and the server public dataset. Further, we filter hi and gi according to this dictionary to obtain sentences hp i and gp i that contain only privacy words. The Published as a conference paper at ICLR 2023 word-level metric then is computed as follows: Precision = PN i P|gp i | j Counthit(gp i,j, hp i ) PN i |hp i | , PN i P|hp i | j Counthit(hp i,j, gp i ) PN i |gp i | , F1 = 2 Precision Recall Precision + Recall , where Counthit(x, y) represents that x has appeared in y. D.4 QUALITATIVE ANALYSIS OF PRIVACY Some qualitative cases are illustrated in Table 15 and we find that the style of all reconstructed texts remained consistent with the attacker s training data, including text length and domain style. For example, in Case1 and Case2, the reconstructed texts from the attack model trained on the Law domain exhibit a domain style of law client. This means that it is difficult to recover and identify valuable information, such as domain and private words, from global memorization. E COST ANALYSIS OF FL METHODS ON DIFFERENT CLIENT S NUMBER The communication and computational costs of different FL methods are illustrated in Table 14. For communication, the cost of Fed Avg is much higher than that of Fed NN and FT-Ensemble. The reason is that Fed Avg requires multi-round communication based on the model, while both Fed NN and FT-Ensemble require only one-round memorization-based communication. For computation, the cost of FT-Ensemble is linearly related to the number of nodes. It cannot be extended to practical applications because of the number of local models that need to be integrated for inference. In contrast, the cost of Fed NN is only 1/60 and N/2 of Fed Avg1 and FT-Ensemble, respectively. Considering many clients limited communication bandwidth and computational resources, Fed NN is a promising framework selection to save a lot of communication time and computational consumption. Table 14: The communication cost and computation cost of different methods, where M, N, R and D respectively represent the model size (414MB), number of client, rounds of communication (160) and the total size of all encrypted datastores (1978MB). Communication Cost (GB) Computation Cost (FLOPs) Compl. 3 6 12 18 3 6 12 18 Fed Avg M N R 2 388.12 776.25 1552.50 3105.00 3.23 1018 3.23 1018 3.23 1018 3.23 1018 FT-Ensemble M N (N+1) 4.85 16.98 63.07 138.27 7.02 1017 1.40 1018 2.11 1018 2.82 1018 Fed NN (D+M) N 5.08 12.08 26.10 40.12 6.72 1015 6.72 1015 6.72 1015 6.72 1015 F LIMITATIONS In this paper, we utilizes one round of memorization-based interaction to share knowledge among different clients, thus building low-overhead privacy-preserving translation systems. We discuss limitations of our method as follows. Despite our proposed approach achieves strong performance when exploiting global memorization sharing, it leads to reduced inference efficiency due to the need for k NN retrieval. As shown in Table 1, the inference Published as a conference paper at ICLR 2023 speed of Fed NN is about 0.75 that of P-NMT. In practice, these costs may be acceptable since we employ FAISS to speed up k NN retrieval. We encourage future work to improve the efficiency of k NN retrieval. The communication overhead required for memorization-based interaction is positively correlated with the client data size. Extremely large client data will make our approach inapplicable because it leads to higher communication overhead. Our approach is more applicable to the generic scenario described in Section 2, i.e., private data is sparse (|Dc| |Ds|). We also encourage further exploration of how to build a smaller and more accurate memorization further to mitigate this problem. This paper is still very preliminary in the privacy leakage analysis of memorization interaction. Although the threat model on shared global memorization has a very low reconstruction scores, privacy leakage is still a potential risk. How to better evaluate and mitigate the privacy leakage of memorization remains an open question, which we leave for future work. Table 15: Examples of qualitative analysis for privacy leakage. Text in green / blue represent the defender- specific ground-truth and private words, respectively. Text in red represents the hit private words by attacker. The bold word represents the threat model of the client-side attacker, where the superscript f KE represents the input of training data k is encrypted K-Encryption, otherwise it is not. Case Examples Case 1: Defender is IT <2src> cursor ; quickly moving ; to an object <2tgt> cursor ; schnell zu einem Objekt bewegen Medicalf KE: <2src> curves , fainting, salivation, vomiting, diarrhoea, fainting, fainting, or vomiting, or diarrhoea. <2tgt> Kleben, Fainting, Speichelfluss, Erbrechen, Durchfall, Ohnmacht oder Erbrechen oder Durchfall oder Erbrechen Medical: <2src> curonium or vecuronium: <2tgt> Vecuronium oder Vecuronium: Lawf KE: <2src> palm oil falling within CN code 2710 00 90 <2tgt> Palm ol des KN-Codes 2710 00 90 Law: <2src> curbiting the use of the designation butter in Annex I to Regulation (EEC) No 3143 85 <2tgt> curbitration der Bezeichnung Butter in Anhang I der Verordnung (EWG) Nr. 3143 / 85 Case 2: Defender is Medical <2src> Intravenous infusion after reconstitution and dilution. <2tgt> Intraven ose Infusion nach Aufl osung und Verd unnung. ITf KE: <2src> Inserts a placeholder. <2tgt> Hiermit f ugen Sie einen Platzhalter ein IT: <2src> Inserts a new row. <2tgt> F ugt eine neue Zeile ein. Lawf KE: <2src> Appointment of the date of minimum durability shall be given. <2tgt> Die Angabe des Law: <2src> The minimum of date durability <2tgt> Angabe des Mindesthaltbarkeitsdatums Case 3: Defender is Law <2src> The Commission consistently takes a favourable view of such aid . <2tgt> Derartige Beihilfen werden von der Kommission stets bef urwortet. ITf KE: <2src> The & kappname; Handbook <2tgt> Das Handbuch zu & kappname; IT: <2src> Following packages depend on the installed packages: <2tgt> Die folgenden Pakete h angen von den installierten Pakete ab: Medicalf KE: <2src> Most common side effects with Azarga (seen in between 1 and 10 patients in 100) areheadache, dizziness, somnolence (sleepiness), nausea (feeling sick), diarrhoea, abdominal tummy pain, diarrhoea, flatulence (gas), abdominal (tummy) pain, dyspepsia (indigestion), diarrhoea, nausea (feeling sick), vomiting, abdominal (tummy) pain, dyspepsia (indigestion), flatulence (wind)... Medical: <2src> European Commission granted a marketing authorisation valid throughout the European Union for Nobilis Influenza H5N6 to Intervet International BV on 24 April 2009. <2tgt> April 2009 erteilte die Europ aische Kommission dem Unternehmen Intervet International BV eine Gene -hmigung f ur das Inverkehrbringen von Nobilis Influenza H5N6 in der gesamten Europ aischen Union.