# federated_nearest_neighbor_machine_translation__e0bbc7c1.pdf

Published as a conference paper at ICLR 2023

FEDERATED NEAREST NEIGHBOR MACHINE TRANSLATION

Yichao Du , Zhirui Zhang , Bingzhe Wu , Lemao Liu , Tong Xu and Enhong Chen

University of Science and Technology of China State Key Laboratory of Cognitive Intelligence Tencent AI Lab duyichao@mail.ustc.edu.cn {tongxu, cheneh}@ustc.edu.cn zrustc11@gmail.com {bingzhewu, redmondliu}@tencent.com

To protect user privacy and meet legal regulations, federated learning (FL) is attracting signiﬁcant attention. Training neural machine translation (NMT) models with traditional FL algorithms (e.g., Fed Avg) typically relies on multi-round model-based interactions. However, it is impractical and inefﬁcient for translation tasks due to the vast communication overheads and heavy synchronization. In this paper, we propose a novel Federated Nearest Neighbor (Fed NN) machine translation framework that, instead of multi-round model-based interactions, leverages one-round memorization-based interaction to share knowledge across different clients and build low-overhead privacy-preserving systems. The whole approach equips the public NMT model trained on large-scale accessible data with a k-nearestneighbor (k NN) classiﬁer and integrates the external datastore constructed by private text data from all clients to form the ﬁnal FL model. A two-phase datastore encryption strategy is introduced to achieve privacy-preserving during this process. Extensive experiments show that Fed NN signiﬁcantly reduces computational and communication costs compared with Fed Avg, while maintaining promising translation performance in different FL settings.

1 INTRODUCTION

In recent years, neural machine translation (NMT) has signiﬁcantly improved translation quality (Bahdanau et al., 2015; Vaswani et al., 2017; Hassan et al., 2018) and has been widely adopted in many commercial systems. The current mainstream system is ﬁrst built on a large-scale corpus collected by the service provider and then directly applied to translation tasks for different users and enterprises. However, this application paradigm faces two critical challenges in practice. On the one hand, previous works have shown that NMT models perform poorly in speciﬁc scenarios, especially when they are trained on the corpora from very distinct domains (Koehn & Knowles, 2017; Chu & Wang, 2018). The ﬁne-tuning method is a popular way to mitigate the effect of domain drift, but it brings additional model deployment overhead and particularly requires high-quality in-domain data provided by users or enterprises. On the other hand, some users and enterprises pose high data security requirements due to business concerns or regulations from the government (e.g., GDPR and CCPA), meaning that we cannot directly access private data from users for model training. Thus, a conventional centralized-training manner is infeasible in these scenarios.

In response to this dilemma, a natural way is to leverage federated learning (FL) (Li et al., 2019) that enables different data owners to train a global model in a distributed manner while leaving raw private data isolated to preserve data privacy. Generally, a standard FL workﬂow, such as Fed Avg (Mc Mahan et al., 2017), contains multi-round model-based interactions between server and clients. At each round, the client ﬁrst performs training on the local sensitive data and sends the model update to the server. The server aggregates these local updates to build an improved global model. This straightforward idea has been implemented by prior works (Roosta et al., 2021; Passban et al., 2022) that directly apply Fed Avg for machine translation tasks

Published as a conference paper at ICLR 2023

and introduce some parameter pruning strategies during node communication. Despite this, multi-round model-based interactions are impractical and inefﬁcient for NMT applications. Current models heavily rely on deep neural networks as the backbone and their parameters can reach tens of millions or even hundreds of millions, bringing vast computation and communication overhead. In real-world scenarios, different clients (i.e., users and enterprises) usually have limited computation and communication capabilities, making it difﬁcult to meet frequent model training and node communication requirements in the standard FL workﬂow. Further, due to the capability differences between clients, heavy synchronization also hinders the efﬁcacy of FL workﬂow. Fewer interactions may ease this problem but suffer from signiﬁcant performance loss.

Inspired by the recent remarkable performance of memorization-augmented techniques (e.g., the k-nearestneighbor, k NN) in natural language processing (Khandelwal et al., 2020; 2021; Zheng et al., 2021a;b) and computer vision (Papernot & Mcdaniel, 2018; Orhan, 2018), we take a new perspective to deal with above federated NMT training problem. In this paper, we propose a novel Federated Nearest Neighbor (Fed NN) machine translation framework, which equips the public NMT model trained on large-scale accessible data with a k NN classiﬁer and integrates the external datastore constructed by private data from all clients to form the ﬁnal FL model. In this way, we replace the multi-round model-based interactions in the conventional FL paradigm with the one-round encrypted memorization-based interaction to share knowledge among different clients and drastically reduce computation and communication overhead.

Speciﬁcally, Fed NN follows a similar server-client architecture. The server holds large-scale accessible data to construct the public NMT model for all clients, while the client leverages their local private data to yield an external datastore that is collected to augment the public NMT model via k NN retrieval. Based on this architecture, the key is to merge and broadcast all datastores built from different clients, while avoiding privacy leakage. We design a two-phase datastore encryption strategy that adopts an adversarial mode between server and clients to achieve privacy-preserving during the memorization-based interaction process. On the one hand, the server builds (K, V)-encryption model for clients to increase the difﬁculty of reconstructing the private text from the datastores constructed by other clients. The K-encryption model is coupled with the public NMT model to ensure the correctness of k NN retrieval. On the other hand, all clients use a shared content-encryption model for a local datastore during the collecting process so that the server can not directly access the original datastore. During inference, the client leverages the corresponding content-decryption model to obtain the ﬁnal integrated datastore.

We set up several FL scenarios (i.e., Non-IID and IID settings) with multi-domain English-German (En-De) translation dataset, and demonstrate that Fed NN not only drastically decreases computation and communication costs compared with Fed Avg, but also achieves the state-of-the-art translation performance in the Non-IID setting. Additional experiments verify that Fed NN easily scales to large-scale clients with sparse data scenarios thanks to the memorization-based interaction across different clients. Our code is open-sourced on https://github.com/duyichao/Fed NN-MT.

2 FEDNMT: FEDERATED NEURAL MACHINE TRANSLATION

Current commercial NMT systems are built on a large-scale corpus collected by the service provider and directly applied to different users and enterprises. However, this mode is difﬁcult to ﬂexibly satisfy the model customization and privacy protection requirements of users and enterprises. In this work, we focus on a more general application scenario, where users and enterprises participate in collaboratively training NMT models with the service provider, but the service provider cannot directly access the private data.

Formally, this application scenario consists of |C| clients (i.e., user or enterprise) and a central server (i.e., service provider). The central server holds vast accessible translation data Ds = {(xi s, yi s)}|Ds| i=1 , where xi = (xi 1, xi 2, ..., xi |xi|) and yi = (yi 1, yi 2, ..., xi |yi|) (for brevity, we omit the subscript s here) are text sequences in the source and target languages, respectively. The central server can easily train a public NMT

Published as a conference paper at ICLR 2023

model fθ based on this corpus, where θ denotes model parameters. For each client c, it contains private data Dc = {(xi c, yi c)}|Dc| i=1 , which is usually sparse in practice (i.e., |Dc| |Ds|) and only accessible to itself. This setting actually falls into the federated learning framework. The straightforward idea is to apply the vanilla FL method (i.e., Fed Avg) or its variants (Roosta et al., 2021; Passban et al., 2022). Generally, Fed Avg contains multi-round model-based interaction updates between server and clients. At each round r, each client c downloads a global model fθr from the server and optimizes it using Dc. Then the local updates θr c are uploaded to the server, while the server aggregates these updates to form a new model fθr+1 via a simple parameter averaging technique: θr+1 = PC m=1 nm

n θr m, where nm denotes the number of data points in the m-th client s private data, and n is the total number of all training data. However, such FL workﬂow is inefﬁcient for the above application scenario because the parameter of NMT models typically reaches tens of millions or even hundreds of millions, bringing vast computation and communication overhead. The system heterogeneity between server and clients, i.e., mismatch of bandwidth, computation resources, etc., also makes it difﬁcult to satisfy frequent updates and communication requirements in the standard FL workﬂow.

3 FEDNN: FEDERATED NEAREST NEIGHBOR MACHINE TRANSLATION

Inspired by the advanced memorization-augmented techniques, e.g., k NN-MT (Khandelwal et al., 2021) that has shown the promising capability of directly incorporating the pre-trained NMT model with external knowledge via k NN retrieval, we explore to leverage one-round memorization-based interaction rather than multi-round model-based interactions to achieve knowledge sharing across different clients. In this work, we design a novel Federated Nearest Neighbour (Fed NN) machine translation framework, which extends the promising capability of k NN-MT in the federated scenario and introduces two-phase datastore encryption strategy to avoid data privacy leakage. The whole approach complements the public NMT model built by the central server with a k NN classiﬁer and safely collects the local datastore constructed by private text data from all clients to form the global FL model. The entire workﬂow of Fed NN is illustrated in Figure 1, consisting of initialization, one-round memorization-based interaction and model inference on clients.

3.1 INITIALIZATION

Fed NN starts with the public NMT model and encryption models. The central server is responsible for optimizing the public NMT model fθ with Ds. Following k NN-MT (Khandelwal et al., 2021), the memorization (also called as datastore) is a set of key-value pairs. Given a sentence pair (xs, ys) Ds, we gain the context representation fθ(xs, ys,<t) in the last decoder layer at each timestep t. The whole datastore Ms = (Ks, Vs) is constructed by taking the representation fθ(xs, ys,<t) as key and ground-truth yt as value:

Ms = (Ks, Vs) = [

(xs,ys) Ds {(fθ(xs, ys,<t), ys,t), ys,t ys}. (1)

Based on Ms, the central server further builds K-Encryption model f KE(.) that is coupled with the public NMT model. This design is for clients, which aims to increase the difﬁculty of reconstructing the private text from datastores constructed by other clients. The f KE(.) should also satisfy the correctness of k NN retrieval during inference and the detailed K-Encryption algorithm selection is described in Section 3.4. All clients prepare the shared content-encryption model f CE(.) and corresponding content-decryption model f DE(.), which are applied to the local datastore so that the server cannot directly access the original datastore. The content-encryption algorithm selection is relatively loose, which is detailed in Section 3.4.

3.2 MEMORIZATION-BASED INTERACTION

The entire memorization-based interaction is decomposed into two steps: private memorization construction and global memorization aggregation. The central server broadcasts fθ and f KE(.) for all clients to build the

Published as a conference paper at ICLR 2023

④ Inference

Passing Public Model

-Encryption

Content Encryption

① Initialization

... Passing Public Model

Learning -Encryption

Upload Broadcast

Passing Public Model

... Passing Public Model

-Encryption

Content Encryption

... Content Decryption

② Private Memorization Construction

Federated Nearest Neighbor

③ Gbobal Memorization

Aggregation

& -Encryption

Weight Prediction

Distinct Values

Final Prediction

Public Memorization

Private Memorization

Private Memorization

Global Memorization

I like dogs

Figure 1: The overall workﬂow of our proposed federated framework (Fed NN).

local encrypted datastore. Speciﬁcally, for each client c, we adopt a similar construction way as Equation 1 to yield the local datastore Mf KE c via Dc, with the difference that f KE(.) is used to preserve private information:

Mf KE c = (Kf KE c , Vc) = [

(xc,yc) Dc {(f KE(fθ(xc, yc,<t)), yc,t), yc,t yc}. (2)

In order to ensure that above datastore is not explicitly available to the server, we encrypt key-value pairs by f CE(.) before uploading them to the server, formalized as:

Mf KE, CE c = (Kf KE, CE c , Vf CE c ) = [

(xc,yc) Dc {f CE(f KE(fθ(xc, yc,<t)), yc,t), yc,t yc}. (3)

Once the central server has received the private memorization from all clients, it directly aggregates all datastores via simple key-value pair concatenation and performs V-encryption operation (i.e., shufﬂing on key-value pair to avoid clients identifying the source of datastore) to obtain the global memorization Mf KE,VE, CE g , which is sent to all clients for model inference. Then each client c decrypts the contents of the received Mf KE,VE, CE g to gain an accessible integrated datastore Mf KE,VE g .

3.3 MODEL INFERENCE ON CLIENTS

For model inference on clients, we follow the adaptive k NN-MT (AK-MT) (Zheng et al., 2021a) to incorporate fθ with Mf KE,VE g via adaptive k NN retrieval. AK-MT introduces a lightweight Meta-k Network f Meta-k to dynamically determine the number of retrieved tokens to consider at each step, and has promising generalization ability. Thanks to this, we could train f Meta-k with a small data in any vertical scenario and then directly apply it to other scenarios. Since the parameters in f Meta-k are negligible, we ignore the additional training and communication costs of AK-MT in Fed NN.

Published as a conference paper at ICLR 2023

Concretely, given the already generated words ˆy<t and source input x, AK-MT augments the probability distribution of t-th target token yt via k NN retrieval based on the context representation fθ(x, ˆy<t). It considers a set of possible ks that are smaller than pre-deﬁned K, i.e., k S where S = {0} {ki N | log2 ki N, ki K} (k = 0 indicates ignoring k NN retrieval and only utilizing the public NMT model). Then K nearest neighbors of the current representation fθ(x, ˆy<t) are retrieved from Mf KE,VE g according to the squared L2 distance d( , ). The L2 distances from fθ(x, ˆy<t) to each neighbor (hi, vi) is denoted as di = d(hi, fθ(x, ˆy<t)) and the count of distinct values in top-i is denoted as ci. The normalized weights of applying different k NN retrieval results are computed as: p Meta(k) = softmax(f Meta([d1, ..., d K; c1, ..., c K])). The ﬁnal prediction probability p (yt|x, ˆy<t) is a weighted ensemble over differnt k NN retrieval distributions:

pki NN (yt|x, ˆy<t) X

(hi,vi) 1yt=vi exp d2 (hi, fθ (x, ˆy<t))

p (yt|x, ˆy<t) = X

ki S p Meta (ki) pki NN (yt|x, ˆy<t) , (4)

where T is the temperature to control the sharpness of softmax function.

3.4 ENCRYPTION AND PRIVACY DISCUSSIONS

K-Encryption. In this work, we adopt the Product Quantizer (J egou et al., 2011) algorithm to build Kencryption model, which decomposes the space into a Cartesian product of low-dimensional subspaces and quantiﬁes each subspace separately into segment code representations. Further, we map the representation to the above shortcode representation, which cannot be reversed to the original one, further reducing the possibility of reverse-constructing private data. This way also satisﬁes the correctness of k NN retrieval after encryption. Note that any algorithm that makes representation distorted and irreducible could be adopted in Fed NN, such as PCA. Content-Encryption. We require to generate the ciphertext by content-encryption model and ensure that it is indistinguishable from the chosen plaintext attacks (Oded, 2004). Since each record of the memorization can be regarded as a string, it can be encrypted and decrypted using any asymmetric encryption algorithm, such as Paillier, Elgamal (Gamal, 1984) and RSA (Rivest et al., 1978), etc. Threat Models and Leakage Quantifying. We consider that all clients involved in the training process are semi-honest following prior works (Zhang et al., 2021; Bonawitz et al., 2016). In this semi-honest setting, each client adheres to the designed protocol but it may attempt to infer information about other participant s input (i.e., memorization). Under this setting, our method has achieved different protection levels for the client and server side. Speciﬁcally, for the server side, our mechanism achieves the same protection level to the conventional Public-key cryptography system (e.g., RSA). Thus, the server cannot obtain any useful information from the encrypted data. For the client side, since the client gets shared information generated by Product Quantizer (i.e.,f KE(fθ(xc, yc,<t)), yc,t)) from other clients, the aim of this paper is to prevent the shared information from reconstruction attacks (i.e., recovering private text data from the datastore). To this end, we introduce some metrics to quantify the privacy leakage of the shared datastore information (see more details in Section 4.5).

4 EXPERIMENTS

We adopt WMT14 En-De data (Bojar et al., 2014) and multi-domain En-De dataset (Koehn & Knowles, 2017) to simulate two typical FL scenarios for model evaluation: 1) the non-independently identically distribution

Published as a conference paper at ICLR 2023

(Non-IID setting) where each client distributes data from different domains; 2) the independently identically distribution (IID setting) where each client contains the same data distribution from all domains. In our experiments, WMT14 and multi-domain En-De dataset are viewed as the server s data and clients private data, respectively. The multi-domain data provides a natural division for exploring the Non-IID setting, of which we assign IT, Medical, and Law domain data to each of the three clients. For the IID setting, we mix the above domain data and randomly sample the same number of sentence pairs from it for each client. More dataset and implementation details can be found in Appendix A.

We compare our method Fed NN with several baselines: (i) Centralized Model (C-NMT): A standard centralized-training method uses all clients and server data to obtain a global NMT model. (ii) Public Model (P-NMT): A generic NMT model is trained on only server data and used for initializing the client-side model. (iii) Fed Avg: A vanilla FL approach (Mc Mahan et al., 2017) that iteratively optimizes a global model through multi-round model-based interactions. (iv) FT-Ensemble: We use the client s private data to ﬁne-tune P-NMT and ensemble the output probability distributions of all ﬁne-tuned models during inference.

Table 1: BLEU score [%] of different methods on clients and server test sets. refers to the improvement of methods compared with P-NMT. The subscript 1/ indicate that the server perform model aggregation after one or inﬁnite epochs (i.e., client model convergence) of client model updates, respectively. The superscript s indicates that the server data is also involved in model training of Fed Avg. Comm. and Comp. refer to communication and computational overhead in GB and FLOPs respectively.

Methods Client Test Server Test Overall Performance Cost Inference IT Law Medical WMT14 Client Global Comm. Comp. Speed

C-NMT 37.30 49.72 47.40 26.58 44.81 40.25 1.00

P-NMT 26.62 35.91 30.27 26.63 30.93 29.86 1.00

Fed Avgs 1 26.99 37.65 32.36 26.63 32.33 +1.40 30.91 +1.05 388.12 1.72 1019

1.00 Fed Avg1 28.26 53.00 45.90 13.45 42.39 +11.45 35.15 +5.30 3.23 1018

Fed Avgs 27.04 38.37 32.32 26.33 32.66 +1.72 31.08 +1.22 4.85 7.02 1017

1.00 Fed Avg 17.03 47.06 30.61 13.33 31.57 +0.63 27.01 - 2.85 7.02 1017

FT-Ensemble 30.11 38.14 39.15 17.13 35.80 +4.87 31.13 +1.28 4.85 7.02 1017 0.39

Fed NN 35.62 55.57 49.21 22.29 46.80 +15.87 40.67 +10.82 5.08 6.72 1015 0.75

Fed Avgs 1 30.83 43.47 39.22 26.36 37.84 +6.91 34.97 +5.11 388.12 1.72 1019

1.00 Fed Avg1 37.99 54.53 50.23 14.80 47.58 +16.65 39.39 +9.53 3.23 1018

Fed Avgs 29.23 39.67 34.49 26.28 34.46 +3.53 32.42 +2.56 4.85 7.02 1017

1.00 Fed Avg 34.71 48.68 44.79 16.23 42.73 +11.79 36.10 +6.25 7.02 1017

FT-Ensemble 36.74 51.34 47.49 16.42 45.19 +14.26 38.00 +8.14 4.85 7.02 1017 0.39

Fed NN 34.64 54.45 47.98 23.15 45.69 +14.76 40.06 +10.20 5.08 6.72 1015 0.75

4.2 MAIN RESULTS

Table 1 illustrates the performance of all methods. We observe an average 13.88 BLEU gap on the client test set between P-NMT and C-NMT. Restricted by privacy protection, it is not always possible to access private data for centralized training, so we attempt to build high-performance global models using FL techniques. Speciﬁcally, we evaluate the performance of different FL methods in both Non-IID and IID settings.

For the Non-IID setting, we have the following ﬁndings: (i) All FL methods outperform P-NMT on the client test set, but show degradation on the server test set. This indicates that FL fuses helpful information from multiple parties, but suffers from a varying degree of knowledge conﬂict and catastrophic forgetting. (ii)

Published as a conference paper at ICLR 2023

Fed Avg is heavily affected by whether the server data is used in training and model aggregation frequency. When server data is exploited during the training of Fed Avg, the overall performance of Fed Avg is similar to P-NMT, but Fed Avg signiﬁcantly improves performance on client test set without involving the server data. In addition, the performance gain has a positive correlation with the size of the client dataset (Law>Medical>IT), since Fed Avg utilizes the size of different datasets as weights to aggregate client models. For the aggregation frequency, Fed Avg1 is much better than Fed Avg and more details can be found in Appendix C.2. We ﬁnd that frequent aggregation signiﬁcantly reduces the parameter conﬂicts between different models, but it brings high communication cost. (iii) FT-Ensemble is better than Fed Avg , indicating that the fusion of output probabilities leads to less knowledge conﬂict compared with model aggregation. (iv) Fed NN achieves an average 4.41/1.99 BLEU score improvement on the client test set compared to Fed Avg1 and C-NMT respectively, and maintains a competitive performance on the server test set. It demonstrates the effectiveness of Fed NN in capturing client-side knowledge by memorization and integrating it with P-NMT. (v) Although Fed NN slightly increases inference time, it not only improves translation quality, but also signiﬁcantly reduces communication and computation overhead compared with other FL baselines, which is tolerable for clients.

For the IID setting, we have some different ﬁndings: (i) Some FL methods that do not leverage server data in their training process (i.e., Fed Avg1, FT-Ensemble and Fed NN) outperform C-NMT on the client test set. The reason is that there is no statistical data heterogeneity among clients, resulting in fewer parameter conﬂicts and less conﬂict of probability outputs. (ii) The performance of Fed NN is slightly weaker than that in the Non-IID setting. It demonstrates that the beneﬁt of the memorization-based interaction is more signiﬁcant when the data distribution is more heterogeneous.

Overall, Fed NN shows stable performance with less communication and computational overhead in Non-IID and IID settings, which veriﬁes the practicality of memorization-based interaction mechanisms. More results and analysis are shown in Appendix B.

4.3 THE IMPACT OF CLIENT S NUMBER

We further verify the effectiveness of Fed NN on a larger number of clients. We adopt the number of clients ranging from (3, 6, 12, 18) for quick experiments.1 The detailed results are shown in Figure 2.

Comparisons with FL Methods. As the number of clients increases, we observe that: (i) Both Fed Avg1 and FT-Ensemble show varying degrees of performance degradation on the client test sets, especially for FT-Ensemble. We conjecture that the limited local data cannot support the training of local models and retain most of the knowledge of P-NMT. (ii) Fed NN outperforms FL baselines on both private and global test sets for the Non-IID setting, while for the IID setting it maintains a similar performance to Fed Avg1 on private test sets and keeps a higher global performance. These results show that Fed NN, beneﬁting from the memorization-based interaction, could quickly scale to large-scale client scenarios and avoid performance loss due to insufﬁcient local private data. The more analysis of FL methods is described in Appendix E. Comparisons with Personalized Methods. We also compare Fed NN with the personalized methods, including FT (ﬁne-tuning P-NMT with only local client-side data) and AK-MT (constructing a datastore with only local client-side data and decoding with assisted Meta-k network). AK-MT and FT perform similarly, as AK-MT is able to capture the personalized knowledge by local memorization. The performance of both AK-MT and FT tends to decrease in the Non-IID setting as the number of clients increase, while Fed NN hardly decreases. For the IID setting, although the performance of all methods degrades, Fed NN still achieves the best performance on all clients test sets. It is because that the global memorization capture more similar patterns than the local memorization to assist in inference.

1Due to the limited resources in our experiments, there are no more domains to ensure the Non-IID setting when the number of clients increases. Thus, we directly separate the Non-IID and IID data distributions with the ratio of (1, 1

6). Note that the Non-IID setting here is not the strictly one, but it is worth exploring.

Published as a conference paper at ICLR 2023

FL Methods: FT-Ensemble Fed NN FT AK-MT Personalized Methods: Fed Avg1

(a) Non-IID Setting (b) IID Setting

Figure 2: The translation performance of FL and personalized methods when the number of clients increases.

0 0.2 0.4 0.6 0.8 20

0 0.2 0.4 0.6 0.8 35

0 0.2 0.4 0.6 0.8 30

55 Client Medical

0 0.2 0.4 0.6 0.8 10

24 Server WMT14

Fed Avg1 FT-Ensemble FT AK-MT Fed NN

Figure 3: The impact of data distribution heterogeneity for different FL and personalized methods.

4.4 THE IMPACT OF DATA HETEROGENEITY

To further investigate the effect of data heterogeneity between three clients on FL performance, we adopt a mixed ratio α {0, 0.2, 0.4, 0.6, 0.8} to construct the data distribution that we want: we randomly take a proportion of α from each domain to construct the IID dataset, and then remain domain data is mixed with one-third of this IID dataset to form the ﬁnal data distribution. As α 0, partitions tend to be more heterogeneous (Non-IID), and conversely, the data distribution is more uniform. As shown in Figure 3, the performance of personalized methods (FT and AK-MT) is degraded as data heterogeneity decrease, which is caused by the reduction of available domain-speciﬁc data in the client. FT-Ensemble also decreases across all client test sets and is worse than FT, while Fed Avg1 shows opposite performance trends between Law and IT, Medical. This is because when Fed Avg aggregates, the model weight of each client is proportional to the data size, and as α increases, the data size between clients tends from |DLaw| |DMedical| |DIT | to equally to 1

3(|DLaw| + |DMedical| + |DIT |). Our Fed NN maintains stable and remarkable performance across all client test sets and signiﬁcantly outperforms other methods in the server test set. It indicates that the memorization-based interaction mechanism could capture and retain the knowledge of all clients, avoiding the knowledge conﬂict based on traditional model-based interaction.

4.5 QUANTITATIVE ANALYSIS OF PRIVACY

We quantify the potential privacy-leaking risks of global memorization. Since all clients obtain the public NMT model, they could utilize their own datastore to train a reverse attack model to reconstruct the private data in

Published as a conference paper at ICLR 2023

global memorization. In this experiment, we task one client (e.g., IT) as the attacker and others as the defenders (e.g., Medical and Law). The reconstruction BLEU (Papineni et al., 2002)/Precision(P)/Recall(R)/F1 scores are used to evaluate the degree of privacy leakage. The more experimental details are shown in Appendix D. As illustrated in Table 2, whether the input is an unencrypted key or a key encrypted by f KE, the threat model has very low scores in all defenders, especially for the recall score, meaning that it is difﬁcult to recover and identify valuable information from global memorization. Furthermore, the f KE increases the difﬁculty of reconstructing the private text from the memorization constructed by other clients. We also provide some case studies in Appendix D.4, which better help qualitatively assess the safety of Fed NN.

Table 2: The reconstruction BLEU/Precision(P)/Recall(R)/F1 score [%] of the attack model. Metric Datastore IT Medical IT Law Medical IT Medical Law Law IT Law Medical

BLEU (K, V) 8.21 5.09 9.90 6.16 7.33 8.30 (Kf KE, V) 6.52 4.27 7.88 5.58 6.35 6.86

P/R/F1 (K, V) 14.55/2.90/4.84 35.78/7.43/12.3 12.18/7.53/9.31 23.18/11.65/15.51 11.15/7.04/8.63 9.75/4.85/6.48 (Kf KE, V) 14.73/2.30/3.98 41.26/5.28/9.36 11.88/7.43/9.14 12.18/7.53/9.31 11.81/6.35/8.26 9.62/4.04/5.69

5 RELATED WORK

The FL algorithm for deep learning (Mc Mahan et al., 2017) is ﬁrst proposed for language modeling and image classiﬁcation tasks. Then theory and framework of FL are widely applied to many ﬁelds, including computer vision (Lim et al., 2020), data mining (Chai et al., 2021), and edge computing (Ye et al., 2020). Recently, researchers explore applications of FL in privacy-preserving NLP, such as next word prediction (Hard et al., 2018; Chen et al., 2019), aspect sentiment classiﬁcation (Qin et al., 2021), relation extraction (Sui et al., 2021), and machine translation (Roosta et al., 2021; Passban et al., 2022). For machine translation, previous works directly apply Fed Avg for this task and introduce some parameter pruning strategies during node communication. However, multi-round model-based interactions are impractical and inefﬁcient for NMT because of the huge computational and communication costs associated with large NNT models. Different from them, we design an efﬁcient federated nearest neighbor machine translation framework that requires only one-round memorization interaction to obtain a high-quality global translation system.

Memorization-augmented methods have attracted much attention from the community and achieved remarkable performance on many NLP tasks, including language modeling (Khandelwal et al., 2020; He et al., 2021), named entity recognition (Wang et al., 2022), few-shot learning with pre-trained language model (Bari et al., 2021; Nie et al., 2022), and machine translation (Khandelwal et al., 2021; Zheng et al., 2021a;b; Wang et al., 2021; Du et al., 2022). For the NMT system, Khandelwal et al. (2021) ﬁrst propose k NN-MT, a simple and efﬁcient non-parametric approach that plugs k NN classiﬁer over a large datastore with traditional NMT models (Vaswani et al., 2017; Zhang et al., 2018a;b; Guo et al., 2020; Wei et al., 2020) to achieve signiﬁcant improvement. Our work extends the promising capability of k NN-MT in the federated scenario and introduces two-phase datastore encryption strategy to avoid data privacy leakage.

6 CONCLUSION

In this paper, we present a novel federated nearest neighbor machine translation framework to handle the federated NMT training problem. This FL framework equips the public NMT model trained on large-scale accessible data with a k NN classiﬁer and safely collects all local datastores via a two-phase datastore encryption strategy to form the global FL model. Extensive experimental results demonstrate that our proposed approach signiﬁcantly reduces computational and communication costs compared with Fed Avg, while achieving promising performance in different FL settings. In the future, we would like to explore this approach on other sequence-to-sequence tasks. Another interesting direction is to further investigate the effectiveness of our method on a larger number of clients, such as hundreds of clients with more domains.

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for helpful feedback on early versions of this work. We appreciate Wenxiang Jiao, Xing Wang, Longyue Wang and Zhaopeng Tu for the fruitful discussions. This work was done when the ﬁrst author was an intern at Tencent AI Lab and supported by the grants from National Natural Science Foundation of China (No.62222213, 62072423), and the USTC Research Funds of the Double First-Class Initiative (No.YD2150002009). Zhirui Zhang, Tong Xu and Enhong Chen are the corresponding authors.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In EMNLP, 2015.

M Saiful Bari, Batool Haider, and Saab Mansour. Nearest neighbour few-shot learning for cross-lingual classiﬁcation. Ar Xiv, abs/2109.02221, 2021.

Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. Findings of the 2014 workshop on statistical machine translation. In WMT@ACL, 2014.

Kallista A. Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan Mc Mahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. Co RR, abs/1611.04482, 2016. URL http://arxiv.org/abs/1611.04482.

Di Chai, Leye Wang, Kai Chen, and Qiang Yang. Secure federated matrix factorization. IEEE Intelligent Systems, 36:11 20, 2021.

Mingqing Chen, Ananda Theertha Suresh, Rajiv Mathews, Adeline Wong, Cyril Allauzen, Franccoise Beaufays, and Michael Riley. Federated learning of n-gram language models. In Co NLL, 2019.

Chenhui Chu and Rui Wang. A survey of domain adaptation for neural machine translation. In COLING, 2018.

Yichao Du, Weizhi Wang, Zhirui Zhang, Boxing Chen, Tong Xu, Jun Xie, and Enhong Chen. Non-parametric domain adaptation for end-to-end speech translation. In EMNLP, 2022.

Taher El Gamal. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans. Inf. Theory, 31:469 472, 1984.

Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei, Boxing Chen, and Enhong Chen. Incorporating bert into parallel sequence decoding with adapters. In Neur IPS, 2020.

Andrew Hard, Kanishka Rao, Rajiv Mathews, Franc oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo e Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. Ar Xiv, abs/1811.03604, 2018.

Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William D. Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Achieving human parity on automatic chinese to english news translation. Ar Xiv, abs/1803.05567, 2018.

Published as a conference paper at ICLR 2023

Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. Efﬁcient nearest neighbor language models. In EMNLP, 2021.

Herv e J egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:117 128, 2011.

Jeff Johnson, Matthijs Douze, and Herv e J egou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535 547, 2021.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In ICLR, 2020.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Nearest neighbor machine translation. In ICLR, 2021.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2015.

Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In NMT@ACL, 2017.

Tian Li, Anit Kumar Sahu, Ameet S. Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37:50 60, 2019.

Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Tao Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22:2031 2063, 2020.

H. B. Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag uera y Arcas. Communicationefﬁcient learning of deep networks from decentralized data. In AISTATS, 2017.

Feng Nie, Meixi Chen, Zhirui Zhang, and Xuan Cheng. Improving few-shot performance of language models via nearest neighbor calibration. Ar Xiv, abs/2212.02216, 2022.

Goldreich Oded. Foundations of cryptography: Volume 2, basic applications. 2004.

A. Emin Orhan. A simple cache model for image recognition. Ar Xiv, abs/1805.08709, 2018.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL, 2019.

Nicolas Papernot and Patrick Mcdaniel. Deep k-nearest neighbors: Towards conﬁdent, interpretable and robust deep learning. Ar Xiv, abs/1803.04765, 2018.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.

Peyman Passban, Tanya Roosta, Rahul Gupta, Ankit R. Chadha, and Clement Chung. Training mixed-domain translation models via federated learning. In NAACL, 2022.

Han Qin, Guimin Chen, Yuanhe Tian, and Yan Song. Improving federated learning for aspect-based sentiment analysis via topic memories. In EMNLP, 2021.

Ronald L. Rivest, Adi Shamir, and Leonard M. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM, 21:120 126, 1978.

Published as a conference paper at ICLR 2023

Tanya Roosta, Peyman Passban, and Ankit R. Chadha. Communication-efﬁcient federated learning for neural machine translation. Ar Xiv, abs/2112.06135, 2021.

Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. Bleurt: Learning robust metrics for text generation. In ACL, 2020.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Ar Xiv, abs/1508.07909, 2016.

Dianbo Sui, Yubo Chen, Kang Liu, and Jun Zhao. Distantly supervised relation extraction in federated settings. In EMNLP, 2021.

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017.

Dongqi Wang, Hao-Ran Wei, Zhirui Zhang, Shujian Huang, Jun Xie, Weihua Luo, and Jiajun Chen. Nonparametric online learning from human feedback for neural machine translation. In AAAI Conference on Artiﬁcial Intelligence, 2021.

Shuhe Wang, Xiaoya Li, Yuxian Meng, Tianwei Zhang, Rongbin Ouyang, Jiwei Li, and Guoyin Wang. knn-ner: Named entity recognition with nearest neighbor search. Ar Xiv, abs/2203.17103, 2022.

Hao-Ran Wei, Zhirui Zhang, Boxing Chen, and Weihua Luo. Iterative domain-repaired back-translation. In EMNLP, 2020.

Yunfan Ye, Shen Li, Fang Liu, Yonghao Tang, and Wanting Hu. Edgefed: Optimized federated learning based on edge computing. IEEE Access, 8:209191 209198, 2020.

Junpeng Zhang, Mengqian Li, Shuiguang Zeng, Bin Xie, and Dongmei Zhao. A survey on security and privacy threats to federated learning. In 2021 International Conference on Networking and Network Applications, Na NA 2021, Lijiang City, China, October 29 - Nov. 1, 2021, pp. 319 326. IEEE, 2021. doi: 10.1109/Na NA53684.2021.00062. URL https://doi.org/10.1109/Na NA53684.2021. 00062.

Zhirui Zhang, Shujie Liu, Mu Li, M. Zhou, and Enhong Chen. Joint training for neural machine translation models with monolingual data. In AAAI Conference on Artiﬁcial Intelligence, 2018a.

Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. Regularizing neural machine translation by target-bidirectional agreement. In AAAI Conference on Artiﬁcial Intelligence, 2018b.

Xin Zheng, Zhirui Zhang, Junliang Guo, Shujian Huang, Boxing Chen, Weihua Luo, and Jiajun Chen. Adaptive nearest neighbor machine translation. In ACL, 2021a.

Xin Zheng, Zhirui Zhang, Shujian Huang, Boxing Chen, Jun Xie, Weihua Luo, and Jiajun Chen. Nonparametric unsupervised domain adaptation for neural machine translation. In EMNLP(ﬁndings), 2021b.

Published as a conference paper at ICLR 2023

A IMPLEMENTATION DETAILS AND EVALUATION

The statistics of the dataset and datastore used by the server/clients are listed in Table 3 and Table 4, respectively. We follow the recipe 2 to perform data pre-processing. The Moses toolkit 3 is used to tokenize all sentences and learn bpe-code in the publicly available corpus WMT14. Based on this, we split all the words of the above datasets into subword units (Sennrich et al., 2016). All experiments are implemented based on the FAIRSEQ toolkit (Ott et al., 2019). We train the public model on the WMT14 En-De dataset and use it as the initialization model for all methods. We adopt Transformer (Vaswani et al., 2017) as model structure of all baselines, in which it consists of 6 transformer encoder layers, and 6 transformer decoder layers. The input embedding size of the transformer layer is 512, the FFN layer dimension is 2048, and the number of self-attention heads is 8. During training, we deploy the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 5e-4 and 4K warm-up updates to optimize model parameters. Both label smoothing coefﬁcient and dropout rate are set to 0.1. The batch size is set to 16K tokens. We train all models with 4 Tesla-V100 GPU and set patience to 5 to select the best checkpoint on the validation set. The FAISS (Johnson et al., 2021) is leveraged to construct the datastore and we use its Index IVFPQ strategy to implement Product Quantizer K-encryption and fast nearest neighbor search. We utilize the FAISS to learn 4096 cluster centroids on public datastore, and apply it to client s datastore. During inference, the beam size and length penalty are set to 5 and 1 for all methods and we search 64 clusters for each target token when using FAISS. In all experiments, we report the case-sensitive BLEU score (Papineni et al., 2002) using sacre BLEU4. We estimate the number of ﬂoating-point operations (FLOPs) used to train the model by multiplying the training time, the number of GPUs used, and an estimation of the sustained single-precision ﬂoating-point capacity of each GPU5.

Table 3: The statistics of datasets for server and clients.

Server Client

WMT14 IT Medical Law

Train 4,475,414 222,927 248,009 467,309 Dev 45,206 2,000 2,000 2,000 Test 3,003 2,000 2,000 2,000

Table 4: The statistics of datastores for server and clients.

Server Client Global WMT14 IT Medical Law

(K, V) size 117,427,034 3,085,523 5,858,648 16,868,065 25,812,236 Hard Disk Space (Datastore) 114 GB 3,938 MB 6,890 MB 17,717 MB 28,545 MB Hard Disk Space (Faiss Index) 8,988 MB 244 MB 451 MB 1,266 MB 1,978 MB

2https://github.com/facebookresearch/fairseq/blob/main/examples/translation/prepare-wmt14en2de.sh 3https://github.com/moses-smt/mosesdecoder 4https://github.com/mjpost/sacrebleu, with a conﬁguration of 13a tokenizer, case-sensitiveness, and full punctuation 5The single-precision ﬂoating-point capacity for Tesla-V100 GPU is 14 TFLOPs.

Published as a conference paper at ICLR 2023

B MORE RESULTS FOR THE NON-IID SETTING

B.1 PERFORMANCE COMPARISONS WITH CONTROLLER

We compare the performance (BLEU) and overhead of Fed NN, Fed Avg, and Controller in the Non-IID setting of the En2De translation task. For the Controller model, as shown in Roosta et al. (2021) s study, 6E-6D/C-C(0-3) model achieves the best trade-off of performance and efﬁciency among all FL methods. Thus, we follow their setup and adopt layers 0 and 3 (both for the encoder and decoder) as controllers to participate in the parameter interaction of Fed Avg training. The experimental results are shown in the Table 5. We ﬁnd that the Controller has a signiﬁcant performance improvement compared to P-NMT, but is still worse than Fed Avg1 and Fed NN. In addition, since the Controller falls into the multi-round model-based FL interaction paradigm, its communication overhead is still much higher than Fed NN.

Table 5: The performance and overhead comparison with Controller. . Comm. and Comp. refer to communication and computational cost in GB and FLOPs respectively

Methods Client Test Server Test Overall Performance Cost IT Law Medical WMT14 Client Global Comm. Comp.

P-NMT 26.62 35.91 30.27 26.63 30.93 29.86

Controller 27.78 46.30 35.62 18.72 36.57 +5.63 32.11 +2.25 10.86 6.77 1017

Fed Avg1 28.26 53.00 45.90 13.45 42.39 +11.45 35.15 +5.30 388.12 3.23 1018

Fed NN 35.62 55.57 49.21 22.29 46.80 +15.87 40.67 +10.82 5.08 6.72 1015

B.2 EVALUATION WITH BLEURT

We evaluate the two settings in Table 1 using the neural metric - BLEURT (Sellam et al., 2020). The detailed results are shown in Table 6. We can get similar conclusions when using the BLEU score as an evaluation metric, i.e., for the Non-IID setting, our Fed NN signiﬁcantly outperforms all other FL methods; for the IID setting, our Fed NN also achieves comparable performance to the Fed Avg1 and FT-Ensemble.

Table 6: BLEURT score [%] of different methods in Table 1.

Methods Client Test Server Test Overall Performance IT Law Medical WMT14 Client Global

C-NMT 70.46 78.49 74.97 71.62 74.64 - 73.89 - P-NMT 62.00 72.65 65.64 71.93 66.76 - 68.06 -

Fed Avgs 1 63.02 73.71 67.06 72.19 67.93 +1.17 69.00 +0.94 Fed Avg1 64.09 78.05 72.74 53.61 71.63 +4.86 67.12 -0.93 Fed Avgs 62.87 74.11 67.50 72.20 68.16 +1.40 69.17 +1.11 Fed Avg 52.63 77.37 65.02 56.26 65.01 -1.76 62.82 -5.24 FT-Ensemble 64.08 72.47 70.01 59.17 68.85 +2.09 66.43 -1.62 Fed NN 68.86 78.12 72.74 67.38 73.24 +6.48 71.78 +3.72

Fed Avgs 1 66.38 76.31 71.65 72.46 71.45 +4.68 71.70 +3.65 Fed Avg1 69.93 78.67 75.68 55.78 74.76 +8.00 70.02 +1.96 Fed Avgs 64.92 74.32 68.68 72.15 69.31 +2.54 70.02 +1.96 Fed Avg 68.96 78.08 74.39 59.44 73.81 +7.05 70.22 +2.16 FT-Ensemble 70.28 78.80 75.04 58.78 74.71 +7.94 70.73 +2.67 Fed NN 67.69 77.74 72.31 68.16 72.58 +5.82 71.48 +3.42

Published as a conference paper at ICLR 2023

B.3 SIGNIFICANT TEST FOR TABLE 1

We use the bootstrap re-sampling method to test the signiﬁcant difference between Fed NN and other methods. Table 7 shows the signiﬁcance test results of English-German direction under the Non-IID setting. The - means that Fed NN is not signiﬁcantly better than the method. We can ﬁnd that Fed NN signiﬁcant outperforms all FL methods, including Fed Avg and FT-Ensemble.

Table 7: The signiﬁcant test between Fed NN and other methods for the Non-IID setting in Table 1.

Methods Client Test Server Test IT Law Medical WMT14

C-NMT - 0.01 0.05 - P-NMT 0.01 0.01 0.01 0.01 Fed Avg1 0.01 0.01 0.05 0.01 FT-Ensemble 0.01 0.01 0.01 0.05

B.4 PERFORMANCE COMPARISONS ON GERMAN-ENGLISH DIRECTION

As illustrated in Table 8, we report the performance of different FL methods in the Non-IID setting of German-English Direction. We observe that the ﬁndings in the German-English direction remain consistent with the English-German direction (shown in Table 1), in which Fed NN outperforms other methods in terms of overall performance both client-side and globally.

Table 8: Performance of different methods in the German-English direction for the Non-IID setting.

Methods Client Test Server Test Overall Performance IT Law Medical WMT14 Client Global

P-NMT 31.70 39.86 34.37 31.64 35.31 - 34.39 -

Fed Avg1 32.22 58.32 48.56 16.83 46.37 +11.06 38.98 +4.59 FT-Ensemble 35.76 44.07 43.20 21.48 41.01 +5.70 36.13 +1.74 Fed NN 41.11 60.18 53.44 27.12 51.58 +16.27 45.46 +11.07

C ABLATION STUDY ON THE NON-IID SETTING

C.1 THE IMPACT OF CLIENT DATA SIZE ON DIFFERENT FL METHODS

We carry out an ablation study to verify the impact of client data size on different FL methods, including Fed Avg1, FT-Ensemble and Fed NN. For each domain, we adopt a ratio range of β {0.0, 0.2, 0.4, 0.6, 0.8} to randomly sample from its complete data to constitute the client data of different scales. The detailed results are shown in the Table 9. We can observe that the performance and cost of all methods increase as the size of the client data increases. Moreover, Fed NN signiﬁcantly outperforms other FL methods in terms of performance, communication and computational overhead.

C.2 THE IMPACT OF INTERACTION FREQUENCY ON FEDAVG

We conduct experiments to analyze the impact of model interaction frequency on the Fed Avg performance. We set the frequency to k {1, 2, 5, 10, 20, }, i.e., the client interacts the model with the server after k

Published as a conference paper at ICLR 2023

Table 9: The impact of client data size on the performance and cost of different methods. refers to the improvement of methods compared with P-NMT. Comm. and Comp. refer to communication and computational cost in GB and FLOPs respectively.

β Methods Client Test Server Test Overall Performance Cost IT Law Medical WMT14 Client Global Comm. Comp.

0.2 P-NMT 26.62 35.91 30.27 26.63 30.93 29.86

Fed Avg1 20.27 47.42 33.92 15.73 33.87 +2.94 29.34 -0.52 111.59 9.27 1017

FT-Ensemble 29.95 40.25 37.41 20.81 35.87 +4.94 32.11 +2.25 4.85 1.40 1017

Fed NN 29.65 46.89 41.46 22.72 39.33 +8.40 35.18 +5.32 2.00 1.34 1015

Fed Avg1 20.99 48.98 35.58 16.58 35.18 +4.25 30.53 +0.67 116.44 9.68 1017

FT-Ensemble 29.60 39.50 38.45 21.43 35.85 +4.92 32.25 +2.39 4.85 2.81 1017

Fed NN 30.41 50.30 43.89 23.32 41.53 +10.60 36.98 +7.12 2.77 2.69 1015

Fed Avg1 22.36 52.38 42.04 12.83 38.93 +7.99 32.40 +2.55 245.00 2.04 1018

FT-Ensemble 28.32 40.55 40.19 18.61 36.35 +2.48 31.92 +2.06 4.85 4.21 1017

Fed NN 31.65 52.29 45.65 23.68 43.20 +12.26 38.32 +8.46 3.54 4.03 1015

Fed Avg1 24.11 52.71 43.72 12.76 40.18 +9.25 33.33 +3.47 354.74 2.92 1017

FT-Ensemble 29.01 38.16 39.30 16.87 35.49 +4.56 30.84 +0.98 4.85 5.62 1017

Fed NN 32.15 54.11 46.62 22.22 44.29 +13.36 38.78 +8.92 4.31 5.38 1015

rounds of local updates. We set the total computation overhead to be the same for a fair comparison of translation performance and communication overhead. The detailed results are shown in the Table 10. We ﬁnd that the performance decreases signiﬁcantly as k increases, especially when k = (i.e., a copy of the server model is trained locally until convergence), and the average performance drops to a level similar to that of P-NMT. The reason is that too many local updates suffer from catastrophic forgetting of knowledge from the previous aggregated models, resulting in strong knowledge conﬂicts in the new round of interactions. The optimal performance is presented at k = 1, which means that frequent interactions are essential to alleviate the knowledge conﬂicts for Fed Avg.

Table 10: BLEU score [%] and communication cost of Fed Avg with different interaction frequency. . Comm. Cost refer to communication cost in GB .

Methods Client Test Server Test Overall Performance Comm. IT Law Medical WMT14 Client Global Cost

P-NMT 26.62 35.91 30.27 26.63 30.93 29.86

Fed Avg1 28.26 53.00 45.90 13.45 42.39 +11.45 35.15 +5.30 388.12 Fed Avg2 26.37 52.92 44.22 13.07 41.17 +10.24 34.15 +4.29 194.06 Fed Avg5 24.93 52.23 41.08 12.85 39.41 +8.48 32.77 +2.92 77.62 Fed Avg10 22.63 51.11 38.60 12.65 37.45 +6.51 31.25 +1.39 38.81 Fed Avg20 21.82 49.53 36.24 12.75 35.86 +4.93 30.09 +0.23 19.41 Fed Avg 17.30 47.06 30.61 13.33 31.57 +0.63 27.01 -2.85 4.85

C.3 THE IMPACT OF ENSEMBLE STRATEGY ON FT-ENSEMBLE

The ensemble strategy of FT-Ensemble could be implemented in two ways: the ﬁrst is to directly average the probability distribution of each client model (as used in the Table 1), i.e., FT-Ensemble; the second, similar

Published as a conference paper at ICLR 2023

Table 11: BLEU score [%] of FT-Ensemble with different aggregation strategy.

Methods Client Test Server Test Overall Performance IT Law Medical WMT14 Client Global

Public Model 26.62 35.91 30.27 26.63 30.93 - 29.86 -

FT-Ensemble 30.11 38.14 39.15 17.13 35.80 4.87 31.13 1.28 FT-Ensemble-Wei 24.09 48.58 34.11 16.06 35.59 4.66 30.71 0.85 Fed NN 35.62 55.57 49.21 22.29 46.80 15.87 40.67 10.82

Table 12: The impact of P-NMT s quality. refers to the improvement of the method compared with the model mentioned in Table 1.

Methods IT Law Medical WMT14 Clients Avg. Global Avg. BLEU BLEU BLEU BLEU BLEU BLEU

P-NMT 30.72 +4.10 38.69 +2.78 35.90 +5.63 29.77 +3.14 35.10 +4.17 33.77 +3.91

Fed Avg1 28.63 +0.37 58.32 +5.32 49.08 +3.18 16.05 +2.60 45.34 +2.95 38.02 +2.87 Fed NN 38.24 +2.62 55.76 +0.19 50.65 +1.44 22.65 +0.36 48.22 +1.42 41.83 +1.16

to Fed Avg, is to weight the probability distribution of each client model s output by assigning weights to it according to its data size, i.e., FT-Ensemble-Wei. The performance comparison of these two ways in the Non-IID setting are shown in the Table 11. We ﬁnd that FT-Ensemble outperforms FT-Ensemble-Wei in both client-side and global overall performance. FT-Ensemble has a more balanced performance on the client side, while FT-Ensemble-Wei is similar to Fed Avg in that the performance is more biased towards the client Law s model, which has more local data. Our Fed NN outperforms both of these methods on all clients. Note that the two implementations of FT-Ensemble described above in the IID setting are equivalent since the data size is the same for all client.

C.4 THE IMPACT OF PUBLIC MODEL S QUALITY

Since Fed NN performs federated learning based on the P-NMT, we investigate the impact of the P-NMT s quality on performance. We introduce WMT20 En-De data to train the P-NMT, which contains 40 million parallel pairs, and conduct fast experiments in the Non-IID setting. From Table 12, we can observe that as the quality of the P-NMT improves, all methods show better performance.

D THE DETAILS OF PRIVACY LEAKAGE ANALYSIS

D.1 DATASET CONSTRUCTION

Given a local parallel sentence pair (xc, yc) Dc of client attacker, the public NMT model generates the context representation k = fθ(xc, yc,<t) in the last decoder layer at each timestep t, and the ground-truth is v = yc,t. k has two forms, i.e., whether it is encrypted by K-Encryption or not. Next, we concat them to obtain a training sample rc,t = k v <2src> xc <2tgt> yc,<t v, where is the concatenation operation. The language tag <2src> and <2tgt> are used to identify the generation of source and target languages, respectively. By traversing the entire Dc, we obtain the whole dataset R = {r1, r2, . . . , rn} used to train the threat model, where n = P|Dc| i |yi| + |Dc|. The detailed statistics of dataset used for threat model are shown in Table 13.

Published as a conference paper at ICLR 2023

Table 13: The statistics of datasets for the threat model.

IT Medical Law

Train 3,085,523 5,858,648 16,868,065 Dev 34,737 55,577 51,423 Test 2,000 2,000 2,000

dogs <2tgt> hunde <2src> I like mag Ich

Embd Embd Embd Linear Embd Embd Embd Embd

like dogs hunde <2src> I Ich <2tgt> mag

Reconstruction Transformer Decoder

Prefix Token Text Predicted by Autoregressive Paradigm

Figure 4: The threat model based on the autoregressive paradigm.

D.2 THE ARCHITECTURE OF THREAT MODEL

The goal of the threat model is to reconstruct the corresponding original text from the memorization (k, v) of client defender. As shown in Figure 4, we use a transformer decoder as the architecture of the threat model, which is similar to the left-to-right language model based on the auto-regressive paradigm. It consists of 6 transformer layers. The input embedding size is 512, the FFN layer dimension is 2048, and the number of self-attention heads is 8. We ﬁrst transform the ﬁrst input token k to the same dimension as the word embedding using a linear layer, and then auto-regressive perform left-to-right reconstruction modeling.

D.3 EVALUATION OF PRIVACY LEAKAGE

We quantify the privacy information leaked by global memorization using sentence-level and word-level metrics, i.e., reconstruction BLEU and privacy word hitting Precision/Recall/F1. Assuming that the text recovered by the threat model from memorization ki vi is hi = {hi,1, hi,2, ..h|hi|} and the ground-truth is gi = {gi,1, gi,2, ....g|gi|}, where i = {1, 2, ..., N} is test sample index. Then we calculate the reconstruction BLEU score using sacre BLEU. Before evaluating the word-level privacy leakage, we require to extract the privacy dictionary of the client defender. The privacy dictionary is obtained by computing the difference between the word distribution of the defender s private dataset and the server public dataset. Further, we ﬁlter hi and gi according to this dictionary to obtain sentences hp i and gp i that contain only privacy words. The

Published as a conference paper at ICLR 2023

word-level metric then is computed as follows:

Precision =

PN i P|gp i | j Counthit(gp i,j, hp i ) PN i |hp i | ,

PN i P|hp i | j Counthit(hp i,j, gp i ) PN i |gp i | ,

F1 = 2 Precision Recall

Precision + Recall ,

where Counthit(x, y) represents that x has appeared in y.

D.4 QUALITATIVE ANALYSIS OF PRIVACY

Some qualitative cases are illustrated in Table 15 and we ﬁnd that the style of all reconstructed texts remained consistent with the attacker s training data, including text length and domain style. For example, in Case1 and Case2, the reconstructed texts from the attack model trained on the Law domain exhibit a domain style of law client. This means that it is difﬁcult to recover and identify valuable information, such as domain and private words, from global memorization.

E COST ANALYSIS OF FL METHODS ON DIFFERENT CLIENT S NUMBER

The communication and computational costs of different FL methods are illustrated in Table 14. For communication, the cost of Fed Avg is much higher than that of Fed NN and FT-Ensemble. The reason is that Fed Avg requires multi-round communication based on the model, while both Fed NN and FT-Ensemble require only one-round memorization-based communication. For computation, the cost of FT-Ensemble is linearly related to the number of nodes. It cannot be extended to practical applications because of the number of local models that need to be integrated for inference. In contrast, the cost of Fed NN is only 1/60 and N/2 of Fed Avg1 and FT-Ensemble, respectively. Considering many clients limited communication bandwidth and computational resources, Fed NN is a promising framework selection to save a lot of communication time and computational consumption. Table 14: The communication cost and computation cost of different methods, where M, N, R and D respectively represent the model size (414MB), number of client, rounds of communication (160) and the total size of all encrypted datastores (1978MB).

Communication Cost (GB) Computation Cost (FLOPs) Compl. 3 6 12 18 3 6 12 18

Fed Avg M N R 2 388.12 776.25 1552.50 3105.00 3.23 1018 3.23 1018 3.23 1018 3.23 1018

FT-Ensemble M N (N+1) 4.85 16.98 63.07 138.27 7.02 1017 1.40 1018 2.11 1018 2.82 1018

Fed NN (D+M) N 5.08 12.08 26.10 40.12 6.72 1015 6.72 1015 6.72 1015 6.72 1015

F LIMITATIONS

In this paper, we utilizes one round of memorization-based interaction to share knowledge among different clients, thus building low-overhead privacy-preserving translation systems. We discuss limitations of our method as follows.

Despite our proposed approach achieves strong performance when exploiting global memorization sharing, it leads to reduced inference efﬁciency due to the need for k NN retrieval. As shown in Table 1, the inference

Published as a conference paper at ICLR 2023

speed of Fed NN is about 0.75 that of P-NMT. In practice, these costs may be acceptable since we employ FAISS to speed up k NN retrieval. We encourage future work to improve the efﬁciency of k NN retrieval. The communication overhead required for memorization-based interaction is positively correlated with the client data size. Extremely large client data will make our approach inapplicable because it leads to higher communication overhead. Our approach is more applicable to the generic scenario described in Section 2, i.e., private data is sparse (|Dc| |Ds|). We also encourage further exploration of how to build a smaller and more accurate memorization further to mitigate this problem. This paper is still very preliminary in the privacy leakage analysis of memorization interaction. Although the threat model on shared global memorization has a very low reconstruction scores, privacy leakage is still a potential risk. How to better evaluate and mitigate the privacy leakage of memorization remains an open question, which we leave for future work.

Table 15: Examples of qualitative analysis for privacy leakage. Text in green / blue represent the defender-

speciﬁc ground-truth and private words, respectively. Text in red represents the hit private words by attacker. The bold word represents the threat model of the client-side attacker, where the superscript f KE represents the input of training data k is encrypted K-Encryption, otherwise it is not.

Case Examples

Case 1: Defender is IT

<2src> cursor ; quickly moving ; to an object <2tgt> cursor ; schnell zu einem Objekt bewegen

Medicalf KE: <2src> curves , fainting, salivation, vomiting, diarrhoea, fainting, fainting, or vomiting, or diarrhoea. <2tgt> Kleben, Fainting, Speichelfluss, Erbrechen, Durchfall, Ohnmacht oder Erbrechen oder Durchfall oder Erbrechen

Medical: <2src> curonium or vecuronium: <2tgt> Vecuronium oder Vecuronium: Lawf KE: <2src> palm oil falling within CN code 2710 00 90 <2tgt> Palm ol des KN-Codes 2710 00 90 Law: <2src> curbiting the use of the designation butter in Annex I to Regulation (EEC) No 3143

85 <2tgt> curbitration der Bezeichnung Butter in Anhang I der Verordnung (EWG) Nr. 3143 / 85

Case 2: Defender is Medical

<2src> Intravenous infusion after reconstitution and dilution. <2tgt> Intraven ose Infusion

nach Aufl osung und Verd unnung.

ITf KE: <2src> Inserts a placeholder. <2tgt> Hiermit f ugen Sie einen Platzhalter ein IT: <2src> Inserts a new row. <2tgt> F ugt eine neue Zeile ein.

Lawf KE: <2src> Appointment of the date of minimum durability shall be given. <2tgt> Die Angabe des

Law: <2src> The minimum of date durability <2tgt> Angabe des Mindesthaltbarkeitsdatums

Case 3: Defender is Law

<2src> The Commission consistently takes a favourable view of such aid . <2tgt> Derartige Beihilfen werden von der Kommission stets bef urwortet.

ITf KE: <2src> The & kappname; Handbook <2tgt> Das Handbuch zu & kappname; IT: <2src> Following packages depend on the installed packages: <2tgt> Die folgenden Pakete h angen von den installierten Pakete ab:

Medicalf KE: <2src> Most common side effects with Azarga (seen in between 1 and 10 patients in 100) areheadache, dizziness, somnolence (sleepiness), nausea (feeling sick), diarrhoea, abdominal tummy pain, diarrhoea, flatulence (gas), abdominal (tummy) pain, dyspepsia (indigestion), diarrhoea, nausea (feeling sick), vomiting, abdominal (tummy) pain, dyspepsia (indigestion), flatulence (wind)...

Medical: <2src> European Commission granted a marketing authorisation valid throughout the European Union for Nobilis Influenza H5N6 to Intervet International BV on 24 April 2009. <2tgt> April 2009 erteilte die Europ aische Kommission dem Unternehmen Intervet International BV eine Gene -hmigung f ur das Inverkehrbringen von Nobilis Influenza H5N6 in der gesamten Europ aischen Union.