# iron_private_inference_on_transformers__fe8de243.pdf Iron: Private Inference on Transformers Meng Hao1 Hongwei Li1 Hanxiao Chen1 Pengzhi Xing1 Guowen Xu2 Tianwei Zhang2 1University of Electronic Science and Technology of China 2Nanyang Technological University {menghao,hanxiao.chen,p.xing}@std.uestc.edu.cn hongweili@uestc.edu.cn {guowen.xu,tianwei.zhang}@ntu.edu.sg We initiate the study of private inference on Transformer-based models in the client-server setting, where clients have private inputs and servers hold proprietary models. Our main contribution is to provide several new secure protocols for matrix multiplication and complex non-linear functions like Softmax, GELU activations, and Layer Norm, which are critical components of Transformers. Specifically, we first propose a customized homomorphic encryption-based protocol for matrix multiplication that crucially relies on a novel compact packing technique. This design achieves m less communication (m is the number of rows of the output matrix) over the most efficient work. Second, we design efficient protocols for three non-linear functions via integrating advanced underlying protocols and specialized optimizations. Compared to the state-of-the-art protocols, our recipes reduce about half of the communication and computation overhead. Furthermore, all protocols are numerically precise, which preserve the model accuracy of plaintext. These techniques together allow us to implement Iron, an efficient Transformer-based private inference framework. Experiments conducted on several real-world datasets and models demonstrate that Iron achieves 3 14 less communication and 3 11 less runtime compared to the prior art. 1 Introduction Transformer-based models [1 5] have realized tremendous success in the fields of natural language processing (NLP) and computer vision (CV) due to their strong representation capabilities. As a new neural network architecture, the Transformer [1] mainly utilizes the self-attention mechanism to compute representations without sequence-aligned recurrence or convolution. Following this work, a number of Transformer variants, such as BERT [2] and GPT [3] in NLP, Vi T [4] and Swin Transformer [5] in CV, have achieved state-of-the-art performance on lots of real-world tasks. The success of Transformers and other big models facilitates emerging inference services and applications [6, 7]. In particular, a service provider trains a complex model based on the Transformer, and deploys it as a paid inference service, e.g., machine translation and question answering. A client queries this service with his input samples and obtains the desired responses. Unfortunately, current inference systems suffer from serious privacy concerns [8]. On the one hand, clients need to send confidential inputs to the service provider, which could compromise the data privacy of these clients if the provider is untrusted. On the other hand, it is undesirable for the provider to distribute the proprietary Transformer-based model to clients, since he needs a large amount of data and computation resources to construct the model [9]. Therefore, there exists a gap between This work was done at NTU as a visiting student. Corresponding author 36th Conference on Neural Information Processing Systems (Neur IPS 2022). unprecedented performance and privacy constraints, which motivates our study of private Transformer inference. Private inference aims to protect server s model weights from clients, while guaranteeing that the server learns no information about clients private inputs. Recently, private inference on traditional neural networks (e.g., convolutional neural networks) have been approached by using secure 2-party computation (2PC) techniques [10 14]. However, due to essentially different structures, private Transformer inference brings several new challenges. First, Transformer-based models use lots of high-dimensional matrix multiplications, rather than matrix-vector multiplications widely studied in prior works. While we can straightforwardly extend prior matrix-vector multiplication protocols to our setting, unfortunately, even the most efficient design [14] incurs heavy communication because of interacting a large amount of ciphertexts. Second, Transformer-based models use complex math functions like Softmax, GELU activations [15], and Layer Norm, in each block, rather than cryptofriendly non-linear functions such as Re LU and Maxpool. Existing methods either use precisionimpaired high-order polynomial approximations [16, 11] or only support limited math functions for specific scenarios [17]. Even worse, all of these approaches are computationally intensive and often require a large amount of communication (for more related works, please refer to Appendix A.5). To facilitate the widespread adoption of Transformer-based inference services in privacy-critical scenarios, designing efficient protocols for the above complex operations is of paramount importance. In this paper, we design Iron, an efficient hybrid cryptographic framework for private Transformer inference without revealing any sensitive information about the server s model weights or clients inputs. Iron contributes several new specialized protocols for the complicated operations in Transformers to alleviate the performance overhead. Specifically, we first propose a customized homomorphic encryption-based protocol for matrix multiplications. Our insight is to pack more plaintext inputs into a single ciphertext by devising a compact packing method, while preserving the functionality of matrix multiplication. Compared to the most efficient matrix-vector multiplication solution implemented in Cheetah [14], we can achieve m (m is the number of rows of the output matrix) improvement in terms of communication overhead, which is about 8 reduction for various Transformer models. Second, we carefully design efficient protocols for Softmax, GELU, and Layer Norm. These protocols are built upon SIRNN [17], the state-of-the-art cryptographic framework for private inference on recurrent neural networks, and make several customized optimizations, such as reducing the overhead of exponentiation in Softmax and simplifying GELU and Layer Norm. These optimizations achieve 1.3 1.8 less runtime and 1.4 1.8 less communication on three non-linear functions. Furthermore, these protocols are numerically precise, which preserve the model accuracy of plaintext. We also give a formal security proof for our designed protocols to demonstrate the security guarantee. Based on the above efficient components, we implement a private Transformer inference framework, Iron, and conduct end-to-end experiments with various BERT architectures [2] (BERT-Tiny, BERTMedium, BERT-Base, and BERT-Large) on GLUE benchmarks [18]. Note that Iron is readily extended to other Transformer-based models (e.g., Vi T) since they share very similar architectures and same operations. Experimental results show that Iron achieves 3 14 less communication and 3 11 less runtime costs over SIRNN on four BERT models. Moreover, compared with the general-purpose state-of-the-art framework MP-SPDZ [16], Iron has up to two orders of magnitude improvement in terms of both communication and computation efficiency. A concurrent work [19] proposed a privacy-preserving Transformer inference with homomorphic encryption, called THE-X. Below, we illustrate some important differences in terms of protocol design and security. (1) Protocol design. Our work aims to design new efficient protocols for the complex operations of Transformer-based models, while orthogonal to ours, THE-X replaces them with crypto-friendly operations. For example, THE-X replaces GELUs with simpler operations, i.e., Re LUs, and Softmax with the combination of Re LU and polynomials. (2) Security. Our work achieves more rigorous privacy protection than THE-X. Specifically, our work uses homomorphic encryption and secret sharing techniques to hide private information (including intermediate results) of all layers. Such rigorous privacy guarantee is in line with recent state-of-the-art private inference works [14, 17]. However, in THE-X, the inputs of each non-linear layer are leaked to the client, which may causes severe privacy leakages in real-world applications [13]. Therefore, our work may be used to enhance the security of THE-X. 2 Preliminaries 2.1 Threat Model As shown in the left part of Figure 1, Iron works in a general private inference scenario, where the server P0 holds a Transformer-based model M with private weights w, while the client P1 holds a private input x. Our framework enables the client to query the server s inference service and learn the output of the model on its input, i.e., M(w, x). Same as prior works [14, 17], we consider an honest-but-curious adversary that passively corrupts either the server or the client, but not both. Such an adversary follows the protocol specification exactly, but may try to learn more information3 than allowed (e.g., the model s weights or inference inputs) via analyzing the data it receives. In Appendix A.1.2, we give a more formal description of the threat model for security analysis. 2.2 Cryptographic Primitives All our protocols are built on the 2-out-of-2 additive secret sharing (ASS) technique [21, 22] over the ring Z2ℓ, in which an ℓ-bit input x is split into two random shares x 0, x 1, held by P0 and P1, respectively, such that x = x 0 + x 1 mod Z2ℓ. When ℓ= 1, i.e., over Z2, we use x B to denote boolean shares. In our protocols, we use x to denote that Pb holds x b for b {0, 1}. The security [21] guarantees that given a share x 0 or x 1, the value of x is perfectly hidden. This secret-sharing property is maintained throughout our private inference scheme. ASS naturally supports linear operations without communication. For instance, to compute cx + y with the constant c and secret shares x and y , Pb can locally compute z b = c x b + y b, where z 0 + z 1 = cx+y mod 2ℓ. To achieve more functionalities under shared inputs, we require to invoke advanced homomorphic encryption or oblivious transfer techniques [22]. Notations. Let x y means x is not a divisor of y. We use bold lower-case letters (e.g., x) to represent vectors, and bold upper-case letters (e.g., X) to denote matrices. Like prior works [17, 14], we encode inputs with the fixed-point representation denoted by Fix (refer to Appendix A.1.1 for details). Let AN,p denote the set of integer polynomials AN,p = Zp[x]/ x N + 1 . We use the circumflex of lower-case letters (e.g., ˆa) to represent a polynomial, and ˆa[i] to denote the i-th coefficient of ˆa. Given polynomials ˆx, ˆy AN,p, the product ˆz = ˆx ˆy over AN,p is defined as 0 j i ˆx[j]ˆy[i j] X i N, we first partition the matrices X, Y into sub-matrices of mw nw and nw kw elements, respectively, such that mwnwkw N. Zero-padding is required when mw m, nw n or kw k. The protocol is shown in Algorithm 1. Complexity and security analysis. For complexity, totally, two parties interact k kw ( m mw + n nw ) ciphertexts, and operate with O(mnk/N) homomorphic additions and multiplications. We formalize the selection of parameters mw, nw, kw as an optimization problem, to minimize the communication cost. Through our analysis in Appendix A.2.2, in general scenarios, our method theoretically achieves m communication improvement over Cheetah [14]. Notably, when mnk N, we reduce the communcation cost by a factor of m, since Cheetah encodes each row of the matrix into a ciphertext. Besides, the security proof is shown in Appendix A.2.3. Algorithm 1 Secure Matrix Multiplication Protocol Input: P0 holds X Zm n 2ℓ , and P1 holds Y Zn k 2ℓ . Output: P0 and P1 get Z 0, Z 1 Zm k 2ℓ , respectively, where Z = XY. 1: P0, P1 compute the partition window size 0 < mw m, 0 < nw n and 0 < kw k such that mwnwkw N, and set n = n/nw , m = m/mw , and k = k/kw . 2: P1 partitions the matrix Y into block matrices Yβ,γ Znw kw 2ℓ for β [n ] and γ [k ]. 3: P1 encodes the matrices to polynomials ˆyβ,γ = πR(Yβ,γ) for β [n ] and γ [k ]. Then P1 sends to P0 the ciphertexts {CTβ,γ = Enc(ˆyβ,γ)}. 4: P0 partitions the matrix X into block matrices Xα,β Zmw nw 2ℓ for α [m ] and β [n ]. P0 encodes the matrices to polynomials ˆxα,β = πL(Xα,β). 5: P0 uniformly at random samples plaintext polynomials ˆrα,γ for α [m ] and γ [k ]. P0 decodes these polynomials to a random mask R Zm k 2ℓ according to Theorem 4.14. 6: On receiving the ciphertexts {CTβ,γ} from P1, P0 operates CT α,γ = β [n ](ˆxα,β CTβ,γ) ˆrα,γ for α [m ] and γ [k ]. Then P0 sends to P1 the ciphertexts CT α,γ . 7: P0 outputs R mod 2ℓas the share Z 0. 8: On receiving the ciphertexts CT α,γ from P0, P1 computes ˆzα,γ 1 = Dec(CT α,γ) that are decoded to Z 1 using the method in Theorem 4.1. 4.2 Protocols for Non-linear Functions Our non-linear protocols rely on several underlying protocols from the state-of-the-art works [13, 17]. In Figure 3, we enumerate the inputs and outputs of these protocols and then use them as a black box (See Appendix A.3.2 for details). On this basis, we provide efficient protocols for Softmax, GELU, and Layer Norm with specific optimizations. Additional details and security analysis are shown in Appendices A.3.3 and A.3.4. Multiply ΠMul OT Input: 𝑃𝑃𝑏𝑏: 𝑧𝑧𝑏𝑏 s.t. 𝒛𝒛= 𝒙𝒙𝒙𝒙 𝑃𝑃𝑏𝑏: 𝑥𝑥𝑏𝑏, 𝑦𝑦𝑏𝑏 Compare ΠCMP s.t. 𝒛𝒛= 𝟏𝟏{𝒙𝒙> 𝟎𝟎} 𝑃𝑃𝑏𝑏: 𝑧𝑧𝑏𝑏 𝐵𝐵 Neg Exp Πn Exp s.t. 𝒛𝒛= 𝒆𝒆𝒙𝒙, 𝒙𝒙<0 Recip Sqrt Πr Sqrt Recip ΠRecip s.t. 𝒛𝒛= 𝟏𝟏/𝒙𝒙 Figure 3: Underlying protocols from [13, 17] 4.2.1 Softmax To evaluate attention layers, we need an efficient protocol to compute Softmax on secret-shared values. In particular, for a vector x Zd 2ℓ, the Softmax function is denoted as Softmaxi(x) = exi/ P j [d] exj for i [d]. The main challenge is to efficiently compute the underlying exponential function. Following the idea from [11, 30], we first normalize the input vector by x maxi [d] xi, which is always negative, and then invoke the existing exponential protocol Πn Exp in Figure 3 that only evaluates exponentiation on negative inputs. A simple analysis shows softmax(x maxi [d] xi) is equal to softmax(x). We evaluate max using a tree-reduction protocol, denoted by Πmax. Specifically, we arrange the vector x Zd 2ℓinto a 2-ary tree with the depth of log d, and evaluate the tree in a top-down fashion [31]. In each comparison of two secret-shared elements xi and xj, we reduce it to the invocations of ΠCMP and ΠMul OT, i.e., max(xi, xj) = ΠMul OT(xi xj, ΠCMP(xi xj)) + xj. With the above insight, the Softmax protocol is detailed in Algorithm 2. SIRNN [17] also provides a solution for the generic exponential protocol by extending Πn Exp. The idea is that the exponential of x equals 1 n Exp( x) if x 0, and n Exp(x) otherwise, where n Exp is the 4This operation aims to avoid extra information leakage from the output polynomial s coefficients. The details refer to [14] and their implementations. same as the exponential function except for negative inputs. Compared to our solution, realizing the softmax function with this generic exponential protocol additionally requires d calls to the reciprocal protocol ΠRecip and multiplication protocol ΠMUTOT. Besides, compared with the generic library MP-SPDZ [16] that provides native support for exponentiation, our protocols achieve orders of magnitude improvement as shown in Section 5.2. Algorithm 2 Secure Softmax Protocol Input: P0, P1 hold x 0 Zd 2ℓ, x 1 Zd 2ℓ, respectively. Output: P0, P1 get y 0 Zd 2ℓ, y 1 Zd 2ℓ, respectively, where y = Softmax(x). 1: P0, P1 invoke Πmax(x) to compute max(x) , where max(x) = maxi [d] xi. 2: For i [d], P0, P1 invoke Πn Exp on input xi , and learn e xi , where xi = xi max(x). 3: P0, P1 invoke ΠRecip with inputs P i [d] e xi and learn 1/ P i [d] e xi . 4: For i [d], P0, P1 invoke ΠMul OT with inputs 1/ P i [d] e xi and e xi , and set outputs as yi . Rather than crypto-friendly Re LU [13], Transformer-based models utilize GELU activations [15], which can be represented as GELU(x) = 0.5x 1 + Tanh hp 2/π x + 0.044715x3 i . The complete protocol is shown in Algorithm 3, in which we provide two insights to reduce the cost of GELU. First, we present an optimized protocol for the square of a secret-shared input x . This relies on the observation: x2 = x 2 0 + x 2 1 + 2 x 0 x 1, where the first two terms can be locally computed by P0 and P1. We only invoke OT protocols to compute the cross term 2 x 0 x 1, and the optimized overhead is half that of the multiplication protocol ΠMul OT. Second, we further optimize the evaluation of Tanh(x) = e2x 1 e2x+1. We observe that the sign of x is equal to the sign of Tanh(x), which allows us to leverage the negative exponential protocol Πn Exp almost for free. At a high level, our Tanh protocol first learns the sign of the input x, and then evaluates Tanh on the negative input x with the constraint | x| = |x|. Finally, the protocol generates the real output Tanh(x) that equals Tanh( x) if x 0, and Tanh( x) otherwise. Our leaner Tanh protocol is given in Algorithm 5 of Appendix A.3.1. SIRNN [17] recently proposed the most efficient Tanh protocol with the insight of Tanh(x) = 2Sigmoid(2x) 1, where Sigmoid(x) = 1 1+e x . However, the evaluation of Sigmoid uses the same idea as the general exponential protocol, as shown in Section 4.2.1. Compared with the protocol in SIRNN, our recipe for Tanh saves one invocation of the reciprocal protocol ΠRecip and multiplication protocol ΠMul OT. Algorithm 3 Secure GELU Protocol Input: P0, P1 hold x 0 Z2ℓ, x 1 Z2ℓ, respectively. Output: P0, P1 get y 0 Z2ℓ, y 1 Z2ℓ, respectively, where y = GELU(x). 1: P0, P1 invoke ΠMul OT with inputs x , and set their outputs as z = Fix( p 2/π)( x + Fix(0.044715) x 3) 2: P0, P1 invoke ΠTanh with inputs z , and set their outputs as Tanh(z) . 3: P0, P1 invoke ΠMul OT with inputs Fix(0.5)x and Fix(1) + Tanh(z) , and output y . 4.2.3 Layer Norm For a vector x Zd 2ℓ, the Layer Norm function is denoted by Layer Normi(x) = γ(xi µ)/σ + β for i [d], where µ = P i [d] xi/d and σ = q P i [d](xi µ)2. In contrast to batch normalization (BN) in CNNs that is evaluated for free [14], Layer Norm requires multiplication and reciprocal square root operations. In our implementation, we observe that the multiplication dominates the overhead of Layer Norm. To address this issue, we adopt the same idea in GELU to compute the square of xi µ, which saves half of the communication and computation costs. Besides, inspired by the optimization in BN [14], we apply a Layer Norm merge technique to further reduce the overhead. Specifically, we make the observation that the weights γ and β of the Layer Norm layer are already known by P0. As a result, P0 first multiplies the scale factor γ with the weights of Table 1: Comparing the runtime (sec) and communication (MB) costs of our matrix multiplication and non-linear protocols with SOTA Matrix Multiplication Non-Linear Protocols Dims=(32, 8, 16) (128, 64, 128) (128, 768, 768) Softmax Layer Norm GELU Time Comm. Time Comm. Time Comm. Time Comm. Time Comm. Time Comm. Ours 0.006 0.11 0.066 1.74 1.71 15.45 Ours 4.78 206.265 2.34 102.435 0.30 10.07 Cheetah 0.16 MP-SPDZ 297.75 the next linear layer. After invoking the linear layer protocol on the scaled weights, P0 adds the shift weight σ to his additive share. The optimized protocol for Layer Norm is presented in Algorithm 4. Algorithm 4 Secure Layer Norm Protocol Input: P0, P1 hold x 0 Zd 2ℓ, x 1 Zd 2ℓ, respectively. Output: P0, P1 get y 0, y 1, respectively, where y = Layer Norm(x). 1: For i [d], P0, P1 invoke ΠMul OT to compute (xi µ)2 , where µ = P i [d] xi/d. 2: P0, P1 invoke Πr Sqrt with inputs P i [d](xi µ)2 to learn output 1 3: For i [d], P0, P1 invoke ΠMul OT with inputs 1 σ and xi µ , and set outputs as yi . 5 Evaluation 5.1 Experimental Setup Implementation. Iron is built on top of the SEAL library [32] and the EMP toolkit [33] in C++. We also use the Ez PC framework [34]. This framework compiles a high-level Tensor Flow code to secure computation protocols, which are then executed by our designed cryptographic backends. Like [17], we simulate a LAN network setting, where the bandwidth is 377 MBps and the echo latency is 0.8ms. All the following experiments are performed on AWS c5.9xlarge instances with Intel Xeon 8000 series CPUs at 3.6GHz. Datasets and Models. We evaluate Iron on four NLP models from [35, 36]: BERT-Tiny, BERTMedium, BERT-Base and BERT-Large. These models are parameterized by three hyper-parameters: the number of blocks, the dimension of representations and the number of input tokens (refer to Appendix A.4.1 for the hyper-parameters of these models). We train the models for four NLP tasks over the datasets of the Stanford Sentiment Treebank (SST-2), the Microsoft Research Paraphrase Corpus (MRPC), the Multi-Genre Natural Language Inference Corpus (MNLI) and the Stanford Question Answering Dataset (QNLI) from GLUE benchmarks [18]. 5.2 Microbenchmark Evaluation Matrix multiplication. In the left part of Table 1, we compare the performance of the proposed matrix multiplication protocol with the state-of-the-art counterparts of Cheetah [14] and SIRNN [17]. For fairness, we follow Cheetah for the parameter setup of homomorphic encryption. Compared with Cheetah, our rumtime is 3 26 faster, and our communication cost is 8 25 lower, depending on the input size. Notably, for small-size matrices (e.g., ones with 32 8 and 8 16 elements), our protocol only requires 0.11 MB communication and 6 ms runtime, while Cheetah achieves the computation with 2.79 MB and 160 ms. The reason is that our protocol encrypts the whole matrix into a single ciphertext, while Cheetah could only encrypt each row into a ciphertext, totally m ciphertexts (m is the number of rows of the output matrix). Moreover, compared with SIRNN that implements the most efficient OT-based matrix multiplication protocol, our protocol incurs up to two orders of magnitude less communication and an order of magnitude less runtime. Non-linear functions. The right part of Table 1 shows the comparison of our Softmax, GELU and Layer Norm protocols with the generic MP-SPDZ framework [16] and the state-of-the-art SIRNN library [17] for math functions. It is worth noting that we implement some functions that these Tiny Medium Base Large Model architectures of BERT Runtime (min) Our work SIRNN Tiny Medium Base Large Model architectures of BERT Communication (GB) Our work SIRNN Tiny Medium Base Large Model architectures of BERT Runtime (min) Our work MP-SPDZ Tiny Medium Base Large Model architectures of BERT Communication (GB) Ours MP-SPDZ 650 646 652 Figure 4: End-to-end comparisons with SIRNN and MP-SPDZ frameworks did not provide before. In particular, GELU and Layer Norm are implemented in MPSPDZ by calling its built-in functions for tanh, square root, and reciprocal, while we add Softmax and GELU protocols in SIRNN utilizing sigmoid and exponent functions. As shown in this table, our protocols are orders of magnitude better than those of MP-SPDZ, in terms of both runtime and communication. In particular, for the communication cost, our protocols achieve 785 993 improvement. Moreover, while our protocols are built upon the underlying protocols from SIRNN, we also achieve 1.3 1.8 lower runtime and 1.4 1.8 lower communication due to our customized optimizations. Such an improvement gives us significant savings in communication complexity, since the non-linear layer protocols dominate the overall overhead, as described below. 5.3 End-to-end Inference Evaluation Comparison with prior methods. In the left part of Figure 4, we evaluate our protocols on 4 BERT models compared with SIRNN [17]. It is observed that our rumtime is 3.3 11.83 faster than that of SIRNN, and our communication cost is 3.47 14.11 lower over four models. Moreover, our performance gains scale up as the model size grows. This is because our protocols achieve better amortized overhead when processing large-scale evaluations. We also compare the end-to-end private inference with MP-SPDZ in Figure 4. The results show that our protocols are orders of magnitude better, in terms of both time and communication costs. This is because specialized protocols are more communication efficient than generic alternatives, which is also observed by SIRNN. Performance breakdown. In Figure 5, we present the runtime and communication breakdown of Iron on four BERT models. For clarity, we just report the result of one encoder. Recall that our private inference can be divided into linear protocols and non-linear protocols. For linear protocols, we observe that as the model size increases, the proportion of communication overhead remains approximately constant, accounting for 12 13%. This shows that our compact ciphertext encoding method effectively reduces the size of communication. For non-linear functions, as the model size increases, the proportion of computation overhead gradually decreases, from 84% in BERT-Tiny to 76% in BERT-Large. The savings come from amortizing the communication and computation costs by packing data from large tensors. Despite such advantages, the main bottleneck of our work is the communication overhead of non-linear layers. However, it is an open problem to solve the communication issue while maintaining the model accuracy. Accuracy comparison with plaintext. In the left part of Figure 6, we show the accuacy of plaintext (float-point) and Iron (fixed-point) on the BERT-Tiny model. We observe that the accuracy achieved by Iron matches the accuracy of the plaintext Tensor Flow code. Specifically, the accuracy loss does not exceed 0.3% over all datasets, and surprisingly, Iron exceeds the plaintext baseline on MNLI by 0.85%. Similar results also appear in private CNN inference [13]. Such accuracy advantages experimentally validate that our protocols are numerically precise. Moreover, in the right part of Figure 6, we also compare the accuracy with the plaintext baseline, as the fractional scale varies on MRPC. We observe that Iron with the scale of 12 exactly matches the accuracy of the plaintext model. The accuracy loss is lower than 1% when the scale 6. This conclusion is in line with the prior work [13], namely that neural networks can tolerate stochastic fault behavior [37]. 6 Discussion We discuss possible solutions to further improve the efficiency of private inference on Transformers. They are compatible with our proposed protocols and hence can be directly integrated into our framework. Tiny Medium Base Large Model architectures of BERT Runtime (sec) Linear Non-Linear Tiny Medium Base Large Model architectures of BERT Communication (MB) Linear Non-Linear Figure 5: Performance breakdown on BERT. MNLI MRPC QNLI SST2 Datasets Accuracy(%) Ours Plaintext 4 6 8 10 12 14 16 Scale Accuracy(%) Ours Plaintext Figure 6: Accuracy comparison with plaintext Pushing expensive operations into an offline phase. Our framework is readily extended to the preprocessing model, including an offline client-input independent phase and an online input-dependent phase. This paradigm has been instantiated in private CNN inference [38, 8], and the results show that about 99% of the cryptographic overhead can be moved to the offline phase. We briefly outline how to build our protocols with this paradigm. For linear operations, we can generate in advance Beaver s triple in the matrix form [22, 38] using our matrix multiplication protocol. In the online phase, these triples are consumed, and it additionally requires non-cryptographic operations and plaintext communication. For non-linear layers, almost all operations crucially rely on the OT protocols, which can be pre-generated in the offline phase [39, 14]. Like linear protocols, the online phase is cheap. Using mixed-bitwidths in private inference. Iron works with a uniform bitwidth, which is required to be large enough to accommodate all intermediate values. While effectively avoiding integer overflows, our protocols may cause intensive communication, since the performance of non-linear operators depends critically on bitwidths. Taking inspiration from [17, 40], one possible optimization is to employ non-uniform (mixed) bitwidths, operate in low bitwidths and switch to high bitwidths only when necessary. To this end, the mixed-precision model should be generated with proper compilers by using techiniques like quantization [41, 42]. With mixed-bitwidths fixed-point models, our protocols can be integrated seamlessly. Applying orthogonal model optimizations. We can improve performance by simplifying the model architecture such as model pruning and advanced neural architecture search [43, 44], which have been commonly adopted in private CNN inference [12, 45, 37]. Like prior works [12, 45, 37] that find and tailor models to the requirements of private inference, we can define a proper search space with the goal of decreasing the overhead, e.g., substituting costly Softmax and GELU functions, or reducing the matrices dimension in matrix multiplication. As mentioned in Section 1, the concurrent work, THE-X [19], has explored replacing complex math functions with HE-friendly alternatives while achieving comparable accuracy. 7 Conclusion We propose Iron, an efficient cryptographic framework for private Tramsformer inference. Specifically, we carefully design a new encoding method for optimizing homomorphic encryption-based matrix multiplication. Further, we devise several communication-efficient non-linear protocols like Softmax, Layer Norm and GELU by integrating advanced secret-sharing primitives. Experimental results show that Iron outperforms prior art by up to one order of magnitude in terms of computation and communication overheads. We believe that our novel protocols would help advance the practical instantiations of private Transformer inference. Acknowledgment The authors would like to thank the anonymous reviewers for their insightful comments. This work is supported by the Key-Area Research and Development Program of Guangdong Province under Grant 2020B0101360001, National Natural Science Foundation of China under Grants 62020106013, 61972454, and 61802051, Sichuan Science and Technology Program under Grants 2020JDTD0007 and 2020YFG0298, the Fundamental Research Funds for Chinese Central Universities under Grant ZYGX2020ZB027, and Singapore Ministry of Education (MOE) Ac RF Tier 2 MOE-T2EP201210006. [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Neur IPS, 2017. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Neur IPS, 2018. [3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. [4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of ICLR, 2020. [5] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of ICCV, pages 10012 10022, 2021. [6] Clarifai general image embedding model. URL https://www.clarifai.com/models/ general-image-embedding. [7] Openai api. URL https://openai.com/blog/openai-api. [8] Ryan Lehmkuhl, Pratyush Mishra, Akshayaram Srinivasan, and Raluca Ada Popa. Muse: Secure inference resilient to malicious clients. In Proceedings of USENIX Security, pages 2201 2218, 2021. [9] Yupei Liu, Jinyuan Jia, Hongbin Liu, and Neil Zhenqiang Gong. Stolenencoder: Stealing pre-trained encoders. ar Xiv:2201.05889, 2022. [10] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proceedings of ICML, pages 201 210, 2016. [11] Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten. Crypten: Secure multi-party computation meets machine learning. In Proceedings of Neur IPS, 2021. [12] Zahra Ghodsi, Akshaj Kumar Veldanda, Brandon Reagen, and Siddharth Garg. Cryptonas: Private inference on a relu budget. In Proceedings of Neur IPS, pages 16961 16971, 2020. [13] Deevashwer Rathee, Mayank Rathee, Nishant Kumar, Nishanth Chandran, Divya Gupta, Aseem Rastogi, and Rahul Sharma. Cryptflow2: Practical 2-party secure inference. In Proceedings of ACM CCS, pages 325 342, 2020. [14] Zhicong Huang, Wen-jie Lu, Cheng Hong, and Jiansheng Ding. Cheetah: Lean and fast secure two-party deep neural network inference. In Proceedings of USENIX Security, 2022. [15] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. [16] Marcel Keller. Mp-spdz: A versatile framework for multi-party computation. In Proceedings of ACM CCS, pages 1575 1590, 2020. [17] Deevashwer Rathee, Mayank Rathee, Rahul Kranti Kiran Goli, Divya Gupta, Rahul Sharma, Nishanth Chandran, and Aseem Rastogi. Sirnn: A math library for secure rnn inference. In Proceedings of IEEE S&P, pages 1003 1020, 2021. [18] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of ICLR, pages 535 548, 2018. [19] Tianyu Chen, Hangbo Bao, Shaohan Huang, Li Dong, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. The-x: Privacy-preserving transformer inference with homomorphic encryption. In Findings of ACL, pages 3510 3520, 2022. [20] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of ACM CCS, pages 308 318, 2016. [21] Ronald Cramer, Ivan Bjerre Damgård, et al. Secure multiparty computation. Cambridge University Press, 2015. [22] Daniel Demmler, Thomas Schneider, and Michael Zohner. Aby-a framework for efficient mixed-protocol secure two-party computation. In Proceedings of NDSS, 2015. [23] Taher El Gamal. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE transactions on information theory, 31(4):469 472, 1985. [24] Pascal Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of Eurocrypt, pages 223 238, 1999. [25] Zvika Brakerski. Fully homomorphic encryption without modulus switching from classical gapsvp. In Proceedings of Crypto, pages 868 886, 2012. [26] Junfeng Fan and Frederik Vercauteren. Somewhat practical fully homomorphic encryption. Cryptology e Print Archive, 2012. [27] Vladimir Kolesnikov and Ranjit Kumaresan. Improved ot extension for transferring short secrets. In Proceedings of Crypto, pages 54 70, 2013. [28] Gilad Asharov, Yehuda Lindell, Thomas Schneider, and Michael Zohner. More efficient oblivious transfer and extensions for faster secure computation. In Proceedings of ACM CCS, pages 535 548, 2013. [29] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. Gazelle: A low latency framework for secure neural network inference. In Proceedings of USENIX Security, pages 1651 1669, 2018. [30] Sijun Tan, Brian Knott, Yuan Tian, and David J Wu. Cryptgpu: Fast privacy-preserving machine learning on the gpu. In Proceedings of IEEE S&P, 2021. [31] Théo Ryffel, Pierre Tholoniat, David Pointcheval, and Francis Bach. Ariann: Low-interaction privacy-preserving deep learning via function secret sharing. In Proceedings of PETS, pages 291 316, 2022. [32] Seal. URL https://github.com/microsoft/SEAL. [33] emp-toolkit. URL https://github.com/emp-toolkit. [34] Ezpc. URL https://github.com/mpc-msri/Ez PC. [35] Bert. URL https://github.com/google-research/bert. [36] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models. ar Xiv:1908.08962, 2019. [37] Zahra Ghodsi, Nandan Kumar Jha, Brandon Reagen, and Siddharth Garg. Circa: Stochastic relus for private deep learning. In Proceedings of Neur IPS, 2021. [38] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and Raluca Ada Popa. Delphi: A cryptographic inference service for neural networks. In Proceedings of USENIX Security, pages 2505 2522, 2020. [39] M Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M Songhori, Thomas Schneider, and Farinaz Koushanfar. Chameleon: A hybrid secure computation framework for machine learning applications. In Proceedings of Asia CCS, pages 707 721, 2018. [40] Siam Umar Hussain, Mojan Javaheripi, Mohammad Samragh, and Farinaz Koushanfar. Coinn: Crypto/ml codesign for oblivious inference via neural networks. In Proceedings of ACM CCS, pages 3266 3281, 2021. [41] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-bert: Integer-only bert quantization. In Proceedings of ICML, pages 5506 5518, 2021. [42] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of AAAI, pages 8815 8821, 2020. [43] François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. In Proceedings of EMNLP, 2021. [44] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Proceedings of Neur IPS, 2019. [45] Nandan Kumar Jha, Zahra Ghodsi, Siddharth Garg, and Brandon Reagen. Deepreduce: Relu reduction for fast private inference. In Proceedings of ICML, pages 4839 4849, 2021. [46] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machine learning. In Proceedings of IEEE S&P, pages 19 38, 2017. [47] Ran Canetti. Universally composable security: A new paradigm for cryptographic protocols. In Proceedings of FOCS, pages 136 145, 2001. [48] Robert E Goldschmidt. Applications of division by convergence. Ph D thesis, MIT, 1964. [49] Xiaoqian Jiang, Miran Kim, Kristin Lauter, and Yongsoo Song. Secure outsourced matrix computation and application to neural networks. In Proceedings ACM CCS, pages 1209 1222, 2018. [50] Fabian Boemer, Anamaria Costache, Rosario Cammarota, and Casimir Wierzynski. ngraph-he2: A high-throughput framework for neural network inference on encrypted data. In Proceedings of the 7th ACM Workshop on Encrypted Computing & Applied Homomorphic Cryptography, pages 45 56, 2019. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [No] (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]