# falcon_fast_spectral_inference_on_encrypted_data__1c414b57.pdf Falcon: Fast Spectral Inference on Encrypted Data Qian Lou Indiana University Bloomington louqian@iu.edu Wen-jie Lu Alibaba Group juhou.lwj@alibaba-inc.com Cheng Hong Alibaba Group vince.hc@alibaba-inc.com Lei Jiang Indiana University Bloomington jiang60@iu.edu Homomorphic Encryption (HE) based secure Neural Networks(NNs) inference is one of the most promising security solutions to emerging Machine Learning as a Service (MLaa S). In the HE-based MLaa S setting, a client encrypts the sensitive data, and uploads the encrypted data to the server that directly processes the encrypted data without decryption, and returns the encrypted result to the client. The client S data privacy is preserved since only the client has the private key. Existing HE-enabled Neural Networks (HENNs), however, suffer from heavy computational overheads. The state-of-the-art HENNs adopt ciphertext packing techniques to reduce homomorphic multiplications by packing multiple messages into one single ciphertext. Nevertheless, rotations are required in these HENNs to implement the sum of the elements within the same ciphertext. We observed that HENNs have to pay significant computing overhead on rotations, and each of rotations is 10 more expensive than homomorphic multiplications between ciphertext and plaintext. So the massive rotations have become a primary obstacle of efficient HENNs. In this paper, we propose a fast, frequency-domain deep neural network called Falcon, for fast inferences on encrypted data. Falcon includes a fast Homomorphic Discrete Fourier Transform (HDFT) using block-circulant matrices to homomorphically support spectral operations. We also propose several efficient methods to reduce inference latency, including Homomorphic Spectral Convolution and Homomorphic Spectral Fully Connected operations by combining the batched HE and block-circulant matrices. Our experimental results show Falcon achieves the state-of-the-art inference accuracy and reduces the inference latency by 45.45% 85.34% over prior HENNs on MNIST and CIFAR-10. 1 Introduction Homomorphic Encryption (HE)-enabled neural networks (NNs) [1, 2, 3] are designed for secure Machine Learning as a Service (MLaa S). In HE-enabled MLaa S, a client encrypts his/her data and uploads the encrypted data to a server in the cloud. The server computes inferences on the encrypted data and returns the encrypted output to the client. The server cannot decrypt the encrypted input or output during an inference. However, HE naturally supports only linear layers of a neural network. Some interactive HE-enabled NNs (HENNs) [4, 5, 6] take advantage of multi-party computation (MPC) to get the client involved in the computation of activation layers by exchanging several gigabyte data with the client during an inference, while other non-interactive HENNs [1, 2, 3] approximate activations by a square function to perform secure inferences without involving the 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. client. A non-interactive HE-enabled NN is a practical MLaa S solution with competitive accuracy for particular clients who have limited computing power and small network bandwidth. However, both interactive and non-interactive HE-enabled inferences are slow. An inference of state-of-the-art HENNs [5, 3] on an encrypted CIFAR-10 image costs several hundred seconds. Their long inference latency is caused by expensive HE rotations. Modern HE cryptosystems, e.g., BFV [7], pack a vector consisting of small integers into a single large integer, so that they can allow concurrent HE arithmetic operations to happen on individual integers by performing a single operation on the large integer. The single instruction multiple data (SIMD) computing style of HE significantly reduces inference latency of HENNs from multiple hours to several hundred seconds. However, each accumulation in linear layers of a HE-enabled NN requires a rotation operation to shuffle small integers packed into a large integer. As a result, rotations consume > 90% of inference latency of a HE-enabled NN on an encrypted CIFAR-10 image. Recent interactive HENNs [4, 8] use frequency-domain convolutions [9, 10] to perform only elementwise multiplications in their linear layers to eliminate expensive HE rotations. A plaintext image of a client is first converted to its frequency-domain representation by discrete Fourier transform (DFT), encrypted to a ciphertext, and then sent to a cloud server. Instead of HE multiply-accumulate (MAC) operations, only HE element-wise multiplications are required to perform frequency-domain convolutions on the server. After receiving the encrypted frequency-domain linear layer output, the client decrypts it and converts the plaintext frequency-domain output to a normal linear layer output by inverse DFT (IDFT). At last, the client performs activations on the normal linear layer output, and then moves to the next layer. The frequency-domain convolutions greatly reduce inference latency of interactive HENNs by 30% 40%. However, naïvely using frequency-domain convolutions in non-interactive HENNs prolongs inference latency. Each HE operation introduces a certain amount of noise into the encrypted data. When the accumulated noise in the encrypted data is larger than the noise budget of a HENN, the encrypted data cannot be correctly decrypted. Therefore, the total number of HE operations along the critical path of a HENN decides the noise budget. A larger noise budget increases the latency of each HE operation. In interactive HENNs, the server sends the output to the client at the end of each linear layer. The client has to participate the computation of each activation layer. The client performs DFT and IDFT on plaintext data, and thus does not increase the number of HE operations at all. On the contrary, non-interactive HENNs homomorphically compute all linear and activation layers on the server without involving client. DFT and IDFT applied in non-interactive HENNs happen on the encrypted data, and thus should be homomorphic. Homomorphic DFT and IDFT greatly increase the noise budget of a non-interactive HENN by adding more HE operations to its critical path. Based on our estimation, the enlarged noise budget significantly prolongs inference latency of a frequency-domain non-interactive HENN by > 100%. In this paper, we propose Falcon for fast non-interactive privacy-preserving inference. Our contributions can be summarized as follows. We propose a novel HE DFT algorithm to homomorphically and efficiently convert an encrypted input to its encrypted frequency-domain representation. We propose a fast HE-enable convolution technique and a fully-connected technique on spectral domain using block circulant weight matrices. We consider the improvements and overhead of proposed techniques on both HE noise growth and HE parameters selection. Our experiments prove that Falcon reduces the inference latency by 45.45% 85.34% over prior HENNs on MNIST and CIFAR-10. Hlinear layer Hlinear layer (a) an interactive HENN (b) a non-interactive HENN Figure 1: Different HENNs schemes. . w1 w0 w3 w2 [x1] [x0] [x3] [x2] encrypted inputs plaintext weights rot([y], 2) sum [y1] [y0] [y3] [y2] [y3] [y2] [y1] [y0] [y13] [y02] [y13] rot([y], 1) [y0~3] accumulated result Figure 2: A homomorphic dot-product. 2 Background 2.1 Secure Neural Network Inference Recent works [1, 2, 3, 4, 5, 6] use HE to implement linear layers of a HENN for MLaa S. However, HE cannot support non-linear activation layers. As Figure 1(a) shows, interactive HENNs [4, 5, 6] take advantage of MPC and secrete sharing to make the server to send the output to the client at the end of each linear layer, and get the client involved in the computation of each activation layer. In contrast, non-interactive HENNs [1, 2, 3] approximate activations by a square function to compute an entire secure inference without involving the client, as shown in Figure 1(b). Compared to interactive HENNs, non-interactive HENNs have a lower requirement on the computing power and network bandwidth of the client, thereby becoming more friendly to low-power mobile devices. A state-ofthe-art interactive HENN, Delphi [5], has to exchange 2GB data between the client and server for only a Res Net-32 inference on an encrypted CIFAR-10 image. In this paper, we focus on accelerating non-interactive HENNs. Particularly, we select Lo La [3] implemented by BFV [7] as our baseline, due to its state-of-the-art inference accuracy and latency. Compared to other HE cryptosystems such as CKKS [11], the BFV-based Lo La improves inference latency by 30%. 2.2 Homomorphic Encryption Homomorphic Encryption. HE allows operations on encrypted data without requiring access to the secret key [7]. Given a public key pk, a private key sk, an encryption function ϵ(), and a decryption function σ(), a HE operation can be defined if there is another operation such that σ(ϵ(x1, pk) ϵ(x2, pk), sk) = σ(ϵ(x1 x2, pk), sk), where x1 and x2 are plaintexts. Each HE operation introduces a certain amount of noise into encrypted data. When the accumulated noise is larger than a noise budget, errors happen during HE decryption. A bootstrapping operation [12] is extremely expensive, although it can reduce the noise in encrypted data. Prior HENNs use leveled HE defining a noise budget to compute only a limited number of HE operations. SIMD and Rotation. Modern HE cryptosystems, e.g., BFV [7], support single instruction multiple data (SIMD) vectors by encoding multiple integers into a larger integer based on Chinese Remainder Theorem. For instance, as Figure 2 shows, an encrypted input integer vector [x0, x1, x2, x3] can be encrypted into mx, while another weight integer vector [w0, w1, w2, w3] can be encrypted into mw. By computing a HE multiplication between mx and mw, four HE multiplications are simultaneously performed on individual integers, i.e., [x0 w0, x1 w1, x2 w2, x3 w3]=[y0, y1, y2, y3]. A HE cryptosystem also supports rotations to shuffle individual integers in a packed vector. For instance, rotating the vector [y0, y1, y2, y3] by 2 results in the vector [y2, y3, y0, y1]. A HE rotation is computationally expensive [4] and introduces non-trivial noise into the encrypted data. Homomorphic Multiply-Accumulate. The major operation in linear layers of a HENN is homomorphic MAC, as shown in Figure 2. For a fully-connected (FC) layer, we assume the encrypted input vector x includes n elements, the encrypted output vector y has m elements, and the plaintext weight matrix w has a dimension of n m. For instance, for each row of wj (0 j m 1) and [x], we have [yj] = [wj x] is computed as [yj] = Mul PC(wj, [x]), [yj] = i=1 rot([y], n where Mul PC indicates a HE SIMD multiplication between a packed plaintext and a packed ciphertext; and rot means a HE rotation. As Equation 1 describes, log2n HE rotations are required to accumulate an element of y. We summarized the HE operation number and noise of a FC layer of Lo La in Table 1. Among all HE operations, rotations dominate inference latency of Lo La. In this paper, we focus on eliminating HE rotate-and-accumulate operations. 2.3 Frequency-Domain Convolution Interactive HENNs. Prior interactive HENNs [4, 8] use frequency-domain convolutions [9, 10] to perform only element-wise multiplications in their HE linear layers to reduce rotation overhead, as shown in Figure 1(a). Based on Convolution Theorem, the convolutions in space domain are equivalent to element-wise products in frequency domain. Therefore, we have y = w x = IDFT(DFT(w) DFT(x)), (2) w0,0 w0,q-1 ... wp-1,0 wp,q-1 ... 1.2 0.8 3.2 7.7 7.7 1.2 0.8 3.2 3.2 7.7 1.2 0.8 0.8 3.2 7.7 1.2 wi,j 0.1 2.2 4.8 3.8 1.2 0.8 3.2 7.7 element-wise multiplication i DFT Figure 3: A block-circulant weight matrix. where IDFT means inverse DFT. Frequency-domain convolutions greatly reduce inference latency of interactive HENNs, since unencrypted DFT and IDFT performed by the client introduce small computing overhead. Scheme Mul P C Add CC rot noise Lo La nc nc log2na nc log2na η0ηmna + ηr(na-1) Lo La+DFT 3n c 2n c log2n a 2n c log2n a 3η0ηmn a + 2ηr(n a-1) Table 1: The comparison of non-interactive HENN convolution schemes. The width, height and channels of convolution input, output: Iw, Ih, Ic and Ow, Oh, Oc; The input channels, output channels, width and height of kernels: Ic, Oc, f, f. N is ciphertext slots number. (Mul CP: HE SIMD multiplications between plaintext and ciphertext; Add CC: HE SIMD additions between ciphertext and ciphertext; rot: HE rotation; nc = f 2 Ic Iw Ih N and n c = (Ic Iw Ih)2 N : ciphertext numbers of Lo La and Lo La+DFT; na = Ic f 2 and n a = Ic Iw Ih : accumulation numbers of Lo La and Lo La+DFT, where x is for roundup x to the nearest integer. η0: initial noise; ηm: Mult PC noise; and ηr: rotation noise). Non-interactive HENNs. Unlike interactive HENNs, non-interactive HENNs shown in Figure 1(b) have to use homomorphic DFT (HDFT) and IDFT (HIDFT), since the entire inference occurs on the server. Though a CKKS-based homomorphic DFT function with bootstrapping [11] exists, there is no BFV-based DFT or IDFT function. Even if we have BFV-based DFT and IDFT functions, the total HE number and noise of a non-interactive HENN will be greatly increased by HDFT and HIDFT that also require HE rotate-and-accumulate operations. DFT and IDFT can be summarized as i=0 xi ωit n ; IDFT(x)t = 1 i=0 xi ω it n (3) where ωn = e2πi/n. As Table 1 shows, if we naïvely apply HDFT and HIDFT on Lo La, Lo La+DFT increases more than 2 rotations since n a > na and n c > nc, and thus introduces more noises. To maintain a larger noise budget for the noises, Lo La+DFT has to enlarge the HE encryption parameters, i.e., the ciphertext modulus q, and the polynomial degree N, which in turn prolong the latency of each Mul PC, Add CC, and rot operation. 2.4 Block-Circulant Weight Matrices Recent works [13] compresses a weight filter of a plaintext NN into multiple blocks-circulant matrices to reduce inference computing overhead. The n m weight matrix is divided into p q square blocks, each of which (wi,j) contains k k elements, where 0 i q 1 and 0 j p 1. We have p = m k and q = n k . The k k elements in a block can be derived by only k elements with shift operations. The input vector is divided into q parts, while the output vector is divided into p parts. To compute each part of the output vector (yi) as shown in Figure 3, we can use j=0 IDFT(DFT(wi,j) DFT(xj)) (4) The computational complexity of a FC layer is O(pqklogk). 2.5 Comparison against Prior Works Table 2 shows the comparison of existing HENNs. A non-interactive HENNs, E2DM, is proposed to compute secure matrix multiplications using HE and MPC schemes. Lo La and CHET achieve the state-of-the-art performance in the non-interactive HENNs using HE ciphertext batching techniques. However, they still suffer from the long inference latency, due to massive and expensive homomorphic Features E2DM [14] Lo La [3] CHET [15] ENSEI [16] MPCHE [8] Ours Non-interactive HENNs Spectral Convolution & FC Efficient homomorphic DFT Table 2: The comparison of HENNs. rotations. ENSEI and MPCHE are proposed to reduce homomorphic rotations of interactive HENNs using spectral-domain convolutions. Nevertheless, their methods based on convolution theorem can not be applied to fully-connected operations. In addition, both ENSEI and MPCHE focus on interactive HENNs and cannot perform DFT and IDFT homomorphically. They enforce clients to process DFT and IDFT on the unencrypted data. Directly applying their methods on non-interactive HENNs prolongs the inference latency, due to the huge overhead of homomorphic DFT. Our Falcon includes an efficient homommorphic DFT technique and supports both spectral convolutions and FC operations, which significantly reduces the inference latency of non-interactive HENNs. 3.1 BFV-based Homomorphic DFT Algorithm 1: Homomorphic DFT (HDFT). Input: a stacked input ciphertext [x]; the block size k Output: a spectral ciphertext [ˆx]; [ˆx] = Mult PC([x], ω) for i = 1; i <= log2k; i + + do [ˆxi] = rot([ˆx], k 2i ) [ˆx] = Add CC([ˆxi], [ˆx]) end return [ˆx] w00 w00 w00 w10 w10 w10 w01 w01 w01 w11 w11 w11 w00 w10 w01 w11 Re Re Re Re w00 w10 w01 w11 Re Re Re Re w00 w00 w00 w10 w10 w10 w01 w01 w01 w11 w11 w11 w00 w10 w01 w11 Im Im Im Im w00 w10 w01 w11 Im Im Im Im . [x0] [x1] [x0] [x1] [x0] [x1] [x0] [x1] . [x0] [x1] [x0] [x1] [y1] [#] [y0] [#] Re Re Re Re [y1] [#] [y0] [#] Re Re Re Re [y1] [#] [y0] [#] Im Im Im Im [y1] [#] [y0] [#] Im Im Im Im [y1] [#] [y0] [#] Re Re Re Re [y1] [#] [y0] [#] Im Im Im Im [y11][y00] [y10][y01] Re Re Re Re [y11][y00] [y10][y01] Re Re Re Re [y11][y00] [y10][y01] Im Im Im Im [y11][y00] [y10][y01] Im Im Im Im [y11][y00] [y10][y01] Re Re Re Re [y11][y00] [y10][y01] Im Im Im Im [y01][y11] [y00][y10] Re Re Re Re [y01][y11] [y00][y10] Re Re Re Re [y01][y11] [y00][y10] Im Im Im Im [y01][y11] [y00][y10] Im Im Im Im [y01][y11] [y00][y10] Re Re Re Re [y01][y11] [y00][y10] Im Im Im Im rot(y,1) rot(y,1) w00 w00 w00 = w00 w00 w00 Re w00 Re + w00 w00 w00 Im w00 Im j w00 = w00 add add Figure 4: Homomorphic DFT. w00 w00 w10 w10 [x0] [x0] [x1] [x1] [x0] [x1] [x0] [x0] [x1] [x1] [x0] [x1] . [y00] [y00] [y10] [y10] [y00] [y10] [y01] [y01] [y11] [y11] [y01] [y11] FC operations in Lo La FC operations in Lo La rot(y,1) rot(y,1) w01 w01 w11 w11 w01 w11 [y10] [y10] [y01] [y01] [y10] [y01] [y11] [y11] [y00] [y00] [y11] [y00] [y10] [y01] [y11] [y00] y y y' y' y y [y0] [y0] [#] [#] [y0] [#] [y1] [y1] [#] [#] [y1] [#] [y0] [#] [y1] [#] y' y' ' y' ' [x0] [x0] [x1] [x1] [x0] [x1] w00 w01 w00 w01 [y0] [y0] [y1] [y1] [y0] [y1] . [x0] [x1] . x0 x0 x1 x1 x0 x1 w10 w11 w11 Figure 5: A Homomorphic Fully-Connected Layer. To support homomorphic DFT shown in Equation 3 using BFV scheme, we should firstly quantize and encode the complex numbers (wi,j) of DFT conversion matrix. Quantization. State-of-the-art non-interactive HENNs, e.g., Lo La, rely on the BFV protocol that supports only integer. The inputs, weights, activations, and outputs of the non-interactive HENNs are all integers. So xi of HDFT and HIDFT in Equation 3 is an integer. ωit n can be quantized as Q(ωit n ) = S1ωit n = S1 cos(2πt n ) j S1 sin(2πt where is the rounding function converting a real number to an integer; and S1 is an integer scaling factor. Through multiplying S1, the real inputs, weights, and ωit n of a HENN are scaled to keep more digits after their decimal point during the rounding process. We quantized the inputs, activations, weights, and ωit n of a HENN with 8-bit. We integrated the quantized HENN into the forward propagation of the training to minimize the accuracy loss. Encoding Complex Numbers. Unlike HEAAN [11], the BFV protocol cannot naturally support complex number. We encode the real part (Re) S1 cos( 2πt n ) and the imaginary part (Im) S1 sin( 2πt n ) of Q(ωit n ) in Equation 5 by two SIMD slots of a BFV ciphertext, so we have C = Re+j Im. For a HE complex addition (C0+C1), we have (Re0+Re1)+j(Im0+Im1). For a HE complex multiplication (C0 C1), we have ((Re0+Im0)Re1 (Re1+Im1)Im0)+j((Re0+Im0)Re1 (Re1+Im1)Re0). Homomorphic DFT and IDFT. We present a BFV-based homomorphic DFT in Algorithm 1 to homomorphically convert an encrypted integer vector [x] to its encrypted frequency-domain vector y=[ˆx]. Here ω is the quantized DFT twiddle factors and its entries can be stacked and permuted in an offline phase. Table 3 shows a comparison of computational complexity between prior homomorphic DFT (FTHDFT) [17] and our Algorithm 1 when they both use the block circulant matrix technique. Algorithm 1 needs (log2k) rotations and additions, and 1 Mult PC operations for each real-part vector (Re) and imaginary-part vector (Im) of input vector x. Figure 4 shows an example of our algorithm 1 with k = 2, and this example requires 1 multiplication, rotation and addition for each real-part vector (Re) and imaginary-part vector (Im). Scheme #Mul P C #Add CC #rot #Depth noise FHDFT 3 log2k 3 log2k 2 log2k 3 log2k η 3 log2k Ours 1 2 log2k 2 log2k 1 η Table 3: The comparison of homomorphic DFT schemes with k points. Here η = η0ηmk+ηr(k 1). 3.2 Homomorphic Spectral FC Layer Algorithm 2 shows our proposed homomorphic fully-connected (HFC) layer, which takes a spatial domain vector x as the input, and outputs the encrypted results [y] = [x] W, where W is the weight matrix at a FC layer. If a FC layer is not the first layer, x already is packed into l = Nc( O k ) ciphertexts. In this paper, we set Nc(x) = I x N be the ciphertext numbers to hold x stacked inputs with size I, where N is ciphertext slots number and y is a function to roundup y into the nearest integer. If N is large enough, x only requires one single ciphertext. Algorithm 2 then uses HDFT to homomrophically convert each ciphertext [xi] to its spectral-domain value [ˆxi]. Then a homomorphical element-wise multiplication between spectral input [ˆxi] and weight ˆWi is performed by one single Mult PC operation, where ˆWi can be pre-computed. According to Equation 4 and Figure 3, we do not need to accumulate the result entries inside a block, but we still need to accumulate the entries between log2 I k blocks. Therefore, log2 I k rotations and additions are required to accumulate these blocks. Then we perform an inverse HDFT step to convert the spectral result [ˆvi] into its spatial value [vi]. At last, we can use Add CC to add the partial sums into one ciphertext. Table 4 compares the computational overheads and noise growth between Lo La and our Falcon. Falcon requires 2log2 I k ) rotations, but Lo La needs log2I Nc(O) rotations. Our experiments show that When k >= 2, Falcon costs less computations than Lo La. For noise growth, we set η = η0ηm I + ηr(I 1). When k > 1, Falcon reduces the noise accumulation, thereby potentially enabling more efficient HE parameters. Figure 5 shows an example why our Algorithm 2 is better than our baseline Lo La. In this example, k = 2, I = 2, O = 2 and N = 2, Lo La requires Nc(O) = 2 ciphertexts and 2 log2(I) = 2 additions, rotations and multiplications. Our Falcon only requires Nc( O k ) = 1 ciphertext and 1 log2( I 2) = 0 rotations. This is only a toy example and we should add the overhead of HDFT into our method. In practice, when I, O and N are very large, the overhead of HDFT is tiny compared to the other computations within a spectral FC, which is shown in section 5. Algorithm 2: Homomorphic FC Layer (HFC). Input: an input vector x with size I, FC weights W with size I O, the block size k; Output: an output ciphertext [y] = [x] W; x is packed to l = Nc( O k ) ciphertexts {[x0], [x1], ..., [xl 1]}; for i = 0; i < l; i + + do [ˆxi] = HDFT([xi]); [ˆvi] = Mult PC([ˆxi], ˆ Wi); for j = log2 I k; j > 0; j do [ˆyi] = rot([ˆvi], j); [ˆvi] = Add CC([ˆyi], [ˆvi]); end [vi] = inverse HDFT([ˆvi]); [yi] = Add CC([yi], [vi]); end return [y] from [yi] to [yl 1]; Algorithm 3: Homomorphic Convolutional Layer. Input: an input tensor x with size Ic Iw Ih, Kernels W with size Ic Oc f f, block size k; Output: the convolution result [y]; x is packed to l = Mc( O k ) ciphertexts {[x0], [x1], ..., [xl 1]}; for i = 0; i < l; i + + do [ˆxi] = HDFT([xi]); [ˆxi] = Mult PC([ˆxi], ˆ Wi); for jc = 1; jc <= log2Ic; jc = jc + kc do [ˆxi] = rot([ˆxi], Iw Ih Ic 2jc ); [ˆxi] = Add CC([ˆxi], [ˆxi]); end for jw = 1; jw <= log2f; jw = jw + kw do [ˆxi] = rot([ˆxi], Ih f 2jw ) ; [ˆxi] = Add CC([ˆxi], [ˆxi]); end for jh = 1; jh <= log2f; jh = jh + kh do [ˆxi] = rot([ˆxi], f 2jh ); [ˆxi] = Add CC([ˆxi], [ˆxi]); end [yi] = inverse HDFT([ ˆxi]); end return [y] from [yi] to [yl 1]; Scheme #Mul P C #Add CC #rot #Depth noise Lo La Nc(O) log2I Nc(O) log2I Nc(O) 1 η Ours 3Nc( O k ) 2log2 I k ) 2log2 I Table 4: The comparison of homomorphic FC operations with I inputs and O outputs. 3.3 Homomorphic Spectral Convolution Layer Algorithm 3 shows how to perform a homomorphic spectral convolution. In this paper, Ic, Iw and Ih are the input channel number, input width and input height, respectively; kernel size and kernel output channel number are f and Oc; The output channel number, height and width are Oc, Oh and Ow. We set Mc(x) = Ic Ih Iw x N be the ciphertext numbers to hold x stacked input. During a convolution, each input s sliding window with size of I = f 2 Ic will perform a dot-product operation. Each convolution needs O = Oc f 2 s2 dot-products, where s is the stride size. Previous works use Mc(O ) ciphertexts to pack O stacked inputs, so that one homomorphic convolution operation is converted to O homomorphic dot-product operations on each sliding window with size I . Table 5 shows that previous work requires log2I Mc(O ) rotations, and noise growth is η = η0ηm I + ηr(I 1). By using the block-circulant matrix technique, we only need Mc( O k ) ciphertexts as shown in Algorithim 3, which potentially reduces k homomorphic rotations, multiplications and additions. In addition, the dot-product of each ciphertext in Algorithm 3 only needs to accumulate entries between blocks, which is implemented by setting kc kw kh = k. Thus, each dot-product with size I only requires I k additions and rotations to accumulate its partial results. Table 5 concludes the computational overheads and noise growth of Lo La and our Falcon. Falcon reduces k log2(k) 2 rotations, additions and multiplications over Lo La. Flacon has k less noise increase than Lo La. Scheme #Mul P C #Add CC #rot #Depth noise Lo La Mc(O ) log2I Mc(O ) log2I Mc(O ) 1 η Ours 3Mc( O k ) 2log2 I k ) 2log2 I Table 5: The comparison of homomorphic convolution schemes. 4 Experimental Methodology Dataset, Networks and Operation Details. Our datasets include MNIST [18] and CIFAR-10 [19]. The network architecture for MNIST dataset is same to Lo La [20], but we replace the spatial-domain middle layers into frequency domain. The network architecture and operations are summarized in Table 6. The block size k of circulant matrix is set as 8 so that accuracy is not decreased. We evaluated a 3-layer CNN same to Lo La [20] on CIFAR-10. To keep original accuracy, the block size k = 16 of circulant matrix is used. Cryptosystems Settings. We use BFV scheme in SEAL [21] to implement Falcon. For MNIST and CIFAR-10, the plaintext modulus t = 2148728833 2148794369 2149810177, modulus degree N = 16384, coefficient modulus Q = 440 bits. More specific encryption parameters settings are shown in Table 7 and Table 8. The security level is larger than 128 bits which is verified by lwe_estimator [22]. To have fair comparisons with baselines, We ran all experiments on the same Azure standard B8ms virtual machine with 8 v CPUs and 32GB DRAM. 5 Results and Analysis We compared Falcon against the state-of-the-art works including Crypto Nets [23], Faster Crypto Nets (FCrypto Nets) [24], n Graph-HE [25], CHET [15] and Lo La [20]. The homomorphic operation numbers (HOPs), encryption parameter bits, message size, latency and accuracy are summarized in Table 7. HOPs is the sum of all homomorphic operation number including Mult PC, Mult CC, Add CC and rot. Message size is the size of encrypted input and output that the client needs to transmit. As Table 7 shows, Crypto Nets runs one MNIST image inference in 205 seconds, since it uses an inefficient encoding method. Specifically, Crypto Nets encodes a pixel into a single message, therefore a 28 28 MNIST image is encoded and encrypted into 784 ciphertexts. n Graph-HE reduces 34% inference latency of Crypto Nets by multiple compiler optimizations. FCrypto Nets makes use of the Layer Input size Representation Falcon HE operation #Rot convolution layer (f 2=25, Oc=5) 25 169 convolution convolution vector - row major multiplication 0 5 169 dense combine to one vector using 4 rotations and additions 4 square layer 1 845 dense square 0 DFT 1 845 dense homomorphic DFT 12 FC layer (I=845, O=100) 1 845 dense stack vectors using 2log2 O k rotations and additions 8 1 (845 O k ) stacked homomorphic FC 14 Inverse DFT 1 100 interleave inverse homomorphic DFT 12 square layer 1 100 interleave square 0 FC layer (O=100, O=10) 1 100 interleave interleaved vector - row major multiplication 70 output 1 10 sparse Table 6: Falcon message representations and operations on MNIST. Falcon only replaces the first FC layer by HSFC. Falcon only uses (12+8+14+12)=46 rotations, reducing 70% rotation numbers in the first FC layer compared to Lo La which consumes 151 rotations. weights sparsity in neural networks to speedup the MNIST inference into 39.1 seconds. However, due to the inefficient encoding method, they still suffer from huge HOPs (> 67K) and big message size (> 368MB). Lo La is proposed to efficiently encode one image s multiple pixels into a single message so that one ciphertext is able to contain all 784 MNIST pixels. In practice, Lo La encodes f 2=25 messages, each of them contains Ow Oh=169 pixels, to fast process first-layer convolution operations. Althought EVA [26] and Lo La significantly reduce HOPs and inference latency, they introduce 2K and 225 expensive rotations, respectively. Rotations are the computational bottleneck of Lo La. Lo La directly uses the over-pessimistic encryption parameters generated in SEAL [21], e.g. 440-bit Q and 14-bit N. According to the noise analysis in Table 4 and Table 5, {40, 60, 80, 60, 70} noise bits are accumulated by convolution, square, FC, square, FC layers respectively. So 340-bit Q and 14-bit N are required to guarantee a 128-bit security level. Using 340-bit Q, Lo La is able to reduce 0.1-second latency of Lo La. Compared to Lo La , our work Falcon reduces 10-bit noise bits using block-circulant matrix, but introduces 100-bit noise bits because of DFT and inverse DFT, thereby Falcon consumes 430-bit Q. Our work Falcon reduces 56% rotation numbers and achieves the MNIST inference latency in 1 seconds without accuracy decrease as shown in Table 7. Table 6 reports the batching representations and operations that Falcon applies in each layer. These batching representations are defined in Lo La. Since the second FC layer of Lo La occupies 63% of the total latency, we replace this layer by our Homomorphic FC operations and keep the other layers same to Lo La. Falcon performs a convolution vector - row major multiplication in the first layer without rotations, generating 5 dense ciphertexts. These 5 dense ciphertexts can be combined into one ciphertext by 4 rotations and additions. This ciphertext is squared using a single multiplication, outputting a dense ciphertext, which is homomorphically converted into a frequencydomain ciphertext by our HDFT algorithm shown in Algorithm 1.Then 13 copies of the spectral ciphertext are stacked before applying the Homomorphic FC layer, which is done using 16 rotations when we support complex-number encoding. Then only 1 round HSFC is performed to generate 1 100 interleaved ciphertext. The interleaved ciphertext is then squared by one single multiplication. The last operations is an interleaved vector - row major multiplication, which requires 70 rotations. Falcon only uses 46 rotations, reducing 70% rotation numbers in the first FC layer compared to Lo La which consumes 151 rotations. Falcon needs 120 rotations in total for MNIST inference, reducing 47% rotations compared to the 225 rotations in the Lo La, thereby Falcon reduces 45% latency. Scheme HOPs Add CC Mult P C Mult CC Rot N(bits) Q(bits) Message Size Latency(s) Acc(%) Crypto Nets 612K 312K 296K 945 0 - - 368MB 205 98.95 n Graph-HE 612K 312K 296K 945 0 14 350 - 135 98.95 FCrypto Nets 67K 38K 24K 945 0 13 219 411MB 39.1 98.71 EVA 10K 4K 4K 3 2K 15 480 63MB 121.5 99.05 Lo La 798 393 178 2 225 14 440 51MB 2.2 98.95 Lo La 798 393 178 2 225 14 340 51MB 2.1 98.95 Falcon 626 312 204 2 120 14 430 51MB 1.2 98.95 Table 7: The MNIST results. 5.2 CIFAR-10 Table 8 shows the inference latency, accuracy and HOPs of existing works on CIFAR-10. n Graph-HE implements CIFAR-10 inference in 1628 seconds with 62.1% accuracy. FCrypto Nets improves 14.6% accuracy compared to n Graph-HE with a deeper and more complex neural network architecture, but it suffers from huge latency with 39K seconds. EVA and Lo La support same vector representations and operations, but they are based on HEAAN and BFV respectively and encryption parameters, thereby they have different latency. Lo La introduces 53K expensive rotations, each of rotations has 10 latency of multiplication Mult PC. Our work Falcon removes 86% rotations and reduces the latency from 730 seconds to 107 seconds. Table 9 reports the batching representations and operations that Falcon applies in each layer. Since the second FC layer of Lo La occupies 97% of the total latency, we replace this layer by our homomorphic convolution and keep the other layers same to Lo La. Falcon uses 7674 rotations, reducing 86% rotation numbers in the second convolution layer compared to Lo La which consumes 52975 rotations. Our baseline Lo La can be improved by Lo La using more proper encryption parameter Q. Scheme HOPs Add CC Mult P C Mult CC Rot N(bits) Q(bits) Message Size Latency(s) Acc(%) EVA 150K 67K 67K 9 16K 16 1225 63MB 3062 81.5 Lo La 123K 61K 8.2K 2 53K 14 440 210MB 730 76.5 Lo La 123K 61K 8.2K 2 53K 14 330 210MB 730 76.5 Falcon 21K 10k 11.9K 2 7.9K 14 430 210MB 107 76.5 Table 8: The CIFAR-10 results. Layer Input size Representation Falcon homomorphic operation #Rot convolution (f 2=64, Oc=83) 64 196 convolution convolution vector - row major multiplication 0 83 196 dense combine to one vector using 82 rotations and additions 82 square 1 16268 dense square 0 DFT 1 16268 dense homomorphic DFT 12 convolution (f 2=25, Oc=163) 1 16268 dense Homomorphic convolution 7650 4075 k k sparse combine to one vector using 4075 k -1 additions 0 inverse DFT 1 4075 dense homomorphic DFT 12 square 1 4075 dense square 0 FC (I=4075, O=10) 1 4075 dense 10 dense vector-row major multiplication 120 output 1 10 sparse Table 9: Falcon operations on CIFAR-10. Falcon replaces the second convolution layer by HSConv. Falcon only uses (12+7650+12)=7674 rotations, reducing 86% rotation numbers in the second convolution layer compared to Lo La which consumes 52975 rotations. 6 Conclusion In this paper, we propose Falcon, a low-latency deep neural network on encrypted data, which consists of a homomorphic DFT unit, a Homomorphic FC unit and a Homomorphic convolution unit based on block-circulant matrix. Our experimental results show Falcon reduces the inference latency by 45.45% 85.34% over prior HENNs on various datasets. Falcon is the first frequency-domain non-interactive HENNs. Broader Impact Falcon enables a low-latency privacy-preserving neural network inference on encrypted data. With Falcon, users can enjoy low-latency secure inference services. In particular, users are able to receive low-latency and powerful machine learning inference services by uploading their sensitive data without concerning data privacy. Falcon has no negative impact on our society. If our proposed method fails, the latency of secure inferences will be prolonged. Acknowledges The authors would like to thank the anonymous reviewers for their valuable comments and helpful suggestions. This work was partially supported by the National Science Foundation (NSF) through awards CCF-1908992 and CCF-1909509. [1] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, pages 201 210, 2016. [2] Qian Lou and Lei Jiang. She: A fast and accurate deep neural network for encrypted data. In Advances in Neural Information Processing Systems, pages 10035 10043, 2019. [3] Alon Brutzkus, Ran Gilad-Bachrach, and Oren Elisha. Low latency privacy preserving inference. In International Conference on Machine Learning, pages 812 821, 2019. [4] Song Bian, Tianchen Wang, Masayuki Hiromoto, Yiyu Shi, and Takashi Sato. Ensei: Efficient secure inference via frequency-domain homomorphic convolution for privacy-preserving visual recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. [5] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and Raluca Ada Popa. Delphi: A cryptographic inference service for neural networks. In USENIX Security Symposium, 2020. [6] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. GAZELLE: A Low Latency Framework for Secure Neural Network Inference. In USENIX Security Symposium, 2018. [7] Junfeng Fan and Frederik Vercauteren. Somewhat practical fully homomorphic encryption. Cryptology e Print Archive, Report 2012/144, 2012. https://eprint.iacr.org/2012/144. [8] Shaohua Li, Kaiping Xue, Bin Zhu, Chenkai Ding, Xindi Gao, David Wei, and Tao Wan. Falcon: A fourier transform based approach for fast and secure convolutional neural network predictions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. [9] Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann Le Cun. Fast convolutional nets with fbfft: A gpu performance evaluation. In International Conference on Learning Representations, 2015. [10] Oren Rippel, Jasper Snoek, and Ryan P Adams. Spectral representations for convolutional neural networks. In Advances in neural information processing systems, pages 2449 2457, 2015. [11] Kyoohyung Han, Minki Hhan, and Jung Hee Cheon. Improved homomorphic discrete fourier transforms and fhe bootstrapping. IEEE Access, 7:57361 57370, 2019. [12] Craig Gentry and Dan Boneh. A fully homomorphic encryption scheme. Stanford University, 2009. [13] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In IEEE/ACM International Symposium on Microarchitecture, pages 395 408, 2017. [14] Xiaoqian Jiang et al. Secure outsourced matrix computation and application to neural networks. In ACM SIGSAC Conference on Computer and Communications Security, 2018. [15] Roshan Dathathri, Olli Saarikivi, Hao Chen, Kim Laine, Kristin Lauter, Saeed Maleki, Madan Musuvathi, and Todd Mytkowicz. Chet: An optimizing compiler for fully-homomorphic neural-network inferencing. In PLDI 2019, pages 142 156. ACM, June 2019. [16] Song Bian, Tianchen Wang, Masayuki Hiromoto, Yiyu Shi, and Takashi Sato. Ensei: Efficient secure inference via frequency-domain homomorphic convolution for privacy-preserving visual recognition. ar Xiv preprint ar Xiv:2003.05328, 2020. [17] K. Han, M. Hhan, and J. H. Cheon. Improved homomorphic discrete fourier transforms and fhe bootstrapping. IEEE Access, 7:57361 57370, 2019. [18] Yann Le Cun, Corinna Cortes, and CJ Burges. MNIST Handwritten Digit Database. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2010. [19] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset, 2014. http://www.cs.toronto.edu/kriz/cifar.html. [20] Alon Brutzkus et al. Low Latency Privacy Preserving Inference. In International Conference on Machine Learning, 2019. [21] Microsoft SEAL (release 3.4). https://github.com/Microsoft/SEAL, October 2019. Microsoft Research, Redmond, WA. [22] Martin R. Albrecht, Rachel Player, and Sam Scott. On the concrete hardness of learning with errors. Cryptology e Print Archive, Report 2015/046, 2015. https://eprint.iacr.org/ 2015/046. [23] Nathan Dowlin et al. Crypto Nets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy. In International Conference on Machine Learning, 2016. [24] Edward Chou et al. Faster Crypto Nets: Leveraging Sparsity for Real-World Encrypted Inference. ar Xiv, abs/1811.09953, 2018. [25] Fabian Boemer et al. n Graph-HE: A Graph Compiler for Deep Learning on Homomorphically Encrypted Data. In ACM International Conference on Computing Frontiers, 2019. [26] Roshan Dathathri, Blagovesta Kostova, Olli Saarikivi, Wei Dai, Kim Laine, and Madan Musuvathi. Eva: An encrypted vector arithmetic language and compiler for efficient homomorphic computation. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2020, page 546 561, New York, NY, USA, 2020. Association for Computing Machinery.