# low_latency_privacy_preserving_inference__a9898e1c.pdf Low Latency Privacy Preserving Inference Alon Brutzkus 1 Oren Elisha 2 Ran Gilad-Bachrach 3 When applying machine learning to sensitive data, one has to find a balance between accuracy, information security, and computationalcomplexity. Recent studies combined Homomorphic Encryption with neural networks to make inferences while protecting against information leakage. However, these methods are limited by the width and depth of neural networks that can be used (and hence the accuracy) and exhibit high latency even for relatively simple networks. In this study we provide two solutions that address these limitations. In the first solution, we present more than 10 improvement in latency and enable inference on wider networks compared to prior attempts with the same level of security. The improved performance is achieved by novel methods to represent the data during the computation. In the second solution, we apply the method of transfer learning to provide private inference services using deep networks with latency of 0.16 seconds. We demonstrate the efficacy of our methods on several computer vision tasks. 1. Introduction Machine learning is used in several domains such as education, health, and finance in which data may be private or confidential. Therefore, machine learning algorithms should preserve privacy while making accurate predictions. The privacy requirement pertains to all sub-tasks of the learning process, such as training and inference. For the purpose of this study, we focus on private neural-networks inference. In this problem, popularized by the work on Crypto Nets (Dowlin et al., 2016), the goal is to build an inference service that can make predictions on private data. To achieve this goal, the data is encrypted before it is sent to the pre- 1Microsoft Research and Tel Aviv University, Israel 2Microsoft, Israel 3Microsoft Research, Israel. Correspondence to: Ran Gilad Bachrach . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). diction service, which should be capable of operating on the encrypted data without having access to the raw data. Several cryptology technologies have been proposed for this task, including Secure Multi-Party Computation (MPC) (Yao, 1982; Goldreich et al., 1987), hardware enclaves, such as Intel s Software Guard Extensions (SGX) (Mc Keen et al., 2013), Homomorphic Encryption (HE) (Gentry, 2009), and combinations of these techniques. The different approaches present different trade-offs in terms of computation, accuracy, and security. HE presents the most stringent security model. The security assumption relies on the hardness of solving a mathematical problem for which there are no known efficient algorithms, even by quantum computers (Gentry, 2009; Albrecht et al., 2018). Other techniques, such as MPC and SGX make additional assumptions and therefore provide weaker protection to the data (Yao, 1982; Mc Keen et al., 2013; Chen et al., 2018; Koruyeh et al., 2018). While HE provides the highest level of security it is also limited in the kind of operations it allows and the complexity of these operations (see Section 3.1). Crypto Nets (Dowlin et al., 2016) was the first demonstration of a privacy preserving Encrypted Prediction as a Service (EPaa S) solution (Sanyal et al., 2018) based on HE. Crypto Nets are capable of making predictions with accuracy of 99% on the MNIST task (Le Cun et al., 2010) with a throughput of 59000 predictions per hour. However, Crypto Nets have several limitations. The first is latency - it takes Crypto Nets 205 seconds to process a single prediction request. The second is the width of the network that can be used for inference. The encoding scheme used by Crypto Nets, which encodes each node in the network as a separate message, can create a memory bottleneck when applied to networks with many nodes. For example, Crypto Nets approach requires 100 s of Gigabytes of RAM to evaluate a standard convolutional network on CIFAR-10. The third limitation is the depth of the network that can be evaluated. Each layer in the network adds more sequential multiplication operations which result in increased noise levels and message size growth (see Section 5). This makes private inference on deep networks, such as Alex Net (Krizhevsky et al., 2012), infeasible with Crypto Nets approach. Low Latency Privacy Preserving Inference In this study, we present two solutions that address these limitations. The first solution is Low-Latency Crypto Nets (Lo La), which can evaluate the same neural network used by Crypto Nets in as little as 2.2 seconds. Most of the speedup (11.2 ) comes from novel ways to represent the data when applying neural-networks using HE.1 In a nut-shell, Crypto Nets represent each node in the neural network as a separate message for encryption, while Lo La encrypts entire layers and changes representations of the data throughout the computation. This speedup is achieved while maintaining the same accuracy in predictions and 128 bits of security (Albrecht et al., 2018).2 Lo La provides another significant benefit over Crypto Nets. It substantially reduces the amount of memory required to process the encrypted messages and allows for inferences on wider networks. We demonstrate that in an experiment conducted on the CIFAR-10 dataset, for which Crypto Nets approach fails to execute since it requires 100 s of Gigabytes of RAM. In contrast, Lo La, can make predictions in 12 minutes using only a few Gigabytes of RAM. The experiment on CIFAR demonstrates that Lo La can handle larger networks than Crypto Nets. However, the performance still degrades considerably when applied to deeper networks. To handle such networks we propose another solution which is based on transfer learning. In this approach, data is first processed to form deep features that have higher semantic meaning. These deep features are encrypted and sent to the service provider for private evaluation. We discuss the pros and cons of this approach in Section 5 and show that it can make private predictions on the Cal Tech-101 dataset in 0.16 seconds using a model that has class balanced accuracy of 81.6%. To the best of our knowledge, we are the first to propose using transfer learning in private neural network inference with HE. Our code is freely available at https://github.com/microsoft/Crypto Nets. 2. Related Work The task of private predictions has gained increasing attention in recent years. Dowlin et al. (2016) presented Crypto Nets which demonstrated the feasibility of private neural networks predictions using HE. Crypto Nets are capable of making predictions with high throughput but are limited in both the size of the network they can support and the latency per prediction. Bourse et al. (2017) used a different HE scheme that allows fast bootstrapping which results in only linear penalty for additional layers in the network. However, it is slower per operation and therefore, the results they presented are significantly less accurate (see Table 2). 1The rest of the speedup (8.2 ) is due to the use of a faster implementation of HE. 2The HE scheme used by Crypto Nets was found to have some weaknesses that are not present in the HE scheme used here. Boemer et al. (2018) proposed an HE based extension to the Intel n Graph compiler. They use similar techniques to Crypto Nets (Dowlin et al., 2016) with a different underlying encryption scheme, HEAAN (Cheon et al., 2017). Their solution is slower than ours in terms of latency (see Table 2). In another recent work, Badawi et al. (2018) obtained a 40.41 acceleration over Crypto Nets by using GPUs. In terms of latency, our solution is more than 6 faster than their solution even though it uses only the CPU. Sanyal et al. (2018) argued that many of these methods leak information about the structure of the neural-network that the service provider uses through the parameters of the encryption. They presented a method that leaks less information about the neural-network but their solution is considerably slower. Nevertheless, their solution has the nice benefit that it allows the service provider to change the network without requiring changes on the client side. In parallel to our work, Chou et al. (2018) introduce alternative methods to reduce latency. Their solution on MNIST runs in 39.1s (98.7% accuracy), whereas our Lo La runs in 2.2s (98.95% accuracy). On CIFAR, their inference takes more than 6 hours (76.7% accuracy), and ours takes less than 12 minutes (74.1% accuracy). Makri et al. (2019) apply transfer learning to private inference. However, their methods and threat model are different. Other researchers proposed using different encryption schemes. For example, the Chameleon system (Riazi et al., 2018) uses MPC to demonstrate private predictions on MNIST and Juvekar et al. (2018) used a hybrid MPC-HE approach for the same task. Several hardware based solutions, such as the one of Tramer & Boneh (2018) were proposed. These approaches are faster however, this comes at the cost of lower level of security. 3. Background In this section we provide a concise background on Homomorphic Encryption and Crypto Nets. We refer the reader to Dowlin et al. (2017) and Dowlin et al. (2016) for a detailed exposition. We use bold letters to denote vectors and for any vector u we denote by ui its ith coordinate. 3.1. Homomorphic Encryption In this study, we use Homomorphic Encryptions (HE) to provide privacy. HE are encryptions that allow operations on encrypted data without requiring access to the secret key (Gentry, 2009). The data used for encryption is assumed to be elements in a ring R. On top of the encryption function E and the decryption function D, the HE scheme provides two Low Latency Privacy Preserving Inference additional operators and so that for any x1, x2 R D (E(x1) E (x2)) = x1 + x2 and D (E (x1) E (x2)) = x1 x2 where + and are the standard addition and multiplication operations in the ring R. Therefore, the and operators allow computing addition and multiplication on the data in its encrypted form and thus computing any polynomial function. We note that the ring R refers to the plaintext message ring and not the encrypted message ring. In the rest of the paper we refer to the former ring and note that by the homomorphic properties, any operation on an encrypted message translates to the same operation on the corresponding plaintext message. Much progress has been made since Gentry (2009) introduced the first HE scheme. We use the Brakerski/Fan Vercauteren scheme3 (BFV) (Fan & Vercauteren, 2012; Brakerski & Vaikuntanathan, 2014) In this scheme, the ring on which the HE operates is R = Zp[x] xn+1 where Zp = Z p Z, and n is a power of 2.4 If the parameters p and n are chosen so that there is an order 2n root of unity in Zp, then every element in R can be viewed as a vector of dimension n of elements in Zp where addition and multiplication operate component-wise (Brakerski et al., 2014). Therefore, throughout this essay we will refer to the messages as n dimensional vectors in Zp. The BFV scheme allows another operation on the encrypted data: rotation. The rotation operation is a slight modification to the standard rotation operation of size k that sends the value in the i th coordinate of a vector to the ((i + k) mod n) coordinate. To present the rotation operations it is easier to think of the message as a 2 n/2 matrix: m1 m2 mn/2 mn/2+1 mn/2+2 mn with this view in mind, there are two rotations allowed, one switches the row, which will turn the above matrix to mn/2+1 mn/2+2 mn m1 m2 mn/2 and the other rotates the columns. For example, rotating the original matrix by one column to the right will result in mn/2 m1 mn/2 1 mn mn/2+1 mn 1 3We use version 2.3.1 of the SEAL, http://sealcrypto. org/, with parameters that guarantee 128 bits of security according to the proposed standard for Homomorphic Encryptions (Albrecht et al., 2018). 4See supplementary material for a brief introduction to rings used in this work. Since n is a power of two, and the rotations we are interested in are powers of two as well, thinking about the rotations as standard rotations of the elements in the message yields similar results for our purposes. In this view, the row-rotation is a rotation of size n/2 and smaller rotations are achieved by column rotations. 3.2. Crypto Nets Dowlin et al. (2016) introduced Crypto Nets, a method that performs private predictions on neural networks using HE. Since HE does not support non-linear operations such as Re LU or sigmoid, they replaced these operations by square activations. Using a convolutional network with 4 hidden layers, they demonstrated that Crypto Nets can make predictions with an accuracy of 99% on the MNIST task. They presented latency of 205 seconds for a single prediction and throughput of 59000 predictions per hour. The high throughput is due to the vector structure of encrypted messages used by Crypto Nets, which allows processing multiple records simultaneously. Crypto Nets use a message for every input feature. However, since each message can encode a vector of dimension n, then n input records are encoded simultaneously such that vi j, which is the value of the j th feature in the i th record is mapped to the i th coordinate of the j th message. In Crypto Nets all operations between vectors and matrices are implemented using additions and multiplications only (no rotations). For example, a dot product between two vectors of length k is implemented by k multiplications and additions. Crypto Nets have several limitations. First, the fact that each feature is represented as a message, results in a large number of operations for a single prediction and therefore high latency. The large number of messages also results in high memory usage and creates memory bottlenecks as we demonstrate in Section 4.4. Furthermore, Crypto Nets cannot be applied to deep networks such as Alex Net. This is because the multiplication operation in each layer increases the noise in the encrypted message and the required size of each message, which makes each operation significantly slower when many layers exist in the network (see Section 5 for further details). In this section we present the Low-Latency Crypto Nets (Lo La) solution for private inference. Lo La uses various representations of the encrypted messages and alternates between them during the computation. This results in better latency and memory usage than Crypto Nets, which uses a single representation where each pixel (feature) is encoded as a separate message. In Section 4.1, we show a simple example of a linear classi- Low Latency Privacy Preserving Inference fier, where a change of representation can substantially reduce latency and memory usage. In Section 4.2, we present different types of representations and how various matrixvector multiplication implementations can transform one type of representation to another. In Section 4.3, we apply Lo La in a private inference task on MNIST and show that it can perform a single prediction in 2.2 seconds. In Section 4.4, we apply Lo La to perform private inference on the CIFAR-10 dataset. 4.1. Linear Classifier Example In this section we provide a simple example of a linear classifier that illustrates the limitations of Crypto Nets that are due to its representation. We show that a simple change of the input representation results in much faster inference and significantly lower memory usage. This change of representation technique is at the heart of the Lo La solution and in the next sections we show how to extend it to nonlinear neural networks with convolutions. Assume that we have a single d dimensional input vector v (e.g., a single MNIST image) and we would like to apply a private prediction of a linear classifier w on v. Using the Crypto Nets representation, we would need d messages for each entry of v and d multiplication and addition operations to perform the dot product w v. Consider the following alternative representation: encode the input vector v as a single message m where for all i, mi = vi. We can implement a dot product between two vectors, whose sizes are a power of 2, by applying point-wise multiplication between the two vectors and a series of log d rotations of size 1, 2, 4, . . . , d/2 and addition between each rotation. The result of such a dot product is a vector that holds the results of the dot-product in all its coordinates.5 Thus, in total the operation requires log d rotations and additions and a single multiplication, and uses only a single message. This results in a significant reduction in latency and memory usage compared to the Crypto Nets approach which requires d operations and d messages. 4.2. Message Representations In the previous section we saw an example of a linear classifier with two different representations of the encrypted data and how they affect the number of HE operations and memory usage. To extend these observations to generic feed-forward neural networks, we define other forms of 5Consider calculating the dot product of the vectors (v1, ..., v4) and (w1, ..., w4) . Point-wise multiplication, rotation of size 1 and summation results in the vector (v1w1 + v4w4, v2w2 + v1w1, v3w3 + v2w2, v4w4 + v3w3). Another rotation of size 2 and summation results in the 4 dimensional vector which contains the dot-product of the vectors in all coordinates. representations which are used by Lo La. Furthermore, we define different implementations of matrix-vector multiplications and show how they change representations during the forward pass of the network. As an example of this change of representation, consider a matrix-vector multiplication, where the input vector is represented as a single message (mi = vi as in the previous section). We can multiply the matrix by the vector using r dot-product operations, where the dot-product is implemented as in the previous section, and r is the number of rows in the matrix. The result of this operation is a vector of length r that is spread across r messages (recall that the result of the dot-product operation is a vector which contains the dot-product result in all of its coordinates). Therefore, the result has a different representation than the representation of the input vector. Different representations can induce different computational costs and therefore the choice of the representations used throughout the computation is important for obtaining an efficient solution. In the Lo La solution, we propose using different representations for different layers of the network. In this study we present implementations for several neural networks to demonstrate the gains of using varying representations. Automating the process of finding efficient representations for a given network is beyond the scope of the current work. We present different possible vector representations in Section 4.2.1 and discuss matrix-vector multiplication implementations in Section 4.2.2. We note that several representations are new (e.g., convolutional and interleaved), whereas SIMD and dense representations were used before. 4.2.1. VECTOR REPRESENTATIONS Recall that a message encodes a vector of length n of elements in Zp. For the sake of brevity, we assume that the dimension of the vector v to be encoded is of length k such that k n, for otherwise multiple messages can be combined. We will consider the following representations: Dense representation. A vector v is represented as a single message m by setting vi 7 mi. Sparse representation. A vector v of length k is represented in k messages m1, . . . mk such that mi is a vector in which every coordinate is set to vi.6 Stacked representation. For a short (low dimension) vector v, the stacked representation holds several copies of the vector v in a single message m. Typically this 6Recall that the messages are in the ring R = Zp[x] xn+1 which, by the choice of parameters, is homomorphic to (Zp)n. When a vector has the same value vi in all its coordinates, then its polynomial representation in Zp[x] xn+1 is the constant polynomial vi. Low Latency Privacy Preserving Inference will be done by finding d = log (k) , the smallest d such that the dimension of v is at most 2d and setting mi, mi+2d, mi+2 2d, . . . = vi. Interleaved representation. The interleaved representation uses a permutation σ of [1, . . . , n] to set mσ(i) = vi. The dense representation is a special case of the interleaved representation where σ is the identity permutation. Convolution representation. This is a special representation that makes convolution operations efficient. A convolution, when flattened to a single dimension, can be viewed as a restricted linear operation where there is a weight vector w of length r (the window size) and a set of permutations σi such that the i th output of the linear transformation is P j wjvσi(j). The convolution representation takes a vector v and represents it as r messages m1, . . . , mr such that mj i = vσi(j).7 SIMD representation: This is the representation used by Crypto Nets (Dowlin et al., 2016). They represent each feature as a separate message but map multiple data vectors into the same set of messages, as described in Section 3.2. 4.2.2. MATRIX-VECTOR MULTIPLICATIONS Matrix-vector multiplication is a core operation in neural networks. The matrix may contain the learned weights of the network and the vector represents the values of the nodes at a certain layer. Here we present different ways to implement such matrix-vector operations. Each method operates on vectors in different representations and produces output in yet another representation. Furthermore, the weight matrix has to be represented appropriately as a set of vectors, either column-major or row-major to allow the operation. We assume that the matrix W has k columns c1, . . . , ck and r rows r1, . . . , rr. We consider the following matrix-vector multiplication implementations. Dense Vector Row Major. If the vector is given as a dense vector and each row rj of the weight matrix is encoded as a dense vector then the matrix-vector multiplication can be applied using r dot-product operations. As already described above, a dot-product requires a single multiplication and log (n) additions and rotations. The result is a 7For example, consider a matrix A R4 4 which corresponds to an input image and a 2 2 convolution filter that slides across the image with stride 2 in each direction. Let ai,j be the entry at row i and column j of the matrix A. Then, in this case r = 4 and the following messages are formed M 1 = (a1,1, a1,3, a3,1, a3,3) , M 2 = (a1,2, a1,4, a3,2, a3,4), M 3 = (a2,1, a2,3, a4,1, a4,3) and M 4 = (a2,2, a2,4, a4,2, a4,4). In some cases it will be more convenient to combine the interleaved representation with the convolution representation by a permutation τ such that mj τ(i) = vσi(j). sparse representation of a vector of length r. Sparse Vector Column Major. Recall that Wv = P vici. Therefore, when v is encoded in a sparse format, the message mi has all its coordinate set to vi and vici can be computed using a single point-wise multiplication. Therefore, Wv can be computed using k multiplications and additions and the result is a dense vector. Stacked Vector Row Major. For the sake of clarity, assume that k = 2d for some d. In this case n/k copies of v can be stacked in a single message m (this operation requires log (n/k) 1 rotations and additions). By concatenating n/k rows of W into a single message, a special version of the dot-product operation can be used to compute n/k elements of Wv at once. First, a point-wise multiplication of the stacked vector and the concatenated rows is applied followed by d 1 rotations and additions where the rotations are of size 1, 2, . . . , 2d 1. The result is in the interleaved representation.8 The Stacked Vector - Row Major gets its efficiency from two places. First, the number of modified dot product operations is rk/n and second, each dot product operation requires a single multiplication and only d rotations and additions (compared to log n rotations and additions in the standard dot-product procedure). Interleaved Vector Row Major. This setting is very similar to the dense vector row major matrix multiplication procedure with the only difference being that the columns of the matrix have to be shuffled to match the permutation of the interleaved representation of the vector. The result is in sparse format. Convolution vector Row Major. A convolution layer applies the same linear transformation to different locations on the data vector v. For the sake of brevity, assume the transformation is one-dimensional. In neural network language that would mean that the kernel has a single map. Obviously, if more maps exist, then the process described here can be repeated multiple times. Recall that a convolution, when flattened to a single dimension, is a restricted linear operation where the weight vector w is of length r, and there exists a set of permutations σi such that the i th output of the linear transformation is P wjvσi(j). In this case, the convolution representation is made of r messages such that the i th element in the mes- 8For example, consider a 2 2 matrix W flattened to a vector w = (w1,1, w1,2, w2,1, w2,2) and a two-dimensional vector v = (v1, v2). Then, after stacking the vectors, point-wise multiplication, rotation of size 1 and summation, the second entry of the result contains w1,1v1 + w1,2v2 and the fourth entry contains w2,1v1 + w2,2v2. Hence, the result is in an interleaved representation. Low Latency Privacy Preserving Inference sage mj is vσi(j). By using a sparse representation of the vector w, we get that P wjmj computes the set of required outputs using r multiplications and additions. When the weights are not encrypted, the multiplications used here are relatively cheap since the weights are scalar and BFV supports fast implementation of multiplying a message by a scalar. The result of this operation is in a dense format. 4.3. Secure Networks for MNIST Here we present private predictions on the MNIST data-set (Le Cun et al., 2010) using the techniques described above and compare it to other private prediction solutions for this task (see Table 2). Recall that Crypto Nets use the SIMD representation in which each pixel requires its own message. Therefore, since each image in the MNIST data-set is made up of an array of 28 28 pixels, the input to the Crypto Nets network is made of 784 messages. On the reference machine used for this work (Azure standard B8ms virtual machine with 8 v CPUs and 32GB of RAM), the original Crypto Nets implementation runs in 205 seconds. Re-implementing it to use better memory management and multi-threading in SEAL 2.3 reduces the running time to 24.8 seconds. We refer to the latter version as Crypto Nets 2.3. Lo La and Crypto Nets use different approaches to evaluating neural networks. As a benchmark, we applied both to the same network that has accuracy of 98.95%. After suppressing adjacent linear layers it can be presented as a 5 5 convolution layer with a stride of (2, 2) and 5 output maps, which is followed by a square activation function that feeds a fully connected layer with 100 output neurons, another square activation and another fully connected layer with 10 outputs (in the supplementary material we include an image of the architecture). Lo La uses different representations and matrix-vector multiplication implementations throughout the computation. Table 1 summarizes the message representations and operations that Lo La applies in each layer. The inputs to Lo La are 25 messages which are the convolution representation of the image. Then, Lo La performs a convolution vector row major multiplication for each of the 5 maps of the convolution layer which results in 5 dense output messages. These 5 dense output messages are joined together to form a single dense vector of 5 169 = 845 elements. This vector is squared using a single multiplication and 8 copies of the results are stacked before applying the dense layer. Then 13 rounds of Stacked vector Row Major multiplication are performed. The 13 vectors of interleaved results are rotated and added to form a single interleaved vector of dimension 100. The vector is then squared using a single multiplication. Finally, Interleaved vector Row Major multiplication is used to obtain the final result in sparse format. Lo La computes the entire network in only 2.2 seconds which is 11 faster than Crypto Nets 2.3 and 93 faster than Crypto Nets. Table 2 shows a summary of the performance of different methods. In the supplementary material we show the dependence of the performance on the number of processor cores. We provide two additional versions of Lo La. The first, Lo La-Dense, uses a dense representation as input and then transforms it to a convolutional representation using HE operations. Then it proceeds similarly to Lo La in subsequent layers. It performs a single prediction in 7.2 seconds. We provide more details on this solution in the supplementary material. The second version, Lo La-Small is similar to Lola-Conv but has only a convolution layer, square activation and a dense layer. This solution has an accuracy of only 96.92% but can make a prediction in as little as 0.29 seconds. 4.4. Secure Networks for CIFAR The Cifar-10 data-set (Krizhevsky & Hinton, 2009) presents a more challenging task of recognizing one of 10 different types of objects in an image of size 3 32 32 . For this task we train a convolutional neural network that is depicted in Figure 1. The exact details of the architecture are given in the supplementary material. For inference, adjacent linear layers were collapsed to form a network with the following structure: (i) 8 8 3 convolutions with a stride of (2, 2) and 83 maps (ii) square activation (iii) 6 6 83 convolution with stride (2, 2) and 163 maps (iv) square activation (v) dense layer with 10 output maps. The accuracy of this network is 74.1%. This network is much larger than the network used for MNIST by Crypto Nets. The input to this network has 3072 nodes, the first hidden layer has 16268 nodes and the second hidden layer has 4075 nodes (compared to 784, 845, and 100 nodes respectively for MNIST). Due to the sizes of the hidden layers, implementing this network with the Crypto Net approach of using the SIMD representation requires 100 s GB of RAM since a message has to be memorized for every node. Therefore, this approach is infeasible on most computers. For this task we take a similar approach to the one presented in Section 4.3. The image is encoded using the convolution representation into 3 8 8 = 192 messages. The convolution layer is implemented using the convolution vector row major matrix-vector multiplication technique. The results are combined into a single message using rotations and additions which allows the square activation to be performed with a single point-wise multiplication. The second convolution layer is performed using row major-dense vector multiplication. Although this layer is a convolution layer, each window of the convolution is so large that it is more efficient to implement it as a dense layer. The output is 9The accuracy is not reported in Boemer et al. (2018). However, they implement the same network as in Dowlin et al. (2016). Low Latency Privacy Preserving Inference Table 1. Message size, message representation and operations in each layer of the Lo La inference solution on MNIST. The input size format is number of vectors dimension. Layer Input size Representation Lo La operation 5 5 convolution layer 25 169 convolution convolution vector row major multiplication 5 169 dense combine to one vector using 4 rotations and additions square layer 1 845 dense square dense layer 1 845 dense stack vectors using 8 rotations and additions 1 6760 stacked stacked vector row major multiplication 13 8 interleave combine to one vector using 12 rotations and additions square layer 1 100 interleave square dense layer 1 100 interleave interleaved vector row major output layer 1 10 sparse Table 2. MNIST performance comparison. Solutions are grouped by accuracy levels. Method Accuracy Latency FHE Di NN100 96.35% 1.65 (Bourse et al., 2017) Lo La-Small 96.92% 0.29 Crypto Nets 98.95% 205 (Dowlin et al., 2016) n Graph-HE 98.95%9 135 (Boemer et al., 2018) Faster-Crypto Nets 98.7% 39.1 (Chou et al., 2018) Crypto Nets 2.3 98.95 24.8 HCNN 99% 14.1 (Badawi et al., 2018) Lo La-Dense 98.95% 7.2 Lo La 98.95% 2.2 Figure 1. The structure of the network used for CIFAR classification. Low Latency Privacy Preserving Inference a sparse vector which is converted into a dense vector by point-wise multiplications and additions which allows the second square activation to be performed with a single pointwise multiplication. The last dense layer is implemented with a row major-dense vector technique again resulting in a sparse output. Lo La uses plain-text modulus p = 2148728833 2148794369 2149810177 (the factors are combined using the Chinese Reminder Theorem) and n = 16384. During the computation, Lo La uses 12 GB of RAM for a single prediction. It performs a single prediction in 730 seconds out of which the second layer consumes 711 seconds. The bottleneck in performance is due to the sizes of the weight matrices and data vectors as evident by the number of parameters which is > 500, 000, compared to < 90, 000 in the MNIST network. 5. Private Inference using Deep Representations Homomorphic Encryptions have two main limitations when used for evaluating deep networks: noise growth and message size growth. Every encrypted message contains some noise and every operation on encrypted message increases the noise level. When the noise becomes too large, it is no longer possible to decrypt the message correctly. The mechanism of bootstrapping (Gentry, 2009) can mitigate this problem but at a cost of a performance hit. The message size grows with the size of the network as well. Since, in its core, the HE scheme operates in Zp, the parameter p has to be selected such that the largest number obtained during computation would be smaller than p. Since every multiplication might double the required size of p, it has to grow exponentially with respect to the number of layers in the network. The recently introduced HEAAN scheme (Cheon et al., 2017) is more tolerant towards message growth but even HEAAN would not be able to operate efficiently on deep networks. We propose solving both the message size growth and the noise growth problems using deep representations: Instead of encrypting the data in its raw format, it is first converted, by a standard network, to create a deep representation. For example, if the data is an image, then instead of encrypting the image as an array of pixels, a network, such as Alex Net (Krizhevsky et al., 2012), VGG (Simonyan & Zisserman, 2014), or Res Net (He et al., 2016), first extracts a deep representation of the image, using one of its last layers. The resulting representation is encrypted and sent for evaluation. This approach has several advantages. First, this representation is small even if the original image is large. In addition, with deep representations it is possible to obtain high accuracies using shallow networks: in most cases a linear predictor is sufficient which translates to a fast evaluation with HE. It is also a very natural thing to do since in many cases of interest, such as in medical image, training a very deep network from scratch is almost impossible since data is scarce. Hence, it is a common practice to use deep representations and train only the top layer(s) (Yosinski et al., 2014; Tajbakhsh et al., 2016). To test the deep representation approach we used Alex Net (Krizhevsky et al., 2012) to generate features and trained a linear model to make predictions on the Cal Tech-101 dataset (Fei-Fei et al., 2006).10 In the supplementary material we provide a summary of the data representations used for the Cal Tech-101 dataset. Since the Cal Tech-101 dataset is not class balanced, we used only the first 30 images from each class where the first 20 where used for training and the other 10 examples where used for testing. The obtained model has class-balanced accuracy of 81.6%. The inference time, on the encrypted data, takes only 0.16 seconds when using the dense vector row major multiplication. We note that such transfer learning approaches are common in machine learning but to the best of our knowledge were not introduced as a solution to private predictions with He before. The use of transfer learning for private predictions has its limitations. For example, if a power-limited client uses private predictions to offload computation to the cloud, the transfer learning technique would not be useful because most of the computation is on the client s side. However, there are important cases in which this technique is useful. For example, consider a medical institution which trains a shallow network on deep representations of private xray images of patients, and would like to make its model available for private predictions. However, to protect its intellectual property, it is not willing to share its model. In that case, it can use this technique to provide private predictions while protecting the privacy of the model. 6. Conclusions The problem of privacy in machine learning is gaining importance due to legal requirements and greater awareness to the benefits and risks of machine learning systems. In this study, we presented two HE based solutions for private inference that address key limitations of previous HE based solutions. We demonstrated both the ability to operate on more complex networks as well as lower latency on networks that were already studied in the past. The performance gain is mainly due to the use of multiple representations during the computation process. This may be useful in other applications of HE. One example is training machine learning models over encrypted data. This direction is left for future study. 10More complex classifiers did not improve accuracy. Low Latency Privacy Preserving Inference Acknowledgments We thank Kim Laine for helpful discussions. Albrecht, M., Chase, M., Chen, H., Ding, J., Goldwasser, S., Gorbunov, S., Hoffstein, J., Lauter, K., Lokam, S., Micciancio, D., et al. Homomorphic encryption standard. 2018. Badawi, A. A., Chao, J., Lin, J., Mun, C. F., Jie, S. J., Tan, B. H. M., Nan, X., Aung, K. M. M., and Chandrasekhar, V. R. The alexnet moment for homomorphic encryption: Hcnn, the first homomorphic cnn on encrypted data with gpus. ar Xiv preprint ar Xiv:1811.00778, 2018. Boemer, F., Lao, Y., and Wierzynski, C. ngraph-he: A graph compiler for deep learning on homomorphically encrypted data. ar Xiv preprint ar Xiv:1810.10121, 2018. Bourse, F., Minelli, M., Minihold, M., and Paillier, P. Fast homomorphic evaluation of deep discretized neural networks. Technical report, Cryptology e Print Archive, Report 2017/1114, 2017. Brakerski, Z. and Vaikuntanathan, V. Efficient fully homomorphic encryption from (standard) lwe. SIAM Journal on Computing, 43(2):831 871, 2014. Brakerski, Z., Gentry, C., and Vaikuntanathan, V. (leveled) fully homomorphic encryption without bootstrapping. ACM Transactions on Computation Theory (TOCT), 6(3):13, 2014. Chen, G., Chen, S., Xiao, Y., Zhang, Y., Lin, Z., and Lai, T. H. Sgxpectre attacks: Leaking enclave secrets via speculative execution. ar Xiv preprint ar Xiv:1802.09085, 2018. Cheon, J. H., Kim, A., Kim, M., and Song, Y. Homomorphic encryption for arithmetic of approximate numbers. In International Conference on the Theory and Application of Cryptology and Information Security, pp. 409 437. Springer, 2017. Chou, E., Beal, J., Levy, D., Yeung, S., Haque, A., and Fei-Fei, L. Faster cryptonets: Leveraging sparsity for real-world encrypted inference. ar Xiv preprint ar Xiv:1811.09953, 2018. Dowlin, N., Gilad-Bachrach, R., Laine, K., Lauter, K., Naehrig, M., and Wernsing, J. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, pp. 201 210, 2016. Dowlin, N., Gilad-Bachrach, R., Laine, K., Lauter, K., Naehrig, M., and Wernsing, J. Manual for using homomorphic encryption for bioinformatics. Proceedings of the IEEE, 105(3):552 567, 2017. Fan, J. and Vercauteren, F. Somewhat practical fully homomorphic encryption. IACR Cryptology e Print Archive, 2012:144, 2012. Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594 611, 2006. Gentry, C. Fully homomorphic encryption using ideal lattices. In STOC, volume 9, pp. 169 178, 2009. Goldreich, O., Micali, S., and Wigderson, A. How to play any mental game. In Proceedings of the nineteenth annual ACM symposium on Theory of computing, pp. 218 229. ACM, 1987. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Juvekar, C., Vaikuntanathan, V., and Chandrakasan, A. Gazelle: A low latency framework for secure neural network inference. ar Xiv preprint ar Xiv:1801.05507, 2018. Koruyeh, E. M., Khasawneh, K., Song, C., and Abu Ghazaleh, N. Spectre returns! speculation attacks using the return stack buffer. In 12th USENIX Workshop on Offensive Technologies (WOOT 18). USENIX Association, 2018. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Le Cun, Y., Cortes, C., and Burges, C. J. Mnist handwritten digit database. at&t labs, 2010. Makri, E., Rotaru, D., Smart, N. P., and Vercauteren, F. Epic: efficient private image classification (or: learning from the masters). In Cryptographers Track at the RSA Conference, pp. 473 492. Springer, 2019. Mc Keen, F., Alexandrovich, I., Berenzon, A., Rozas, C. V., Shafi, H., Shanbhogue, V., and Savagaonkar, U. R. Innovative instructions and software model for isolated execution. HASP@ ISCA, 10, 2013. Low Latency Privacy Preserving Inference Riazi, M. S., Weinert, C., Tkachenko, O., Songhori, E. M., Schneider, T., and Koushanfar, F. Chameleon: A hybrid secure computation framework for machine learning applications. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 707 721. ACM, 2018. Sanyal, A., Kusner, M. J., Gascón, A., and Kanade, V. Tapas: Tricks to accelerate (encrypted) prediction as a service. ar Xiv preprint ar Xiv:1806.03461, 2018. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T., Kendall, C. B., Gotway, M. B., and Liang, J. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299 1312, 2016. Tramer, F. and Boneh, D. Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. ar Xiv preprint ar Xiv:1806.03287, 2018. Yao, A. C. Protocols for secure computations. In Foundations of Computer Science, 1982. SFCS 08. 23rd Annual Symposium on, pp. 160 164. IEEE, 1982. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Advances in neural information processing systems, pp. 3320 3328, 2014.