# cryptonas_private_inference_on_a_relu_budget__0d0e757e.pdf Crypto NAS: Private Inference on a Re LU Budget Zahra Ghodsi, Akshaj Veldanda, Brandon Reagen, Siddharth Garg New York University {zg451, akv275, bjr5, sg175}@nyu.edu Machine learning as a service has given raise to privacy concerns surrounding clients data and providers models and has catalyzed research in private inference (PI): methods to process inferences without disclosing inputs. Recently, researchers have adapted cryptographic techniques to show PI is possible, however all solutions increase inference latency beyond practical limits. This paper makes the observation that existing models are ill-suited for PI and proposes a novel NAS method, named Crypto NAS, for finding and tailoring models to the needs of PI. The key insight is that in PI operator latency costs are inverted: non-linear operations (e.g., Re LU) dominate latency, while linear layers become effectively free. We develop the idea of a Re LU budget as a proxy for inference latency and use Crypto NAS to build models that maximize accuracy within a given budget. Crypto NAS improves accuracy by 3.4% and latency by 2.4 over the state-of-the-art. 1 Introduction User privacy has recently emerged as a first-order constraint, sparking interest in private inference (PI) using deep learning models: techniques that preserve data and model confidentiality without compromising user experience (i.e., inference accuracy). In response to this concern, highly-secure solutions for PI have been proposed using cryptographic primitives, namely with homomorphic encryption (HE) and secure multi-party computation (MPC). However, both solutions incur substantial slowdown to the point where even state-of-the-art techniques that combine HE and MPC cannot achieve practical inference latency. Most prior work has focused on developing new systems (e.g., Cheetah [1]) and security protocols (e.g., Mini ONN[2]) for privacy, while little effort has been made to find new network architectures tailored to the needs of PI. Realizing practical PI requires a better understanding of latency bottlenecks and new deep learning model optimizations to address them directly. Prior work on PI [3, 2, 4] has developed cryptographic protocols optimized for common CNN operators. High-performance protocols take a hybrid approach to combine the strengths and avoid the pitfalls of different cryptographic methods. Most protocols today use one method for linear (e.g., secret sharing [5] for convolutions) and another for non-linear layers (e.g., Yao s garbled circuit [6] for Re LUs). Figure 1 compares the cryptographic latency of Re LU and linear layers for the Mini ONN PI protocol [2] (details in Section 2). The data highlights how a layer s plaintext latency has little bearing on its corresponding cryptographic latency: Re LU operations dominate latency, taking up to 10,000 longer to process than convolution layers. Given the inversion of operator latency costs, enabling efficient PI requires developing new models that maximize accuracy while minimizing Re LU operations, which we refer to as a model s Re LU budget. This stands in stark contrast to existing neural architecture search (NAS) approaches that focus only on optimizing the number of floating point operations (FLOPs). Based on this insight we propose two optimization techniques: Re LU reduction and Re LU balancing. Re LU reduction describes methods to reduce the Re LU counts starting from a Re LU-heavy baseline network, for which two solutions are developed: Re LU pruning and Re LU shuffling. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Linear Layers Figure 1: PI latency of a convolution and following Re LU layer. The convolution layer has input size 8 W W with W {8, 16, 32, 64} and filter size 8 3 3. Re LU pruning removes unnecessary Re LUs from existing models, specifically from skip-connection and re-shaping layers of CNNs. Re LU shuffling moves the skip connection base such that the forwarded activations are post-Re LU, eliminating the redundancy. We show Re LU reduction is effective and can be implemented with negligible accuracy impact, e.g., Re LU pruning reduces a model s Re LU count by up to 5.8 and reduces latency proportionately. Re LU balancing scales the number of channels in the CNN to maximize model capacity constrained by a Re LU budget. We prove formally that the maximum model capacity under a Re LU budget is achieved when constant Re LUs per layer are used; in contrast, traditional channel scaling commonly assumes constant FLOPs per layer [7, 8, 9]. Re LU balancing is most effective for smaller models, which benefit the most from the additional parameters, providing an accuracy improvement of up to 3.75% on the CIFAR-100 dataset. Leveraging these optimizations, we propose Crypto NAS: a new and efficient automated search methodology to find models that maximize accuracy within a given Re LU budget. Crypto NAS searches over a space of feed-forward CNN architectures with skip connections (referred to as the macro-search space in NAS literature [9, 10]). First, to incorporate constrained Re LU budgets, Crypto NAS uses Re LU reduction to make skip connections effectively free in terms of latency. Next, the number of channels in the core network, defined as the skip-free layers, are scaled based on Re LU balancing, which keeps the number of Re LUs per layer constant. Finally, Crypto NAS uses an efficient decoupled search procedure based on the observation that PI latency is dominated by the Re LUs of the core network, while skips can be freely added to increase accuracy. Crypto NAS uses the ENAS [9] macro search to determine where to add skips, and separately scales the size of the core network to meet the Re LU budget. This paper makes the following contributions: Develop the idea of a Re LU budget, showing how Re LUs are the primary latency bottleneck in PI to motivate the need for new research in Re LU-lean networks. Propose optimizations for Re LU-efficient networks. Re LU reduction (Re LU pruning and shuffling) reduce network Re LU counts and Re LU balancing (provably) maximizes model capacity per Re LU. Our Re LU reduction technique reduces Re LU cost by over 5 with minimal accuracy impact, while Re LU balancing increases accuracy by 3.75% on a Re LU budget. We develop Crypto NAS to automatically and efficiently find new models tailored for PI. Crypto NAS decouples core and PI optimizations for search efficiency. Resulting networks along the accuracy-latency Pareto frontier outperform prior work with accuracy (latency) savings of 1.21% (2.2 ) and 4.01% (1.3 ) on CIFAR-10 and CIFAR-100 respectively. 2 Background on Private Inference Crypto NAS implements the Mini ONN protocol [2] for PI, using the same threat model and security assumptions. We note, however, that the choice of PI protocol is largely independent of the qualitative results; e.g., we expect similar benefit if secret sharing is replaced with HE, as done in Gazelle [3]. In this section we provide the necessary background for Mini ONN and cryptographic primitives used. 2.1 Cryptographic Primitives Mini ONN uses the forllowing two-party compute (2PC) cryptographic primitives: secret sharing for linear layers and garbled circuits for non-linear operators. The protocol is executed between two parties: a server (S), hosting the model, and a client (C) with the input data for inference using S s model. Secret Sharing is a 2PC technique for computing additions and multiplications on secure data; Mini ONN uses the SPDZ version of secrete sharing [11]. Imagine that C and S wish to compute the dot-product operation y = w x, where x Fn 1 p , an n-dimensional column vector in finite field, Fp, (defined by prime p) belongs to C. Similarly, row vector w F1 n p belongs to S. To do so, C first blinds x using a randomly picked vector r Fn 1 p and sends x S = x r to S. x S is referred to as S s share. C s share is simply x C = r. Note that the sum of the client and server shares equals the secret, i.e., x C + x S = x this will hold true for any value shared between the two parties. Note also that S cannot infer anything about x from its share alone. To handle multiplications, the SPDZ protocol uses Beaver s multiplication triples (), that are pre-computed off-line with w and r as inputs from S and C respectively. The protocol generates and outputs u Fp to S and v Fp to C, such that (1) v is a random value and no party learns the other party s output; and (2) u + v = w r. Using the triple, S can compute, y S, its share of the result y as follows: y S = w x S + u. C s share is simply y C = v. One can verify that y S + y C = y = w x. In practice, multiplication triples can be generated using HE. Although HE is computationally expensive, this step can be performed offline if S s input (w) is fixed note that the triples only require C s random input r, which can also be generated offline. By moving triple generation offline, the protocol s online costs of performing a linear operation (e.g., a dot-product) is reduced to near plain-text performance. Yao s Garbled Circuits (GC) Secret sharing only supports additions and multiplications, it cannot be used for non-polynomial non-linear operations like Re LU. GC is an alternate secure 2PC solution that, unlike secret sharing, works on Boolean representations of C s and S s inputs and allows the two parties to compute a bi-variate Boolean function B : {0, 1}n {0, 1}n {0, 1} without either party revealing its input to the other. We briefly describe the GC protocol and refer the reader to [12] for more details. The GC protocol represents function B as a Boolean logic circuit with 2-input logic gates. C and S evaluate the circuit gate by gate; critically, each gate evaluation involves data transfer between the parties and expensive cryptographic computations, i.e., AES decryptions. Thus, as observed in Figure 1 the GC protocol incurs significant online cost per function evaluation. In practice, the GC protocol is asymmetric in that one party acts as the so-called garbler and the other as the evaluator. The evaluator obtains the encrypted output and shares it with the garbler. The encrypted output is then decrypted by the garbler. 2.2 The Mini ONN Protocol for Private Inference Here we briefly summarize the Mini ONN protocol and refer the reader to the original paper for details [2]. Mini ONN employs secret sharing and GC for private inference in a client-server (C-S) setting where C and S keep their inputs and model private, respectively. Building on Mini ONN s terminology, we model a neural network containing D layers as: yi+1 = f (Wi yi + bi) i [0, D 1], (1) where y0 = x is C s input and y D = z is the output prediction. {Wi}i [0,D 1] and {bi}i [0,D 1] are S s model weights and biases, respectively, and f is the Re LU function. Imagine that S and C have shares of layer i s inputs, y S i and y C i , respectively. Mini ONN first uses secret sharing to compute shares of qi = Wi yi + bi, which we will call q S i and q C i . Recall that because multiplication triples are generated offline, this computation happens almost as quickly as it would non-securely. Finally, the GC protocol is used to privately perform the Re LU operation with S as the garbler and C as the evaluator. C contributes q C i and ri+1, its pregenerated share of the subsequent layer s input, while S contributes q S i . The GC protocol computes y S i+1 = max(0, q C i + q S i ) ri+1; only S is able to decrypt this value and learn the output. C sets y C i+1 = ri+1. It can be verified that C and S now hold shares of layer i s outputs. The protocol proceeds similarly for the next layer. 3 Crypto NAS: Finding Accurate, Low-Latency Networks for PI We now present Crypto NAS: a method for finding accurate CNNs on a Re LU budget. A formal NAS procedure described in Section 3.3, which comprises network optimizations to reduce Re LU counts (a) Core Network (b) Conventional Skip Connection (c) Shuffled Skip Connection (d) Pruned Skip Connection Figure 2: Illustration of Re LU count reduction methods. (a) Core network prior to adding skip connections. (b) Conventional skip connections increase Re LU counts. (c) Re LU shuffling preserves the functionality of (b) but with lower Re LU cost. (d) Re LU pruning further reduces Re LU cost by completely eliminating Re LUs from skips. (Section 3.1) and maximize network parameters per Re LU (Section 3.2). Updates to the Mini ONN protocol to support skip connections are described in Section 3.4. Crypto NAS Search Space As in prior work [10, 9], Crypto NAS searches over all feed-forward depth D CNN architectures where layer i [0, D 1] has an input feature map of shape Hi Hi Ci and performs standard convolutions with Ci+1 filters of size Fi Fi Ci. Layer i can take inputs from any prior layer; the channels of the corresponding feature maps are concatenated and reshaped to the appropriate input dimension using 1 1 convolutions, as shown in Figure 2. Vector Si {0, 1}i 1 indicates the preceding layers from which layer i takes inputs. Consequently, the design space for Crypto NAS consists of the depth D and per-layer parameters {Hi, Ci, Fi, Si}i [0,D 1]. 3.1 Reducing Re LU Counts We find that many existing hand-crafted deep learning models and NAS techniques use Re LUs indiscriminately this is not surprising, as in plaintext inference Re LU latency is far less than MAC operations and thus effectively free. We propose two simple optimizations for skip-based CNN layers to (i) significantly reduce Re LUs while preserving accuracy (Re LU Shuffling) or (ii) saving even more Re LUs in return for a small drop in accuracy (Re LU Pruning). Re LU Shuffling A standard way (e.g., as done in Dense Nets [13] and ENAS [9]) to add skip connections is to forward the post-convolution output of a layer and concatenate it with the convolution output of a future layer the concatenated feature maps are then passed through a Re LU operation and re-shaped, as shown in Figure 2. Thus, the number of Re LU operations in a layer is magnified by the number of its incoming skip connections. Re LU shuffling proposes to move (shuffle) the standard convolution-skip-Re LU sequence to convolution-Re LU-skip (see Figure 2(c)). In doing so, the Re LU after the first convolution in the core network is reused by the skip-connection. In PI Re LU shuffling is strictly beneficial: it precisely preserves network functionality to achieve the exact same accuracy with fewer Re LUs. Re LU Pruning With Re LU shuffling the cost of skip connections are reduced but skips still add Re LUs to the core network. Re LU pruning goes a step further by removing all Re LUs introduced by skips, as shown in Figure 2(d). We found that these Re LUs can be removed with only minor accuracy impact. Re LU pruning further reduces Re LU counts by 2 over Re LU shuffling and we show in Section 4 that for smaller Re LU budgets, it actually provides higher accuracy compared to Re LU Shuffling while being competitive for higher Re LU budgets. 3.2 Re LU Balancing for Maximum Network Capacity Many traditional networks (e.g., VGG [8], Res Net [7]) and NAS procedures [9, 10] keep the number of FLOPs constant across layers i.e., when the spatial resolution is reduced across pooling layers (typically by a factor of two in each dimension), the number of channels is correspondingly increased (typically doubled)1. Crypto NAS s optimization objective, i.e., to minimize Re LUs and not FLOPs, suggests an alternate channel scaling rule wherein the number of Re LUs is kept constant across layers. Indeed, we show that this rule is not only intuitively appealing, but also maximizes model capacity (number of trainable parameters) for a fixed Re LU budget2. Lemma 3.1. Consider a depth D CNN with Hi Hi Ci input tensors and F F Ci sized convolutional filters in layer i for all i [0, D 1] followed by a Re LU operation. Assuming the spatial resolution is reduced by a factor of α in each pooling layer, the number of trainable parameters given a fixed Re LU budget is maximized if Ci+1 = α2Ci after pooling layer i. We show in Section 4 that Re LU balancing provides consistent wins in accuracy over traditional FLOP balanced networks, especially for small Re LU budgets. Algorithm 1 Crypto NAS Algorithm 1: Input: D := Set of network depths, Rbdg := Re LU Budget 2: Output: N := Highest accuracy model under Rbdg 3: Definitions: 4: Ci, Hi, Fi, Si := Channels, resolution, filter size and skip connections of layer i 5: α := resolution scaling factor 6: function CRYPTONAS(D, Rbdg) 7: for D D do 8: /* Optimize core network parameters H0, if 0 < i D/3 H0/α, if D/3 < i 2D/3 H0/α2, otherwise C i = Rbdg/DH2 0, if 0 < i D/3 βC1, if D/3 < i 2D/3 β2C1, otherwise 10: 11: {S i , F i }i L ENAS(D) /* Run ENAS to optimize skip connections and filter sizes 12: 13: N D = {{C i , H i , F i , S i }}i=0,...,D 1 14: N = arg max D D acc(N D) 15: return N 3.3 The Crypto NAS Method Crypto NAS uses the above optimizations to develop a NAS technique tailored to the needs of PI, searching over a large space of CNNs with skip connections. Any architecture in this search space can be viewed as containing two components: the core network (see Figure 2(a)) and the skip connections between layers of the core. For a traditional FLOP constrained search, such as in [14], these components are coupled: the FLOPs in a larger core network would have to be compensated with fewer skips and vice-versa, requiring an expensive search over the product of these two sub-spaces. Instead, Crypto NAS leverages the two optimizations described in Sections 3.1 and 3.2 to efficiently decouple the search, i.e., to search for the core network and skip connections separately. That is, since skips are effectively free and the PI latency is dominated by the core network, searching for the most accurate architecture under a Re LU budget breaks down into two separate questions: (i) what is the most accurate core network within a Re LU budget and (ii) where should skips be added to maximally increase the core network s accuracy? Crypto NAS answers these questions for networks of varying depth D picked from a set D specified by the user (line 7), as described in Algorithm 1. For a given depth, Crypto NAS first determines the core network s parameters as follows. Consistent with a body of prior work on NAS, pooling layers are inserted at layers D 3 to reduce spatial resolution by a factor of α = 2 in each dimension. Following the Re LU balancing rule (Section 3.2), the number of channels is correspondingly increased by a factor of α2 = 4 after each of the two pooling layers. With this in place, the only remaining free variable is C1, the number of channels in the first layer; C1 (and correspondingly the number of channels in all subsequent layers) is set to the largest values such that total Re LU count of the core network is within the budget (line 9). 1Recall that FLOPs are proportional to the square of number of channels. 2See Appendix for the complete proof. Separately, Crypto NAS uses ENAS macro-search [9] to determine the best skip connections for each depth (line 11). ENAS is ideally suited for this purpose because the search procedure depends on depth only and not the number of channels in the core model (recall that we seek to separately scale the size of the core model). The skip connections returned by ENAS are then stitched into the previously found core model. These two steps yield optimized networks for different depths. The final step selects the model with the highest accuracy on the validation set (line 14). Experiments show that Crypto NAS produces a Pareto frontier of points that outperform existing solutions (see Section 4.2). Critically, for each point on the Pareto front, Crypto NAS performs only one ENAS search followed by a single run to train the core model plus skips for each network depth explored. 3.4 Incorporating Skip Connections in Mini ONN The Mini ONN protocol only considers skip-free CNNs. Here we extend Mini ONN to support skip connections, which are necessary for finding highly-accurate models. First, assume there is a skip connection between layer i from the previous layer. Recall that skip connections are made by concatenating the outputs of Re LU layers and reshaping the concatenation via a 1 1 convolution. Thus, we can write the output of this concatenation layer as yi+1 = W i+1 [y i+1; yi] + b i+1 (2) where W i+1 and b i+1 are the weights and biases of the 1 1 convolution layer, and y i+1 = f(Wi yi + bi). To compute this layer privately, S and C can generate multiplication triples offline for weight W i+1 and random value [y C i+1; y C i ] (recall that the protocol takes these two inputs). The random value in the multiplication triple protocol comes from C and is constructed by concatenating C s share of the two layers, which are determined in the offline phase as well. This construction is easily extended to skip connections for multiple layers. 4 Evaluation In this section we evaluate Crypto NAS and show that it outperforms the SOTA, analyze scaling trends, and present a case study to show our optimizations generalize to architectures, beyond the core Crypto NAS search space, e.g., WRNs [15]. 4.1 Experimental Setup Crypto NAS implements Mini ONN [2] for PI with extensions described in Section 3.4. Beaver triples are generated with the SEAL [16] library and the ABY library [17] is used for GC. We report online runtimes for the client and server together, including both computation and communication costs. Since Re LU evaluations are bottlenecks for both computation and communication, reductions in Re LU counts reduce both costs proportionally. Crypto NAS is evaluated on CIFAR-10 and CIFAR100 [18]. We note that CIFAR-10/100 is the standard for both PI and NAS research due to the long run times, and we chose them to be consistent with prior work. Datasets are preprocessed with image centering (subtracting mean and dividing standard deviation), and images are augmented for training using random horizontal flips, 4 pixel padding, and taking random crops. Experiments for latency are run on a 3 GHz Intel Xeon E5-2690 processor with 60GB of RAM, and networks are trained on Tesla P100 GPUs. We use Crypto NAS to discover three models with depth={6, 12, 24}, which we refer to as CNet1, CNet2 and CNet3. 4.2 Results Re LU shuffling and pruning reduce Re LUs with minimal impact on accuracy. Table 1 compares the proposed Re LU shuffling (Figure 2(c)) and Re LU pruning (Figure 2(d)) optimizations for different Re LU budgets on the CNet2 model. The ENAS baseline (Figure 2(b)) version of these models have 385K, 1.92M, and 3.89M Re LUs; the accuracy of the ENAS baseline is the same as obtained via Re LU shuffling. We observe that compared to baseline, Re LU shuffling reduce Re LU counts by about 1.9 for this model (across all modes we see up to 4.76 drop in Re LU counts) without compromising accuracy. For the same reduction in Re LU counts, Re LU pruning provides 2.2% higher accuracy for CIFAR-100 at a 100K Re LU budget, but incurs a small accuracy drop for larger Re LU budgets (0.64% for 1M budget). Table 1: Comparing the accuracy of networks generated by Crypto NAS for different Re LU budgets achieved by Re LU pruning and shuffling. Model (Re LU-Budget) CIFAR-10 CIFAR-100 Re LU Pruning Re LU Shuffling Re LU Pruning Re LU Shuffling Params Acc Params Acc Params Acc Params Acc CNet2-100K 1.2M 92.18% 305K 90.71% 1.3M 68.67% 320K 66.46% CNet2-500K 30M 94.41% 7.6M 94.51% 31M 77.2% 7.8M 77.69% CNet2-1M 124M 95.00% 30M 95.47% 128M 79.07% 31M 79.71% Table 2: Results comparing channel scaling techniques across datasets (CIFAR10/100) with 3 models depths (6, 12, 24). Re LU balancing produces the most accurate models on a Re LU budget. Model (Base-Dataset) Re LU Budget Depth Re LU Balanced FLOP Balanced Channel Balanced Params Acc Params Acc Params Acc CNet1-C10 86K 6 1.4M 91.28% 352K 90.55% 94K 88.07% CNet2-C10 344K 12 15M 94.04% 3.5M 93.98% 1M 93.07% CNet3-C10 1376K 24 167M 95.55% 39M 95.49% 9M 95.09% CNet1-C100 86K 6 1.3M 68.13% 313K 64.38% 94K 50.14% CNet2-C100 344K 12 15M 75.64% 3.8M 74.05% 1.2M 69.21% CNet3-C100 1376K 24 142M 79.59% 34M 79.23% 9M 77.00% Re LU balanced scaling maximizes accuracy. Next we evaluate our channel scaling rule optimization, Re LU balancing, against the commonly used FLOP balancing approach and a strawman channel balancing solution that has the same number of channels in each layer. Our results are shown in Table 2. We find that Re LU balancing always outperforms competing scaling techniques, likely because of its significantly larger model capacity under Re LU constraints. For CNet1-C100 at an 86K Re LU budget, Re LU balancing increases accuracy by 3.75% compared to traditional FLOP balancing. The benefits are most significant at lower Re LU budgets likely because these models underfit and benefit more from increased capacity. Table 3: Comparing networks discovered by ENAS to maximal networks in search space (all-skip, 5 5 filters). Model (Base-Dataset) ENAS Arch Largest Arch Params Acc Params Acc CNet3-C10 166M 95.55% 311M 95.31% CNet3-C100 149M 79.59% 311M 79.11% Crypto NAS perform better than maximal networks. Given that skips are free, another relevant baseline can be a maximal network that includes all skips and largest filter sizes. We compare against maximal networks in Table 3. We observe that the maximal network for CNet3 Re LU balanced had an accuracy loss of 0.24% (CIFAR-10) and 0.48% (CIFAR-100) compared to the models found by ENAS. That is, the models returned by ENAS are more accurate than the largest all-skip models. This also reaffirms the results reported in [10], where models with all possible skip connections showed an accuracy drop. Efficiency of Crypto NAS search procedure. The total runtime of Crypto NAS for a given Re LU budget involves a single ENAS run to find filter sizes and skip connections and a single training of the scaled core model with skip connections added per network depth explored. Running ENAS on CNet3 (our largest model) for CIFAR-100 dataset takes about 22hrs on relatively slow P100 GPUs, and final training takes about 19hrs. Crypto NAS models outperform prior art across the accuracy-latency Pareto frontier. Using Crypto NAS, we sweep a range of model depths and Re LU budgets to understand latency-accuracy tradeoffs in PI and plot the Pareto points in Figure 3 (Crypto NAS in red). We compare Crypto NAS against prior work that (a) achieve state-of-art performance using standard augmentation and Re LU activations including Res Nets [7] and Dense Nets [13], (b) networks like Mobile Nets-v2 [19] that are optimized for FLOPs instead of Re LUs, (c) the ENAS-macro and ENAS-micro search results [9], and (d) networks from prior work on PI including Mini ONN [2] and Delphi [20]. *Mini ONN: 81.6% accuracy CNet3-50K CNet3-1.38M (a) CIFAR-10 CNet2-1M CNet3-1.38M (b) CIFAR-100 Figure 3: PI runtimes for Crypto NAS networks compared to state-of-the-art on CIFAR-10 and CIFAR-100 datasets. The runtime is the sum of client and server online costs. We see that Crypto NAS s Pareto frontier dominates all prior points for regions of interest. On CIFAR-10, CNet2-100K has 0.93% higher accuracy while being 2.2 faster on inference latency compared to Res Net-20, and a more than 10% accuracy improvement over Mini ONN. Compared to Res Net-110, CNet3-500K is 1.9 faster with 1.31% higher accuracy. Crypto NAS compares even more favorably to Mobile Net-v2; Crypto NAS is 1.2 faster with 1.31% higher accuracy. On CIFAR-100, CNet2-1M is 3.49% more accurate than Dense Net-12-40 while being 1.5 faster. We also achieve gains over Delphi, with CNet-100K being 1.6 and 2.4 faster than Delphi with 200K and 300K Re LU budget respectively and 2.07% and 1.07% more accurate. We note that while Delphi does better than Crypto NAS for the smallest (50K) Re LU budget, it does so by systematically replacing a fraction of Re LUs with quadratic activations. In our evaluations, we give the benefit of considering all non-Re LU activations free. Second, we note that Delphi s optimizations can be applied over models found by Crypto NAS, replacing Re LUs in our networks with quadratics might yield further benefits. Finally, we note that the largest models from prior work provide slightly higher accuracy, but at impractically high PI latency of more than 30 minutes per inference. Down scaling these models to smaller sizes resulted in much lower accuracy than comparable Crypto NAS networks. Crypto NAS optimizations generalize to other network architectures. We note that although we have thus far applied Crypto NAS s optimizations to ENAS-like CNN models, the optimizations are not ENAS specific and can be used to improve other models as well. As an example, we implemented Re LU balanced channel scaling to wide residual networks (WRN) [15] and found consistent improvements in accuracy relative to the baseline FLOP balanced WRNs for iso-Re LU budgets. Despite the more aggressive baseline, Crypto NAS still outperforms re-scaled WRN across a range of Re LU budgets, see Table 4 in the Appendix for complete results. 5 Related Work In this section we discuss prior work on PI and NAS as related to Crypto NAS. Crypto Nets [21], one of the earliest works on PI used fully homomorphic encryption (FHE) to guarantee data (but not model) privacy. Due to its reliance on FHE for all network layers, Crypto Nets is restricted to using only polynomial activations. Subsequent work, including Mini ONN [2], Secure ML [4], Gazelle [3] and Delphi [20] seek to provide both data and model privacy and support standard activations like Re LUs; to do so, they switch between separately optimized protocols for linear layers and non-linear layers. Common to these protocols is the use of expensive GC for Re LU activations. As such, we expect Crypto NAS s benefits to extend to these protocols as well. In addition to cryptographic optimizations, Delphi [20] also proposes to automatically replace a subset of Re LUs with quadratic activations using an automated search procedure. While Crypto NAS outperforms Delphi for all but one data-point, Delphi s optimizations can be applied over models found by Crypto NAS for further benefit and would be an interesting direction for future research. Nonetheless, we note that large quadratic networks are hard to train [22]. A separate line of work, Deep Secure [23] and XONN [24], has proposed PI methods for binarized neural networks but these typically have lower accuracy than conventional networks. There is a large and growing body of work on NAS including methods that make use of Bayesian optimization [14], reinforcement learning (RL) [10], hill-climbing [25], genetic algorithms [26], etc. We refer the reader to an excellent survey paper by Elsken et al. [27] for more details. Most relevant to this work are multi-objective NAS methods, for example, Mnas Net [28], that uses measured model latency to optimize for inference time and accuracy. While these approaches could also be used to optimize for Re LU costs, they would still end up searching over spaces with a relatively high number of Re LUs per FLOP. The Re LU optimizations and decoupled search insights in Crypto NAS should help these approaches as well in terms of both accuracy and search efficiency. 6 Conclusions This paper presents Crypto NAS, a new NAS method to minimize PI latency based on the observation that Re LUs are the primary latency bottleneck in PI. This observation motivates the need for new research in Re LU-lean networks. To this end, we have proposed optimizations to obtain Re LU-lean networks including Re LU reduction via Re LU pruning and shuffling, and Re LU balancing that (provably) maximizes model capacity per Re LU. Re LU reduction yields > 5 reductions in Re LU cost with minimal impact on accuracy, while Re LU balancing increases accuracy by 3.75% on a fixed Re LU budget compared to traditional approaches. Finally, we develop Crypto NAS to automatically and efficiently find new models tailored for PI based on a decoupling of the search space. Resulting networks along the accuracy-latency Pareto frontier outperform prior work with accuracy (latency) savings of 1.21% (2.2 ) and 4.01% (1.3 ) on CIFAR-10 and CIFAR-100 respectively. Broader Impact Privacy is an increasingly and critically important societal concern, especially given the fraying bonds of trust between individuals, organizations and governments. By enabling users to protect the privacy of their data and organizations to protect the privacy of their models without compromising accuracy and at reasonable latency, Crypto NAS seeks to herald a new suite of private inference solutions based on cryptography. On the other hand, performing computations in the encrypted domain can inadvertently affect the ability to detect other types of attacks on machine learning systems, e.g. inference attacks. Acknowledgments and Disclosure of Funding This project was funded in part by NSF grants #1801495 and #1646671. [1] Brandon Reagen, Wooseok Choi, Yeongil Ko, Vincent Lee, Gu-Yeon Wei, Hsien-Hsin S. Lee, and David Brooks. Cheetah: Optimizing and accelerating homomorphic encryption for private inference, 2020. [2] Jian Liu, Mika Juuti, Yao Lu, and N Asokan. Oblivious neural network predictions via minionn transformations. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 619 631, 2017. [3] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. Gazelle: A low latency framework for secure neural network inference. In 27th USENIX Security Symposium (USENIX Security 18), pages 1651 1669, 2018. [4] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages 19 38, 2017. [5] Donald Beaver. Efficient multiparty protocols using circuit randomization. In Annual International Cryptology Conference, pages 420 432, 1991. [6] Andrew Chi-Chih Yao. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (sfcs 1986), pages 162 167, 1986. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [8] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [9] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. ar Xiv preprint ar Xiv:1802.03268, 2018. [10] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016. [11] Ivan Damgård, Valerio Pastro, Nigel Smart, and Sarah Zakarias. Multiparty computation from somewhat homomorphic encryption. In Annual Cryptology Conference, pages 643 662, 2012. [12] Mihir Bellare, Viet Tung Hoang, and Phillip Rogaway. Adaptively secure garbling with applications to one-time programs and secure outsourcing. In International Conference on the Theory and Application of Cryptology and Information Security, 2012. [13] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017. [14] Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin-Ping Chou, Chun-Hao Liu, Shih Chieh Chang, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. Monas: Multi-objective neural architecture search using reinforcement learning. ar Xiv preprint ar Xiv:1806.10332, 2018. [15] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. [16] Nathan Dowlin, Ran Gilad-Bachrach, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Manual for using homomorphic encryption for bioinformatics. 2015. [17] Daniel Demmler, Thomas Schneider, and Michael Zohner. Aby-a framework for efficient mixed-protocol secure two-party computation. In The Network and Distributed System Security Symposium (NDSS), 2015. [18] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [19] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [20] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and Raluca Ada Popa. Delphi: A cryptographic inference service for neural networks. In 29th USENIX Security Symposium (USENIX Security 20), 2020. [21] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, pages 201 210, 2016. [22] Zahra Ghodsi, Tianyu Gu, and Siddharth Garg. Safetynets: Verifiable execution of deep neural networks on an untrusted cloud. In Advances in Neural Information Processing Systems, pages 4672 4681, 2017. [23] Bita Darvish Rouhani, M Sadegh Riazi, and Farinaz Koushanfar. Deepsecure: Scalable provablysecure deep learning. In Proceedings of the 55th Annual Design Automation Conference, page 2, 2018. [24] M Sadegh Riazi, Mohammad Samragh, Hao Chen, Kim Laine, Kristin Lauter, and Farinaz Koushanfar. Xonn: Xnor-based oblivious deep neural network inference. In 28th USENIX Security Symposium (USENIX Security 19), pages 1501 1518, 2019. [25] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Multi-objective architecture search for cnns. ar Xiv preprint ar Xiv:1804.09081, 2018. [26] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning, pages 2902 2911, 2017. [27] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, pages 1 21, 2019. [28] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2820 2828, 2019.