# codedinvnet_for_resilient_prediction_serving_systems__232bbf5d.pdf Coded-Inv Net for Resilient Prediction Serving Systems Tuan Dinh 1 Kangwook Lee 2 Inspired by a new coded computation algo rithm for invertible functions, we propose Coded Inv Net, a new approach to design resilient predic tion serving systems that can gracefully handle stragglers or node failures. Coded-Inv Net lever ages recent findings in the deep learning litera ture such as invertible neural networks, Manifold Mixup, and domain translation algorithms, iden tifying interesting research directions that span across machine learning and systems. Our ex perimental results show that Coded-Inv Net can outperform existing approaches, especially when the compute resource overhead is as low as 10%. For instance, without knowing which of ten work ers is going to fail, our algorithm can design a backup task to correctly recover the missing pre diction with an accuracy of 85.9%, significantly outperforming the previous SOTA by 32.5%. 1. Introduction Prediction serving systems (PSSs) are systems that host pre-trained machine learning models for inference queries such as search, translation, and ranking ones. PSSs usually consist of a large number of parallel/distributed compute workers to process a large batch of input queries in paral lel. As the system scales, maintaining low response time (latency) becomes more challenging. This is because the slowdown or failure of a single node can slow down the entire query processing time (Dean & Barroso, 2013; Narra et al., 2019). Multiple approaches for latency reduction have been proposed, including detect-and-relaunch (Zaharia et al., 2008), query replication (Suresh et al., 2015), and approximate inference (Han et al., 2017). Coded computation, by introducing backup tasks in a coded 1Department of Computer Sciences, University of Wisconsin Madison, Madison, USA 2Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, USA. Correspondence to: Tuan Dinh , Kang wook Lee . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). form, improves system resilience with lower overhead com pared to other methods (Lee et al., 2017). Recently, (Ko saian et al., 2019) showed that a coded computation-based technique, called Parity Models (Par Ms), can significantly reduce the latency of PSSs. Par M consists of a tuple of en coder/parity model/decoder: the encoder aggregates multi ple queries into a coded query, the parity model computes an inference task on it, and the decoder uses the encoded query together with available task results to reconstruct missing task results. However, experimental results of Par M were limited to small-scale PSSs (2 4 workers), and it is not clear whether the proposed method is applicable to large-scale PSSs. Also, small-scale regimes correspond to high resource overhead regimes, as one backup worker is required for ev ery two to four workers. We empirically show that Par M does not scale well. For instance, when there is only one backup worker for ten main workers, Par M achieves approx imately 19% accuracy on the CIFAR10 image classification when one of the main workers is not available. We propose Coded-Inv Net, a new coded computation-based framework to design scalable resilient prediction serving systems. Inspired by an efficient coded computation scheme for invertible functions, Coded-Inv Net designs the infer ence function with a computationally-heavy but invertible module followed by a computationally-light module. To design the invertible module, we take advantage of recent developments in invertible neural networks (Behrmann et al., 2018; Song et al., 2019; Jacobsen et al., 2018). Then, by making use of GAN-based paired domain translation al gorithms (Isola et al., 2017), Coded-Inv Net trains a light weight encoding function so that one can efficiently generate encoded queries from input queries without incurring high encoding overhead. To further improve the classification ac curacy in the event of failures, Coded-Inv Net also leverages mixup algorithms (Zhang et al., 2017; Verma et al., 2018). We evaluate the efficacy of Coded-Inv Net in image classifi cation and multitask learning setting. Experimental results show that Coded-Inv Net can scale much beyond the scales at which existing algorithms operate. More specifically, Coded-Inv Net achieves higher reconstruction accuracy com pared to Par M, and the accuracy gap increases remarkably as the system scales. For instance, Coded-Inv Net can achieve reconstruction accuracy of 85.9% with 10 workers, while that of Par M is 53.4%. We also evaluate end-to-end laten Coded-Inv Net for Resilient Prediction Serving Systems f(x1) + f(x2) Enc( , ) f( ) Dec( , ) f(x1) any 2 out of 3 f 1 f(x1) + f(x2) cies on an AWS EC2 cluster, showing that Coded-Inv Net s computing overhead is negligible. 1.1. Key Idea of Coded-Inv Net The key idea of coded computation is best illustrated when the target function is linear (Lee et al., 2017). For illustration purposes, consider two inputs x1 and x2 and three parallel workers. The goal here is to assign computation tasks to the workers so that one can obtain f(x1) and f(x2) even in the presence of slowdown or failure of a node. Coded computation assigns f(x1) to the first worker and f(x2) to the second worker. For the third worker, it first combines x1+x2 two input queries to get an encoded query . It then 2 x1+x2 assigns f to the third worker. By the linearity of 2 x1+x2 f(x1)+f(x2) the target function f( ), we have f = . 2 2 These three tasks can be viewed as three linearly indepen dent weighted sums of f(x1) and f(x2), i.e., f(x1) 1 0 f(x1) f(x2) = 0 1 . (1) f (x1)+f (x2 ) 1 1 f(x2) 2 2 2 Observe that any two rows of the coefficient matrix in the RHS of (1) is an invertible matrix. That is, as long as any two of the three computation results are available, the decoder Dec ( , ) can decode the computation results to recover f(x1) and f(x2). For instance, consider a situation where the second computation result f(x2) is missing. Then, f(x1) 1 0 f(x1) f(x1)+f(x2) = 1 1 . (2) f(x2) 2 2 2 Since the coefficient matrix is full rank, the decoder can sim ply multiply the inverse of the coefficient matrix to recover f(x1) and f(x2). Coded-Inv Net is based on a simple observation that this coded computation framework is applicable to the family of invertible functions. Note that the target function is as sumed to be linear or polynomial in the input in the existing works (Lee et al., 2017; Yu et al., 2019). Shown in Fig. 1 is the visual illustration of the coded computation algorithm for an invertible function. Consider the following encoding f (x1 )+f(x2) function Enc (x1, x2) = f 1 . By applying 2 f (x1)+f (x2) the target function to x1, x2, and f 1 , we 2 obtain f(x1), f(x2), and f(x1)+f (x2) . Therefore, as shown 2 in the previous example, the decoder can always recover f(x1) and f(x2) with any two out of three task results.1 While this example demonstrates an efficient coded compu tation for invertible functions, this scheme as it is cannot 1We provide a concrete example on synthesis dataset to com plete this illustration in the supplementary material. Figure 1. Coded computation for invertible functions. Consider an invertible function f ( ) and two inputs x1 and x2. The optimal encoding function Enc ( , ) takes the two inputs and generates an f (x1)+f (x2) encoded input f 1 2 . By applying function f( ) to the original inputs and the encoded input, one can obtain linearly independent weighted sum of f(xi) s. Thus, a simple decoder can always take any 2 out of 3 computation results to decode f(x1) and f(x2) by solving a system of linear equations. For instance, if f (x2) is missing, one can multiply f (x1)+ f (x2) by 2 and then subtract f(x1) to recover f(x2). Our framework Coded-Inv Net extends this idea to design resilient prediction serving systems. be used in practice. This is because the encoding process involves multiple evaluations of f( ) and f 1( ), resulting in a significant encoding overhead. However, in machine learning applications, one can resolve this issue for the following reasons. First, f( ) does not need to be exactly computed, and it is sufficient to compute f( ) up to some approximation error. For instance, if f( ) is the feature extractor, then a small approximation error may not alter results of the following downstream tasks such as classification, segmentation, etc. This immediately implies that one can safely replace the computationally expensive encoding function with an approximate encoding function. Second, f( ) is also a design choice in machine learning applications. For instance, if f( ) is a feature extractor used by a downstream classification task, one can choose any f( ) as long as the classification performance is maintained. Inspired by these observations, Coded-Inv Net replaces the ideal encoding function with an approximate function that is parameterized by a light-weight neural network. Then, it learns the encoding function Enc ( ) together with the target function f( ). By carefully designing the encoding function s architecture, one can obtain low enough encod ing approximation error while maintaining the encoding overhead negligible. 2. Related Works Straggler mitigation in prediction serving systems has been studied in various approaches, such as detect-and relaunch (Zaharia et al., 2008), replication (Dean & Bar roso, 2013), approximate computing (Goiri et al., 2015), and coded computing (Lee et al., 2017). Among these, coded computation requires a minimal amount of com pute resource overhead, making it promising with many methods to provide resiliency to slowdowns and failures in machine learning PSS (Lee et al., 2017; Li et al., 2016; Kosaian et al., 2018; 2019; Narra et al., 2019). Learning Coded-Inv Net for Resilient Prediction Serving Systems a-code (Kosaian et al., 2018; Dhakal et al., 2019) learns a neural network-based encoder/decoder pair to transform erasure-coded queries into a form that decoders can recon struct unavailable predictions. However, this approach suf fers from high encoding and decoding overhead. Par M (Ko saian et al., 2019) overcomes this limitation by learning a new inference function applied to encoded queries, which they call parity models. (Narra et al., 2019) proposes a novel convolutional neural network for multi-image classification on a collage image in one shot, thus reducing cost redun dancy. This design is specific for image classification while Coded-Inv Net is applicable to more downstream tasks. Invertible Neural Networks (INNs) are NNs that can be in verted from the input up to the projection, or final classes (Ja cobsen et al., 2018; Behrmann et al., 2018; Song et al., 2019). i-Rev Nets (Jacobsen et al., 2018) define an INN family based on a succession of homeomorphic layers. Behrmann et al. (2018) builds i-Res Net on top of residual networks (He et al., 2016), showing that one can invert a residual block with an exponential convergence rate via fixed-point itera tion if Lipschitz constants of nonlinear parts are strictly less than 1. The authors bound each layer s Lipshitz constant by normalizing the weight matrix by its spectral norm. We adopt i-Res Net for our INN s architecture. Manifold Mixup (Verma et al., 2018) extends Mixup (Zhang et al., 2017) to augment classification models with virtual data points which are random convex combinations of data points in the hidden space. This simple scheme effectively improves the margin in the hidden space, helps flatten the class-conditional representation, and reduces the number of on pairs of input and target images, using a combination of GAN and L1 losses. Pix2Pix uses Patch-GAN (Isola et al., 2017) and U-Net (Ronneberger et al., 2015) for the discriminator and generator architectures, respectively. We use Pix2Pix model for our encoder. 3. Coded-Inv Net 3.1. Setting Given a task needed to be deployed, we train an invert ible network followed by a neural network that has signif icantly fewer layers and parameters compared to the first part. While this specific architectural choice may seem re strictive, it indeed supports a large range of applications. First of all, for classification tasks, state-of-the-art invert ible NNs nearly match the performance of non-invertible NNs, and the gap is quickly closing. For instance, Mint Net (Song et al., 2019) achieves only 1.4% lower accuracy on CIFAR10 than Res Net. Moreover, even non-invertible NNs are pseudo-invertible when trained with adversarial training (Engstrom et al., 2019), so our framework can serve any robust classification model. Though we limit our focus on the classification in this work, Coded-Inv Net can be im mediately extended to support generative models that are based on invertible architectures, such as flow-based mod els (Kingma & Dhariwal, 2018), or invertible Transformerbased models (Kim et al., 2020). Shown in Fig. 2 is the PSS architecture of Coded-Inv Net. Here, the overall inference network is denoted by yˆ(x) := g(f(x)), where f( ) is the invertible module of the neu ral network and g( ) is the second module of the net work. The goal is to compute yˆ1 = g(f(x1)), yˆ2 = g(f(x2)), . . . , yˆk = g(f(xk)) in parallel with n (n k) parallel workers in the presence of stragglers or failures. At the time the query arrives, the front end of the PSS applies n k encoding functions to obtain n k en coded queries. The i-th encoding function is denoted by Enci ( ), and the i-th encoded query is denoted by xk+i for 1 i n k. That is, x1, x2, . . . , xk denote the k inputs while xk+1, xk+2, . . . , xn denote the n k encoded inputs. Similar to the example illustrated in Sec. 1.1, we first iden tify the design of n k ideal encoding functions as fol lows. An ideal encoded input is such that one can get a linear combination of f(xi) s by applying f( ) to it. Let Pk f(xk+i) = j=1 ci,j f(xj ) for 1 i n k. Then, f(x1) 1 0 0 . . . f(xk) f(xk+1) directions by a significant variance. Coded-Inv Net uses the classifier augmented with Manifold Mixup to make it more robust to encoding approximation errors. Pix2Pix (Isola et al., 2017) is an image-to-image translation f(x1) . . . . . . . . . . . . 0 0 1 c1,1 c1,2 c1,k . . . . . . . . . (3) . f(xk) model that converts a set of images into a target image. . . . . . f(xn) Pix2Pix trains a conditional GAN (Mirza & Osindero, 2014) cn k,1 cn k,2 cn k,k To guarantee the decodability when any k tasks complete, any k rows of the coefficient matrix in the RHS of (3) must be full rank. For instance, when n = k + 1, i.e., there is only one additional worker, one can satisfy this property 1 by setting c1,j = for all j. Also, when ci,j is chosen k from the i.i.d. Gaussian distribution, one can show that the coefficient matrix satisfies the property with probability 1. Referring to the readers to (Bodmann, 2013) for various ways of choosing ci,j s, we will assume that ci,j s are chosen such that the desired property holds. Remark 1. When choosing ci,j s, in addition to the de codability condition, one should also consider whether Pk f(xk+i) = j=1 ci,j f(xj ) lies close enough to the origi nal manifolds in the embedding space. This is because the invertibility of f 1 may not hold for embedding vectors that are too far away from the original manifolds. Moreover, if Coded-Inv Net for Resilient Prediction Serving Systems any k out of n f(xk) ˆy1 = g(f(x1)) ˆy2 = g(f(x2)) ˆyk = g(f(xk)) ... ... ... xn := Encn k(x1, x2, . . . , xk) f 1 Pk j=1cn k,jf(xj) xk+2 := Enc2(x1, x2, . . . , xk) f 1 Pk j=1c2,jf(xj) xk+1 := Enc1(x1, x2, . . . , xk) f 1 Pk j=1c1,jf(xj) Figure 2. Prediction serving system architecture of Coded-Inv Net. Given k inputs x1, x2, . . . , xk, the goal of the predicting serving system is to compute yˆ(xi) := g(f(xi)) for all 1 i k in the presence of stragglers or failures, where f ( ) is an invertible function. To this end, Coded-Inv Net first generates n k encoded inputs xk+1, . . . , xn by applying n k distinct encoding functions to xi s. It then assigns the task of computing f(xi) to each of n parallel workers. By leveraging the coded computation algorithm for invertible functions described in Sec. 1.1, one can approximately decode f(x1), f(x2), . . . , f(xk) as soon as any k out of n tasks are completed. Note that the approximation errors occur since we use approximate encoding functions. Once f(x1), f(x2), . . . , f(xk) are computed, the front end applies g( ) to each of them and returns the query results. this additional condition holds, f 1(f(xk+i)) might share some semantic features with the original data. This essen tially reduces the gap between the domain where original inputs belong and the domain where encoded inputs belong, allowing for efficient learning of encoding functions with domain translation techniques. 3.2. Approximate Encoding Functions Recall that ideal encoding functions cannot be used in prac tice due to their large computation overhead. Thus, we use a neural network (NN) to approximate ideal encoding functions. The universal approximation theorem asserts that ideal encoding functions can be well approximated by some NNs (Cybenko, 1989). By limiting the number of layers and parameters of the NN, we can control the computation overhead of the NN-based approximate encoding functions. More specifically, for each i, 1 i n k, we want to train a neural network Enci ( ) such that Enci ( ) Pk f 1 j=1 ci,j f(xj ) . Note that if x X , then Enci ( ) : X k X , i.e., the encoding function takes k inputs from domain X to generate one output in X . Thus, to learn the ith encoding function, we need to collect the following train n o Pk set: (x1, x2, . . . , xk), f 1 j=1ci,j f(xj ) . Here, the input (x1, x2, . . . , xk) is a tuple of k randomly chosen in puts from the train set. Given the input tuple, the label Pk f 1 j=1ci,j f(xj ) can be always computed by apply ing f( ) and f 1( ). Once the train set for an encoding function is collected, one can train the encoding function Pk such that Enci ( ) f 1 j=1 ci,j f(xj ) holds on the collected train set. f(xa) is missing for some 1 a k, and all the other task results are available, one can decode f(xi) as follows: Pk \ f(xa) = kf(xk+1) f(xi). And the inferi=1,...,k,i6=a \ ence result will be g f(xa) . By comparing this inference result with the correct result g (f(xa)), one can explicitly capture the semantic difference between them. For instance, if the target task is classification, one can compare the logit values of these outputs and apply the distillation loss (Hin ton et al., 2015). Note that one can also co-train these encoding functions together with the target function g f. Remark 2. One can further reduce the inference overhead of the encoder by applying neural network compression techniques such as pruning, knowledge distillation, vector quantization, etc. Such techniques allow for larger encoder architectures, reducing the encoding error. Remark 3. One can deploy Enc ( , ) at the front-end server or at the backup workers. The former option increases the computational load of the front-end server while the latter increases the load of backup workers. Moreover, the latter incurs extra communication cost between the frontend and the workers. Thus, one should make a proper choice considering the communication/computation tradeoff. 3.3. Minimizing Encoding Error Propagation Once we introduce approximate encoding functions into this framework, f(xk+i), for 1 i n k, will not ex Pk actly match the desired linear combination j=1 ci,j f(xj ). That is, with approximate encoding, we only have xk+i f 1 Pk Pk j=1 ci,j f(xj ) , so f(xk+i) = j=1 ci,j f(xj )+ε, where ε is the approximation error in f, which depends on (x1, x2, . . . , xk). One can also make use of explicit semantic loss when To see how this approximation error affects the inference training encoding functions. For illustration purposes, as 1 quality, assume that n = k+1 and consider a scenario where sume that n = k + 1 and c1,j = for all j. When k f(x1) is missing, and f(x2), f(x3), . . . , f(xk), f(xk+1) Coded-Inv Net for Resilient Prediction Serving Systems 1 are available. By assuming c1,j = , one can decode k Pk \ f(x1) as follows: f(x1) = kf(xk+1) f(xi). i=2 Since f(xk+1) = P =1 c1,j f(xj ) + ε, we have \ f(x1) = f(xk+1)+kε. Then, yˆ1 = g(f(xk+1)+kε). Therefore, the recovered inference result for a missing computation task will be incorrect if yˆ1 = g(f(xk+1) + kε) =6 g(f(xk+1)). To prevent this, one must first make sure that the en coding functions are well trained such that Enci ( ) Pk f 1 j=1 ci,j f(xj ) holds. That way, one can expect that the magnitude of ε is small enough. However, even when encoding error ε is small, if the second part of the inference module g( ) is highly sensitive to small perturbations, inference results could still be incorrect. In deed, Verma et al. (2018) shows that standard deep neural networks are highly sensitive to small noises injected in the embedding space. This implies that a small encoding error, which necessarily arises during the approximate encoding procedure, can distort the overall inference result. To resolve this issue, we leverage the recent Manifold Mixup (Verma et al., 2018) regularization technique, which smoothens class-conditional embedding spaces, making inference outputs more robust to noises injected in the embedding space. Inspired by this, Coded-Inv Net trains f( ) and g( ) with Manifold Mixup to ensure that yˆ1 = g(f(xk+1) + kε) = g(f(xk+1)) with high probability. 3.4. Advantages of Coded-Inv Net Separation between Encoding and Inference By de sign, Coded-Inv Net maintains a clear separation between en coding and inference,2 which is highly advantageous when building a prediction serving system. First, every worker does the same work, i.e., simply computing f( ), simplify ing the system implementation and management. On the other hand, Par M requires a highly heterogeneous system configuration: the first k workers compute f( ), and the other n k workers compute n k distinct parity mod els, say f1 0 ( ), . . . , f 0 ( ). Similarly, when the total num n k ber of workers n and the total number of inputs k change, Coded-Inv Net does not require any change to the worker configuration and simply needs to retrain the encoder. Second, when optimizing inference time via various tech niques such as model compression (Zhang et al., 2019), hardware optimization (Marculescu et al., 2018), or network pruning (Blalock et al., 2020), one just needs to focus on optimizing one inference function f( ). On the other hand, Par M needs to optimize n k + 1 models, incurring a significantly larger cost. 2Learning-a-code (Kosaian et al., 2018) also has this property. Applicability to Multi-task Serving A representation in NNs can serve for various downstream tasks (Liu et al., 2015). For instance, well-trained image representation can be simultaneously useful for image classification, segmen tation, and depth estimation (Kendall et al., 2018). For a detailed overview of multi-task learning and the role of shared representation, we refer the readers to (Ruder, 2017). We highlight that Coded-Inv Net is a promising solution when the prediction serving system is serving multiple in ference tasks sharing the underlying representation. Recall that Coded-Inv Net reconstructs the embedding (or represen tation) of missing inputs while existing approaches directly reconstruct missing inference results (Kosaian et al., 2018; 2019). Therefore, Coded-Inv Net does not need any extra efforts to support multiple tasks, say m tasks. The only difference is that we now have gi( ) for 1 i m, where gi( ) is specific to task i. Note that this does not affect the training complexity of the encoder at all. On the other hand, Par M has to train m parity models to achieve the same goal. See Sec. 4.6 for experimental results where we demonstrate the applicability of Coded-Inv Net to multi-task settings. 4. Experiments We show how we implement the components of the Coded Inv Net framework and evaluate its performance. In compar ison with the baselines, we focus on the image classification task on popular datasets: MNIST (Deng, 2012), Fashion MNIST (Xiao et al., 2017), and CIFAR10 (Krizhevsky & Hinton, 2009). These are all 10-way image classification tasks. We also demonstrate the applicability of Coded Inv Net on a large-scale Image Net-based dataset (Howard, 2019), and on the multiple-failure setting. We report various performance metrics such as the clas sification accuracy when straggler/failure happens, encod ing/decoding overhead, scalability, end-to-end latency, etc. Since Coded-Inv Net recovers the full embedding of the missing input, it naturally fits with multi-task applications, i.e., one common embedding can be used for multiple down stream tasks. We also show how one can apply Coded Inv Net for such multi-task applications. We provide further additional experiment results and imple mentation details in the supplementary material. 4.1. Architectures and Training Methods While the Coded-Inv Net framework is applicable to any values of n, k, n k, the most important case is when n = k + 1. This is of practical interest since this scheme s compute overhead is minimal ( 100 %) while still being ro k bust against a single failure. We will consider k {2, 4, 10} 1 and choose c1,j = for all j. k Coded-Inv Net for Resilient Prediction Serving Systems Architecture. We use i-Res Net as the classification network g f.3 We mostly follow the recommended configurations in (Behrmann et al., 2018), but we remove the injective padding module to improve the invertibility of off-manifold embedding vectors. We use the Pix2Pix (Isola et al., 2017) architecture for encoding functions in our experiments for all values of k. To avoid linear scaling, we design our en coder architecture such that only the complexity of the first few layers depends on k, and the rest of the architecture does not scale with k. See Fig. 3 for the encoder architecture for k = 2. When k = 2, two inputs x1 and x2 are first pro cessed by the weight-shared network. The two processing results are then concatenated and projected to a fixed-size hidden vector. In general, our encoder processes k inputs in parallel, and the concatenated output is projected to the same size hidden vector. By limiting the size of this input processing part, we control how the encoder complexity scales as k increases. See Sec. 4.4 for experimental results where we demonstrate that encoding overhead can be kept nearly constant for increasing values of k. Training. We train our i-Res Net classifier with Manifold Mixup (with mixup coefficient 1). Then, we use the trained classifier to generate a train set for the encoding network. In particular, we draw k random inputs (x1, x2, . . . , xk) Pk and compute labels f 1 j=1ci,j f(xj ) to obtain an in put/output pair. Here, since our f( ) is an i-Res Net module, the inverse function does not have an explicit form, we com pute the inverse function by solving fixed point equations. We repeat this 50, 000 times to construct a sufficiently large training set. We, then, train the encoder function using Least Square GAN loss (Mao et al., 2017) plus 100 times L1 loss. We enforce permutation invariance of inputs by (1) sharing weights of the first hidden layers across all k inputs and (2) using the average function after the first hidden layer. The encoder training procedure is summarized in Fig. 3. Train ing the encoder may take up to 7.5 hours (Res Net-301-based architecture with 150 epochs on a 48-GB RTX8000 GPU). 4.2. Training Results Classifier The trained classifiers achieve 99.2%, 91.8%, and 90.9% on MNIST, Fashion-MNIST, and CIFAR10, respectively. Note that removing the injection padding layer from the original i-Res Net, as mentioned previously, results in slight drops in the classification accuracy. Encoder To see if the encoder training was successful, we observe the training curve (shown in Fig. 3 in the supple mentary material). Here, we show the L1 loss measured on the train/test sets of MNIST, respectively. The encoder was 1 trained with k = 4 and c1,j = . The train L1 loss keeps 4 3We report additional results with another invertible network i-Rev Net (Jacobsen et al., 2018) in the supplementary material. Figure 3. Encoder training. Let k = 2 and c1,1 = c1,2 = 1 . We 2 first pick two random inputs x1 and x2. We first apply f( ) to each of them, compute the average, and then apply f 1( ), obtaining the target. We then feed the two inputs to the encoder. We ensure the input symmetry of the encoder by individually processing them with a weight-shared network. The processing results are concatenated and then get further processed to obtain the encoded input. We then train the encoder s weights with the Pix2Pix loss. improving while the test L1 loss saturates around epoch 20. Shown in Fig. 4 is the qualitative performance evaluation of the trained encoder. In Fig. 4 (h), we show eight random samples of (x1, x2, x3, x4) from the test set. Fig. 4 (i) visu alizes ideal encoded inputs. Recall that each of these ideal encoded inputs is computed by applying f( ) to each of xi s, obtaining a weighted sum, and then computing f 1( ). Since i-Res Net does not have an explicit inverse function, f 1( ) is computed by solving a fixed point equation via an interactive method. In Fig. 4 (j), we show that the encoded inputs obtained by our trained encoder are indistinguishable from ideal encoding results to naked eyes. To further check the validity of the trained encoder, we perform a qualitative evaluation by observing the visual quality of input decoding. Note that with the ideal encoding input, one can perfectly recover missing inputs. Assuming 1 n = k + 1 and c1,j = , let us define k Pk \ f(x1) := kf (Enc0(x1, x2, . . . , xk)) f (xk), (4) i=2 1 Pk xc1 := f kf (Enc0(x1, x2, . . . , xk)) i=2f (xk) . (5) That is, \ f(x1) is the reconstruction of the missing func tion evaluation, and c x1 is the reconstruction of the cor [ responding input. One can define f(xi) and xbi for 2 i k in a similar way. If the encoding function is ideal, f 1 Pk i.e., Enc0 (x1, x2, . . . , xk) = j=1f(xj )/k , then [ f(xi) = f(xi) and hence xbi = xi. Therefore, we can indi rectly evaluate the performance of an encoder by comparing xbi and xi. See Fig. 4 (c), (d) for the comparison results for k = 2 and Fig. 4 (i), (j) for k = 4. Observe that xbi closely recovers xi in all tested cases up to some offsets, justifying the validity of the trained encoder. 4.3. Degraded Mode Accuracy Comparison with Baselines We now evaluate the classi fication performance of Coded-Inv Net under the presence of Coded-Inv Net for Resilient Prediction Serving Systems (a) x1 (b) x2 (c) f 1 f(x1)+f(x2) (d) Enc0(x1, x2) (f) c x1 (g) c x2 | {z } (h) x1,x2,x3,x4 | {z } (k) c x1,c x2,c x3,c x4 (i) f 1 P f(xi) (j) Enc0(x1, . . . , x4) Figure 4. Qualitative performance evaluation of a trained encoder for MNIST (k = 2 and k = 4). We used c1,j = k 1 . (a), (b): Randomly chosen x1 and x2. (c): The ideal encoding results, i.e., f 1 ((f (x1) + f(x2))/2). (d): The out put of a trained encoder, i.e., Enc0 (x1, x2). Note that (c) and (d) are visually indistinguishable. (f), (g): Note that x1 := f 1 f 1 f 1 x1 2f ((f(x1) + f(x2))/2) f (x2) . By replacing ((f(x1) + f (x2))/2) with Enc0 (x1, x2), we have c := f 1 (2f(Enc0(x1, x2)) f(x2)). Similarly, one can also define xc2. Shown in (f) and (g) are xc1 and c x2. Observe that they suc cessfully recover the missing inputs up to offsets. (h) (k): We repeated the same experiment for k = 4 and observed similar results. Accuracy (%) Fashion-MNIST Par Ms La Cs Ours Figure 5. Degraded mode accuracy comparison. We compare Par M, Learning-a-code (La C), and Coded-Inv Net (Ours) on dif ferent resource overhead k = 2, 4, 10. Dotted lines represent our normal accuracy Ex,y [ˆy = y], i.e., the classification accuracy when the inference does not involve any reconstruction. Bars rep resent the degraded mode accuracy Ex1,y1,x2,x3,...,xk ,[ˆy = y]. Observe that all models achieve comparable performances when the resource overhead is high, e.g., k = 2 (50%) and k = 4 (25%). However, when the resource overhead is very low, i.e., k = 10 (10%), Coded-Inv Net significantly outperforms on every dataset. stragglers or failures. Normal accuracy is simply the top-1 classification accuracy of our classifier, i.e., Ex,y [ˆy = y]. If f(xi) is available at the time of k tasks complete, then [ f(xi) = f(xi). Therefore, the normal accuracy will capture the classification accuracy for such cases. We also define degraded mode accuracy (Kosaian et al., 2019). If any of f(xi) is not available at the time of k task results are available, we first decode f[ (xi) using (4) and then compute ybi f(xi)), the recovered inference = g( [ result on xi. Due to the encoding error, ybi could be different from g(f(x1)). The degraded accuracy is the accuracy measured with respect to this recovered inference result g(\ We expect that the degraded accuracy is lower f(x1)). than the normal accuracy. More formally, it is defined as h i g(\ Ex1,y1,x2,x3,...,xk f(x1)) = y1 , where the expectation is over k random inputs. We report the normal and degraded mode accuracy in Fig. 5. We compare our model with Par M (Kosaian et al., 2019) built with the default summation encoder/subtraction de coder,4 and Learning-a-code (Kosaian et al., 2018) built 4Kosaian et al. (2019) also develops a downsample-and concatenation-based encoder for image classification, but this de with convolutional encoder/fully-connected decoder. The dotted lines represent normal accuracies on each dataset, and the bars are for the degraded mode accuracies. When k = 2 or k = 4, all approaches obtain similar de graded accuracies on every dataset. When k = 10, i.e., the system is larger and the resource overhead is much lower (1/k = 10%), a large number of embedding vectors have to be packed into a single aggregated vector. Also, the en coded input has more chance to be off the input manifold so is the corresponding embedding. These factors make the encoder learning more challenging, indicating a trade-off between resource overhead and degraded accuracy, i.e., as the resource overhead decreases (as k increases), the en coding and the degraded mode errors increase. Indeed, in the original paper, the state-of-the-art approach Par M was tested only for k 4, and it was not clear how well this approach would perform in such a low resource regime. As shown in Fig. 5, Coded-Inv Net significantly outperforms Par M and Learning-a-code in the low resource regime. For instance, on MNIST, Coded-Inv Net achieves 85.9% de graded accuracy while Par M and Learning-a-code achieve 53.4%, 40.1% respectively. On CIFAR10, the degraded mode accuracy of Coded-Inv Net is 39.1%, more than twice that of Par M, and three times that of Learning-a-code. Take-away: Coded-Inv Net is highly resilient in the presence of stragglers or failures. It matches the performance of state of the art when the resource overhead is high enough ( 25%), and significantly outperforms them when the resource overhead is as low as 10%. Coded-Inv Net on Large-scale Image Net Data We show that our framework can be designed for more complex datasets. In particular, we use Imagenette that is a 10 class subset of Image Net. We replace i-Res Net with an i-Rev Net (Jacobsen et al., 2018) (96.2% in normal accuracy). We achieve 85.5% degraded accuracy for (n, k) = (3, 2) and 58.4% for (n, k) = (5, 4). Fig. 6 shows the sample reconstruction of our algorithm when f(x1) needs to be sign is task-specific, and currently does not support k = 10. Coded-Inv Net for Resilient Prediction Serving Systems Fashion-MNIST Figure 7. Wall-clock compute time of Enc ( ), f( ), and g( ). Figure 6. Reconstruction on Imagenette samples. Left to right: ((f(x1) + f(x2))/2), Enc(x1, x2), and x1. c x1, x2, f 1 Note that arg max f (x1) c Shown The compute overhead of encoding must be negligible compared to the actual inference time. We measure the wall-clock time to run below them are the softmax scores of their prediction outputs f( ). = arg max f(x1), i.e., the degraded Enc ( ), f( ), and g( ) to quantify the compute overhead of encod ing when k = 2. We fix the batch size as 256 images and measure the compute time over 10 independent runs, reporting averages, and standard deviations. We can observe low overheads: 5.8% for MNIST, 8.3% for Fashion-MNIST, and 6.6% for CIFAR10. Fashion-MNIST k=2 k=4 k=10 Figure 8. Encoding overhead as a function of k. The encoding overhead is measured for k = 2, 4, 10. We fix the batch size as classification is correct. recovered from the backup worker. One can see that the recovered computation result (5th column) closely matches the missing computation result f(x1) (1st column). Coded-Inv Net with Multiple Failures We first handle two failures at the same time by applying Coded-Inv Net with (n, k) = (4, 2). We consider the following encoders: f(x1) + f(x2) Enc1(x1, x2) = f 1 , 2 256 images and measure the compute time of Enc ( ) over 10 f(x1) + 2f(x2) Enc2(x1, x2) = f 1 independent runs, reporting the medians. The encoding overhead . 3 remains nearly constant as k increases on all datasets. With these encoders, the four tasks can be viewed as four linearly independent sums of f(x1) and f(x2), i.e., gible compared to that of the inference function. Shown in Fig. 7 are the wall-clock compute time of Enc ( ), f( ) f(x1) 1 0 and g( ). We measure the running time over 10 runs and f(x2) f (x1)+f (x2) 0 1 1/2 1/2 f(x1) . report the mean and standard deviation. All of these are = f(x2) measured on a 12-GB NVIDIA TITAN Xp GPU, 128-GB 2 f(x1)+2f (x2) 1/3 2/3 of DRAM, and 40 Intel Xeon E5-2660 CPUs. Note that the 3 One can confirm that any two rows of the coefficient matrix in the RHS is an invertible matrix. That is, as long as any two of the four computation results are available, the de coder can recover the computation results f(x1) and f(x2). We evaluate this coding scheme on the MNIST dataset. We consider a setting where both f(x1) and f(x2) are unavail able, i.e., one must recover them only from the backup task results. Even for this very non-trivial setting, we obtain 67.4% degraded accuracy. Similarly, we also test our frame work with (n, k) = (6, 4). Focusing on the setting where two of the first four results are not available (i.e., both the encoders are involved when decoding), we achieve 45.4% degraded accuracy. To the best of our knowledge, all the previous works focused on the n = k + 1 setting, and this is the first non-trivial degraded accuracy for n > k + 1. 4.4. Encoding/decoding Overhead At the time of query arrival, Coded-Inv Net must first com pute n k encoded queries before assigning compute tasks to workers. If the encoding procedure requires a nonnegligible amount of compute time, then the whole premise of coded computation is invalidated. To avoid this, as described in Sec. 4.1, we carefully chose the encoder architecture so that its compute time is negli time to compute f( ) is dominant, and the encoding time is relatively negligible. The encoding overhead is 5.8% for MNIST, 8.3% for Fashion-MNIST, and 6.6% for CIFAR10. As k increases, the encoder complexity also increases. To avoid linearly increasing complexity, in Sec. 4.1, we de signed our encoder architecture such that only the first few layers complexity depends on k. Shown in Fig. 8 is the encoding overhead as a function of k. Observe that the encoding overhead remains almost constant as k increases. The decoding overhead of Coded-Inv Net is minimal, and a simple online decoding algorithm can make it independent of k. See the supplemental materials for more details. Take-away: Coded-Inv Net s encoding overhead is negligi ble (< 10%) due to its lightweight encoder architecture and remains almost constant as k increases. The decoding overhead does not scale with k with online decoding. 4.5. End-to-End Latency We evaluate the effectiveness of Coded-Inv Net in reducing tail latency by evaluating its end-to-end inference latency measured on an AWS EC2 cluster. In particular, we imple ment a custom testbed for Coded-Inv Net and Par M, written with MPI4py (Dalcin et al., 2011). All experiments are Coded-Inv Net for Resilient Prediction Serving Systems Median Mean 99th 99.5th 99.9th 0 Latency (ms) Par Ms Ours w/o stragglers Figure 9. End-to-end latency on AWS cluster with (n, k) = (11, 10). Latencies are measured on Par M (grey), our Coded Inv Net (orange) and inference models without stragglers (olive). Coded-Inv Net shows negligible overhead compared to Par M and inference models. For instance, the 99th-tail latency of Coded Inv Net is only 3ms higher than Par M, accounting for less than 3% inference latency. Note that, our extra latencies (and Par Ms) probably come from the communication to the redundancy worker. run on an Amazon AWS EC2 cluster with GPU instances (p2.xlarge with NVIDIA K80). We deploy a pretrained in ference model in k = 10 instances and deploy the encoder and the inference model in one additional instance. We also deploy one dedicated instance as the front-end node. To emulate stragglers in our testbed, we randomly add artificial latencies of 0.1s to one of the k workers, following the setup proposed in (Tandon et al., 2017). We then measure the end-to-end latency that is the time between when the frontend server receives a query and when the prediction result becomes available at the front-end server. We measure laten cies over 5,000 independent CIFAR10 classification queries. Par M is set to use the default linear encoder/decoder and the same neural network architecture with ours. Shown in Fig. 9 are the end-to-end latency statistics of Coded-Inv Net and Par M, compared with the latency statistics without artificial delays. As expected, Coded-Inv Net s additional overhead is almost negligible, compared to the other latency sources such as communication and inference. Varying Batch Size For completeness, we evaluate end-to end latency of Coded-Inv Net with batched queries. Using (n, k) = (3, 2), with the batch size of 2 and 4, we also observe close gaps of 2.29% and 3.78% between Coded Inv Net and Par M on 99.9th percentile latency, respectively. 4.6. Applicability to Multi-task Serving To illustrate the multi-tasking advantage, we perform two 2-task experiments on CIFAR10: (1) fine and coarse clas sifications, (2) classification, and density estimation. The overall architecture remains the same except that we have g1( ), g2( ) for two tasks instead of only one g( ). In the first experiment, fine classification is with the 10 original classes, and we define two coarse classes as vehicle (air plane, automobile, ship, and truck) and animal (the other classes). Coded-Inv Net achieves high accuracies for both the tasks, across different k (shown in Table 1 in the sup plementary material). For instance, with k = 4, we achieve 86.6%, 43.4% on coarse and fine classification, respectively. For the second experiment, we co-train a classifier and a normalizing flow model (Behrmann et al., 2018) based on i Res Net, obtaining 85.2% normal accuracy and 3.9 bits/dim, respectively. Applying Coded-Inv Net, with (n, k) = (5, 4), the two models recover 50% degraded accuracy and 6.8 degraded bits/dim. These elementary results demonstrate the applicability of Coded-Inv Net to multi-task settings. 4.7. Ablation Study Beyond Invertibility Extending our framework to noninvertible architectures is a crucial future direction. (En gstrom et al., 2019) shows that a robust representation learned by adversarial training is approximately invertible. When f() is not invertible, one can still apply our frame work by finding an approximate inverse of y by solving minx kf(x) yk. We replace the INN in Coded-Inv Net by the robust classifier (Engstrom et al., 2019) and use approximate inverse for inversion. On CIFAR10 with (n, k) = (5, 4), we achieve 38.5% in degraded accuracy, which shows the promise of this direction. Effect of Manifold Mixup The regularizer makes g( ) ro bust to small approximation errors. We found that degraded accuracies on CIFAR10 when k = 2, 4 decrease by 42% and 64% when trained without Manifold Mixup, despite gaining higher normal accuracy. As discussed in Sec. 3.3, this can be accounted to the effect of Manifold Mixup on learning smooth data manifold. This effect increases the chance that the aggregated embedding locates on the manifold, and thus the corresponding inverses obtain the in-domain structural and semantic representation. This makes encoder training easier as most domain translation algorithms are designed for close-to-enough domains. 5. Conclusion We present Coded-Inv Net, a new coded computation-based approach for designing a resilient prediction serving sys tem. Inspired by a new coded computation algorithm for invertible functions, it jointly trains the inference function and the encoder, by making use of recent developments in the deep learning literature such as Manifold Mixup and do main translation algorithms. Maintaining a clear separation between encoding and inference, Coded-Inv Net provides a hassle-free design alternative to the existing methods for system designers. Coded-Inv Net is shown to significantly outperform the state of the art especially when the compute resource overhead is as low as 10%. Acknowledgements This material is based upon work supported by NSF/Intel Partnership on Machine Learning for Wireless Networking Program under Grant No. CNS-2003129. Coded-Inv Net for Resilient Prediction Serving Systems Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. ar Xiv preprint ar Xiv:1811.00995, 2018. Blalock, D., Ortiz, J. J. G., Frankle, J., and Guttag, J. What is the state of neural network pruning? ar Xiv preprint ar Xiv:2003.03033, 2020. Bodmann, B. G. Frames as Codes, pp. 241 266. Birkhäuser Boston, Boston, 2013. ISBN 978-0-8176-8373-3. doi: 10.1007/978-0-8176-8373-3_7. URL https://doi. org/10.1007/978-0-8176-8373-3_7. Cybenko, G. Approximation by superpositions of a sig moidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Dalcin, L. D., Paz, R. R., Kler, P. A., and Cosimo, A. Parallel distributed computing using python. Advances in Water Resources, 34(9):1124 1139, 2011. Dean, J. and Barroso, L. A. The tail at scale. Communica tions of the ACM, 56(2):74 80, 2013. Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. Dhakal, S., Prakash, S., Yona, Y., Talwar, S., and Himayat, N. Coded federated learning. 2019 IEEE Globecom Workshops (GC Wkshps), Dec 2019. doi: 10.1109/gcwkshps45667.2019.9024521. URL http://dx.doi.org/10.1109/ GCWkshps45667.2019.9024521. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A. Adversarial robustness as a prior for learned representations, 2019. Goiri, I., Bianchini, R., Nagarakatte, S., and Nguyen, T. D. Approxhadoop: Bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth Interna tional Conference on Architectural Support for Program ming Languages and Operating Systems, pp. 383 397, 2015. Han, R., Huang, S., Wang, Z., and Zhan, J. Clap: Component-level approximate processing for low tail latency and high result accuracy in cloud online services. IEEE Transactions on Parallel and Distributed Systems, 28(8):2190 2203, 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Howard, J. Imagenette dataset. 2019. URL https:// github.com/fastai/imagenette. Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vi sion and pattern recognition, pp. 1125 1134, 2017. Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i revnet: Deep invertible networks. ar Xiv preprint ar Xiv:1802.07088, 2018. Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482 7491, 2018. Kim, H., Papamakarios, G., and Mnih, A. The lipschitz constant of self-attention, 2020. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. ar Xiv preprint ar Xiv:1807.03039, 2018. Kosaian, J., Rashmi, K., and Venkataraman, S. Learning a code: Machine learning for approximate non-linear coded computation. ar Xiv preprint ar Xiv:1806.01259, 2018. Kosaian, J., Rashmi, K., and Venkataraman, S. Parity mod els: erasure-coded resilience for prediction serving sys tems. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 30 46, 2019. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images (technical report). 2009. Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., and Ramchandran, K. Speeding up distributed machine learn ing using codes. IEEE Transactions on Information The ory, 64(3):1514 1529, 2017. Li, S., Maddah-Ali, M. A., and Avestimehr, A. S. A unified coding framework for distributed computing with strag gling servers. In 2016 IEEE Globecom Workshops (GC Wkshps), pp. 1 6. IEEE, 2016. Liu, X., Gao, J., He, X., Deng, L., Duh, K., and Wang, Y.-Y. Representation learning using multi-task deep neu ral networks for semantic classification and information retrieval. 2015. Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE international con ference on computer vision, pp. 2794 2802, 2017. Coded-Inv Net for Resilient Prediction Serving Systems Marculescu, D., Stamoulis, D., and Cai, E. Hardware-aware machine learning: modeling and optimization. In Pro ceedings of the International Conference on Computer Aided Design, pp. 1 8, 2018. Mirza, M. and Osindero, S. Conditional generative adver sarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. Narra, K., Lin, Z., Kiamari, M., Avestimehr, S., and An navaram, M. Distributed matrix multiplication using speed adaptive coding. ar Xiv preprint ar Xiv:1904.07098, 2019. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu tional networks for biomedical image segmentation. In In ternational Conference on Medical image computing and computer-assisted intervention, pp. 234 241. Springer, 2015. Ruder, S. An overview of multi-task learning in deep neural networks. ar Xiv preprint ar Xiv:1706.05098, 2017. Song, Y., Meng, C., and Ermon, S. Mintnet: Building invertible neural networks with masked convolutions. In Advances in Neural Information Processing Systems, pp. 11004 11014, 2019. Suresh, L., Canini, M., Schmid, S., and Feldmann, A. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In 12th {USENIX} Symposium on Net worked Systems Design and Implementation ({NSDI} 15), pp. 513 527, 2015. Tandon, R., Lei, Q., Dimakis, A. G., and Karampatziakis, N. Gradient coding: Avoiding stragglers in distributed learn ing. In International Conference on Machine Learning, pp. 3368 3376. PMLR, 2017. Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Courville, A., Lopez-Paz, D., and Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. ar Xiv preprint ar Xiv:1806.05236, 2018. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. Yu, Q., Li, S., Raviv, N., Kalan, S. M. M., Soltanolkotabi, M., and Avestimehr, S. A. Lagrange coded computing: Optimal design for resiliency, security, and privacy. vol ume 89 of Proceedings of Machine Learning Research, pp. 1215 1225. PMLR, 16 18 Apr 2019. URL http:// proceedings.mlr.press/v89/yu19b.html. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., and Stoica, I. Improving mapreduce performance in heteroge neous environments. In OSDI, 2008. Zhang, C., Patras, P., and Haddadi, H. Deep learning in mobile and wireless networking: A survey. IEEE Commu nications Surveys & Tutorials, 21(3):2224 2287, 2019. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.