# fault_tolerance_in_iterativeconvergent_machine_learning__acb80f03.pdf Fault Tolerance in Iterative-Convergent Machine Learning Aurick Qiao 1 2 Bryon Aragam 3 Bingjing Zhang 1 Eric P. Xing 1 2 3 Abstract Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterativeconvergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78% 95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures. 1. Introduction Distributed model training for machine learning (ML) is a workload that is typically long-running and resource-intensive. Throughout a job s lifetime, it is susceptible to hardware failures, performance fluctuations, and other uncertainties inherent to real-world cluster environments. For example, processes can be preempted by a cluster resource allocator (Vavilapalli et al., 2013; Hindman et al., 2011), parameter synchronization can be bottlenecked on a slow or congested network (Li et al., 2014b; Zhang et al., 2017b), and stragglers can severely impact overall job throughput (Cipar et al., 2013; Harlap et al., 2016). These concerns are amplified in modern shared clusters and cloud-based spot instances such as those provided by Amazon Web Services (AWS). Thus, developing new fault-tolerance strategies for modern ML systems is a critical area of research. ML-agnostic distributed systems approaches for addressing such problems often adopt strong consistency semantics. They aim to provide strong execution guarantees at a per-operation level (such as linearizability or serializability), but may also incur higher performance overhead. On the other hand, ML training is often tolerant to small calculation errors and may not require such strong consistency guarantees. This observation has been exploited by recent ML systems to overcome cluster 1Petuum, Inc., Pittsburgh, Pennsylvania, USA 2Computer Science Department, Carnegie Mellon Univeristy, Pittsburgh, Pennsylvania, USA 3Machine Learning Department, Carnegie Mellon Univeristy, Pittsburgh, Pennsylvania, USA. Correspondence to: Aurick Qiao . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Figure 1. The self-correcting behavior of iterative-convergent algorithms. Even though a calculation error results in an undesirable perturbation of δ at iteration T, the subsequent iterations still brings the solution closer to the optimum value of x . unreliability and resource limitation issues, such as bounded staleness consistency (Ho et al., 2013; Cipar et al., 2013; Cui et al., 2014), quantization and low-precision arithmetic (Courbariaux et al., 2014; Gupta et al., 2015; Hubara et al., 2017), and lock-free execution (Niu et al., 2011; Dean et al., 2012). One notable exception to this trend is checkpoint-based fault tolerance, a common strategy in current ML systems for mitigating hardware failures (Abadi et al., 2016; Wei et al., 2015; Low et al., 2012) which continues to enforce strong consistency semantics at a high cost of re-computing lost work. This trend of relaxing consistency in ML systems relies on the self-correcting behavior of iterative-convergent ML training algorithms (Fig. 1). During each step, the training algorithm calculates updates based on the current values of model parameters, and then applies the updates to obtain a better set of model parameters. By iteratively performing this computation, the model parameters eventually converge to a set of optimal values. Small computation errors made during this procedure are eventually washed out by the successive Fault Tolerance in Iterative-Convergent Machine Learning iterative improvements. This self-correcting behavior of ML training suggests a general strategy for designing robust training systems for unreliable environments, as follows: (A) The execution system allows certain environmental faults and/or resource limitations to manifest as calculation errors in model training. These errors can be conceptualized as perturbations to the model parameters. (B) The perturbations are self-corrected by the model training algorithm, which incurs an extra cost (e.g. additional iterations, batches, epochs, etc.). We refer to this additional cost as the rework cost of the perturbations. Motivated by this general strategy, we develop a framework for exploiting self-correction in ML systems in a way that is adaptive to generic perturbations whose cause or origin is unknown. It provides a theoretical foundation for understanding the self-correcting behavior of iterative-convergent model training as well as the tools needed by ML systems to take advantage of this behavior. Our main contributions are: 1. We quantify the impact of generic perturbations on iterative-convergent algorithms in terms of their rework cost. Under reasonable convergence assumptions, we bound the rework cost in terms of the sizes of these perturbations. 2. We propose new strategies for checkpoint-based fault tol- erance in distributed model training. Partially recovering from checkpoints, combined with prioritizing checkpoints in a way that reduces the size of perturbations, can significantly reduce the rework cost due to partial failures. 3. We design SCAR, a parameter server system for fault tolerant ML training and show that SCAR reduces the rework cost of partial failures by 78% 95% when compared with traditional checkpointing, which is close to optimal (vs. training with no failures). 2. Modeling Faults in ML Training Most ML training algorithms are iterative, i.e. model parameters are updated given a current estimate of the model parameters x(k) until convergence to some target parameter x . Such algorithms are commonly called iterative-convergent, and include most optimization, Monte Carlo, and numerical schemes used in practice. These iterative schemes are of the form x(k+1)=f(x(k)), x(k)2Rd, (1) for some function f. This model of iterative-convergent algorithms assumes that the current state x(k) is stored persistently and losslessly in memory. In practice, modern distributed ML systems are subject to faults such as hardware failures, memory corruption, and performance fluctuations. Thus, it is unrealistic to assume that x(k) can always be retrieved with perfect fidelity. To model this uncertainty, let δk be a random variable that represents an unknown perturbation that corrupts the current state to produce a perturbed state x(k)+δk. We make no assumptions about the cause, size, or behavior of the perturbations δk. More specifically, we assume the iterates obey the following scheme: y(1)=f(y(0)+δ0) y(k+1)=f(y(k)+δk) In the absence of errors, ie. δk =0, we have y(k)=x(k), which reduces to the basic iterative scheme (1). Moreover, since δk is arbitrary, this model allows for any type of perturbation. In particular, perturbations may occur in every iteration or periodically according to some random process. This setup captures many of the ways that system faults can be manifested as perturbations, and we give a few important examples below. Example 2.1 (Reduced Precision). A simple practical example is using reduced precision floating/fixed point representations for storing parameter values. If ey(k) is a reduced precision version of the exact parameter values y(k), then the algorithm suffers perturbations of δk =ey(k) y(k) at each iteration k. If the representation has a p-bit mantissa, then the size of δk is bounded by |δk|<2 (p 1)|y(k)| (Higham, 2002). Example 2.2 (Bounded Staleness Consistency). In stochastic gradient descent (SGD) under the stale synchronous parallel (SSP) consistency model (Ho et al., 2013), gradients are computed in a data-parallel fashion where each of M machines may observe a stale version of the model parameters ex(k) m . Suppose r(ex(k) m , Dm) are the gradients computed during iteration k using input data Dm at machine m. If r(x(k),D) is the true stochastic gradient at iteration k, then the algorithm suffers a perturbation at iteration k+1 of: m ,Dm) r(x(k),D) Example 2.3 (Checkpoint-based Fault Tolerance). In failure recovery from checkpoints, a copy of the entire job state is periodically saved to persistent storage, and is restored in the case of a failure. Suppose a system experiences a failure at iteration T, and recovers from the failure by restoring a full checkpoint of the model parameters taken at iteration C 0, what is the cost in number of iterations for y(k) to reach "-optimality compared to the unperturbed sequence x(k)? We write cost in quotations to emphasize that this number can be negative for example, δk could randomly move y(k) closer to x , or δk can be constructed in advance to improve convergence as in perturbed gradient descent (see Remark 2.2). We call this quantity the rework cost of the perturbed sequence y(k), introduced in Sec. 1. Our goal in the present section is to bound the rework cost, which will be formally defined next. 3.1. Rework cost In order to keep things simple, we assume that the unperturbed sequence satisfies kf(x(k)) x k ckx(k) x k, 0 (y(k),") implies Eky(m) x k<" (this may be +1 or negative). Under (3), it is straightforward to derive a similar lower bound for the unperturbed sequence x(k) as (x(k), ") = log /log(1/c). This will be used as a baseline for comparison: The rework cost for the perturbations δk is defined to be (δk,"):= (y(k),") (x(k),"). (4) Using the unperturbed sequence x(k) as a benchmark, (δk,") bounds the additional number of iterations needed for the perturbed sequence y(k) to reach "-optimality (where we bear in mind that this can be negative). Clearly, (δk,") depends on the sequence δk, and should be smaller whenever the δk are smaller. We seek a bound on (δk,") that holds for arbitrary δk. Remark 3.1. We use the criterion Eky(k) x k<" as an optimality criterion instead of directly bounding P(ky(k) x k<"). This is commonly done (e.g. Bottou et al., 2016) since bounds on Eky(k) x k imply bounds on the latter probability via standard concentration arguments (see e.g. Rakhlin et al., 2012). 3.2. Bounding the rework cost To bound the rework cost, we also require that the update f satisfies a convergence rate similar to (3) for the perturbed data ey(k):=y(k)+δk: Ekf(ey(k)) x k c Ekey(k) x k, 0T. Under (3) and (5), we have for any ">0, 1+ T kx(0) x k log(1/c) (6) where T :=PT In fact, the bound (6) is tight in the following sense: As long as (3) cannot be improved, there exists a deterministic sequence δ1,...,δT such that (6) holds with equality. Theorem 3.1 is illustrated on a simple quadratic program (QP) in Figure 2, which provides empirical evidence of the tightness of the bound. The interesting part of the bound (6) is the ratio T/kx(0) x k, which is essentially a ratio between the aggregated cost of the perturbations and the badness of the initialization. For more intuition, re-write this ratio as T kx(0) x k = =0ck Ekδ k ckkx(0) x k . Up to constants, the denominator is just the error of the original sequence x(k) after k iterations. The numerator is more interesting: It represents a time-discounted aggregate of the overall cost of each perturbation. Each perturbation δ is weighted by a discount factor ck , which is larger for more recent perturbations (e.g. δT) and smaller for older perturbations (e.g. δ0). Thus, the dominant quantity in (6) is a ratio between the re-weighted perturbations and the expected Fault Tolerance in Iterative-Convergent Machine Learning (a) Rework cost vs. kδkk for a single perturbation at iteration 500. (b) Rework cost vs. T for a single perturbation at iteration 500. (c) Rework cost vs. T for perturbations with p=0.001 at each iteration. Figure 2. Illustrations of rework costs using gradient descent on a simple 4-D quadratic program. Each plot consists of 1,000 trials with perturbation(s) randomly generated according to a normal distribution. The red line is the rework cost bound according to Theorem 3.1. The value of c is determined empirically, and the value of is set so that an unperturbed trial converges in roughly 1,000 iterations. error from the original sequence. As expected, if the original sequence converges very quickly and the perturbations are large, the rework cost increases proportionally. Theorem 3.1 also assumes that there are no perturbations after time T. The idea is that if there are no more perturbations, (6) bounds the cost of the perturbations incurred so far. Of course, in practice, the system may experience faults after time T, in which case (6) can be adjusted to include the most recent fault. The difficulty in directly accounting for future perturbations lies in our assumption that the δk can be arbitrary: If future iterations can experience any perturbation, it is clear that convergence cannot be guaranteed (e.g. consider δk =x y(k) for some fixed x 6= x and all k > T). Under some additional assumptions, something can be said about this case; see Example A.4. 4. Checkpoint-Based Fault Tolerance As an application of our framework, we study new strategies for checkpoint-based fault tolerance, by which a stateful computation is made resilient to hardware failures by periodically saving its program state to persistent storage. This fault-tolerance mechanism is used in many popular ML frameworks including Tensor Flow (Abadi et al., 2016) and Py Torch (Paszke et al., 2017). Using traditional checkpointing, the entire saved program state is restored after a failure, and input data is re-loaded from its persistent storage. Then, all computation since the previous checkpoint is repeated. This process maximizes the consistency of recovery by restoring the system to an exact state it was in during the past, but can incur high rework cost if the checkpoint interval is long. Let Trework be the total amount of time spent re-computing lost iterations. For a single failure, Trework for the traditional checkpoint strategy is the total amount of time between the previous checkpoint and the failure. Although this traditional checkpointing is sufficient for many usage scenarios, it can break down in computing environments where the mean-time-to-failure is low (Harlap et al., 2017). For example, resource schedulers in shared clusters can kill running jobs to give more resources to higher-priority jobs, and cloud-based spot instances may be preempted frequently. In these environments, jobs using traditional checkpointing can incur a large penalty each time they experience a failure. In the most degenerate scenario, a job can run for an undetermined amount of time when its checkpoint interval is longer than the mean-time-to-failure. Thus, it is critical to reduce the rework cost incurred by checkpoint-based fault tolerance. Fortunately, for iterative-convergent ML, we can exploit its self-correcting behavior to reduce Trework. In particular, we can give up the consistency of checkpoint-recovery, and design a system which tries to reduce the size of the perturbation kδTk incurred upon failure. By doing so, Theorem 3.1 shows that the rework cost bound is also reduced, lowering the worst case rework cost and thus reducing Trework. We design a system architecture, SCAR,1 consisting of two strategies which reduce kδTk compared to traditional checkpoint recovery: (1) Partial recovery, and (2) Prioritized checkpoints. SCAR extends the popular parameter server (PS) architecture for distributed model training (Ho et al., 2013; Li et al., 2014b;a) the model parameters are partitioned across a number of PS nodes, which are accessed by worker nodes. We assume that during a failure, any number of PS nodes can go down, causing the loss of their partitions of the model parameters. We present these strategies and the design of SCAR below, and show evaluation of SCAR in Section 5. 4.1. Partial Recovery Our first strategy is to only recover (i.e. from a previous checkpoint) the part of the model parameters which are lost due to the failure. Since the model parameters are partitioned across several PS nodes, a partial failure of PS nodes should only cause a partial loss of model parameters. Mathematically, the partial recovery strategy should result in a smaller perturbation 1SCAR stands for Self-Correcting Algorithm Recovery. Fault Tolerance in Iterative-Convergent Machine Learning to the model parameters and, according to Theorem 3.1, incur a smaller rework cost. Suppose that a fully-consistent checkpoint is taken after iteration C , and a failure occurs during iteration T >C which triggers checkpoint recovery. Theorem 4.1. Let δ be the perturbation incurred by full checkpoint recovery, and δ0 be the perturbation incurred by partial checkpoint recovery, then kδ0k