# stochastic_optimization_with_laggard_data_pipelines__8f07c7fd.pdf Stochastic Optimization with Laggard Data Pipelines Naman Agarwal Google AI Princeton Princeton, NJ 08540 namanagarwal@google.com Rohan Anil Google Research Mountain View, CA 94043 rohananil@google.com Tomer Koren Tel Aviv University & Google Tel Aviv, Israel tkoren@tauex.tau.ac.il Kunal Talwar Cupertino, CA 95014 ktalwar@apple.com Cyril Zhang Microsoft Research New York, NY 10012 cyrilzhang@microsoft.com State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of data-echoed extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate. 1 Introduction Recent empirical successes in large-scale machine learning have been powered by massive data parallelism and hardware acceleration, with batch sizes trending beyond 10K+ images [46] or 1M+ tokens [9]. Numerous interdisciplinary sources [5, 12, 24, 33] indicate that the performance bottlenecks of contemporary deep learning pipelines can lie in many places other than gradient computation. In other words, since the initial breakthroughs in hardware-accelerated deep learning [14, 28, 37], GPUs (and TPUs, FPGAs, etc.) have become too fast for upstream data loaders and preprocessors to keep up with. Choi et al. [13] propose data echoing, a simple and versatile way to improve training performance in this regime. Each stage of the data pipeline runs asynchronously, oblivious to whether its input has been refreshed upstream. In particular, the optimization algorithm may choose to take additional gradient steps before a minibatch is refreshed, rather than spend idle time waiting for more data. The authors present a large-scale proof-of-concept empirical study, and find that data echoing affords a 3.25 speedup in a network-bound Image Net setting. Some natural curiosities arise from this practice: When might this overfit? How carefully should one adjust the step size of an echoed gradient? Does acceleration work? A theoretical understanding of convergence guarantees for these data-echoed optimization algorithms is missing. Work performed while at Google Brain. Work performed while at Google AI Princeton and Princeton University. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Disk I/O Network I/O Shuffler Augmentation Batch Loader Figure 1: Schematic of data echoing, inspired by Choi et al. [13]. If the upstream data pipeline is = 4 times slower than SGD, then SGD can potentially take that many steps on the same batch before the next one arrives. ) = 1 ) general = 1 SGD small Compute-bound ERM Data echoing (Thm. 7) large Data-bound ERM Approx-Prox [43] ! 1 Statistical ERM Minibatch-Prox [43] Table 1: Regimes of echoing factor and number of batches ) which our analyses interpolates. In this paper, we settle the issues of convergence and generalization for echoed gradient methods in convex optimization. We show that these methods can match the optimization performance of their non-stochastic counterparts, while achieving optimal statistical rates. As state-of-the-art batch sizes continue to grow, along with the distributed systems that enable them, we hope that this will provide a first theoretical grounding towards understanding the algorithmic and statistical challenges in these hardware-motivated optimization settings. 1.1 Technical contributions Our model of data echoing is parameterized by the batch size , the number of fresh i.i.d. batches ), and the echoing factor , which is the number of gradient steps an algorithm can take on the (convex) loss on each batch. This reflects the hardware-determined setting where the data loader is at least times slower than the optimizer. Convergence in all data echoing regimes. We first show that echoed SGD, with the correctly tuned step size, achieves a factorspeedup on the curvature term of the standard convergence rate, while keeping the optimal statistical term. Next, we develop an echoed method that is oblivious to the echoing factor , getting the same rates for echoed SGD with an appropriately chosen proximal regularizer. Finally, we show that Nesterov s accelerated gradient descent, when echoed, achieves the optimal rates on quadratic losses. As a side contribution, we fix a small error in a technical lemma in [11], used in establishing the stability of AGD on quadratics. For general convex losses, we arrive at the same open question as these authors. Full interpolation between known regimes. To set up notation, suppose that we go over ) batches of data, and perform echoed gradient steps for each batch. In the special case of ) = 1 fresh batches, the problem becomes empirical risk minimization with a limited computational budget of gradient steps. When is small, the error is dominated by a curvature term, while for large enough this falls below the statistical term. Motivated by the communication-limited setting, Wang et al. [43] focus on the case where ) is general and ! 1, analyzing the convergence of exact optimization of the prox-regularized minibatch loss. They develop a mild approx-prox guarantee when is large enough to enable an exact+perturbation analysis. Our analysis generalizes and strengthens these results, handling all values of ; Table 1 summarizes this discussion. When ! 1, the statistical problem disappears, and we recover the classical setting of full gradient descent with ) oracle calls [8, 35]. Algorithm Standard Data-echoed (classical; see Lan [29]) (Theorem 7) Minibatch-Prox $ (Wang et al. [43]; large) (Theorem 10) Stochastic AGD $ (Theorem 13; quadratics) Table 2: Single-step and data-echoed convergence rates of stochastic optimization algorithms studied in this paper. Notice that the optimization terms depend analogously on the total number of steps ), and the statistical terms have optimal dependence on the total number of i.i.d. samples ). Stability-based analysis. We provide a modular proof framework for data echoing convergence bounds, based on uniform stability [7] and a potential-based notion of regret, which isolates the bias (curvature) and variance (generalization) components of the problem. This recipe (Theorem 4) can be used to sharpen bounds in more restricted settings, or analyze future data-echoed algorithms. 1.2 Motivation and context It is well-known in the practice of GPU training that model parameter updates are not necessarily the performance bottleneck; this is why SSD storage is critical for pipelines on the scale of Image Net [17]. For quantitative studies of I/O performance in deep learning, see [12, 45]. Many empirical advances have stemmed from innovations in data augmentation [15, 22, 41]. Unlike neural network training and inference, these data transformations can be highly sequential and/or heterogeneous, and must be done on CPU. Unlocking GPU parallelism for CPU-bound computations is often a significant engineering effort [16, 20, 27, 32]. Extremely large batch sizes have become the norm in training state-of-the-art models [4, 9, 18, 40, 46]. An overwhelming theme has been that constant factors matter; for example, memory-bound optimizers [2, 10, 39] care about factors of 2-3. When selecting hyperparameters in large-batch training setups, it is common to balance the curvatureand noise-dominated terms [26, 34, 38, 42]. This underscores the need to better understand the fine-grained dependences on , ), and , especially the resources at stake are on the scale of GPU-years. The idea of repeated steps on a batch/workers has also been investigated in the context of federated learning [25], where a related concept is referred to as local SGD or federated averaging. There are two key distinctions: federated learning considers multiple copies of local SGD running on different workers, which synchronize intermittently through averaging; under the most simplified assumptions, each individual gradient step within a worker is taken on a fresh batch. While the improvements obtained in this recent and concurrent line of work (see [44] and references therein) bear resemblance to our bounds, we do not see a direct reduction in either direction. Indeed, due to the distinctions mentioned, getting similar improvements to the curvature term in federated learning is not possible beyond quadratics, as shown by [44]. Obtaining optimal rates for convex functions in the federated learning setting remains an interesting open problem. 1.3 The bias-variance problem in data echoing As mentioned earlier, data echoing presents a natural tradeoffbetween the optimization gains from repeating gradient steps vs. the potential loss of generalization due to overfitting to stale batches. To understand this in detail, let us revisit the standard convergence guarantee for SGD on smooth functions: E[퐹(Fout)] 퐹(F ) $ We interpret the first term as a bias (curvature) term, which diminishes at a faster rate due to smoothness. The second term is the variance (statistical) term, which arises due to the stochasticity in the data, and thus naturally scales as the inverse square root of the batch size. Viewing as fixed, the variance term is intrinsic to the data; therefore, we cannot expect data-echoing (or any algorithm) to give us improvements on that term for free. In fact, it is possible to make this term degrade, by overfitting on a batch. On the other hand, we can expect the bias term, which is governed by progress on the curvature of the underlying population loss, to decrease as we are given more echoing steps . In light of this, the best analogous convergence rate one should hope to achieve in the data-echoing setting is Our results establish exactly this rate for the data-echoed version of gradient descent. The data-echoed version of accelerated gradient descent is also shown to possess similar gains but with a faster rate of 2)2. The challenge is to prevent overfitting; obtaining such rates requires careful control (depending on ) of step sizes. Later, we alleviate this need via data-echoed proximal GD, whose parameters are independent of . 1.4 Overview of techniques All of our theorems follow the same analysis structure. In particular, we formalize a notion of potential-bounded regret (Definition 3), which connects an algorithm s function-value progress on a minibatch to a decrement on a certain potential function with respect to an arbitrary point. This potential function depends on the algorithm in question, but the key property is that it telescopes when summed over batches; this provides a fast rate on the bias term with respect to ). The second piece of the analysis connects function-value decrease on a batch to the population objective via the notion of uniform stability (Definition 1). Note that the potential decrease scales inversely with , whereas the stability constant increases with (unless a proximal regularizer is added). The key to maintaining the optimal statistical rate is to balance these terms via the choice of an appropriate step size. This type of algorithmic stability analysis has appeared various times in the literature [7, 11, 21]; we show here that it affords a way to analyze echoed gradient methods. 2 Preliminaries 2.1 Problem definition Given a convex set W R= and a domain with a distribution D, we consider the following stochastic convex optimization problem: def= E ª D[ 5 (F, ª)]. (1) Here 5 : R= ! R is such that for any ª, 5 ( , ª) is convex, differentiable, Ω-Lipschitz, and Ø-smooth; i.e., for all F, F0 2 W, 5 (F) 5 (F0) hr5 (F0), F F0i + Ø 2 k F F0k2. When the minimizer exists, we define F = arg min F 2W 퐹(F). However, our results pertaining to optimality gaps 퐹(F) 퐹(F ) hold for arbitrary F , encompassing the case when this minimizer does not exist. We further assume that we have access to an initial point F0 with a bounded distance from the comparator; i.e., k F0 F k . Minibatch optimization. We will work in the stochastic minibatch oracle model: at each time step C, we receive a new batch (of size ) examples ª(C) = {ª(C,8)} 8=1 sampled i.i.d. from the distribution D. For any batch of examples ª = {ª(8)}, we define the empirical objective on the batch as 5 (F, ª(8)). Throughout this paper, we will use boldface ª to denote a batch of examples, and unbolded ª to represent a single example in . Optimization algorithms. We formalize a generic notion of optimization algorithms. Since these algorithms are called repeatedly by the data-echoing procedure, we will augment the output space of optimization algorithms with a notion of state, which it internally maintains and passes to the next run of the same algorithm. Formally, an optimization algorithm is an iterative procedure which takes four arguments: an initial point Finit 2 W, an initial state Binit, the current batch ª which determines the current objective 5ª, and the number of steps :. The algorithm outputs a point Fout 2 W and an output state Bout. In short, an algorithm A implements (Fout, Bout) A(Finit, Binit,ª, :). We will suppress the notation of one or more of the arguments to A when they will be clear from the context, and write 5 (A( )) as a shorthand for 5 (Fout), ignoring the auxiliary state Bout. Note that Fout and Bout are random variables, determined by the stochastic minibatch ª. 2.2 Algorithmic stability Definition 1 (Uniform stability). A deterministic3 algorithm A is considered to be -uniformly stable with respect to loss function 5 : W ! R if, for two batches of data ª,ª0 differing in exactly one example, we have that | 5 (A(ª), ª) 5 (A(ª0), ª) | . The following is a well-known result connecting stability to generalization [7]. Here, we state a version taken from [21]: Theorem 2. If an algorithm A is -uniformly stable, then it holds that 5ª (A(ª)) 퐹(A(ª)) 3 The data echoing meta-algorithm Given an minibatch optimization algorithm A, its data-echoed extension is defined by Algorithm 1. Algorithm 1 Data echoing meta-algorithm 1: Input: Optimizer A; initializer Finit := F0; initial state Binit := B0; number of inner steps 2: for C = 0, . . . ,) 1 do 3: Receive a batch of examples ª(C) = {ª(C,8)} 8=1. 4: Execute A on ª(C) starting at FC for steps: (FC+1, BC+1) A(FC, BC,ª(C), ). 5: Output: Average iterate Fout := 1 3.1 Data-echoed algorithms Using the framework of Algorithm 1, we introduce the data-echoed versions of three ubiquitous optimization algorithms. In [13], several types of data echoing are defined; we focus on what the authors call batch echoing. Data-echoed gradient descent. We first formalize gradient descent in our optimization framework. The gradient descent procedure only contains the fixed learning rate as the state: Binit = Bout := { }. The iterations defining the inner algorithm A are straightforward: F0 = Finit, {F 9+1 = F 9 r 5ª(F 9)} 1 9=0 , Fout = F . When Algorithm 1 is instantiated with this choice of A, we call the overall procedure data-echoed gradient descent. Data-echoed proximal gradient descent. The state of the proximally-regularized gradient descent procedure contains three variables: the fixed learning rate , the prox parameter , and Fpivot, the center of the prox term: Binit := { , , Fpivot}. We now define the proximal function 5prox(F) = 5ª(F) + 2 k F Fpivotk2. 3A similar definition exists for randomized algorithms [7]. In this work, we focus on deterministic algorithms. The iterations proceed in same way as gradient descent, but on 5prox: F0 = Finit, {F 9+1 = F 9 r 5prox(F 9)} 1 9=0 , Fout = F . The output returned is Bout = { , , 1 9=0 F 9}. This particular choice of returning the average iterate as the next Fpivot simplifies our analysis. With this choice of A, this overall procedure will be called data-echoed proximal gradient descent. Data-echoed accelerated gradient descent. The state space for accelerated gradient consists of a step size , an initial momentum vector 3, and a momentum scale factor ; thus Binit = { , 3, }. Define the following scalar sequences with 0 = : 9+1 9+1 = 2 The updates now follow the progression as in Nesterov s acceleration [36]: F0 = Finit, 30 = 3, F 9+1 = (F 9 + 3 9) r 5ª(F 9 + 3 9), 3 9+1 = 9+1(F 9+1 F 9). Finally, the outputs are given by Bout = { , 3 , }, Fout = F . With this choice of A, we refer to the overall procedure as data-echoed accelerated gradient descent. 4 Convergence analyses of echoed methods We will analyze the data-echoing algorithms by separating their optimization properties from their stability properties. For the latter, we use the standard notion of uniform stability, as defined earlier. For the optimization part, we use a notion of potential-bounded regret, which we define next. Definition 3 (Potential-bounded regret). We say that an algorithm A has potential-bounded regret with potential function +A if given a Ø smooth convex function 5 on a domain W and a starting point Finit, A produces a point Fout such that for all F 2 W, it holds that 5 (Fout) 5 (F ) +A(Finit, Binit, F ) +A(Fout, Bout, F ). This inequality is a fundamental lemma in the standard analysis of mirror descent (see [6], or Section B.2 from [1]), but we extend it to nested stateful algorithms instead of a single step. For the echoed algorithms we analyze in this work, squared Euclidean norms will be suitable potentials. We state and prove our main generic theorem below: Theorem 4. Let A be an -uniformly stable algorithm. Furthermore, suppose A has the potentialbounded regret property with respect to +A. Then, for any F 2 W, Algorithm 1 with inner algorithm A satisfies E[퐹(Fout)] 퐹(F ) +A(F0, B0, F ) E[+A(F) , B) , F )] Proof. From the potential-bounded regret property of the algorithm A, we get that 5ª(C) (FC+1) 5ª(C) (F ) +A(FC, BC, F ) +A(FC+1, BC+1, F ). Let EC [ ] denote the expectation conditioned on all randomness in the minibatches up to (and including) time C. We now get from the uniform stability of A that E[퐹(FC+1)] = E ª(C) [퐹(FC+1)] E C 1 E ª(C) [ 5ª(C) (FC+1)] + Thus we have E[퐹(FC+1)] 퐹(F ) E C 1 E ª(C) [ 5ª(C) (FC+1) 5ª(C) (F )] + ª(C) [+A(FC, BC, F ) +A(FC+1, BC+1, F )] + E[+A(FC, BC, F )] E[+A(FC+1, BC+1, F )] + . Summing the above over time and using the convexity of 퐹gives us that E[퐹(Fout)] 퐹(F ) E[퐹(FC+1)] 퐹(F ) + +A(F0, B0, F ) E[+A(F) , B) , F )] In the rest of the section, we present various applications of our main data echoing theorem. In each case, we will consider a standard algorithm, derive its stability and potential bounded regret properties, then use Theorem 4 to derive the convergence rate for its echoed version. All regret proofs can be found in Appendix A, and stability proofs in Appendix B; the corresponding convergence rates for the echoed algorithms are proven in Appendix C. 4.1 Echoed gradient descent We begin by establishing the following properties of gradient descent. In the rest of the theorem and lemma statements in this section F is an arbitrary point in W. Lemma 5 (Potential-bounded regret for GD). Let 5 be a Ø-smooth convex function. Then steps of gradient descent on 5 , with a step size 1/Ø, satisfies the potential-bounded regret property with +(F, B, F ) := 1 2 k F F k2: 5 (Fout) 5 (F ) 1 k Finit F k2 2 k Fout F k2 Lemma 6 (Stability of GD). For a Ø-smooth function 5 , and any 0 1/Ø, gradient descent on 5 , run with step size for steps, is -uniformly stable with = 2 Ω2/ . Combining Lemmas 5 and 6, we conclude the following convergence bound for data-echoed GD: Theorem 7 (Data-echoed GD). ) outer steps of data-echoed gradient descent, with a step size of and internal steps, produces a point F>DC satisfying E[퐹(Fout)] 퐹(F ) Ø 2 4.2 Echoed proximal gradient descent For proximal GD, we derive the following bounds on potential-bounded regret and stability: Lemma 8. Let 5 be a Ø-smooth convex function. Consider the potential function +(F, { , , Fpivot}, F ) = k F F k2 k F Fpivotk2 Then -step proximal gradient descent, with step-size 1/(Ø + ) has regret bounded by 5 (Fout) 5 (F ) +(Fout, Bout, F ) +(Finit, Binit, F ). Lemma 9 (Stability of prox-GD). For a Ø-smooth function 5 , any 0 and any 0 1/(Ø+ ), steps of proximal gradient descent are -uniformly stable with = 2Ω2 The proofs of both lemmas are included in the the supplementary material. Combining Lemmas 8 and 9, we get the following guarantee on the performance of data-echoed prox-GD (proof included in the supplementary material): Theorem 10 (Data-echoed prox-GD). ) outer steps of echoed gradient descent, with a prox parameter , step size = 1 Ø+ , and internal steps, produces a point Fout satisfying E[퐹(Fout)] 퐹(F ) 2Ωk Finit F k p + Øk Finit F k2 Note that using this algorithm, the correct choice of step size no longer depends on the echoing factor . In fact, even if varies across the execution of the proximal algorithm, a straightforward extension of our analysis shows that proximal gradient descent can achieve P C C instead of the ) factor in the denominator of the bias term. This resilience to indeterminate echoing factors is especially appealing for the motivating setting of asynchronous pipelines. 4.3 Echoed accelerated gradient descent For the case of Nesterov s accelerated gradient descent, we consider a slightly modified version of our data-echoing meta-procedure. This arises from the fact that even the stochastic setting of accelerated gradient descent, algorithms output the final iterate and not the average iterate. The resulting slightly modified procedure is outlined in Algorithm 2. Algorithm 2 Data-echoing meta-algorithm (final iterate) 1: Input: Optimizer A; initializer Finit := F0; initial state Binit := B0; number of inner steps . 2: for C = 0, . . . ,) 1 do 3: Receive a batch of examples ª(C) = {ª(C,8)} 8=1. 4: Execute A on ª(C) starting at FC for steps: FC+1, BC+1 A(FC, BC,ª(C), C). 5: Output: Final iterate Fout := F) We also add a slight extension to our potential-based regret abstraction: Lemma 11 (Potential-bounded regret for AGD). Let 5 be a Ø-smooth convex function. Running accelerated gradient descent for steps, with a step size 1/Ø, gives the regret bound out out)( 5 (Fout) 5 (F)) ( 2 init init)( 5 (Finit) 5 (F)) 1 2 (k Finit + init3init Fk2 k Fout + out3out Fk2). Further, to bound the stability, we note the following lemma which was essentially proved in [11]. Since we believe there is a small typo in the main argument in the original presentation of the proof, we provide an alternate derivation in the supplementary material. Lemma 12 (Stability of AGD). Suppose that 5 is a Ø-smooth convex quadratic function of F for any ª. Then, for any 0 1/Ø and initial state Binit, steps of accelerated gradient descent are -uniformly stable with = $( Ω2 2/ ). Combining Lemmas 11 and 12 we obtain the following guarantee for data-echoed AGD: Theorem 13. Suppose 5 is a convex quadratic in F, for all ª. Then, ) outer steps of echoed AGD, with echoing factor and step size = (min{ 1 /) 3/2 }), produces a point F>DC satisfying E[퐹(Fout)] 퐹(F ) = $ 2)2 + Ωk F0 F k p 5 Experiments We demonstrate numerical experiments on convex machine learning benchmarks. This acts as a validation of our theoretical findings, as well as a way to examine beyond worst-case phenomena not captured by our minimax convergence guarantees. This can be seen as a combination of the experiments of Figures 4-6 in [13], where we have exchanged the state-of-the-art setting for a more robust one, allowing for a closer dissection of the bias-variance decomposition. Methodology. We consider two logistic regression problems as a benchmark, the scaled Cover Type dataset from the UCI repository [19], and MNIST [30]. We record the number of iterations (including as well as excluding the data-echoing iterations) needed for SGD to reach within 1% of the optimum training loss, as we increase the echoing factor , and thus decrease the rate of fresh independent samples usable by SGD. For each choice of ( , ), we tune a constant learning rate by grid search, to minimize this time. All details can be found in the supplementary material. Results and discussion. Figures 2 and 3 show our findings. As batch size increases, there is a phase transition from a variance-dominated regime (the $(Ω / )) term in our analysis is larger) to a bias-dominated regime (the $(Ø / )) term is larger). In the former regime, data-echoed SGD saturates on the stale data, and the optimal learning rate scales inversely with , as predicted by the theory. In the latter regime, echoing attains a nearly embarrassingly-parallel speedup, and the optimal learning rate is close to constant. These experiments provide an end-to-end example of how the bias-variance decomposition can help to understand and diagnose the benefits and limitations of data-echoed algorithms. Figure 2: Convergence times as a function of echoing factor , for logistic regression on the Cover Type dataset. Learning rates (yellow) are tuned for each ( , ) to minimize convergence times. Convergence times are presented in number of SGD steps ) (blue), as well as number of independent samples consumed ) (red). Note that the red curves reflect wall-clock time for data-echoing when the data loader is times slower than the optimizer. As batch size increases, we move from the noise-dominated regime (red curve plateaus) to the curvature-dominated regime (blue curve plateaus). Figure 3: Convergence times, as in Figure 2, for logistic regression on the MNIST dataset. Note that the phase transition from noise-dominated to curvature-dominated regimes happens in a batch size range commonly used in deep learning benchmarks with this dataset. A note on deep neural nets. Our theoretical setting was originally motivated by hardware constraints most frequently encountered in the massively parallel training of deep neural networks. Beyond the convex setting, we note that the experimental design problem become significantly more challenging. Some potential confounds include the learning rate choice affecting the generalization gap [23], and counterintuitive interactions between learning rate and batch normalization [3, 31]. In [13], the authors study the end-to-end performance gains of data echoing. Indeed, those experiments need many tweaks (like example-wise echoing, data re-augmentation, and individually tuned momentum and learning rate schedules) to obtain their most impressive speedups. 6 Conclusion We have established first theoretical analysis in the nascent field of optimization algorithms for asynchronous data pipelines, where we have found that gradient descent and well-known variants can be adapted to resist overfitting to stale data. An immediate open problem is to develop a corresponding theory for local convergence and saddle point avoidance in the non-convex setting. This work provides further motivation to show the $( Ω2 2/ )-uniform stability of AGD for smooth convex functions, which was conjectured in [11] with different motives. More broadly, we hope that the design and analysis of algorithms in optimization for machine learning can derive fruitful inspiration from nascent hardware considerations, like those that motivated this work. Broader Impact This work is theoretical in nature, and is concerned with the very general framework of stochastic optimization. As such, there are no foreseen ethical or societal consequences for the research presented herein. We hope that by providing theoretical groundwork and algorithmic techniques for efficient large-scale optimization in settings informed by modern developments in optimization, works like this one will contribute to alleviating the steep resource and energy costs of large-scale machine learning. Acknowledgments and Disclosure of Funding We are grateful to George Dahl and Yoram Singer for helpful discussions. CZ was supported by Elad Hazan s NSF grants IIS-1523815 and CCF1704860, and a Google internship. [1] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. ar Xiv preprint ar Xiv:1407.1537, 2014. [2] R. Anil, V. Gupta, T. Koren, and Y. Singer. Memory-efficient adaptive optimization for large-scale learning. ar Xiv preprint ar Xiv:1901.11150, 2019. [3] S. Arora, Z. Li, and K. Lyu. Theoretical analysis of auto rate-tuning by batch normalization. ar Xiv preprint ar Xiv:1812.03981, 2018. [4] A. Bapna, C. A. Cherry, D. D. Lepikhin, G. Foster, M. Krikun, M. Johnson, M. Chen, N. Ari, O. Firat, W. Macherey, Y. Wu, Y. Cao, and Z. Chen. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019. [5] T. Ben-Nun and T. Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1 43, 2019. [6] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM, 2001. [7] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002. [8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [9] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020. [10] X. Chen, N. Agarwal, E. Hazan, C. Zhang, and Y. Zhang. Extreme tensoring for low-memory preconditioning. ar Xiv preprint ar Xiv:1902.04620, 2019. [11] Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-offof iterative optimization algorithms. ar Xiv preprint ar Xiv:1804.01619, 2018. [12] S. W. Chien, S. Markidis, C. P. Sishtla, L. Santos, P. Herman, S. Narasimhamurthy, and E. Laure. Characterizing deep-learning i/o workloads in tensorflow. In 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), pages 54 63. IEEE, 2018. [13] D. Choi, A. Passos, C. J. Shallue, and G. E. Dahl. Faster neural network training with data echoing. ar Xiv preprint ar Xiv:1907.05550, 2019. [14] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition, pages 3642 3649. IEEE, 2012. [15] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113 123, 2019. [16] S. Dalton, I. Frosio, and M. Garland. Gpu-accelerated atari emulation for reinforcement learning. ar Xiv preprint ar Xiv:1907.08467, 2019. [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. [18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. [19] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml. [20] J. A. Guirao, K. ecki, J. Lisiecki, S. Panev, M. Szo ucha, A. Wolant, and M. Zientkiewicz. Fast AI Data Preprocessing with NVIDIA DALI, 2019. URL https://developer.nvidia.com/ blog/fast-ai-data-preprocessing-with-nvidia-dali/. [21] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015. [22] E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler, and D. Soudry. Augment your batch: better training with larger batches. ar Xiv preprint ar Xiv:1901.09335, 2019. [23] Z. Jiang, C. Zhang, K. Talwar, and M. C. Mozer. Exploring the memorization-generalization continuum in deep learning. ar Xiv preprint ar Xiv:2002.03206, 2020. [24] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1 12, 2017. [25] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. [26] J. Kaplan, S. Mc Candlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. [27] M. Khadatare, Z. Khan, and H. Bayraktar. Leveraging the Hardware JPEG Decoder and NVIDIA nv JPEG Library on NVIDIA A100 GPUs, 2020. URL https://developer.nvidia.com/ blog/leveraging-hardware-jpeg-decoder-and-nvjpeg-on-a100/. [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [29] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365 397, 2012. [30] Y. Le Cun, C. Cortes, and C. Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. [31] Z. Li and S. Arora. An exponential learning rate schedule for deep learning. ar Xiv preprint ar Xiv:1910.07454, 2019. [32] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270 282, 2018. [33] R. Mayer and H.-A. Jacobsen. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR), 53(1):1 37, 2020. [34] S. Mc Candlish, J. Kaplan, D. Amodei, and O. D. Team. An empirical model of large-batch training. ar Xiv preprint ar Xiv:1812.06162, 2018. [35] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [36] Y. E. Nesterov. A method of solving a convex programming problem with convergence rate o(kˆ2). In Doklady Akademii Nauk, volume 269, pages 543 547. Russian Academy of Sciences, 1983. [37] R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pages 873 880, 2009. [38] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. ar Xiv preprint ar Xiv:1811.03600, 2018. [39] N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost. ar Xiv preprint ar Xiv:1804.04235, 2018. [40] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017. [41] C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019. [42] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. Don t decay the learning rate, increase the batch size. ar Xiv preprint ar Xiv:1711.00489, 2017. [43] J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic optimization with minibatch-prox. ar Xiv preprint ar Xiv:1702.06269, 2017. [44] B. Woodworth, K. K. Patel, S. U. Stich, Z. Dai, B. Bullins, H. B. Mc Mahan, O. Shamir, and N. Srebro. Is local sgd better than minibatch sgd? ar Xiv preprint ar Xiv:2002.07839, 2020. [45] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. Image classification at supercomputer scale. ar Xiv preprint ar Xiv:1811.06992, 2018. [46] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017.