# 8bit_optimizers_via_blockwise_quantization__b313904d.pdf Under review as a conference paper at ICLR 2022 8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION Anonymous authors Paper under double-blind review Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, Image Net classification, WMT 14 machine translation, Mo Co v2 contrastive Image Net pretraining+finetuning, and Ro BERTa pretraining, without changes to the original optimizer hyperparameters. We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. Increasing model size is an effective way to achieve better performance for given resources (Kaplan et al., 2020; Henighan et al., 2020; Raffel et al., 2019; Lewis et al., 2021). However, training such large models requires storing the model, gradient, and state of the optimizer (e.g., exponentially smoothed sum and squared sum of previous gradients for Adam), all in a fixed amount of available memory. Although significant research has focused on enabling larger model training by reducing or efficiently distributing the memory required for the model parameters (Shoeybi et al., 2019; Lepikhin et al., 2020; Fedus et al., 2021; Brown et al., 2020; Rajbhandari et al., 2020), reducing the memory footprint of optimizer gradient statistics is much less studied. This is a significant missed opportunity since these optimizer states use 33-75% of the total memory footprint during training. For example, the Adam optimizer states for the largest GPT-2 (Radford et al., 2019) and T5 (Raffel et al., 2019) models are 11 GB and 41 GB in size. In this paper, we develop a fast, high-precision non-linear quantization method block-wise dynamic quantization that enables stable 8-bit optimizers (e.g., Adam, Adam W, and Momentum) which maintain 32-bit performance at a fraction of the memory footprint and without any changes to the original hyperparameters.1 While most current work uses 32-bit optimizer states, recent high-profile efforts to use 16-bit optimizers report difficultly for large models with more than 1B parameters (Ramesh et al., 2021). Going from 16-bit optimizers to 8-bit optimizers reduces the range of possible values from 216 = 65536 values to just 28 = 256. To our knowledge, this has not been attempted before. Effectively using this very limited range is challenging for three reasons: quantization accuracy, computational efficiency, and large-scale stability. To maintain accuracy, it is critical to introduce some form of non-linear quantization to reduce errors for both common small magnitude values and rare large ones. However, to be practical, 8-bit optimizers need to be fast enough to not slow 1We study 8-bit optimization with current best practice model and gradient representations (typically 16-bit mixed precision), to isolate optimization challenges. Future work could explore further compressing all three. Under review as a conference paper at ICLR 2022 -3.1 0.1 -0.03 1.2 -1.0 0.032 -0.025 1.0 -1.0 0.0329 -0.0242 1.0 0 170 80 255 -1.0*3.1 0.0329*3.1 -0.0242*1.2 1.0*1.2 -3.1 0.102 -0.029 1.2 0 170 80 255 -1.0 0.0329 -0.0242 1.0 Quantization Dequantization Optimizer State Chunk into blocks Find block-wise absmax Normalize with absmax Find closest 8-bit value -3.1 0.1 -0.03 1.2 Find corresponding index Store index values Load Index values Lookup values Denormalize by absmax Dequantized optimizer states Updated optimizer states Update optimizer states Figure 1: Schematic of 8-bit optimizers via block-wise dynamic quantization, see Section 2 for more details. After the optimizer update is performed in 32-bit, the state tensor is chunked into blocks, normalized by the absolute maximum value of each block. Then dynamic quantization is performed, and the index is stored. For dequantization, a lookup in the index is performed, with subsequent denormalization by multiplication with the block-wise absolute maximum value. Outliers are confined to a single block through block-wise quantization, and their effect on normalization is limited. down training, which is especially difficult for non-linear methods that require more complex data structures to maintain the quantization buckets. Finally, to maintain stability with huge models beyond 1B parameters, a quantization method needs to not only have a good mean error but excellent worse case performance since a single large quantization error can cause the entire training run to diverge. We introduce a new block-wise quantization approach that addresses all three of these challenges. Block-wise quantization splits input tensors into blocks and performs quantization on each block independently. This block-wise division reduces the effect of outliers on the quantization process since they are isolated to particular blocks, thereby improving stability and performance, especially for large-scale models. Block-wise processing also allows for high optimizer throughput since each normalization can be computed independently in each core. This contrasts with tensor-wide normalization, which requires slow cross-core synchronization that is highly dependent on task-core scheduling. We combine block-wise quantization with two novel methods for stable, high-performance 8-bit optimizers: dynamic quantization and a stable embedding layer. Dynamic quantization is an extension of dynamic tree quantization for unsigned input data. The stable embedding layer is a variation of a standard word embedding layer that supports more aggressive quantization by normalizing the highly non-uniform distribution of inputs to avoid extreme gradient variation. Our 8-bit optimizers maintain 32-bit performance at a fraction of the original memory footprint. We show this for a broad range of tasks: 1.5B and 355M parameter language modeling, GLUE finetuning, Image Net classification, WMT 14+WMT 16 machine translation, Mo Co v2 contrastive image pretraining+finetuning, and Ro BERTa pretraining. We also report additional ablations and sensitivity analysis showing that all components block-wise quantization, dynamic quantization, and stable embedding layer are crucial for these results and that 8-bit Adam can be used as a simple drop-in replacement for 32-bit Adam, with no hyperparameter changes. We open-source our custom CUDA kernels and provide a Py Torch implementation that enables 8-bit optimization by changing two lines of code. 1 BACKGROUND 1.1 STATEFUL OPTIMIZERS An optimizer updates the parameters w of a neural network by using the gradient of the loss with respect to the weight gt = L w at update iteration t. Stateful optimizers compute statistics of the gradient with respect to each parameter over time for accelerated optimization. Two of the most commonly used stateful optimizers are Adam (Kingma and Ba, 2014), and SGD with momentum Under review as a conference paper at ICLR 2022 (Qian, 1999) or Momentum for short. Without damping and scaling constants, the update rules of these optimizers are given by: Momentum(gt, wt 1, mt 1) = m0 = g0 Initialization mt = β1mt 1 + gt State 1 update wt = wt 1 α mt Weight update (1) Adam(gt, wt 1, mt 1, rt 1) = r0 = m0 = 0 Initialization mt = β1mt 1 + (1 β1)gt State 1 update rt = β2rt 1 + (1 β2)g2 t State 2 update wt = wt 1 α mt rt+ϵ Weight update, where β1 and β2 are smoothing constants, ϵ is a small constant, and α is the learning rate. For 32-bit states, Momentum and Adam consume 4 and 8 bytes per parameter. That is 4 GB and 8 GB for a 1B parameter model. Our 8-bit non-linear quantization reduces these costs to 1 GB and 2 GB. 1.2 NON-LINEAR QUANTIZATION Quantization compresses numeric representations to save space at the cost of precision. Quantization is the mapping of a k-bit integer to a real element in D, that is, Qmap : [0, 2k 1] 7 D. For example, the IEEE 32-bit floating point data type maps the indices 0...232 1 to the domain [-3.4e38, +3.4e38]. We use the following notation: Qmap(i) = Qmap i = qi, for example Qmap(231 + 131072) = 2.03125, for the IEEE 32-bit floating point data type. To perform general quantization from one data type into another we require three steps. (1) Compute a normalization constant N that transforms the input tensor T into the range of the domain D of the target quantization data type Qmap, (2) for each element of T/N find the closest corresponding value qi in the domain D, (3) store the index i corresponding to qi in the quantized output tensor TQ. To receive the dequantized tensor TD we look up the index and denormalize: TD i = Qmap(TQ i ) N. To perform this procedure for dynamic quantization we first normalize into the range [-1, 1] through division by the absolute maximum value: N = max(|T|). Then we find the closest values via a binary search: TQ i = 2n arg min j=0 |Qmap j Ti 1.3 DYNAMIC TREE QUANTIZATION Figure 2: Dynamic tree quantization. Dynamic Tree quantization (Dettmers, 2016) is a method that yields low quantization error for both small and large magnitude values. Unlike data types with fixed exponent and fraction, dynamic tree quantization uses a datatype with a dynamic exponent and fraction that can change with each number. It is made up of four parts, as seen in Figure 2: (1) The first bit of the data type is reserved for a sign. (2) The number of subsequent zero bits indicates the magnitude of the exponent. (3) The first bit that is set to one indicates that all following values are reserved for (4) linear quantization. By moving the indicator bit, numbers can have a large exponent 10 7 or precision as high as 1/63. Compared to linear quantization, dynamic tree quantization has better absolute and relative quantization errors for non-uniform distributions. Dynamic tree quantization is strictly defined to quantize numbers in the range [-1.0, 1.0], which is ensured by performing tensor-level absolute max normalization. Under review as a conference paper at ICLR 2022 2 8-BIT OPTIMIZERS Our 8-bit optimizers have three components: (1) block-wise quantization that isolates outliers and distributes the error more equally over all bits; (2) dynamic quantization, which quantizes both small and large values with high precision; and (3) a stable embedding layer to improve stability during optimization for models with word embeddings. With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update, and then quantize the states back to 8-bit for storage. We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers, as we show in Section 3. 2.1 BLOCK-WISE QUANTIZATION Our block-wise quantization reduces the cost of computing normalization and improves quantization precision by isolating outliers. In order to dynamically quantize a tensor, as defined in Section 1.2, we need to normalize the tensor into the range [-1, 1]. Such normalization requires a reduction over the entire tensor, which entails multiple synchronizations across GPU cores. Block-wise dynamic quantization reduces this cost by chunking an input tensor into small blocks of size B = 2048 and performing normalization independently in each core across this block. More formally, using the notation introduced in Section 1.2, in block-wise quantization, we treat T as a one-dimensional sequence of elements that we chunk in blocks of size B. This means for an input tensor T with n elements we have n/B blocks. We proceed to compute a normalization constant for each block: Nb = max(|Tb|), where b is the index of the block 0..n/B. With this block-wise normalization constant, each block can be quantized independently: TQ bi = 2n arg min j=0 |Qmap j Tbi Nb | 0