# scaling_laws_for_floatingpoint_quantization_training__15fd907b.pdf Scaling Laws for Floating Point Quantization Training Xingwu Sun * 1 2 Shuaipeng Li * 1 Ruobing Xie 1 Weidong Han 1 Kan Wu 1 Zhen Yang 1 Yixing Li 1 3 An Wang 1 4 Shuai Li 1 Jinbao Xue 1 Yu Cheng 3 Yangyu Tao 1 Zhanhui Kang 1 Chengzhong Xu 2 Di Wang 1 Jie Jiang 1 Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it s research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal FP quantization precision is directly proportional to the computational power, but within a wide computational power range. We estimate that the best cost-performance precision should lie between 4-8 bits. *Equal contribution 1Tencent Hunyuan 2University of Macau 3The Chinese University of Hong Kong 4Institute of Science Tokyo. Correspondence to: Shuaipeng Li , Chengzhong Xu , Di Wang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction Scaling laws of large language models (LLMs) could help developers effectively select superior parameter settings before experiments and accurately predict the model performance under different configurations. They are regarded as excellent guidance in LLM training. The widely acknowledged scaling law efforts such as Kaplan et al. (2020), Hoffmann et al. (2022), and Li et al. (2024) mainly concentrated on the central factors, i.e., model size and trained token size, which significantly impact the performance of LLMs. With the rapid growth of both model and data sizes, there has been increasing attention to the efficiency and cost of LLM training. Training and serving with lower precision becomes a popular solution. Currently, lots of representative LLMs were trained in BF16 and even lower precision (Dubey et al., 2024; Sun et al., 2024; Liu et al., 2024; Yang et al., 2024; Ma et al., 2024; Wang et al., 2023; Peng et al., 2023), aiming to balance effectiveness and efficiency. Compared to integer quantization, floating-point (FP) quantization can better maintain LLMs accuracy at extremely lower bit rates and thus is often equipped in low-precision LLMs. Therefore, exploring the scaling laws of LLM performance under different low precision settings with FP quantization becomes essential to shed light on future low-precision LLM training. Recently, there was pioneer work that conducted in-depth analyses and explorations on the LLM s scaling laws for precision in both training and inference (Kumar et al., 2024), quantitatively measuring the degradation rules of post-train quantization and quantized training. This scaling law provides an appropriate conclusion explaining the potential damage of excessively increasing training data to low-precision LLMs performance. However, Kumar et al. (2024) directly adopted the bit width as the precision in its low-precision scaling laws, which might lose finer-grained modeling of the relationship between various parameter settings related to the FP quantization and the final loss of LLMs. In practice, the key factors of FP quantization such as the exponent, mantissa, and the block size of scaling factors may have different impacts on the final loss. A more comprehensive, precise, and practical scaling law for FP quantized training related to the data size (D), model size Scaling Laws for Floating Point Quantization Training (N), exponent (E), mantissa (M), and block size of scaling factors (B) is urgently desired. Our work concentrates on establishing, verifying, and analyzing the scaling law for FP quantized training in LLMs. At the beginning, we first predict the model performance via the precision-related scaling law from previous work under different data/model sizes and precision settings. Surprisingly, we discover that the predictive performance was not perfectly satisfactory under different FP quantized training settings. Subsequently, we carefully design a comprehensive set of explorations with experiments of different precision settings (training 366 models), exploring the basic scaling law formation, as well as the potential impact of the quantization targets, exponent and mantissa, and block sizes on the loss. Finally, we aggregate these factors to get our final scaling law for FP quantized training with valuable insights to guide the LLM training under low precision. Our FP quantization training scaling law, namely Capybara (Appendix B), is formulated as follows: L(N, D, E, M, B) = n N α + d N α log2 B γ(E + 0.5)δ(M + 0.5)ν . (1) The first two factors D and N indicate the data size and model size respectively, which show the main impacts on training loss given by the key factors of data and model size similar to the Chinchilla scaling law (Hoffmann et al., 2022); ϵ represents the bias; The last factor could be regarded as the additional negative impact deriving from low precision training, where Dβ N α implies a certain form of knowledge intensity in LLM, and log2 B, (E +0.5)δ, and (M +0.5)ν jointly reflect the low precision information loss of FP quantized training. We have conducted extensive fitting experiments with various possible scaling law formulations to ensure the accuracy and simplicity of our scaling laws. Note that the exponential hyper-parameters α and β of model and data sizes are exactly the same as those in the first two factors. The product of the above knowledge intensity and low precision information loss forms the last factor. Figure 1 illustrates the fitting results of our Capybara scaling law compared with other ones, demonstrating our advantages on predicting LLM performances under different float quantized training settings. Throughout our experiments and analyses related to our Capybara scaling law, we also discover the following observations and insights that could facilitate future low-precision LLM training: (a) It has been discovered that the impact of quantized weights on the performance is relatively minor during both forward and backward computations. Meanwhile, activations demonstrate a higher degree of quantization tolerance specifically when computing gradients pertaining to themselves. (b) The data 2.75 3.00 3.25 3.50 3.75 Actual (a) Kumar et al. (2024). 2.75 3.00 3.25 3.50 3.75 Actual Figure 1. Comparing Eq. (6) from Kumar et al. (2024) and Eq. (13) from our work, our Capybara scaling law fits data better in low - precision scenarios. Specifically, Kumar et al. (2024) s fitting results show considerable bias in the E1M1 case. In the subfigures, data point magnitudes are roughly proportional to E. size of LLM pre-training cannot be added indefinitely without harming the performance under low precision, while large model sizes, higher precision settings (measured by exponent and mantissa), and smaller block sizes could increase the extreme point of effective trained tokens for LLM training. (c) Intuitively, the negative impact of low-precision training in LLMs would be proportionally amplified with the knowledge intensity . (d) The exponent and mantissa have their optimal settings under different bit widths. Exponent bits contribute slightly more to the model performance than mantissa bits. (e) The optimal FP quantization precision exhibits a direct proportionality with computational power. Nonetheless, across a broad spectrum of computational power, our estimated optimal cost-performance precision should reside within the 4-8 bits range. 2. Preliminary Classical Scaling Laws. Scaling laws have become a fundamental framework for understanding the relationship between essential factors such as model size (N), data size (D), and the resulting loss (L) in deep learning. Two classical scaling laws have been widely recognized in the industry: Chinchilla scaling law (Hoffmann et al., 2022) and Open AI scaling law (Kaplan et al., 2020). The Chinchilla scaling law is expressed as: L(N, D) = n N α + d Dβ + ϵ. (2) The Open AI scaling law is given by: L(N, D) = n where n, d, α, β, and ϵ are positive fitted constants. The balance between N and D emerges as critical for computeoptimal training. Scaling Laws for Precision. Subsequent research extends this framework by incorporating the role of precision in Scaling Laws for Floating Point Quantization Training quantized training and inference, so as to provide insights into how precision affects model performance. In Kumar et al. (2024), precision-aware scaling laws were introduced to capture the trade-offs between model size N, data size D, and precision P. For integer quantized training, they proposed the tradeoff between weight N and weight precision P as: Neff(N, P) = N(1 e P/γ), (4) where Neff indicates the effective parameter count of models, and γ is a constant representing the sensitivity of model weights to precision. Incorporating Neff into the Chinchilla scaling law yields: L(N, D, P) = n [N(1 e P/γ)]α + d Dβ + ϵ. (5) This framework highlights that reducing weight precision P can be compensated by increasing the parameter count N to maintain performance, which is a critical insight for low-precision model optimization. Current Scaling Laws cannot Fit Well in FP Quantization. Note that most previous work focused on integer quantized training. FP quantization is more prevalent in real-world applications due to its hardware compatibility and finer granularity. For instance, formats such as FP16 and BF16 are standard in many large-scale training pipelines, and emerging formats like FP8 and FP4 are gaining traction. Despite this, scaling laws specifically tailored to FP quantization are still largely unexplored. The primary distinction between FP and integer quantization lies in the allocation and usage of bits. FP numbers allocate bits to represent both the exponent and the mantissa, with each set of bits serving distinct purposes: the exponent mainly captures dynamic range, while the mantissa mainly encodes precision within that range. In contrast, integer formats uniformly distribute all bits to refine the quantization lattice, providing consistent resolution across the representable range. This fundamental difference highlights the need for dedicated scaling laws for the unique characteristics of FP formats. Kumar et al. (2024) hypothesized that the exponent and mantissa bits should be scaled jointly (i.e., increase together as total bit count does). Then, in FP formats, precision is determined by the exponent E and mantissa M, with the total precision: P = E + M + 1. By substituting P in their precision-aware scaling law, we have: L(N, D, E, M) = n h N(1 e 1+E+M Dβ + ϵ, (6) However, upon conducting experiments and applying this scaling law to fit empirical results, particularly for lowprecision training regimes, we observed significant deviations between the law s predictions and actual performance, as illustrated in Figure 1a. The unsatisfactory fit, especially for training results using low-bit FP formats, suggests that the previous relationship proposed in Kumar et al. (2024) does not adequately capture the nuanced dynamic impacts of FP quantization on LLM performance. In this work, we address these shortcomings by re-deriving the scaling law for FP quantized training. Our re-derivation incorporates a more nuanced understanding of how the finer factors of exponent, mantissa, and block size affect lowprecision training. By refining the theoretical framework and aligning it more closely with observed behaviors, we aim to establish a more accurate and predictive scaling law so as to bridge the gap between theoretical insights and real-world applications. 3. Setup and Scaling Laws 3.1. Method and Implementation Quantization Method. We quantized a tensor into a lowprecision FP format, following the IEEE 754 standard (Kahan, 1996), which includes both normal and subnormal representations. The format consists of a sign bit, E exponent bits and M mantissa bits. To expand the dynamic range, the special bits are adopted for normal values instead of representing Infinity and Not a Number (Na N). Since the modern hardware does not support arbitrary FP format, we simulate them using QPy Torch (Zhang et al., 2019) with nearest rounding. Due to the narrow dynamic range and low representation precision of the low-precision format, we employ scaling techniques (Sun et al., 2019; Micikevicius et al., 2022). The original tensor is multiplied by a higherprecision scaler before being cast into the low-precision format. The scaling factor is computed as follows: Si = FPmax/max |X[Bi:B(i+1)]| , (7) where FPmax represents the maximum normal value of the low-precision FP format. A scaling factor can be shared every B elements along the channel dimension. It is a unified representation for tensor-wise scaling (B = bdin), channelwise scaling (B = din) and block-wise (1 B