# training_with_mixedprecision_floatingpoint_assignments__3f5bff12.pdf

Published in Transactions on Machine Learning Research (06/2023)

Training with Mixed-Precision Floating-Point Assignments

Wonyeol Lee wonyeol.lee.cs@gmail.com Stanford University, USA

Rahul Sharma rahsha@microsoft.edu Microsoft Research, India

Alex Aiken aaiken@stanford.edu Stanford University, USA

Reviewed on Open Review: https: // openreview. net/ forum? id= Zo Xi7n54OB

When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit ﬂoats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit ﬂoats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment a mapping from all tensors (arising in training) to precision levels (high or low) that keeps most of the tensors in low precision and leads to suﬃciently accurate models. We provide a technique that explores this memory-accuracy tradeoﬀby generating precision assignments for convolutional neural networks that (i) use less memory and (ii) lead to more accurate convolutional networks at the same time, compared to the precision assignments considered by prior work in low-precision ﬂoating-point training. We evaluate our technique on image classiﬁcation tasks by training convolutional networks on CIFAR-10, CIFAR-100, and Image Net. Our method typically provides > 2 memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading oﬀaccuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.

1 Introduction

In deep neural network training, ﬂoating-point formats are usually used to represent tensors and it is worthwhile to use the smallest bitwidth format that gives acceptable results. For example, it is common to replace tensors using 32-bit ﬂoats with tensors that use 16-bit ﬂoats (Kalamkar et al., 2019; Micikevicius et al., 2018). The beneﬁts are easy to understand: computations using lower-precision ﬂoats not only use less memory but are also faster (due to improved vector parallelism, locality, and reduced data movement). The downside is that there is generally some loss of training accuracy, and in the worst case training may not even converge.

For such low-precision ﬂoating-point training, the most common approaches use two ﬂoating-point formats one for lower-precision ﬂoats (e.g., 8-bit ﬂoats) and the other for higher-precision ﬂoats (e.g., 16-bit ﬂoats) and assign one of the two formats to each tensor (including weights, activations, and their gradients). The precision assignments studied in previous work fall into one of two assignment schemes (which both have several variants): the uniform assignment uses low precision for almost all tensors (often excepting those in the ﬁrst and/or last few layers) (Micikevicius et al., 2018), while the operator-based assignment limits low precision to the input tensors of certain operators (e.g., convolutions) (Sun et al., 2019). Prior work has shown that both precision assignment schemes (with well-chosen low-bitwidth ﬂoating-point formats) can match the accuracy of 32-bit-ﬂoat training (Cambier et al., 2020; Chmiel et al., 2021; Drumond et al., 2018; Fox et al., 2021; Kalamkar et al., 2019; Micikevicius et al., 2018; Sun et al., 2019; Wang et al., 2018).

There is an important limitation in all prior approaches to low-precision ﬂoating-point training: they use very few precision assignments (most often just one) for a given set of models, but there are some other

Published in Transactions on Machine Learning Research (06/2023)

(a) Squeeze Net

(b) Shuﬄe Net-v2

(c) Mobile Net-v2

Figure 1: Training trajectory of various models on CIFAR-100. Colors denote precision assignments: all-32-bit πfp32 (red), uniform πunif (yellow), and operator-based πop (blue) (see 3.1); the latter two use the 8-bit (and 16-bit) ﬂoats in Sun et al. (2019) as low (and high) precision numbers. Markers denote the width multiplier of a model, which controls the capacity of the model (see 5.3): 1.0 ( ), 0.5 ( ), 0.25 ( ), and 0.1 ( ). Some lines of πunif are missing as they converge to small values or diverge. Observe that neither πunif nor πop works best for all models: in some models, πop has a similar accuracy to πfp32; but in other (and all) models, the accuracy drop of πop (and πunif) from πfp32 are noticeably large (i.e., >1%).

models and inputs where the chosen precision assignment (i) results in noticeably worse accuracy than 32-bit-ﬂoat training, (ii) causes training to even diverge, or (iii) admits a more eﬃcient assignment that achieves similar training accuracy (see Figures 1, 3, and 4).

In this paper, we present a new, automated method for choosing precision assignments that removes the limitations described above. To do so, we formally introduce the memory-accuracy tradeoﬀproblem ( 3.2): given a dataset, a model, and two ﬂoating-point precision levels (i.e., bitwidths; high and low), ﬁnd a mixed precision assignment (a mapping from all tensors arising in training to high/low precision) for the model that maximizes training accuracy subject to a given upper bound on the model aggregate (i.e., the total number of bits of all tensors appearing in training). The model aggregate is a proxy for the memory and time required for training, as it is roughly proportional to memory footprint and also well-correlated with training time (since training is often dominated by data movement) (Micikevicius et al., 2018).

We prove that the memory-accuracy tradeoﬀproblem is theoretically diﬃcult (namely NP-hard) partly due to the exponential number of possible mixed precision assignments (which we often refer to simply as precision assignments for brevity) ( 3.3). The large number of possible assignments makes the problem diﬃcult in practice as well: there is no known analytical method for predicting the training accuracy of a given precision assignment, and for any practical model there are far too many precision assignments to simply test them all.

We propose a simple (heuristic) approach to the tradeoﬀproblem that prioritizes tensors for low-precision formats based on the tensor s size (with an additional step described below) ( 4.1). More speciﬁcally, our algorithm takes as input a single parameter giving a desired upper bound on the model aggregate. Starting with the largest tensor in the model, tensors are assigned low precision in size order (from largest to smallest) until the model aggregate falls below the given upper bound; all remaining tensors are assigned high precision. Our main result is that this method discovers mixed precision assignments that use less memory while achieving higher training accuracy than previous approaches. While we cannot show that our method ﬁnds Pareto-optimal memory-accuracy tradeoﬀs, we do show that our results are closer to Pareto-optimal than prior methods.

Some precision assignments initially generated by our algorithm cause training to diverge due to an excessive number of overﬂows. To address this issue, we propose an overﬂow handling technique that promotes tensors causing too many overﬂows from low precision to high precision during training ( 4.2). In our experiments, these promotions consume only a small amount of additional memory (< 3% of the maximum model aggregate) and prevent training from diverging. The overﬂow handling technique is not speciﬁc to our algorithm and can be applied to other precision assignment methods as well.

We evaluate a Py Torch implementation of our method on standard image classiﬁcation tasks by training four convolutional networks (and their variants) on CIFAR-10, CIFAR-100, and Image Net ( 5). We ﬁrst demonstrate that the precision assignments computed by our method alleviate the limitations of existing methods: they indeed explore the tradeoﬀbetween memory and accuracy and exhibit a better tradeoﬀthan

Published in Transactions on Machine Learning Research (06/2023)

the uniform and operator-based assignments. We then show the two main components of our method (i.e., precision demotion of larger tensors and precision promotion of overﬂowing tensors) are both important to produce competitive precision assignments. We also provide some guidance on how users may apply our method to navigate the memory-accuracy tradeoﬀ.

To summarize, this work makes four main contributions:

We formally introduce the memory-accuracy tradeoﬀproblem to explore better mixed precision assignments for low-precision ﬂoating-point training and prove the NP-hardness of the problem. We present a novel precision assignment technique, as a heuristic solution to the tradeoﬀproblem, that proposes assignments based on a single parameter denoting a desired upper bound on the model aggregate. We present a novel technique that handles an excessive number of overﬂows arising in training while using a small amount of additional memory. The technique is applicable to any (not just our) precision assignments. We demonstrate that the mixed precision assignments found by our method do explore the tradeoﬀ between memory and training accuracy, and outperform existing precision assignment methods.

We remark that this work focuses on low-precision ﬂoating-point training, not ﬁxed-point training (which uses ﬁxed-point formats), since we want to target upcoming (or very recent) hardware with native support for low-precision ﬂoats (e.g., 8-bit ﬂoats) and their operations (e.g., Andersch et al. (2022)). Also, this work focuses on low-precision training (which trains a model from scratch), not inference (which assumes a pre-trained model). More discussion is in 2. We further remark that in our experiments we simulate low-precision formats (e.g., 8-bit ﬂoats) with 32-bit ﬂoats as in prior works, since a hardware and software ecosystem that natively supports these formats does not yet exist. Similarly, we do not include other models (e.g., language models) in the experiments, since no current software stacks support per-tensor precision assignments for certain operators that those models use. More details are in 5.1 and 5.2.

For image classiﬁcation tasks and convolutional networks, our precision assignment method typically provides > 2 memory reduction over the operator-based assignment while maintaining similar training accuracy and gives further reductions by trading oﬀaccuracy. Our method also provides similar memory reduction to the uniform assignment, while avoiding the divergence of training often caused by a uniform assignment.

The paper is organized as follows. After discussing related work ( 2), we deﬁne the memory-accuracy tradeoﬀ problem and study its hardness ( 3). We then describe our algorithm for the problem ( 4) and our evaluation ( 5). We conclude with limitations and future work ( 6).

2 Related Work

Low-precision ﬂoating-point training has been extensively studied since the work of Micikevicius et al. (2018). One active research direction is to select appropriate ﬂoating-point formats (or their variants) for lowand high-precision numbers in training. Various ﬂoating-point formats have been proposed, including FP16 (Micikevicius et al., 2018), BF16 (Kalamkar et al., 2019), FP8 (Micikevicius et al., 2022; Wang et al., 2018), HFP8 (Sun et al., 2019), and FP6 (Chmiel et al., 2021), along with some variants such as HBFP (Drumond et al., 2018), S2FP8 (Cambier et al., 2020), and BM (Fox et al., 2021). Recently, the problem of automatically selecting such ﬂoating-point formats has been considered (Yang et al., 2022). Another research direction is to develop algorithmic techniques that improve training accuracy under low precision: e.g., (Björck et al., 2021; Sa et al., 2018; Yang et al., 2019a; Zamirai et al., 2020). Our work is orthogonal and complementary to all these prior works: they consider various ﬂoating-point formats or training algorithms but use a ﬁxed precision assignment, which is either the uniform or operator-based assignment (or their variants); our work explores various precision assignments once ﬂoating-point formats and training algorithms are ﬁxed (e.g., based on the prior works). The tradeoﬀbetween memory and accuracy in training is also considered in Yang et al. (2022), but the work diﬀers from ours: they vary ﬂoating-point formats when a precision assignment is ﬁxed, while we vary precision assignments when ﬂoating-point formats are ﬁxed.

Low-precision ﬁxed-point training uses ﬁxed-point formats as a low-precision representation instead of a ﬂoating-point format. Some works use ﬁxed-point formats for forward tensors and ﬂoating-point formats for backward tensors: e.g., (Choi et al., 2018; Courbariaux et al., 2015; Jacob et al., 2018; Sun et al., 2020; Yang et al., 2019b). Other works use only ﬁxed-point formats for all tensors: e.g., (Banner et al., 2018; Das et al.,

Published in Transactions on Machine Learning Research (06/2023)

2018; Gupta et al., 2015; Rajagopal et al., 2020; Sakr & Shanbhag, 2019; Wu et al., 2018; Zhang et al., 2020; Zhou et al., 2016). Among all these works, some consider various mixed precision assignments with diﬀerent bitwidth (ﬁxed-point) formats: (Sakr & Shanbhag, 2019; Zhang et al., 2020); but they are not applicable to our context (i.e., ﬂoating-point training) since they rely on some properties of ﬁxed-point formats that do not hold for ﬂoating-point formats (e.g., all numbers in a given format are equally distributed). The approach taken in Rajagopal et al. (2020) is orthogonal and complementary to ours: they use only the uniform precision assignment, but change the underlying low-precision formats during training; we consider various mixed precision assignments, but ﬁx the underlying low-precision formats during training.

Low-precision inference, often called neural network quantization (in a narrow sense), aims at reducing the latency or memory of neural network inference (instead of training) by using low-precision numbers (Nagel et al., 2021). Existing approaches typically assume a pre-trained model and try to ﬁnd low-precision formats for each part of the inference computation, either by retraining the model (called quantization-aware training) or without any retraining (called post-training quantization); see, e.g., Gholami et al. (2022); Qin et al. (2022) for surveys. Some works on inference consider various mixed precision assignments, but they are not applicable to our context: they focus on making inference more eﬃcient and usually assume a pre-trained model; we focus on making training more eﬃcient and aim at learning a model from scratch.

Floating-point tuning is another related topic, which considers the following problem: given a program, assign appropriate formats (among given candidates) to the program s ﬂoating-point variables such that the program s output has an error smaller than a given threshold for all given inputs, while also maximizing performance (Chiang et al., 2017; Guo & Rubio-González, 2018; Menon et al., 2018; Rubio-González et al., 2013; 2016). This problem is diﬀerent from the problem we focus on: the former considers the ﬂoating-point error after a single run of a program, while we consider the training accuracy after a large number of runs of a program (i.e., a gradient computation) where each run aﬀects the next run; further, the former considers general-purpose programs, while we consider deep learning programs and exploit their unique features.

In this section, we ﬁrst provide background on low-precision ﬂoating-point training ( 3.1), based on which the memory-accuracy tradeoﬀproblem is introduced ( 3.2). We then prove the NP-hardness of the problem ( 3.3). Our approach in 3 4 is more formal than most related works for two reasons: (i) we show the problem is NP-hard, which has not been considered in prior work; and (ii) to clearly describe the precision assignments to be considered.

3.1 Background: Low-Precision Floating-Point Training

Let T be the set of real-valued tensors and let [n] denote the set {1, . . . , n}. For a supervised learning task, we usually consider a model network M = (f1, . . . , fn) parameterized by θ = (θ1, . . . , θn) Tn, and a loss network L = (fn+1, . . . , fm), where fi : T2 T is a primitive operator on tensors (e.g., convolution, batch normalization, maxpool, and softmax). Given an input-output pair (x, y) T2, the model M computes a predicted output y of x by iteratively applying fi( , θi) to x (i [n]), and L computes a loss from y by iteratively applying fi ( , y) to y (i [m]\[n]). A standard way to train M is to minimize the loss value using the gradient descent algorithm: iteratively update θ by following the gradient of the loss with respect to θ.

Floating-point training. In practice, we perform a gradient computation usually with tensors represented in ﬂoating-point formats. Let π : TS FP be a precision assignment giving the ﬂoating-point format of each tensor, where TS =

{vi, dvi, θj, dθj | i [m + 1], j [n]} is the set of tensors arising in a gradient computation (explained below), and FP =

{fp(e, m, b) | e, m N, b Z} is the set of ﬂoating-point formats. Here fp(e, m, b) denotes a ﬂoating-point format that consists of a 1-bit sign, an e-bit exponent, and an m-bit mantissa, and has an (additional) exponent bias of b Z. A common choice of π is πfp32(t) =

fp32 for all t TS, where fp32 =

fp(8, 23, 0) is the standard 32-bit ﬂoating-point format.

Published in Transactions on Machine Learning Research (06/2023)

%𝑣$%! %𝑣$%#

# 𝑑𝑣$%! # 𝑑𝑣$%#

: forward computation : backward computation

%𝑣! %𝑣$ %𝑣&

# 𝑑𝑣! # 𝑑𝑣$

Figure 2: A diagram showing the tensors and operators used in a gradient computation; see Eq. (1) for details. For brevity, rounding functions rndπ( ) are omitted.

Given a precision assignment π, a gradient computation is typically performed by the backpropagation algorithm: with ˆv1 = rndπ(v1)(x) and ˆdvm+1 = rndπ(dvm+1)(1), compute

ˆvi+1 = rndπ(vi+1)(fi(ˆvi, ˆui)), ˆθj = rndπ(θj)(θj), ˆdvi = rndπ(dvi)(df i,1( ˆdvi+1, ˆvi, ˆui)), ˆdθj = rndπ(dθj)(df j,2( ˆdvj+1, ˆvj, ˆθj)), (1)

for i [m] and j [n]; see Figure 2 for a diagram. Here rnd : FP T T is a function rounding a given input to a given ﬂoating-point format, df i,1, df i,2 : T3 T are the backward operators of fi with respect to its ﬁrst and second arguments, respectively, and ˆui = ˆθi if i [n] and y otherwise. We call vi and θj the forward tensors, and dvi and dθj the backward tensors. We put a hat over each tensor to emphasize that its value is the output of a rounding function to a possibly low-precision format; remark that such a rounding function is not used within fi, df i,1, and df i,2, since they typically use large bitwidth ﬂoats (e.g., fp32) and no low-precision ﬂoats internally (Cambier et al., 2020; Kalamkar et al., 2019). After the computation, ˆdθj stores the gradient of the loss value with respect to θj.

The overall picture of ﬂoating-point training is now described as follows. In each iteration of the gradient descent algorithm, we compute ˆdθj via Eq. (1) using a given precision assignment π, training data (x, y), and current weights θ. We then update each θj by θj rndfp32(θj η ˆdθj) given a learning rate η > 0, and proceed to the next iteration until the training ends. Here we use fp32 to represent θj by following the convention in low-precision ﬂoating-point training: a master copy of weights (i.e., θj) is stored separately from the weight values (i.e., ˆθj) used in a gradient computation, and is usually represented by fp32 (Cambier et al., 2020; Kalamkar et al., 2019; Micikevicius et al., 2018). The memory overhead of this master copy is very small compared to the memory required to store other tensors (e.g., activation tensors vi) (Micikevicius et al., 2018).

Low-precision ﬂoating-point training. In low-precision training, we use a precision assignment π where some tensors have a smaller bitwidth than fp32. Particularly well-studied are π that use two predetermined ﬂoating-point bitwidths (which are diﬀerent) and optionally vary the rest of the format from tensor to tensor. We call C : TS {lo, hi} FP a precision-candidate assignment if C(t, lo) has the same bitwidth for all t TS, the same holds for hi, and the bitwidth for lo is smaller than that for hi. We deﬁne Π(C) =

{π : TS FP | t TS. π(t) {C(t, lo), C(t, hi)}} as the set of precision assignments that conform to C.

Among various precision assignments in Π(C), two have received the most attention: the uniform assignment πunif,C (Micikevicius et al., 2018) and the operator-based assignment πop,C (Sun et al., 2019). The former assigns low-precision formats to all tensors uniformly1, and the latter to (most of) the input tensors of GEMM operators (in both forward and backward passes):

πunif,C(t) =

C(t, lo) for all t TS,

C(t, lo) if t {vi, θi, dvi+1} for some i and fi is a GEMM operator (but not the ﬁrst/last one) C(t, hi) otherwise, (2)

where a GEMM operator refers to a general matrix multiplication operator which arises in, e.g., fully-connected or convolutional layers. A particular variant πop ,C of πop,C has received much attention as well (Kalamkar et al., 2019), which assigns low-precision formats to (most of) the input and output tensors of GEMM operators: it is deﬁned as πop,C except that {vi, θi, dvi+1} in Eq. (2) is replaced by {vi, θi, vi+1, dvi, dθi, dvi+1}.

1For simplicity we deﬁne πunif,C without the common exceptions for tensors near v1 and/or vm+1.

Published in Transactions on Machine Learning Research (06/2023)

We note that the precision assignments used in apex.amp and torch.amp (Nvidia, 2019; Py Torch, 2022) correspond to πop,C and πop ,C, respectively. For several choices of C, these assignments have been shown to produce training accuracy similar to that by πfp32 on many datasets and models (see 1 2).

3.2 Memory-Accuracy TradeoﬀProblem

We now introduce the following problem based on 3.1, to address the limitation of existing approaches for low-precision ﬂoating-point training discussed in 1:

Problem 3.1 (Memory-accuracy tradeoﬀ). Given training data {(xi, yi)}, a model and loss network M and L, a precision-candidate assignment C, and a lower bound r [0, 1] on the low-precision ratio, ﬁnd a precision assignment π Π(C) that maximizes acc(π) subject to ratiolo(π) r.

Here acc(π) denotes the accuracy of the model M when trained with π on {(xi, yi)}, and ratiolo(π) denotes the low-precision ratio of π, i.e., the portion of the tensors represented in low-precision under π, among all tensors appearing in a gradient computation:

ratiolo(π) =

size({t TS | π(t) = C(t, lo)})

size(TS) [0, 1]

where size(T) =

t T size(t) denotes the total size (i.e., number of elements) of all tensors in T TS.2 For instance, ratiolo(πhi) = 0 and ratiolo(πlo) = 1 for the all-high-precision assignment πhi and the all-low-precision assignment πlo. The problem asks for a precision assignment that maximizes training accuracy under a memory constraint, which is expressed as a fraction of the memory required to train the model using πhi.

3.3 NP-Hardness of the Problem

We prove that the memory-accuracy tradeoﬀproblem from 3.2 is NP-hard by showing that there is a polynomial-time reduction from the knapsack problem to this problem:

Theorem 3.2. Problem 3.1 is NP-hard.

Proof sketch. Recall the knapsack problem: given n items with weights wi N and proﬁts pi N (i [n]), ﬁnd a subset of the items that maximizes the total proﬁt while its total weight does not exceed a given threshold W N.

Given an instance (w, p, W) of the knapsack problem, we construct an instance of Problem 3.1 such that we get the following (informal) correspondence between the two: wi corresponds to the size of the parameter tensor θi; pi to the i-th component of the input data; W to the lower bound r on the low-precision ratio (in an inverse way); and selecting the i-th item corresponds to assigning a high-precision format to the tensor θi (and related tensors), which roughly decreases the low-precision ratio by wi while increasing the accuracy of the model (after training) by pi. Based on this informal correspondence, we formally prove that an optimal solution to the above instance of Problem 3.1 can be converted in linear time to an optimal solution to the given knapsack problem (w, p, W). That is, we have a linear-time reduction from the knapsack problem (which is NP-hard (Karp, 1972)) to Problem 3.1 which is therefore NP-hard. For a detailed proof, see Appendix A.

Intuitively, the proof relies on two aspects of Problem 3.1: the size of the search space (i.e., |Π(C)|) is exponential in the size of the problem (especially |TS|), and some values representable in a high-precision format underﬂow to 0 in a lower-precision format. Note that underﬂows are relevant in low-precision training: they frequently arise in practice, degrading the results of training (Micikevicius et al., 2018). The NP-hardness result indicates that it is unlikely any polynomial-time algorithm solves the problem exactly.

4 Algorithm

In this section, we propose a novel (heuristic) algorithm for the memory-accuracy tradeoﬀproblem ( 4.1), and a new technique to handle overﬂows arising in training ( 4.2). We point out that the former algorithm

2As explained in 1, the low-precision ratio is a proxy for the reduction in memory as well as training time (because the low-precision ratio increases as the model aggregate decreases).

Published in Transactions on Machine Learning Research (06/2023)

ﬁnds an initial precision assignment before training starts, whereas the latter technique updates the current precision assignment while training proceeds.

4.1 Precision Demotion for Saving Memory

Consider an input to the memory-accuracy trade-oﬀproblem (Problem 3.1): a model and loss network M = (f1, . . . , fn) and L = (fn+1, . . . , fm), a precision-candidate assignment C, and a lower bound r on the low-precision ratio. Given the input, our algorithm ﬁnds a precision assignment π in two steps (Algorithm 1).

Tensor grouping. We ﬁrst group tensors in TS such that each group consists of all the tensors between two adjacent GEMM operators (see below for details). This grouping reduces the search space over precision assignments, from all of Π(C) to a subset in which the same precision is assigned to the tensors in the same group. This speciﬁc grouping strategy is based on two observations: a majority of ﬂoating-point operations are carried out in GEMM operators, and it is standard (e.g., in Py Torch) to use the same precision for a forward tensor and its corresponding backward tensor.

Formally, we group tensors as follows. Let fk and fk (k < k ) be GEMM operators that are adjacent , i.e., there is no GEMM operator in {fk+1, . . . , fk 1}. For each such (fk, fk ), we create a group {vi, dvi, θj, dθj | i (k, k ] [m + 1], j (k, k ] [n]}. After that, we create two more groups for the remaining tensors: one for the tensors near v1 and the other for tensors near vm+1. As a result, we obtain a set of disjoint groups of tensors {T1, T2, . . .} 2TS.

Precision demotion. Given the groups of tensors, T1, T2, . . ., we construct a precision assignment π as follows: initialize π to the all-high-precision assignment and update π by demoting the precision of all tensors in a group to low precision, one group at a time, until the low-precision ratio of π becomes greater than r. We demote the precision of groups in decreasing order of their sizes (i.e., the total number of elements in tensors); that is, the precision of a larger size group is demoted earlier. Formally, let {T 1, T 2, . . .} be the reordering of {T1, T2, . . .} such that size(T 1) size(T 2) . After initializing π by π(t) = C(t, hi) for all t, we iterate over i N and update π to π(t) = C(t, lo) for all t T i, until ratiolo(π) r is ﬁrst satisﬁed. The resulting π is the output of our algorithm.

The intuition behind using group size as the priority order for precision demotion is based on the fact that it is actually optimal in a very simpliﬁed setting. Suppose that an input x to the model M stores a quantity of information I and the forward computation of M is nothing but a process of extracting the information in the input into a small number of values, i.e., the tensor vn+1. Assume that passing through each group Oi = {fk+1, . . . , fk } of operators (corresponding to the group Ti of tensors) reduces the amount of information by a factor αi (0, 1), and using low precision on the group Ti further reduces the amount of information by a constant factor β (0, 1) for all i. Then, the amount of information left in vn+1 becomes I (α1α2 ) βl, where l is the number of groups in low precision. In this simpliﬁed setting, maximizing the amount of information in vn+1 is equivalent to minimizing the number of groups in low precision, which is achieved precisely by demoting the largest groups ﬁrst (when there is a constraint on the low-precision ratio). We show empirically ( 5.4) that using the decreasing size order in precision demotion indeed produces better precision assignments than using other orders.

4.2 Precision Promotion for Handling Overﬂows

While our algorithm in 4.1 exerts a constraint on memory usage, it places no explicit constraint on training accuracy, and so not surprisingly for some models and datasets the resulting precision assignment causes training to diverge accuracy decreases signiﬁcantly and remains low after some point. We observe that when training begins to diverge (and a bit before that), many overﬂows occur in the rounding function of some tensors ˆvi, i.e., an input tensor to the function rndπ(vi)( ) in Eq. (1) contains many elements whose magnitude is larger than the maximum representable number of the format π(vi) (Figure 6(a-b); 5.4). This rapid increase in overﬂows in individual tensors is a signal that training may diverge.

Precision promotion. Based on this observation, after each gradient computation we update the current precision assignment π by promoting to high precision (i.e., C(t, hi)) any forward tensor t whose overﬂow ratio is greater than a given threshold Θ (0, 1); this updated precision assignment is used in the next gradient

Published in Transactions on Machine Learning Research (06/2023)

Algorithm 1: Computing π with precision demotion Input: (f1, . . . , fn), (fn+1, . . . , fm), C, r /* Tensor grouping */ k = 1; Tk = for i = 1 to m do Tk = Tk {vi, dvi} if k n then { Tk = Tk {θi, dθi} } if fi is GEMM then { k = k + 1; Tk = } end /* Precision demotion */ (T 1, . . . , T k) = sort(T1, . . . , Tk) by decreasing size π(t) = C(t, hi) for all t TS for j = 1 to k do if ratiolo(π) r then { break } π(t) = C(t, lo) for all t T j end return π

Algorithm 2: Training with precision promotion Input: π, Θ, C /* Training loop */ Initialize weights θ while training not ﬁnished do Compute the current gradient dθ using π Update θ using dθ /* Precision promotion */ for forward t TS do if overﬂow_ratio(t) > Θ then π(t) = C(t, hi) end end end return θ

computation (Algorithm 2). Here the overﬂow ratio of t TS denotes the number of overﬂows arising in the rounding function of ˆt in Eq. (1), divided by the number of elements in ˆt. We show empirically ( 5.4) that training always converges using this technique and the additional memory cost of promotion is small (in our experiments, < 3% of the maximum model aggregate3). For the experiments, we use Θ = 0.01; in fact we found that a wide range of values for Θ (0.1, 0.01, and 0.001) all work well. Note that this technique is not speciﬁc to our algorithm and can also be applied to other precision assignment methods.

We apply precision promotion only to forward tensors for two reasons. First, dynamic loss scaling (Micikevicius et al., 2018; Nvidia, 2019; Py Torch, 2022; Sun et al., 2019) already handles overﬂows in backward tensors, but not in forward tensors: loss scaling multiplies the backward loss tensor dvm+1 by a constant before performing backward computation, to scale up all backward tensors; the dynamic version adjusts the constant during training in a way that avoids overﬂows in backward tensors. Note that dynamic loss scaling does not aﬀect forward tensors at all. Second, we cannot use a similar idea to handle overﬂows in forward tensors, because forward tensors are not linear in the input tensor v1 whereas backward tensors are linear in the backward loss tensor dvm+1 (by the linearity of diﬀerentiation).

Precision promotion incurs little if any computational overhead: checking whether a single rounding operation overﬂows is cheap, and we only apply rounding functions to the output tensor of an arithmetic-intensive operator (e.g., convolution and batch normalization), amortizing the cost of the overﬂow checks over a large number of other operations.

5 Experiments

In this section, we evaluate our precision assignment technique (developed in 4) on standard training tasks to answer three research questions:

Does our technique explore the tradeoﬀbetween memory and accuracy and achieve a better tradeoﬀ than existing (ﬁxed) precision assignments ( 5.3)? Are the two main components of our technique, precision demotion/promotion of larger/overﬂowing tensors, important for good performance ( 5.4)? How can we choose the parameter r in our technique (i.e., a lower bound on the low-precision ratio) ( 5.5)?

3For each training where our methods (presented in 4) are used, we measure the model aggregate when the training starts and when it ends. We observe that the diﬀerence between the two values (averaged over four diﬀerent runs) is at most 3% of the maximum model aggregate (i.e., the model aggregate when all tensors are in high precision).

Published in Transactions on Machine Learning Research (06/2023)

5.1 Implementation

We have implemented our precision assignment technique using Py Torch (Paszke et al., 2019). Given a model and loss network, and a dataset, our implementation takes as parameters a precision-candidate assignment C and a lower bound r on the low-precision ratio; it then automatically assigns precisions to tensors (appearing in training) according to our technique and uses those assigned precisions in gradient computations. To make these procedures automatic, our implementation works as follows:

For each primitive operator in Py Torch (e.g., torch.nn.Conv2d), our implementation provides its wrapped version (e.g., ext3.nn.Conv2d) which records auxiliary information for our technique (e.g., ﬂoating-point format of input/output tensors) and applies proper rounding functions in forward/backward computations based on the auxiliary information. Models should now use the wrapped classes instead of the original ones. Our implementation ﬁrst constructs a computation graph (of a given model and loss network) dynamically by running a forward computation on a minibatch of input data. The computation graph and other information (e.g., each tensor s size) are recorded in the wrapped classes. Using the auxiliary information just recorded, our implementation then constructs an initial precision assignment according to 4.1, and starts training with this assignment. During the training, our implementation uses the current precision in gradient computations, and updates it after each gradient computation according to 4.2. We record the precision assignment also in the wrapped classes to automatically apply proper rounding functions in gradient computations.

We simulate low-precision formats used in the experiments (e.g., 8-bit ﬂoats) and their operations, with 32-bit ﬂoats and 32-bit operations followed by rounding functions as described in Eq. (1); simulating low-precision formats is the standard methodology set by prior works on low-precision training (Cambier et al., 2020; Fox et al., 2021; Kalamkar et al., 2019; Micikevicius et al., 2022) and we simply follow this.4 We implement the rounding functions based on the QPy Torch library (Zhang et al., 2019), but a few extensions are required, e.g., to support exponent bias and signal overﬂows for dynamic loss scaling. We automatically apply these rounding functions after each primitive operator, by using Py Torch s hook feature (e.g., nn.Module.register_*hook).

5.2 Experiment Setups

Datasets and models. As benchmarks for our experiments, we use the image classiﬁcation task and three datasets for the task: CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), and Image Net (Russakovsky et al., 2015); these task and datasets have been widely used in recent works on low-precision training as a standard choice (Chmiel et al., 2021; Rajagopal et al., 2020; Sakr & Shanbhag, 2019; Wang et al., 2018) and we simply follow this. For the task and datasets, we use four well-known models: Squeeze Net (Iandola et al., 2016), Shuﬄe Net-v2 (Ma et al., 2018), Mobile Net-v2 (Sandler et al., 2018), and Res Net-18 (He et al., 2016); they are chosen since models with relatively few weights, such as these, are generally known to be more diﬃcult to train with low precision than those with more weights (Sun et al., 2019). We considered other tasks (e.g., language modeling) and related models (e.g., RNN/transformer-based models) but did not include them in our experiments because substantial additional implementation eﬀort orthogonal to our main contributions would be required: these models use some Py Torch operators that do not support per-tensor precision assignments,5

so applying our technique to these models requires signiﬁcant modiﬁcations to Py Torch internals.

Precision-candidate and precision assignments. For the experiments, we use the precision-candidate assignment C studied in Sun et al. (2019), which uses 16-bit (and 8-bit) ﬂoats for high (and low) precision; in particular, C(t, hi) = fp(6, 9, 0) for all (forward/backward) tensors t, and C(t, lo) = fp(4, 3, 4) for all forward tensors t and fp(5, 2, 0) otherwise. We choose this particular C since it uses sub-32-bit ﬂoating-point formats for both low and high precision and the precision assignment πop,C was shown to achieve accuracy comparable to 32-bit training (Sun et al., 2019). The three ﬂoating-point formats used in C have subnormals but no

4The 8-bit formats used in our experiments (see 5.2) began to be supported natively in hardware very recently (by NVIDIA H100 GPU). But access to such hardware is still very limited (e.g., no major cloud services including AWS, Azure, and Google Cloud provide it), and these formats are not yet supported natively in software (e.g., Py Torch, Tensor Flow, and cu DNN). Due to such lack of a hardware and software ecosystem natively supporting these formats, we chose to simulate them as in prior works. 5For instance, the Py Torch functions nn.RNN and nn.Multihead Attention do not allow to change the precision of intermediate tensors (e.g., input/output tensors of each GEMM operator used in the functions) to user-deﬁned formats (e.g., fp(4, 3, 4)).

Published in Transactions on Machine Learning Research (06/2023)

inﬁnities and Na Ns, which are rounded to the largest or smallest representable numbers. Since our technique is parameterized by a precision-candidate assignment, it is easily applied to other assignments as well.

We evaluate our technique by varying its parameter r (i.e., a lower bound on low-precision ratio) over deciles r {0, 0.1, 0.2, . . . , 1}. We write πours,r to denote the precision assignment chosen by our technique (described in 4) for a given r; e.g., πours,0 is the all-high-precision assignment, and πours,1 is the all-low-precision assignment equipped with our precision promotion technique ( 4.2). Following Sun et al. (2019), all precision assignments (including πours,r) in our experiments use high precision (i.e., 16 bits) for all backward weight tensors (i.e., ˆdθj). For each precision assignment π, its low-precision ratio can change during training due to our precision promotion technique (when applied), so we compute the average of the ratio over all epochs and report this value as the low-precision ratio of π.

Other setups and compute time. All experiments were performed on NVIDIA V100 GPUs; total compute time for all experiments was 1,081 GPU days. We train all models in a standard way: we apply dynamic loss scaling (a standard technique used in low-precision ﬂoating-point training; see 4.2 for details) except for 32-bit training, and use standard settings (e.g., learning rate); see Appendix B for details. Due to random variations in training, we perform four runs of training for each conﬁguration and report the average and the range of measured quantities.

5.3 Comparison with Existing Precision Assignments

To compare our technique with existing precision assignments for ﬂoating-point training, we train each model with the following precision assignments: all-32-bit πfp32, uniform πunif (Micikevicius et al., 2018), operatorbased πop (Nvidia, 2019; Sun et al., 2019), its variant πop (Kalamkar et al., 2019; Py Torch, 2022), and ours πours,r (see 3.1 and 5.2 for their deﬁnitions). We choose πunif, πop, and πop as baselines because existing precision assignments for ﬂoating-point training fall into one of the three assignments (or their variants) (see 1 2).

We train the four models mentioned in 5.2 on CIFAR-10 and CIFAR-100, and Shuﬄe Net-v2 on Image Net. We also train smaller variants of the four models (which are more diﬃcult to train with low precision) on CIFAR100. We obtain these variant models by following Sun et al. (2019), i.e., by applying a well-known approach for model reduction that uses a parameter called the width multiplier (Howard et al., 2017): each variant model reduces the number of channels in most tensors by a width multiplier; we use three values {0.5, 0.25, 0.1} for the width multiplier. We train just one model on Image Net due to the large amount of computation involved: for each model, 44 training runs (11 choices for r and 4 runs for each choice) are required for πours,r and each run on Image Net takes nearly a half day with 16 GPUs. We use Shuﬄe Net-v2 for Image Net since the model shows interesting memory-accuracy tradeoﬀs when trained on the (smaller) CIFAR datasets.

Image Net. Figure 3 presents training results of Shuﬄe Net-v2 on Image Net: its left graph plots the average training trajectory for each precision assignment, and its right graph shows how each precision assignment trades oﬀbetween memory and accuracy, where memory is represented (inversely) by the low-precision ratio of the assignment (which is averaged over all epochs; see 5.2) and accuracy is the best test accuracy of the model during training. Each point in the right graph shows the average accuracy of four runs of training, while the shaded area shows the variation in accuracy among those four training runs.

Figure 3 shows three points. First, as the parameter r increases, the average accuracy drop of πours,r from πfp32 increases (up to 5%). In contrast, πunif and πop have a much larger average accuracy drop (more than 30%), as some training runs diverge when πunif and πop are used. Second, the tradeoﬀgiven by πours,r is better (i.e., closer to Pareto-optimal) than by πop: πours,r for r {0.3, 0.4} has both higher accuracy and larger low-precision ratio (i.e., memory reduction) than πop. In particular, πours,0.4 has 1.6 the memory reduction of πop. Third, πours,r provides options that πop cannot (which has an accuracy drop of >1%). If we want accuracy closer to πfp32, say within 0.5%, we can use πours,0.2 with 2.6% more memory than πop. If we can tolerate a larger accuracy loss, say 3%, then we can use πours,0.7 with 2.9 the memory reduction of πop.

CIFAR-10/100. Figure 4 presents the memory-accuracy tradeoﬀs of precision assignments for the four models on CIFAR-10 and CIFAR-100, and their smaller variants (with width multiplier 0.25) on CIFAR-100. The results for other smaller variants are similar and included in Figure 10 (see Appendix C.1).

Published in Transactions on Machine Learning Research (06/2023)

The conclusions from Figure 3 hold for Figure 4: πours,r provides a range of options by varying r and exhibits a better tradeoﬀthan πunif, πop, and πop in almost all cases. We give a detailed comparison as follows. First, in half of all 12 plots, πunif shows a similar tradeoﬀto πours,1. But in the remaining half, πunif has an accuracy drop much larger than all other precision assignments including πours,r, since using πunif often makes training diverge while using, e.g., πours,1 does not do so. Second, in all but two plots, πours,r shows a strictly better tradeoﬀthan πop: πours,r has noticeably larger (> 2 ) memory reduction than πop while maintaining similar accuracy. Even in the two plots, πours,r has a tradeoﬀvery close to πop. Note that in three plots, πop has an accuracy drop of >1% while πours,r provides several options that have smaller accuracy drops and more memory savings at the same time. Third, πours,r shows a strictly better (or similar) tradeoﬀthan πop in all but two (or two) plots. Note that πop has accuracy smaller than πop in all but one plots. Also it has an accuracy drop of >1% in half of all plots, and sometimes makes training even diverge (in one plot here and three other plots in Figure 10).

Additional results. To isolate the eﬀect of our precision promotion technique on the above results (Figures 3, 4 and 10), we compare our precision assignments and the baseline assignments while applying the precision promotion to all of them, and present the results in Appendix C.1 (Figures 13 15). Two observations can be made in these results: for each precision assignment, (i) if training diverged without the precision promotion, applying the precision promotion prevents such divergence and produces much higher accuracy; (ii) otherwise, the accuracy and the low-precision ratio remain similar regardless of using the precision promotion. These observations lead to the same conclusion as in the above: our assignments provide similar or better tradeoﬀ between memory and accuracy than the baseline assignments, even when the latter are equipped with our precision promotion technique. In addition, these observations also indicate that our precision promotion technique can eﬀectively handle divergence in training (see 5.4 for more results on this).

5.4 Ablation Study: Precision Demotion and Promotion

Precision demotion. To evaluate the decision to use precision demotion in decreasing-size order, we train the four models on CIFAR-100 with πours,r, πours[inc],r (which demotes tensor groups in increasing-size order) and πours[rand],r (which demotes tensor groups in random order). For πours[rand], three diﬀerent random orders are used in each case. The results, presented in Figure 5 (and Appendix C.2), show that the order of precision demotion has a signiﬁcant impact on the resulting memory-accuracy tradeoﬀ, and that decreasing order provides the best results in nearly all cases. Increasing order consistently shows the worst results, suggesting our intuition (given in 4.1) for choosing decreasing order has some basis in reality.

Precision promotion. To understand whether precision promotion of overﬂowing tensors is important to our technique, we train Shuﬄe Net-v2 on Image Net using πours[no-promo],r which does not promote tensors. The results, presented in Figure 6(a), show that several training trajectories diverge in early epochs and fail to recover afterwards. Figure 6(b) plots the top-5 tensor overﬂow ratios for the highlighted trajectory in Figure 6(a). The overﬂow ratios ﬁrst spike about when divergence occurs around epoch 11. A closer look shows that the spike in overﬂow ratio occurs shortly before divergence, and starts ﬁrst in a few tensors and then propagates to others. These observations indicate that an excessive number of overﬂows in a few tensors are the cause of the training divergence.

Finally, Figure 6(c-d) shows that precision promotion is eﬀective at preventing the divergence of training while sacriﬁcing only a small amount of memory reduction. The ﬁgure shows Shuﬄe Net-v2 on Image Net trained using our technique with and without precision promotion. Figure 6(c) shows that without precision promotion large accuracy drops occur due to divergence, whereas with precision promotion training converges. Figure 6(d) shows that the total size of tensors promoted to high precision is small for all r values. See Appendix C.2 for similar results for CIFAR-10.

5.5 Choosing the Value of r

The time and space savings of our method are most signiﬁcant when a model is regularly retrained, which commonly occurs when new data is periodically incorporated into an existing model. Assuming that new data has a similar distribution to existing data, we can choose a single r (a parameter in our method) by conducting one set of experiments where we train with πfp32 and πours,r for diﬀerent r and then choose the r value that maximizes model aggregate savings while still having an acceptable drop in accuracy.

Published in Transactions on Machine Learning Research (06/2023)

Figure 3: Results of training Shuﬄe Net-v2 on Image Net with πfp32, πunif (Micikevicius et al., 2018), πop (Sun et al., 2019), πop (Kalamkar et al., 2019), and πours,r. Left: Each line shows the average training trajectory for each precision assignment; πours,r is colored from navy to yellow (darker for smaller r). A zoomed-in version of this plot can be found in Appendix C.1. Right: Each point shows the memory-accuracy tradeoﬀof each precision assignment; a red-dashed line shows the accuracy of πfp32; and shaded areas show the variation among four training runs. In the right ﬁgure, top-right points are better than bottom-left ones. Observe that there are s above and to the right of and , respectively. is missing as its y-value is too small.

(a) CIFAR-10, Squeeze Net

(b) CIFAR-100, Squeeze Net

(c) CIFAR-100, Squeeze Net

(d) CIFAR-10, Shuﬄe Net-v2

(e) CIFAR-100, Shuﬄe Net-v2

(f) CIFAR-100, Shuﬄe Net-v2

(g) CIFAR-10, Mobile Net-v2

(h) CIFAR-100, Mobile Net-v2

(i) CIFAR-100, Mobile Net-v2

(j) CIFAR-10, Res Net-18

(k) CIFAR-100, Res Net-18

(l) CIFAR-100, Res Net-18

Figure 4: Memory-accuracy tradeoﬀs of πunif (Micikevicius et al., 2018), πop (Sun et al., 2019), πop (Kalamkar et al., 2019), and πours,r for four models and their smaller variants on CIFAR-10 and CIFAR-100. The variant models have width multiplier 0.25 and are marked by . Top-right points are better than bottom-left ones. In all but three plots, there are s above and to the right of and , respectively; even in the three plots (g,h,k), s have almost the same tradeoﬀs to and . In half of all plots, has much smaller y-values than other points. The training trajectories for the above plots and the results of other smaller models are in Appendix C.1.

Published in Transactions on Machine Learning Research (06/2023)

(a) Squeeze Net

(b) Shuﬄe Net-v2

(c) Mobile Net-v2 Figure 5: Memory-accuracy tradeoﬀs of πours,r, πours[inc],r, and πours[rand],r for three models on CIFAR-100. Observe that s are above and to the right of other points in nearly all cases. The results of Res Net-18 are in Appendix C.2.

Figure 6: Training Shuﬄe Net-v2 on Image Net with πours,r and πours[no-promo],r. (a) Training trajectories of πours[no-promo],r for diﬀerent r; colors denote r values (darker for smaller r). (b) Top-5 overﬂow ratios of tensors at each epoch, for the highlighted trajectory in (a); the largest ratio is blue and the ﬁfth largest red. (c) Memory-accuracy tradeoﬀs of πours,r and πours[no-promo],r. (d) Low-precision ratio when training ends vs. when training starts, for πours,r and πours[no-promo],r. The results on CIFAR-10 are in Appendix C.2.

To simulate this scenario, we create ﬁve datasets Image Net-200-i (i [5]) as follows, so that each of them contains diﬀerent but similar data: randomly select 1/5 of the classes in Image Net (which has 1000 classes in total), and split the training data of each class evenly into ﬁve new datasets.

For each Image Net-200-i, we train Shuﬄe Net-v2 with πfp32 and πours,r and present the results in Figure 7. Based on the tradeoﬀresults of πours,r, we can choose r = 0.4 if we desire an average of < 1% accuracy drop from πfp32, and we can choose r = 0.9 if an average 3% accuracy drop is tolerable. We make two more observations: the tradeoﬀresult of πours,r is similar across all ﬁve datasets even though each dataset is diﬀerent, and for each r the variance in the accuracy of πours,r from diﬀerent datasets and runs of training is similar to that of πfp32. Thus we expect that on a new but similar dataset, πours,r would have an accuracy drop similar to Figure 7 with acceptable variance.

Figure 7: Memory-accuracy tradeoﬀs of πours,r for Shuﬄe Netv2 on Image Net-200-i (i [5]). 6 Limitations and Future Work

Our work has the same limitation present in prior works on low-precision ﬂoating-point training: low-precision ﬂoats and operations are simulated in software (instead of being handled natively in hardware) and so the potential speedup of our method is not directly measured, though we do expect speedups to be proportional to the reduction in the model aggregate. We leave it as future work to perform such experiments on very recent or future hardware (e.g., NVIDIA H100 GPU) that natively supports more low-precision formats. Another direction for future work is to integrate our method into systems for automatically optimizing deep learning computations (e.g., Jia et al. (2019); Unger et al. (2022)) to accelerate training.

Acknowledgments

We thank Thiago S. F. X. Teixeira for helping us run the experiments on cluster machines, and Colin Unger for providing helpful comments on an earlier version of this paper. This work was supported in part by the

Published in Transactions on Machine Learning Research (06/2023)

Advanced Simulation and Computing (ASC) program of the US Department of Energy s National Nuclear Security Administration (NNSA) via the PSAAP-III Center at Stanford, Grant No. DE-NA0002373.

Michael Andersch, Greg Palmer, Ronny Krashinsky, Nick Stam, Vishal Mehta, Gonzalo Brito, and Sridhar Ramaswamy. NVIDIA Hopper Architecture In-Depth. https://developer.nvidia.com/blog/ nvidia-hopper-architecture-in-depth/, 2022.

Ron Banner, Itay Hubara, Elad Hoﬀer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Neur IPS, pp. 5151 5159, 2018.

Johan Björck, Xiangyu Chen, Christopher De Sa, Carla P Gomes, and Kilian Weinberger. Low-Precision Reinforcement Learning: Running Soft Actor-Critic in Half Precision. In ICML, pp. 980 991, 2021.

Leopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Oguz H. Elibol, Mehran Nekuii, and Hanlin Tang. Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks. In ICLR, 2020.

Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. Rigorous ﬂoating-point mixed-precision tuning. In POPL, pp. 300 315, 2017.

Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoﬀer, Ron Banner, and Daniel Soudry. Neural gradients are near-lognormal: improved quantized and sparse training. In ICLR, 2021.

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. PACT: Parameterized Clipping Activation for Quantized Neural Networks. ar Xiv:1805.06085, 2018.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binary Connect: Training Deep Neural Networks with binary weights during propagations. In Neur IPS, pp. 3123 3131, 2015.

Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, and Vadim Pirogov. Mixed Precision Training of Convolutional Neural Networks using Integer Operations. In ICLR, 2018.

Mario Drumond, Tao LIN, Martin Jaggi, and Babak Falsaﬁ. Training DNNs with Hybrid Block Floating Point. In Neur IPS, pp. 451 461, 2018.

Sean Fox, Seyedramin Rasoulinezhad, Julian Faraone, david boland, and Philip Leong. A Block Miniﬂoat Representation for Training Deep Neural Networks. In ICLR, 2021.

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A Survey of Quantization Methods for Eﬃcient Neural Network Inference. In Low-Power Computer Vision: Improving the Eﬃciency of Artiﬁcial Intelligence, pp. 291 326. CRC Press, 2022. URL https://arxiv.org/abs/2103.13630.

Hui Guo and Cindy Rubio-González. Exploiting Community Structure for Floating-Point Precision Tuning. In ISSTA, pp. 333 343, 2018.

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In ICML, pp. 1737 1746, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pp. 770 778, 2016.

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobile Nets: Eﬃcient Convolutional Neural Networks for Mobile Vision Applications. ar Xiv:1704.04861, 2017.

Published in Transactions on Machine Learning Research (06/2023)

Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeeze Net: Alex Net-level accuracy with 50x fewer parameters and <1MB model size. ar Xiv:1602.07360, 2016.

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and Training of Neural Networks for Eﬃcient Integer-Arithmetic-Only Inference. In CVPR, pp. 2704 2713, 2018.

Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism for Deep Neural Networks. In MLSys, pp. 1 13, 2019.

Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A Study of BFLOAT16 for Deep Learning Training. ar Xiv:1905.12322, 2019.

Richard M. Karp. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations, pp. 85 103, 1972.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. URL

https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuﬄe Net V2: Practical Guidelines for Eﬃcient CNN Architecture Design. In ECCV, pp. 122 138, 2018.

Harshitha Menon, Michael O. Lam, Daniel Osei-Kuﬀuor, Markus Schordan, Scott Lloyd, Kathryn Mohror, and Jeﬀrey Hittinger. ADAPT: Algorithmic Diﬀerentiation Applied to Floating-Point Precision Tuning. In SC, pp. 48:1 48:13, 2018.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. In ICLR, 2018.

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart F. Oberman, Mohammad Shoeybi, Michael Y. Siu, and Hao Wu. FP8 formats for deep learning. ar Xiv:2209.05433, 2022.

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A White Paper on Neural Network Quantization. ar Xiv:2106.08295, 2021.

Nvidia. Documentation of apex.amp. https://nvidia.github.io/apex/amp.html, 2019.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An Imperative Style, High-Performance Deep Learning Library. In Neur IPS, pp. 8024 8035, 2019.

Py Torch. Documentation of torch.amp. https://pytorch.org/docs/stable/amp.html, 2022.

Haotong Qin, Yifu Ding, Wei Fan, Chris Leﬀ, Mehdi Bahri, and Eric Shaw. Awesome Model Quantization.

https://github.com/htqin/awesome-model-quantization, 2022.

Aditya Rajagopal, Diederik Adriaan Vink, Stylianos I. Venieris, and Christos-Savvas Bouganis. Multi-Precision Policy Enforced Training (Mu PPET): A Precision-Switching Strategy for Quantised Fixed-Point Training of CNNs. In ICML, pp. 7943 7952, 2020.

Published in Transactions on Machine Learning Research (06/2023)

Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough. Precimonious: Tuning Assistant for Floating-Point Precision. In SC, pp. 27:1 27:12, 2013.

Cindy Rubio-González, Cuong Nguyen, Benjamin Mehne, Koushik Sen, James Demmel, William Kahan, Costin Iancu, Wim Lavrijsen, David H. Bailey, and David Hough. Floating-Point Precision Tuning Using Blame Analysis. In ICSE, pp. 1074 1085, 2016.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R. Aberger, Kunle Olukotun, and Christopher Ré. High-Accuracy Low-Precision Training. ar Xiv:1803.03383, 2018.

Charbel Sakr and Naresh Shanbhag. Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm. In ICLR, 2019.

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobile Net V2: Inverted Residuals and Linear Bottlenecks. In CVPR, pp. 4510 4520, 2018.

Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks. In Neur IPS, pp. 4901 4910, 2019.

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. Ultra-Low Precision 4-bit Training of Deep Neural Networks. In Neur IPS, pp. 1796 1807, 2020.

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat Mc Cormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization. In OSDI, pp. 267 284, 2022.

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training Deep Neural Networks with 8-bit Floating Point Numbers. In Neur IPS, pp. 7686 7695, 2018.

Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and Inference with Integers in Deep Neural Networks. In ICLR, 2018.

Chengrun Yang, Ziyang Wu, Jerry Chee, Christopher De Sa, and Madeleine Udell. How Low Can We Go: Trading Memory for Error in Low-Precision Training. In ICLR, 2022.

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Chris De Sa. SWALP : Stochastic Weight Averaging in Low Precision Training. In ICML, pp. 7015 7024, 2019a.

Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. Quantization Networks. In CVPR, pp. 7308 7316, 2019b.

Pedram Zamirai, Jian Zhang, Christopher R. Aberger, and Christopher De Sa. Revisiting BFloat16 Training. ar Xiv:2010.06192, 2020.

Tianyi Zhang, Zhiqiu Lin, Guandao Yang, and Christopher De Sa. QPy Torch: A Low-Precision Arithmetic Simulation Framework. ar Xiv:1910.04540, 2019.

Xishan Zhang, Shaoli Liu, Rui Zhang, Chang Liu, Di Huang, Shiyi Zhou, Jiaming Guo, Qi Guo, Zidong Du, Tian Zhi, and Yunji Chen. Fixed-Point Back-Propagation Training. In CVPR, pp. 2327 2335, 2020.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Do Re Fa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. ar Xiv:1606.06160, 2016.

Published in Transactions on Machine Learning Research (06/2023)

A Problem: Deferred Proof

Theorem 3.2. Problem 3.1 is NP-hard.

Proof. We prove the NP-hardness of Problem 3.1 (the memory-accuracy tradeoﬀproblem) by reducing the knapsack problem (which is NP-hard) to the tradeoﬀproblem. More precisely, we prove that the knapsack problem can be solved in polynomial time if we assume an oracle for the tradeoﬀproblem.

Recall the knapsack problem: given n items with weights wi N and proﬁts pi N (i [n]), and given a threshold W N, decide which items to choose such that the total proﬁt of the chosen items is maximized while their total weight does not exceed W. That is, ﬁnd α {0, 1}n that maximizes P

i [n] αipi subject to P

i [n] αiwi W. This problem is well-known to be NP-hard (Karp, 1972).

Given an instance of the knapsack problem (w, p, W), we construct an instance of the tradeoﬀproblem as follows.

Notations. The following construct uses a constant k N and ﬂoating-point formats fphi, fplo FP (one for high precision and the other for low precision). Below we will specify the conditions they should satisfy, and show that some k, fphi, and fplo indeed satisfy the conditions. We write rndhi( ) and rndlo( ) as shorthand for rndfphi( ) and rndfplo( ). Training setups. We consider a very simple setting for training: the gradient descent algorithm with a learning rate η = 2 l (l N) is applied for just one epoch; all parameters are initialized to 0 and their master copies are represented in fphi; and the negative loss of a model on training data (i.e., L(fθ(x), y) using notations to be described below) is used as the accuracy of the model. Here l N can be any natural number. Model and loss networks. A model network M and a loss network L are given as Figure 8, where M has n parameter tensors θi Rwi of size wi (i [n]). For an input-output pair (x, y) Rn R, M and L compute a predicted output fθ(x) R and a loss L(fθ(x), y) R as follows (assuming that no rounding functions are applied):

j [wi] θi,jxi, L(fθ(x), y) =

2 k|fθ(x) y|.

Roughly speaking, M is (a variant of) a linear classiﬁer and L is a ℓ1-loss (scaled by 2 k). Training data. Training data consists of a single input-output pair (x, y) Rn R that satisﬁes the following:

xi = rndlo( p

pi/wi), y < 2 (k+l) X

i [n] wix2 i

for all i [n]. Here y can take any value as long as it satisﬁes the above inequality. Note that xi can be diﬀerent from p

pi/wi since the latter value may not be representable in fplo. Precision-candidate assignment. A precision-candidate assignment C : TS {hi, lo} FP is given as:

fphi, C(t, lo) =

fplo for all t TS.

That is, for all tensors, fphi is used as a high-precision format and fphi as a low-precision format. Here fphi and fplo should satisfy the following:

ehi elo, mhi mlo, (3)

|rndlo(s) s| < |s| err for all s S1, (4)

rndlo(s) = 0 for all s S2, (5)

rndhi(s) = s for all s S2 S3. (6)

Here ehi and mhi (and elo and mlo) denote the number of exponent bits and mantissa bits of fphi (and fplo), and err and Sj are deﬁned as: err =

1/(6n maxi [n]pi), S1 =

pi/wi | i [n]}, S2 =

{2 k} {2 kxi | i [n]}, and S3 =

{2 (k+l)xi | i [n]}. Eq. (4) says that the relative error of representing each s S1 in fplo should be less than err. Eq. (5) says that each s S2 should underﬂow to 0 when represented in fplo. Eq. (6) says that each s S2 S3 should be representable in fphi.

Published in Transactions on Machine Learning Research (06/2023)

𝑥! ! # . ! # $ %! 𝑣#&!,$. 2() 𝑣*#&+ 𝑦 𝑣,,+ 𝜃+,$ 𝑣+ .$ %"

𝜃#,$ 𝑣# .$ %#

𝑣#&+ ℝ%" 𝑣*#&+ ℝ 𝑣*#&* ℝ

Figure 8: The model network M and the loss network L used in the proof of Theorem 3.2.

Low-precision ratio. A lower bound r [0, 1] on the low-precision ratio is given as:

max n 0, 1 2W + 1

So r decreases linearly as W increases.

We make three points on the above construction.

First, each part of the knapsack problem (w, p, W) is used in the following parts of the construction: wi is used mainly in the size of the parameter tensor θi; pi in the input xi; and W in the lower bound r. Second, there exist k N and fphi, fplo FP that satisfy Eqs. (3) (6). This can be shown as follows: ﬁrst, by taking suﬃciently many exponent and mantissa bits for fplo, we can make Eq. (4) satisﬁed; next, by taking a suﬃciently large k, we can make Eq. (5) satisﬁed; ﬁnally, by taking suﬃciently many exponent and mantissa bits for fphi, we can make Eq. (3) and Eq. (6) satisﬁed (since xi is representable in fplo and 2 (k+l) is a power of two). Third, some well-known models (e.g., Shuﬄe Net-v2) have a similar structure to M in that they apply the following operations as a subroutine: split a tensor into multiple tensors, apply some operators to each split tensor, and combine the resulting tensors into a single tensor.

We now prove that the knapsack problem (w, p, W) can be solved in polynomial time, if an oracle to the above tradeoﬀproblem is given. Suppose that π Π(C) is an optimal solution to the above tradeoﬀproblem (given by the oracle). Deﬁne an item selection α {0, 1}n for the knapsack problem as:

( 1 if π(dθi) = π(dvn+i) = π(dv2n+1) = fphi 0 otherwise

for each i [n]. Note that α can be constructed from π in linear time. Thus, it suﬃces to show that α is an optimal solution to the knapsack problem (w, p, W), which is equivalent to the following two claims:

Claim 1: We have P

i [n] αiwi W. Claim 2: For any α {0, 1}n with P

i [n] α iwi W, we have P

i [n] α ipi P

i [n] αipi.

We now prove each claim as follows.

Proof of Claim 1. If α = (0, . . . , 0), then the claim clearly holds. Suppose that α = (0, . . . , 0). Then,

i [n] αiwi size(TS) ratiolo(π) r 1 1 + 2W

Here the ﬁrst inequality uses α = (0, . . . , 0) and the deﬁnition of α and M; the second inequality uses the fact that π is a valid solution to the above tradeoﬀproblem; and the third inequality uses the deﬁnition of r. Hence, the claim holds.

Published in Transactions on Machine Learning Research (06/2023)

Proof of Claim 2. Suppose that the claim does not hold. Then, there exists α {0, 1}n such that X

i [n] α iwi W, X

i [n] α ipi > X

i [n] αipi.

Deﬁne a precision assignment π Π(C) as:

π (dv2n+1) =

π (dvn+i) =

fphi for all i [n] with α i = 1,

fplo for all other t TS.

Then, we have ratiolo(π ) r by P

i [n] α iwi W and the deﬁnition of π , M, and r. Hence, it suﬃces to show acc(π) < acc(π ), because this would contradict to the fact that π is an optimal solution.

To show acc(π) < acc(π ), we prove the following two lemmas: the ﬁrst lemma gives a closed form of acc(π) and acc(π ), and the second lemma shows that P

i [n] βiwix2 i is close to P

i [n] βipi (where the former summation appears in acc(π) and acc(π )).

Lemma A.1. The following hold:

acc(π) = 2 ky + 2 (2k+l) X

i [n] αiwix2 i , acc(π ) = 2 ky + 2 (2k+l) X

i [n] α iwix2 i .

Proof. We prove the equation for acc(π) only, since the equation for acc(π ) can be proved similarly.

First, we show that for all i [n] and j [wi],

ˆdθi,j = αi 2 kxi. (7)

Pick any i [n] and j [wi]. Note that by the deﬁnition of M, we have

ˆdθi,j = rndπ(dθi) rndπ(dvn+i)(rndπ(dv2n+1)(2 k)) rndvi(rndv0(xi))

= rndπ(dθi) rndπ(dvn+i)(rndπ(dv2n+1)(2 k)) xi ,

where the second equality uses Eq. (3) and that xi is representable in fplo. We prove Eq. (7) by case analysis on αi. Suppose αi = 1. Then, by the deﬁnition of αi, π(dθi) = π(dvn+i) = π(dv2n+1) = fphi. From this, we get the desired equation:

ˆdθi,j = rndhi rndhi(rndhi(2 k)) xi = rndhi(2 k xi) = 2 kxi,

where the last two equalities use Eq. (6). Suppose now αi = 0. Then, by the deﬁnition of αi, at least one of π(dθi), π(dvn+i), and π(dv2n+1) is fplo. If π(dvn+i) = fplo or π(dv2n+1) = fplo, we get the desired equation:

ˆdθi,j = rndπ(dθi) rndlo(2 k) xi = rndπ(dθi)(0 xi) = 0,

where the ﬁrst equality uses Eq. (3) and Eq. (6), and the second equality uses Eq. (5). The remaining case is when π(dvn+i) = π(dv2n+1) = fphi and π(dθi) = fplo. We get the desired equation in this case as well:

ˆdθi,j = rndlo rndhi(rndhi(2 k)) xi = rndlo(2 k xi) = 0,

where the second equality uses Eq. (6), and the last equality uses Eq. (5). Hence, we have proved Eq. (7).

Next, let θi be the i-th parameter tensor before training starts, and θ i be the corresponding tensor after training ends (i [n]). Then, by the deﬁnition of the tradeoﬀproblem constructed above, we have θi,j = 0 and

θ i,j = θi,j rndhi(2 l ˆdθi,j) = 0 rndhi(2 l (αi 2 kxi)) = αi ( 2 (k+l)xi),

Published in Transactions on Machine Learning Research (06/2023)

where the second equality uses Eq. (7) and the third equality uses Eq. (6). Using this equation, we ﬁnally obtain the conclusion of this lemma:

acc(π) = L(fθ (x), y)

j [wi] θ i,jxi

j [wi] αi ( 2 (k+l)xi) xi

= 2 k y + X

i [n] αi 2 (k+l)wix2 i

= 2 k y + X

i [n] αi 2 (k+l)wix2 i

= 2 ky + 2 (2k+l) X

i [n] αiwix2 i ,

where the ﬁrst two equalities use the deﬁnition of accuracy, and the second last equality uses the deﬁnition of y. This concludes the proof of the lemma.

Lemma A.2. For any β {0, 1}n, X

i [n] βiwix2 i X

i [n] βipi < 1

Proof. We ﬁrst show that for any i [n],

|wix2 i pi| < 1

Pick any i [n]. By Eq. (4) and the deﬁnition of xi, we have xi r pi

wi 1 6n maxj [n] pj r pi

wi 1 6npi .

From this, we have r pi

< xi < r pi

2 < x2 i < pi

From this, we obtain the desired result:

|wix2 i pi| < pi 1 + 1 6npi

2 1 = pi 1 3npi + 1 (6npi)2

< pi 1 3npi + 1 6npi

= pi 1 2npi = 1

where the second inequality uses 6npi > 1 (as n, pi N).

Using this result, we can show the conclusion as follows: X

i [n] βiwix2 i X

i [n] βipi = X

i [n] βi(wix2 i pi) X

i [n] |βi| |wix2 i pi| < X

where the last inequality uses |βi| 1. This completes the proof of the lemma.

Using the two lemmas, we now prove acc(π) < acc(π ). By Lemma A.2 and P

i [n] αipi < P

i [n] α ipi, we have

i [n] αiwix2 i < X

i [n] αipi + 1

i [n] α ipi 1

i [n] α iwix2 i ,

where the second inequality comes from αi, α i {0, 1} and pi N. From this, and by Lemma A.1, we obtain acc(π) < acc(π ) as desired. This concludes the proof of Claim 2, thereby ﬁnishing the proof of the theorem.

Published in Transactions on Machine Learning Research (06/2023)

Remark A.3. In the proof of Theorem 3.2, we proved the NP-hardness of Problem 3.1 by making use of only a few limited aspects of the problem. For instance, we used the fact that some values representable in a highprecision format round to zero in a low-precision format; on the other hand, many other values representable in a high-precision format round to non-zero values in a low-precision format, and this indeed occurs in practical training (even more frequently than underﬂows). Also, we used a simple setting for training in which a gradient descent algorithm is applied for one epoch, training data consist of one input-output pair, and test data is the same as training data; on the other hand, in practical training, a gradient descent algorithm is applied for many epochs, training data consists of many input-output pairs, and test data is diﬀerent from training data.

Problem 3.1 is general enough so that it embraces all the aforementioned aspects of ﬂoating-points and training, including those that are not considered in the proof of Theorem 3.2. Since those aspects are likely to make the problem even more diﬃcult, we conjecture that the problem would be more intractable than being NP-hard.

B Experiments: Deferred Details

The datasets we use have the following licenses:

CIFAR-10 and CIFAR-100: These datasets are under the MIT license. Image Net: This dataset can be used only for non-commercial research and educational purposes. For more details, see its webpage (Stanford Vision Lab, 2020).

The implementations of models we use have the following licenses:

Squeeze Net for CIFAR-10 and CIFAR-100: We adapt an implementation of the model in a public Git Hub repository (Pathak, 2020), whose license information is not available. Shuﬄe Net-v2, Mobile Net-v2, and Res Net-18 for CIFAR-10 and CIFAR-100: We adapt an implementation of these models in a public Git Hub repository (kuangliu, 2021), which is under the MIT license. Shuﬄe Net-v2 for Image Net and Image Net-200-i: We adapt an implementation of the model in the torchvision library (Py Torch, 2022b), which is under the BSD 3-Clause license.

The details of how we train models are as follows:

Four models on CIFAR-10 and CIFAR-100: We train the four models with a standard setup (kuangliu,

2021). In particular, we run the (non-Nesterov) SGD optimizer for 200 epochs with minibatch size of 128 (over 1 GPU), learning rate of 0.1, momentum of 0.9, weight decay of 5 10 4, and the cosine annealing scheduler for learning rate. For dynamic loss scaling, we use initial scale of 216, growth factor of 2, back-oﬀfactor of 0.5, and growth interval of 1 epoch, as suggested in Py Torch (Py Torch, 2022a). Shuﬄe Net-v2 on Image Net: We train the model with the default setup given in Py Torch s Git Hub repository (Py Torch, 2022c), except that we use larger minibatch size and learning rate as in (Goyal et al., 2017; Kalamkar et al., 2019; Krizhevsky, 2014; Py Torch, 2022d) to reduce the wall-clock time of training. In particular, we run the (non-Nesterov) SGD optimizer for 90 epochs with minibatch size of 1024 (over 16 GPUs), learning rate of 0.4, momentum of 0.9, weight decay of 10 4, and the cosine annealing scheduler for learning rate. For dynamic loss scale, we use initial scale of 216, growth factor of 2, back-oﬀfactor of 0.5, and growth interval of 0.5 epoch, as suggested in Py Torch (Py Torch, 2022a). Shuﬄe Net-v2 on Image Net-200-i: We train the model with the same settings for Image Net except that we use the default values for minibatch size and learning rate given in (Py Torch, 2022c), i.e., minibatch size of 256 (over 4 GPUs) and learning rate of 0.1.

C Experiments: Deferred Results

C.1 Comparison with Existing Precision Assignments

Figure 9 presents a zoomed-in version of Figure 3 (left).

Figure 10 presents results omitted in Figure 4: training results of smaller variant models (which have width multiplier 0.5 or 0.1) on CIFAR-100 with πfp32, πunif, πop, πop , and πours,r. The ﬁgure shows similar results

Published in Transactions on Machine Learning Research (06/2023)

to Figure 4: the results for the variant models with width multiplier 0.5 (and 0.1) are similar to those for the original models (and the variant models with width multiplier 0.25).

Figures 11 and 12 show the average training trajectories for the conﬁgurations presented in Figures 4 and 10.

Figures 13 and 14 present the same results as Figures 3 and 4 except the following: in the former, πop and πop are equipped with our precision promotion technique, whereas in the latter they do not so. Figures 13 and 14 do not include πunif because this assignment with the precision promotion is identical to πours,1.

C.2 Ablation Study: Precision Demotion and Promotion

Figure 16 presents results omitted in Figure 5: training results of Res Net-18 on CIFAR-100 with πours,r, πours[inc],r, and πours[rand],r. The ﬁgure shows similar results to Figure 5 except that it shows smaller diﬀerences in memory-accuracy tradeoﬀbetween the three precision assignments.

Figure 17 presents results omitted in Figure 6: training results of four models on CIFAR-10 with πours,r and πours[no-promo],r. The ﬁgure shows similar results to Figure 6 except that the training of Res Net-18 on CIFAR-10 does not diverge even with πours[no-promo],r for all r values.

References for Appendix

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training Image Net in 1 Hour. ar Xiv:1706.02677, 2017.

Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A Study of BFLOAT16 for Deep Learning Training. ar Xiv:1905.12322, 2019.

Richard M. Karp. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations, pp. 85 103, 1972.

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. ar Xiv:1404.5997, 2014.

kuangliu. https://github.com/kuangliu/pytorch-cifar, 2021.

Gaurav Pathak. https://github.com/gsp-27/pytorch_Squeezenet, 2020.

Py Torch. Documentation of torch.amp. https://pytorch.org/docs/stable/amp.html, 2022a.

Py Torch. https://github.com/pytorch/vision/tree/main/torchvision/models, 2022b.

Py Torch. https://github.com/pytorch/vision/tree/main/references/classification, 2022c.

Py Torch. https://github.com/pytorch/vision/tree/main/references/classification#resnext, 2022d.

Stanford Vision Lab. https://image-net.org/download.php, 2020.

Published in Transactions on Machine Learning Research (06/2023)

Figure 9: A zoomed-in version of Figure 3 (left). Results of training Shuﬄe Net-v2 on Image Net with πfp32, πunif (Micikevicius et al., 2018), πop (Sun et al., 2019), πop (Kalamkar et al., 2019), and πours,r. Each line shows the average training trajectory for each precision assignment; πours,r is colored from navy to yellow (darker for smaller r).

Published in Transactions on Machine Learning Research (06/2023)

(a) CIFAR-100, Squeeze Net

(b) CIFAR-100, Squeeze Net

(c) CIFAR-100, Shuﬄe Net-v2

(d) CIFAR-100, Shuﬄe Net-v2

(e) CIFAR-100, Mobile Net-v2

(f) CIFAR-100, Mobile Net-v2

(g) CIFAR-100, Res Net-18

(h) CIFAR-100, Res Net-18

Figure 10: Continued from Figure 4. Memory-accuracy tradeoﬀs of πunif (Micikevicius et al., 2018), πop (Sun et al., 2019), πop (Kalamkar et al., 2019), and πours,r for smaller variants of four models on CIFAR-100. The variant models have width multiplier 0.5 (marked by ) or 0.1 (marked by ). Top-right points are better than bottom-left ones. In all but one plots, there are s above and to the right of and , respectively; even in the one plot (g), s have almost the same tradeoﬀs to and . In three of all plots, has much smaller y-values than other points; is missing in (h) as its y-value is too small.

Published in Transactions on Machine Learning Research (06/2023)

(a) CIFAR-10, Squeeze Net

(b) CIFAR-100, Squeeze Net

(c) CIFAR-100, Squeeze Net

(d) CIFAR-10, Shuﬄe Net-v2

(e) CIFAR-100, Shuﬄe Net-v2

(f) CIFAR-100, Shuﬄe Net-v2

(g) CIFAR-10, Mobile Net-v2

(h) CIFAR-100, Mobile Net-v2

(i) CIFAR-100, Mobile Net-v2

(j) CIFAR-10, Res Net-18

(k) CIFAR-100, Res Net-18

(l) CIFAR-100, Res Net-18

Figure 11: Training trajectories for the conﬁgurations shown in Figure 4. Each line shows the average training trajectory for each precision assignment. πours,r is colored from navy to yellow (darker for smaller r).

Published in Transactions on Machine Learning Research (06/2023)

(a) CIFAR-100, Squeeze Net

(b) CIFAR-100, Squeeze Net

(c) CIFAR-100, Shuﬄe Net-v2

(d) CIFAR-100, Shuﬄe Net-v2

(e) CIFAR-100, Mobile Net-v2

(f) CIFAR-100, Mobile Net-v2

(g) CIFAR-100, Res Net-18

(h) CIFAR-100, Res Net-18

Figure 12: Training trajectories for the conﬁgurations shown in Figure 10. Each line shows the average training trajectory for each precision assignment. πours,r is colored from navy to yellow (darker for smaller r).

Published in Transactions on Machine Learning Research (06/2023)

Figure 13: Results corresponding to Figure 3. The only diﬀerence from Figure 3 is that πop and πop here are equipped with our precision promotion technique, whereas πop and πop in the previous ﬁgure do not so.

(a) CIFAR-10, Squeeze Net

(b) CIFAR-100, Squeeze Net

(c) CIFAR-100, Squeeze Net

(d) CIFAR-10, Shuﬄe Net-v2

(e) CIFAR-100, Shuﬄe Net-v2

(f) CIFAR-100, Shuﬄe Net-v2

(g) CIFAR-10, Mobile Net-v2

(h) CIFAR-100, Mobile Net-v2

(i) CIFAR-100, Mobile Net-v2

(j) CIFAR-10, Res Net-18

(k) CIFAR-100, Res Net-18

(l) CIFAR-100, Res Net-18

Figure 14: Results corresponding to Figure 4. The only diﬀerence from Figure 4 is that πop and πop here are equipped with our precision promotion technique, whereas πop and πop in the previous ﬁgure do not so.

Published in Transactions on Machine Learning Research (06/2023)

(a) CIFAR-100, Squeeze Net

(b) CIFAR-100, Squeeze Net

(c) CIFAR-100, Shuﬄe Net-v2

(d) CIFAR-100, Shuﬄe Net-v2

(e) CIFAR-100, Mobile Net-v2

(f) CIFAR-100, Mobile Net-v2

(g) CIFAR-100, Res Net-18

(h) CIFAR-100, Res Net-18

Figure 15: Results corresponding to Figure 10. The only diﬀerence from Figure 10 is that πop and πop here are equipped with our precision promotion technique, whereas πop and πop in the previous ﬁgure do not so.

Published in Transactions on Machine Learning Research (06/2023)

(a) Res Net-18

Figure 16: Continued from Figure 5. Memory-accuracy tradeoﬀs of πours,r, πours[inc],r, and πours[rand],r for Res Net-18 on CIFAR-100. Observe that s are above and to the right of other points in nearly all cases.

(a) CIFAR-10, Squeeze Net

(b) CIFAR-10, Shuﬄe Net-v2

(c) CIFAR-10, Mobile Net-v2

(d) CIFAR-10, Res Net-18 Figure 17: Continued from Figure 6. Training four models on CIFAR-10 with πours,r and πours[no-promo],r. Column 1: Training trajectories of πours[no-promo],r for diﬀerent r; colors denote r values (darker for smaller r). Column 2: Top-5 overﬂow ratios of tensors at each epoch, for the highlighted trajectory in (a); the largest ratio is blue and the ﬁfth largest red. Column 3: Memory-accuracy tradeoﬀs of πours,r and πours[no-promo],r. Column 4: Low-precision ratio when training ends vs. when training starts, for πours,r and πours[no-promo],r.