# dimensionfree_bounds_for_lowprecision_training__1389c59c.pdf Dimension-Free Bounds for Low-Precision Training Zheng Li IIIS, Tsinghua University lzlz19971997@gmail.com Christopher De Sa Cornell University cdesa@cs.cornell.edu Low-precision training is a promising way of decreasing the time and energy cost of training machine learning models. Previous work has analyzed low-precision training algorithms, such as low-precision stochastic gradient descent, and derived theoretical bounds on their convergence rates. These bounds tend to depend on the dimension of the model d in that the number of bits needed to achieve a particular error bound increases as d increases. In this paper, we derive new bounds for low-precision training algorithms that do not contain the dimension d , which lets us better understand what affects the convergence of these algorithms as parameters scale. Our methods also generalize naturally to let us prove new convergence bounds on low-precision training with other quantization schemes, such as lowprecision floating-point computation and logarithmic quantization. 1 Introduction As machine learning models continue to scale to target larger problems on bigger data, the task of training these models quickly and efficiently becomes an ever-more-important problem. One promising technique for doing this is low-precision computation, which replaces the 32-bit or 64-bit floating point numbers that are usually used in ML computations with smaller numbers, often 8-bit or 16-bit fixed point numbers. Low-precision computation is a broadly applicable technique that has received a lot of attention, especially for deep learning, and specialized hardware accelerators have been developed to support it [2, 3, 14]. A major application for low-precision computation is the training of ML models using empirical risk minimization. This training is usually done using stochastic gradient descent (SGD), and most research in low-precision training has focused on low-precision versions of SGD. While most of this work is empirical [4 7, 11, 12, 15, 16, 18, 20, 22, 23], significant research has also been done in the theoretical analysis of low-precision training. This theoretical work has succeeded in proving bounds on the convergence rate of low-precision SGD and related low-precision methods in various settings, including for convex [8, 21] and non-convex objectives [1, 9, 17]. One common characteristic of these results is that the bounds tend to depend on the dimension d of the model being learned (equivalently, d is the number of parameters). For example, [17] gives the convergence bound E [f( w T ) f(w )] (1 + log(T + 1))σ2 max 2µT + σmaxδ where the objective f is strongly convex with parameter µ, low-precision SGD outputs w T after T iterations, w is the true global minimizer of the objective, σ2 max is an upper bound on the second moment of the stochastic gradient samples E f(w) 2 σ2 max, and δ is the quantization step, the difference between adjacent numbers in the low-precision format. Notice that, as T , this bound shows convergence down to a level of error that increases with the dimension d. Equivalently, in order to achieve the same level of error as d increases, we would need to use more bits of quantization to make δ smaller. Similar dimension-dependent results, where either the error or the 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. number of bits needed increases with d, can also be seen in other work on low-precision training algorithms [1, 8, 21]. This dependence on d is unsatisfying because the motivation for low-precision training is to tackle large-scale problems on big data, where d can range up to 108 or more for commonly used models [19]. For example, to compensate for a factor of d = 108 in (1), we could add bits to decrease the quantization step δ by a factor of d, but this would require adding log2(104) 13 bits, which is significant compared to the 8 or 16 bits that are commonly used in low-precision training. In this paper, we address this problem by proving bounds on the convergence of LP-SGD [17] that do not contain dimension d in the expression. Our main technique for doing so is a tight dimension-free bound on the expected quantization error of the low-precision stochastic gradients in terms of the ℓ1-norm. Our results are summarized in Table 3, and we make the following contributions: We describe conditions under which we can prove a dimension-free bound on the convergence of SGD with fixed-point, quantized iterates on both convex and non-convex problems. We study non-linear quantization schemes, in which the representable low-precision numbers are distributed non-uniformly. We prove dimension-free convergence bounds for SGD using logarithmic quantization [16], and we show that using logarithmic quantization can reduce the number of bits needed for LP-SPG to provably converge. We study quantization using low-precision floating-point numbers, and we present theoretical analyis that suggests how to assign a given number of bits to exponent and mantissa to optimize the accuracy of training algorithms. We validate our results experimentally. 2 Related Work Motivated by the practical implications of faster machine learning, much work has been done on low-precision training. This work can be roughly divided into two groups. The first focuses on training deep models with low-precision weights, to be later used for faster inference. For some applications, methods of this type have achieved good results with very low-precision models: for example, binarized [5, 12, 18] and ternary networks [23] have been observed to be effective (although as is usual for deep learning they lack theoretical convergence results). However, these approaches are still typically trained with full-precision iterates: the goal is faster inference, not faster training (although faster training is often achieved as a bonus side-effect). A second line of work on low-precision training, which is applied to both DNN training and non-deeplearning tasks, focuses on making various aspects of SGD low-precision, while still trying to solve the same optimization problem as the full-precision version. The most common way to do this is to make the iterates of SGD (the wt in the SGD update step wt+1 = wt αt ft(wt)) stored and computed in low-precision arithmetic [4, 8, 9, 11, 17]. This is the setting we will focus on most in this paper, because it has substantial theoretical prior work which exhibits the dimension-dependence we set out to study [1, 8, 17, 21]. The only paper we found with a bound that was not dimension-dependent was De Sa et al. [9], but in that paper the authors required that the gradient samples be 1-sparse (have only one nonzero entry), which is not a realistic assumption for most ML training tasks. In addition to quantizing the iterates, other work has studied quantizing the training set [21] and numbers used to communicate among parallel workers [1]. We expect that our results on dimension-free bounds will be complementary with these existing theoretical approaches, and we hope that they can help to explain the success of the exciting empirical work in this area. 3 Dimension-Free Bounds for SGD In this section, we analyze the performance of stochastic gradient descent (SGD) using low-precision training. Though there are numerous variants of this algorithm, SGD remains the de facto algorithm used most for machine learning. We will start by describing SGD and how it can be made lowprecision. Suppose we are trying to solve the problem minimize: f(w) = 1 i=1 fi(w) over: w Rd. (2) SGD solves this problem iteratively by repeatedly running the update step wt+1 = wt α fit(wt) (3) where α is the step size1 or learning rate, and it is the index of a component function chosen randomly and uniformly at each iteration from {1, . . . , n}. To make this algorithm low-precision, we quantize the iterates (the vectors wt) and store them in a low-precision format. The standard format to use lets us represent numbers in a set dom(δ, b) = { δ 2b 1, δ (2b 1 1), , δ, 0, δ, 2δ, , δ (2b 1 1)} with δ > 0 being the quantization gap, the distance between adjacent representable numbers, and b N being the number of bits we use [8]. Usually, δ is a power of 2, and this scheme is called fixed-point arithmetic. It is straightforward to encode numbers in this set as b-bit signed integers, by just multiplying or dividing by δ to convert to or from the encoded format and we can even do many arithmetic computations on these numbers directly as integers. This is sometimes called linear quantization because the representable points are distributed uniformly throughout their range. However, as the gradient samples will produce numbers outside this set during iteration, we need some way to map these numbers to the set of numbers that we can represent. The standard way to do this is with a quantization function Q(x) : R dom(δ, b). While many quantization functions have been proposed, the one typically used in theoretical analysis (which we will continue to use here) is randomized rounding. Randomized rounding, also known as unbiased rounding or stochastic rounding, rounds up or down at random such that E [Q(x)] = x whenever x is within the range of representable numbers (i.e. when δ 2b 1 x δ (2b 1 1)). When x is outside that range, we quantize it to the closest representable point. When we apply Q to a vector argument, it quantizes each of its components independently. Using this quantization function, we can write the update step for low-precision SGD (LP-SGD), which is a simple quantization of (3), wt+1 = Q(wt α fit(wt)) (4) As mentioned before, one common feature of prior bounds on the convergence of LP-SGD is that they depend on the number of dimensions d, whereas bounds on full precision SGD under the same conditions don t. This difference is due to the fact that, when we quantize a number w, it increases its variance by E (Q(w) w)2 δ2/4. Observe that this inequality is tight since it holds as an equality when w is in the middle of two quantization points, e.g. w = δ/2, as illustrated in Figure 1(a). When quantizing a vector w Rd, the squared error can be increased by E h Q(w) w 2 2 i = k=1 E (Q(wk) wk)2 δ2d and this bound is again tight. This variance inequality is the source of the d term in analyses of LP-SGD, and the tightness of the bound leads to the natural belief that the d term is inherent, and that low-precision results are inevitably dimension-dependent. However, we propose that if we can instead bound the variance in (5) with some properties of the problem itself that is not inherently dependent on d, we can achieve a result that is dimension-free. One way to do this is to look at the variance graphically. Figure 1(a) plots the quantization error as a function of w along with the bound in (5). Notice that the squared error looks like a series of parabolas, and the bound in (5) is tight at the top of those parabolas, but loose elsewhere. Instead, suppose we want to do the opposite and produce a bound that is tight when the error is zero (at points in dom(δ, b)). To do this, we observe that E (Q(w) w)2 δ|w z| for any z dom(δ, b). This bound is also tight when z is adjacent to w, and we plot it in Figure 1(a) as well. The natural vector analog of this is E h Q(w) w 2 2 i k=1 δ|wk zk|= δ w z 1 , z dom(δ, b)d (6) 1Usually in SGD the step size is decreased over time, but here for simplicity we consider a constant learning rate schedule. 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 quantized value Variance and Bounds Variance and Bounds for Linear Quantization actual variance our l1-norm bound (a) Linear quantization error and two possible tight upper bounds 0 200 400 600 800 1000 1200 1400 quantized value Variance and Bounds Variance and Bounds for Non-Linear Quantization actual variance our bound (b) Variance and bound for logarithmic quantization. 0 50 100 150 200 250 quantized value Variance and Bound Variance and Bound for de-normal FPQ actual variance our bound (c) Variance and bound for floating-point quantization. Figure 1: A figure of actual quantization variance E (Q(w) w)2 and the tight upper bound that we introduced in one dimension. We plot this bound when taking the minimum over all possible z. where 1 denotes the ℓ1-norm. This is a dimension-free bound we can use to replace (5) to bound the convergence of LP-SGD and other algorithms. However, this replacement is nontrivial as our bound is now non-constant: it depends on w, which is a variable updated each iteration. Also, in order to bound this new ℓ1-norm term, we will need some new assumptions about the problem. Next, we will state these assumptions, along with the standard assumptions used in the analysis of SGD for both convex and non-convex objectives, and then we will use them to present our dimension-free bound on the convergence of SGD. Assumption 1. All the loss functions fi are differentiable, and their gradients are L-Lipschitz continuous in the sense of 2-norm, that is, i {1, 2, , n}, x, y Rd, fi(x) fi(y) 2 L x y 2 Assumption 2. All the gradients of the loss functions fi are L1-Lipschitz continuous in the sense of 1-norm to 2-norm, that is, i {1, 2, , n}, x, y Rd, fi(x) fi(y) 1 L1 x y 2 These two assumptions are simply expressing of Lipschitz continuity in different norms. Assumption 1 is a standard assumption in the analysis of SGD on convex objectives, and has been applied in the low-precision case as well in prior work [8]. Assumption 2 is analogous to 1, except we are bounding the ℓ1-norm instead of the ℓ2-norm. This holds naturally (with a reasonable value of L1) for many problems, in particular problems for which the gradient samples are sparse. Assumption 3. The total loss function f is µ-strongly convex for some µ > 0: w, v, f(w) f(v) µ 2 w v 2 2 (w v)T f(v) This is a standard assumption that bounds the curvature of the loss function f, and is satisfied for many classes of convex objectives. When an objective is strongly convex and Lipschitz continuous, it is standard to say it has condition number κ = L/µ, and here we extend this to say it has L1 condition number κ1 = L1/µ. And for our analysis on the non-convex case, we don t have this assumption. Assumption 4. If the objective is convex, we assume that the gradient of each loss function is bounded by some constant near the optimal point w in the sense of l1 and l2 norm, that is, σ2, E h fi(w ) 1 If the objective is non-convex, there is not necessarily a single optimal point, so we just assume each loss function has a global bound on its gradient: for any w, w, E fi(w) 2 σ2, E h fi(w) 1 This assumption constrains the gradient for each loss function at the optimal point. We know f(w ) = 1 i fi(w ) = 0, so it is intuitive that each fi(w ) can be bounded by some value. Table 1: Summary of our dimension-free results compared with prior work. The values report the number of bits needed, according to the theoretical bound, for the LP-SGD [17] algorithm to achieve an expected objective gap (f(w) f(w )) of ϵ in the convex case, and an expected gradient of ϵ in the non-convex case, when we let step size α 0, epoch length T . Here we let R denote the radius of the range of numbers representable in the low-precision format and assume w 2 = Θ(R). The rest of the parameters can be found in the assumptions to be introduced later. OBJECTIVE CLASS CONVEX NON-CONVEX NUMBER OF BITS NEEDED FOR E [f(w) f(w )] ϵ E f( w) 2 2 ϵ PRIOR DIMENSION- DEPENDENT BOUND log2 O(Rσmax OUR DIMENSIONFREE BOUND log2 O(Rσ1/ϵ) log2 O(LRσ1/ϵ) DIMENSION-FREE WITH LOGARITHMIC QUANTIZATION log2 O( Rσ ϵ log (1 + σ1 σ )) log2 O( LR ϵ log (1 + σ1 ϵ)) In the non-convex case, however, we need a global bound on the gradient instead of just at the optimum. This is a natural assumption to make and it has been used in a lot of other work in this area. Note that this assumption only needs to hold under the expectation over all fi. For non-convex cases, we need the following additional assumption. Assumption 5. The variance of the gradient of each loss function is bounded by some constant σ2 0: w, Var( fi(w)) = E h fi(w) f(w) 2 2 i σ2 0 With these assumptions, we proved the following theorems for low-precision SGD: Theorem 1. Suppose that we run LP-SGD on an objective that satisfies Assumptions 1 4, and with step size α < 1/(2κ2µ). After T LP-SGD update steps (4), select w T uniformly at random from {w0, w1, . . . , w T 1}. Then, the expected objective gap of w T is bounded by E [f( w T ) f(w )] 1 2αT w0 w 2 2 + ασ2 + δσ1 2 + δ2κ2 1µ 4 Theorem 2. Suppose that we run LP-SGD on an objective that is non-convex and satisfies Assumptions 1, 4, 5, with constant step size α. After T LP-SGD update steps, select w T uniformly at random from {w0, w1, . . . , w T 1}. Then the expected squared gradient norm of w T is bounded by E h f( w T ) 2 2 i 2 2α α2L f(w0) f T + ασ2 0L + Lδσ1 The first theorem shows a bound of the expected distance between the result we get at T-th iteration and the optimal value for a convex objective, and the second shows a bound of the expected gradient at T-th iteration, where f is the global minimum of objective f. By choosing an appropriate step size we can achieve convergence at a 1/T rate, while the limit we converge to is only dependent on dimension-free factors. Meanwhile, as mentioned in the first section, previous work gives a dimension-dependent bound (1) for the problem, which also converges at a 1/T rate.2 Therefore our result guarantees a dimension-free convergence limit without weakening the convergence rate. It is important to note that, because the dimension-dependent bound in (5) was tight, we should not expect our new result to improve upon the previous theory in all cases. In the worst case, κ1 = d κ and similarly σ1 = d σ; this follows from the fact that for vectors in Rd, the norms are related by the inequality x 1 d x 2. Substituting this into our result produces a dimension-dependent bound again. This illustrates the importance of introducing the new parameters κ1 and σ1 and requiring that they be bounded; if we could not express our bound in terms of these parameters, the best we could do here is recover a dimension-dependent bound. This means that we achieve the standard result of a dimension-dependent bound in the worst case, but in other cases our results are strictly better. 2Previous work (1) used a decaying step size while ours uses a constant step size to achieve a better result. 0 200000 400000 600000 800000 1000000 iterations Convergence of Low-Precision SGD d = 128, fp d = 1024, fp d = 8192, fp d = 128, b = 8 d = 1024, b = 8 d = 8192, b = 8 d = 128, b = 6 d = 1024, b = 6 d = 8192, b = 6 (a) The convergence of training loss gap f(w) f(w ) for LPSGD, showing the effect of different model sizes and precision. 64 128 256 512 1024 2048 4096 8192 model size asymptotic average loss gap Dimension vs. Noise Ball Size of Low-Precision SGD full precision 8-bit 6-bit (b) The size of the noise ball is not significantly affected by model size d. 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 L1 gradient noise at optimum 1 asymptotic average loss gap 1 vs. Noise Ball Size of Low-Precision SGD full precision 8-bit 6-bit (c) The size of the noise ball does depend on σ1, especially when precision is low. Figure 2: (a) Convergence of full-precision (fp) SGD and LP-SGD; (b)(c) Plots of the asymptotic loss gap from (a) as a function of model size d and σ1. Experiments Next, we validate our theoretical results experimentally on convex problems. To do this, we analyzed how the size of the noise floor of convergence of SGD and LP-SGD varies as the dimension is changed for a class of synthetic problems. Importantly, we needed to pick a class of problems for which the parameters L, L1, µ, σ, and σ1, did not change as we changed the dimension d. To do this, we chose a class of synthetic linear regression models with loss components sampled independently and identically as fi(w) = 1 2( x T w y)2 where x is a sparse vector sampled to have s nonzero entries each of which is sampled uniformly from { 1, 1}, and y is sampled from N( x T w , β2) for some variance parameter β. Importantly, the nonzero entries of x were chosen non-uniformly such that Pr[ xi = 0] = pi for some probabilities pi which decrease as i increases; this lets us ensure that µ remains constant as d is increased. For simplicity, we sampled a fresh loss component of this form at each SGD iteration, which is sometimes called the online setting. It is straightforward to derive that for this problem µ = pd L = s L1 = s s σ2 = β2s σ1 = p We set α = 0.01, β = 0.2, p1 = 0.9, pd = 0.001, and s = 16, we chose each entry of w uniformly from [ 1/2, 1/2], and we set δ such that the low-precision numbers would range from 1 to 1. We set these parameters so that our model satisfies Assumptions 1 to 4 while maintaining the smallest values of the parameters L, L1, σ, and σ1; choosing a different set of parameters will lead to a looser theoretical bound. Figure 2(a) shows the convergence of SGD and LP-SGD as the dimension d is changed, for both 8-bit and 6-bit quantization. Notice that while changing d has an effect on the initial convergence rate for both SGD and LP-SGD, it has no effect on the noise ball size, the eventual loss gap that the algorithm converges to. Figure 2(b) measures this noise ball size more explicitly as the dimension is changed: it reports the loss gap averaged across the second half of the iterates. Notice that as the dimension d is changed, the average loss gap is almost unchanged, even for very low-precision methods for which the precision does significantly affect the size of the noise ball. This validates our dimension-free bounds, and shows that they can describe the actual dependence on d in at least one case. Figure 2(c) validates our results in the opposite way: it looks at how this gap changes as our new parameters σ1 and L1 change while d, µ, and σ are kept fixed. To do this, we fixed d = 1024 and changed s across a range, setting β = 0.8/ s, which keeps σ2 constant as s is changed: this has the effect of changing σ1 (and, as a side effect, L1 and L). We can see from figure 2(c) that changing σ1 in this way has a much greater effect on LP-SGD than on SGD. This validates our theoretical results, and suggests that σ1 and L1 can effectively determine the effect of low-precision compute on SGD. 4 Non-linear Quantization Up till now, most theoretical work in the area of low-precision machine learning has been on linear quantization, where the distance between adjacent quantization points is a constant value δ. Another option is non-linear quantization (NLQ), in which we quantize to a set of points that are non-uniformly distributed. This approach has been shown to be effective for accelerating deep learning in some settings [16]. In general, we can quantize to a set of points D = { qn, , q1, q0, q1, , qn 1}, and, just like with linear quantization, we can still use a quantization function Q(w) with randomized rounding that rounds up or down to a number in D in such a way that E [Q(w)] = w for w [ qn, qn 1]. When we consider the quantization variance here, the natural dimension-dependent bound would be E h Q(w) w 2 2 i d 4 max i (qi qi 1)2. This is still a tight bound since it holds with equality for a number in the middle of two adjacent quantization points. However, when applied in the analysis of LP-SGD, this bound induces poor performance and often under-represents the actual result. Here we discuss a specific NLQ method and use it to introduce a tight bound on the quantization variance. This method has been previously studied as logarithmic quantization or µ law quantization, and is defined recursively by q0 = 0, qi+1 qi = δ + ζqi (7) where δ > 0 and ζ > 0 are fixed parameters. Note that this includes linear quantization as a special case by setting ζ = 0. It turns out that we can prove a tight dimension-free bound on the quantization variance of this scheme. First, we introduce the following definition. Definition 1. An unbiased quantization function Q satisfies the dimension-free variance bound with parameters δ, ζ, and η if for all w [ qn, qn 1] and all z D, E h Q(w) w 2 2 i δ w z 1 + ζ z 2 w z 2 + η w z 2 2 . We can prove that our logarithmic quantization scheme satisfies this bound. Lemma 1. The logarithmic quantization scheme (7) satisfies the dimension-free variance bound with parameters δ, ζ, and η = ζ2 Notice that this bound becomes identical to the linear quantization bound (6) when ζ = 0, so this result is a strict generalization of our results from the linear quantization case. With this setup, we can apply NLQ to the low-precision training algorithms we have studied earlier in this paper. Theorem 3. Suppose that we run LP-SGD on a convex objective that satisfies Assumptions 1 4, and using a quantization scheme that satisfies the dimension-free variance bound 1. If ζ < 1 E [(f( w T ) f(w ))] w0 w 2 2 2αT +(1 + η)ασ2 + δσ1 + ζσ w 2 2 +(δL1 + ζL w 2 + ζσ)2 For non-convex objectives, we need to assume a bound for the iterates we deal with, that is, Assumption 6. The scale of the iterates is bounded by some constant R0, i.e. t, wt 2 R0. Theorem 4. Suppose that we run LP-SGD on a non-convex objective thatsatisfies previous assumptions 1,4 6, with constant step size α < 1 2(η+1)L and using a quantization scheme that satisfies the dimension-free variance bound 1, then E h f( w T ) 2 2 i 2(f(w0) f ) αT + ασ2 0L + Lδσ1 + 1 If we fix the representable range R (the largest-magnitude values representable in the low-precision format) and choose our quantization parameters optimally, we get the result that the number of bits we need to achieve objective gap or expected gradient ϵ is log2 O((Rσ/ε) log (1 + σ1/σ)) and log2 O(LR/ ϵ log (1 + σ1/ ϵ)) (as is shown in table 3). These bounds are notable because even in the worst case where we do not have a bound on σ1 and must use σ1 d σ, which recovers the dimension term, the bounds still manage to hide it within a log term. This greatly decreases the effect of the dimension, and suggests that NLQ may be a promising technique to use for low-precision training at scale. Also note that, although the first bound holds only when ζ < 1 L, which to some extent limits the acceleration of the strides in logarithmic quantization, the bound µ L is independent of σ and σ1, thus this effect of pushing " σ1 into a log term is independent of the setting of ζ. Floating point. Next, we look at another type of non-linear quantization that is of great practical use: floating-point quantization (FPQ). Here, the quantization points are simply floating-point numbers with some fixed number of exponential bits be and mantissa bits bm. Floating-point numbers are represented in the form ( 1)sign bit 2exponent bias (1.m1m2m3 . . . mbm) (8) where exponent is a be-bit unsigned number, the mi are the bm bits of the mantissa, and bias is a term that sets the range of the representable numbers by determining the range of the exponent. In standard floating point numbers, the exponent ranges from [ 2be 1+2, 2be 1 1], which corresponds to a bias of 2be 1 1. To make our results more general, we also consider non-standard bias by defining a scaling factor s = 2 (bias standard bias); the standard bias setting corresponds to s = 1. We also consider the case of denormal floating point numbers, which tries to address underflow by replacing the 1 in (8) with a 0 for the smallest exponent value. Under these conditions, we can prove that floating-point quantization satisfies the bound in Definition 1. Lemma 2. The FPQ scheme using randomized rounding satisfies the dimension-free variance bound with parameters δnormal, ζ, and η for normal FPQ and δdenormal, ζ, and η for denormal FPQ where δnormal = 4s 22be 1 , δdenormal = 8s 22be 1+bm , ζ = 2 bm, η = ζ2 This bound can be immediately combined with Theorem 3 to produce dimension-free bounds on the convergence rate of low-precision floating-point SGD. If we are given a fixed number of total bits b = be + bm, we can minimize this upper bound on the objective gap or the expected gradient to try to predict the best way to allocate our bits between the exponent and the mantissa. Theorem 5. When using normal FPQ for a convex objective, given b total bits, the optimal number of exponential bits be such that the asymptotic upper bound on the objective gap given by Theorem 3 is minimized is in the interval between: + 2b and log2 2(ln 2)s L1 Theorem 6. When using denormal FPQ for a convex objective, given b total bits, the optimal number of exponential bits be such that the asymptotic upper bound on the objective gap, as T and α 0, given by Theorem 3 is minimized is in the interval between: 1 2 ln 2W eσ w 2 1 2 ln 2W e(L w 2 + σ) where e denotes the base of the natural logarithm and W stands for the Lambert W function. In cases where neither of these two values exists, the noise ball size increases as be, thus be = 2 would be the optimal setting, which is equivalent to linear quantization. Theorem 7. When using normal FPQ for a non-convex objective, given b total bits, the optimal number of exponential bits be such that the asymptotic upper bound on the gradient, as T and α 0, given by Theorem 4 is minimized, at: ln 2W (ln 2)2sσ122b These theorems give us an idea of where the optimal setting of be lies such that the theoretical asymptotic error or the expected gradient is minimized. When using normal FPQ, this optimal assignment of be is O(log(b)), and for denormal FPQ the result is independent of b. Also, we found that for de-normal FPQ used in non-convex objectives, the optimal setting of be is the solution to a transcendental equation, which may not exist. This suggests that once the total number of bits grows past a threshold, we should assign most of or all the extra bits to the mantissa. Experiments For FPQ, we ran experiments on two different data sets. First, we ran LP-SGD on the same synthetic data set that we used for linear regression. Here we used normal FPQ with 20 bits in total, and we get the result in Figure 3(a). In this diagram, we plotted the empirical noise ball size, its theoretical upper bound, and the optimal interval for be as Theorem 5 predicts. As the figure 2 3 4 5 6 7 8 number of exponential bits noise ball size SGD noise ball size under FPQ with 16 bits empirical result theory bound (a) Training noise ball size of SGD using 16 bits normal FPQ on synthetic data set 2 3 4 5 6 7 8 9 number of exponential bits noise ball size empirical results theoretical noise ball 16 bits noise ball of FPQ on MNIST theory bound optimal interval (b) Training noise ball size of SGD using 16 bits normal FPQ on MNIST Figure 3: Plots of noise ball size vs. be when running SGD with 16 bits FPQ on synthetic data set and MNIST. Note the use of two y-axes in Figure 3(b) to make the series fit in one figure. shows, our theorem accurately predicts the optimal setting of exponential bits, which is 5 in this case, to minimize both the theoretical upper bound and the actual empirical result of the noise ball size, despite the theoretical upper bound being loose. Second, we ran LP-SGD on the MNIST dataset [10]. To set up the experiment, we normalized the MNIST data to be in [0, 1] by dividing by 255, then subtracted out the mean for each features. We ran multiclass logistic regression using an L2 regularization constant of 10 4 and a step size of α = 10 4, running for 500 total epochs (passes through the dataset) to be sure we converged. For this task, our (measured) problem parameters were L = 37.41, L1 = 685.27, σ = 2.38, σ1 = 29.11, and d = 784. In Figure 3(b), we plotted the observed loss gap, averaged across the last ten epochs, for LP-SGD using various 16-bit floating point formats. We also plot our theoretical bound on the loss gap, and the predicted optimal number of exponential bits to use based on that bound. Our results show that even though our bound is very loose for this task, it still predicts the right number of bits to use with reasonable accuracy. This experiment also validates the use of IEEE standard half-precision floating-point numbers, which have 5 exponential bits, for this sort of task. 5 Conclusion In this paper, we present dimension-free bounds on the convergence of SGD when applied to lowprecision training. We point out the conditions under which such bounds hold, for both convex and non-convex objectives. We further extend our results to non-linear methods of quantization: logarithmic quantization and floating point quantization. We analyze the performance of SGD under logarithmic quantization and demonstrate that NLQ is a promising method for reducing the number of bits required in low-precision training. We also presented ways in which our theory could be used to suggest how to allocate bits between exponent and mantissa when FPQ is used. We hope that our work will encourage further investigation of non-linear quantization techniques. [1] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707 1718, 2017. [2] Doug Burger. Microsoft unveils project brainwave for real-time ai. Microsoft Research, Microsoft, 22, 2017. [3] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Daniel Firestone, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, et al. Configurable clouds. IEEE Micro, 37(3):52 61, 2017. [4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. ar Xiv preprint ar Xiv:1412.7024, 2014. [5] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123 3131, 2015. [6] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, et al. Mixed precision training of convolutional neural networks using integer operations. ar Xiv preprint ar Xiv:1802.00930, 2018. [7] Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 561 574. ACM, 2017. [8] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger, Kunle Olukotun, and Christopher Ré. High-accuracy low-precision training. ar Xiv preprint ar Xiv:1803.03383, 2018. [9] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A unified analysis of hogwild-style algorithms. In Advances in neural information processing systems, pages 2674 2682, 2015. [10] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. [11] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737 1746, 2015. [12] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107 4115, 2016. [13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In International Conference on Neural Information Processing Systems, pages 315 323, 2013. [14] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1 12. ACM, 2017. [15] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems, pages 1742 1752, 2017. [16] Edward H Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S Simon Wong. Lognet: Energy-efficient neural networks using logarithmic computation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5900 5904. IEEE, 2017. [17] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, pages 5813 5823, 2017. [18] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525 542. Springer, 2016. [19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [20] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. ar Xiv preprint ar Xiv:1802.04680, 2018. [21] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pages 4035 4043, 2017. [22] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160, 2016. [23] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. ar Xiv preprint ar Xiv:1612.01064, 2016. A Algorithm In our work, we presented dimension-free bounds on the performance of low-precision SGD, here we present the algorithm in detail. Algorithm 1 LP-SGD: Low-Precision Stochastic Gradient Descent given: n loss functions fi, number of epochs T, step size α, and initial iterate w0. given: low-precision quantization function Q. for t = 0 to T 1 do sample it uniformly from {1, 2, , n}, quantize wt+1 Q wt α fit(wt) end for return w T B Proof for results in Table 3 As mentioned in the caption of Table 3, here only we consider the convergence limit, that is, we assume α 0, T , and we compute the minimum number of bits b we would require in order for the limit to be less than some small positive ε. Meanwhile, we denote the radius of the representable range by R and we assume R = w 2 without loss of generality, as this is the worst case for all our bounds that depend on w 2. Then in linear quantization, we have: q2b 1 1 = δ 2b 1 1 R and in non-linear quantization, we need: q2b 1 1 = δ (1 + ζ)(2b 1 1) 1 R (9) In the following proof we ll take the equality for these two inequalities. First, for convex objectives. B.1 LP-SGD in previous work In previous work Li et al. [17], we have f( w T ) f(w ) (1 + log(T + 1))G2 here we re-denote G as σmax for concordance with our result. Here σ2 max is an upper bound on the second moment of the stochastic gradient samples E f(w) 2 σ2 max. Substitute δ with R 2b 1 1 and set the limit (as α 0 and T ) to be ε, and notice that 2b 1 1 > 2b 2, then we have: d 2 (2b 1 1) = O (ε) b log2 + 1 = log2 O B.2 LP-SGD in our work In Theorem 1, we know that E [f( w T ) f(w )] 1 2αT w0 w 2 2 + ασ2 + δσ1 2 + δ2κ2 1µ 4 Set the limit (as α 0 and T ) to be ε, then we need: 2 = O (ε) , δ2κ2 1µ 4 = O (ε) . Then for sufficiently small ε, more explicitly, ε that satisfies L2 1 µσ2 1 O(ε) < 1, setting will satisfy the requirements, and we will get R 2b 1 1 = δ = O ε b = log2 O σ1R This is the expression that we wanted. Notice that even if we did not invoke small ε in the above big-O analysis, we can set δ = O min ε Then our number of bits would look like b = log2 O max σ1R which shows explicitly that we have replaced the dimension factor with parameters of the loss functions. B.3 LP-SGD in our work using NLQ In Theorem 3, we know that, if ζ < 1 E [(f( w T ) f(w ))] 1 2αT w0 w 2 2+(1 + η)ασ2 + δσ1 + ζσ w 2 2 +(δL1 + ζL w 2 + ζσ)2 Set the limit (as α 0 and T ) to be ε and replace w 2 with R; then we get 2 + (δL1 + ζLR + ζσ)2 4µ = O (ε) . So, in addition to our requirement that ζ κ 1, it suffices to have δσ1 = O (ε) , ζσR = O (ε) , δ2L2 1 µ = O (ε) , ζ2(LR + σ)2 µ = O (ε) . σ1 , ζ = O (ε) then all our other requirements will be satisfied for sufficiently small ε. Specifically, we need ε to be small enough that κ σRO (ε) 1, L2 1 σ2 1µO (ε) 1, (LR + σ)2 σ2R2µ O (ε) 1. As is standard in big-O analysis, we assume that ε is small enough that these requirements are satisfied, in which case our assignment of δ and ζ, combined with the results of Theorem 3, is sufficient to ensure an objective gap of ε. Next, starting from (9), the number of bits we need for non-linear quantization must satisfy (1 + ζ)(2b 1 1) 1 ζR δ which happens only when 2b 1 1 log(1 + ζ) log 1 + ζR Since we know that 0 ζ < 1, it follows that log(1 + ζ) ζ/2. So in order for the above to be true, it suffices to have 2b 1 1 ζ 2 log 1 + ζR Since 2b 1 1 > 2b 2, it follows that it suffices to have 8 log 1 + ζR And this will be true if b = log2 O 1 ζ log 1 + ζR Finally, using our assignment of δ and ζ gives us b = log2 O σR ϵ log 1 + σ1 This is the expression that we wanted. Notice that even if we did not invoke small ε in the above big-O analysis, we would still get a rate in which all of our ℓ1-dependent terms are inside the doublelogarithm, because none of the requirements above that constrain ζ are ℓ1-dependent. To be explicit, to do this we would set δ and ζ to be δ = O min ε , ζ = O min ε σR, εµ LR + σ , 1 Then our number of bits would look like b = log2 O max σR ε , LR + σ εµ , κ log 1 + ζR which shows explicitly that any ℓ1-dependent terms are inside the double logarithm. Next, for non-convex objectives. B.4 LP-SGD in our work for non-convex objectives In Theorem 2, we know that E h f( w T ) 2 2 i 2 2α α2L f(w0) f(w ) T + ασ2 0L + Lδσ1 Set the limit (as α 0 and T ) to be ε, then we need: will satisfy the requirements, and we will get R 2b 1 1 = δ = O ε b = log2 O LRσ1 B.5 LP-SGD in our work using NLQ for non-convex objectives In Theorem 4, we know that, if α < 1 2(η+1)L , then E h f( w T ) 2 2 i 2(f(w0) f(w )) αT + ασ2 0L + Lδσ1 + 1 Set the limit (as α 0 and T ) to be ε and replace R0 with R; then we need 2(LζR0)2 = O (ε) it suffices to have Lδσ1 = O (ε) , (LζR0)2 = O (ε) then all our other requirements will be satisfied. Next, starting from (9), the number of bits we need for non-linear quantization must satisfy (1 + ζ)(2b 1 1) 1 ζR δ which happens only when 2b 1 1 log(1 + ζ) log 1 + ζR Since we know that 0 ζ < 1, it follows that log(1 + ζ) ζ/2. So in order for the above to be true, it suffices to have 2b 1 1 ζ 2 log 1 + ζR Since 2b 1 1 > 2b 2, it follows that it suffices to have 8 log 1 + ζR And this will be true if b = log2 O 1 ζ log 1 + ζR Finally, using our assignment of δ and ζ gives us b = log2 O LR ϵ log 1 + σ1 ϵ C Proof for theorems Before we prove the main theorems presented in the paper, we will prove the following lemmas that will be useful later, as well as the lemmas we presented before. The proof of lemma 1 can be extracted from the proof of lemma 5 that we will show later. Proof of Lemma 2. Here we consider the positive case first, then symmetrically the negative case also holds. First, for normal FPQ, the set of quantization points are: D = {0} s 1 + x 2y | x = 0, 1, , nm 1, y = ne 2 + 2, , ne and we set the parameters for the nonlinear quantization bound to be: 2 ne , ζ = 1 nm , η = ζ2 4(1 + ζ) = 1 4nm(nm + 1) For any w within representable range, we can assume it is in [qi, qi+1), then E [Q(w) w]2 = qi+1 w qi+1 qi (w qi)2 + w qi qi+1 qi (qi+1 w)2 = (w qi)(qi+1 w) So now we only need to prove that v D, (w qi)(qi+1 w) δ |w v|+ζ |v| |w v|+η |w v|2 First, we consider a special case where qi = 0. In this case, qi+1 = s 1 2 ne 2 +2 = δ. If v = 0, it is obvious that LHS = (w qi)(qi+1 w) = w(δ w) δw RHS and similarly for v = δ, LHS = (w qi)(qi+1 w) = w(δ w) δ(δ w) RHS and for v > δ, RHS δ(v w) δ(δ w) w(δ w) = LHS Next, we consider the case where qi = 0. In this case, we can assume qi = s 1 + x nm qi+1 qi = s 2y 1 nm qi = ζqi. If v qi+1, denote y = qi+1 w, then LHS = (w qi)(qi+1 w) = y (qi+1 qi y) = y (ζqi y) RHS ζ qi+1 (qi+1 w) = ζqi+1y ζqiy LHS Secondly, if 0 v qi, denote y = w qi, then LHS = (w qi)(qi+1 w) = y (qi+1 qi y) = y (ζqi y) RHS = δ (w v) + ζ v (w v) + η (w v)2 = (ζ η) v2 + ( δ + ζw 2ηw) v + δw δ2 observe that ζ η > 0, so the right hand side is a concave function of v, thus it achieves minimum at either v = 0 or v = qi. At v = qi: RHS = δy + ζqiy + ηy2 ζqiy LHS and at v = 0, since qi+1 (1 + ζ)qi and qi w, RHS LHS = δ w + ζ 0 w + η w2 (w qi)(qi+1 w) = (1 + η)w2 + (δ qi qi+1)w + qiqi+1 (1 + η)w2 (qi + qi+1) w + qiqi+1 = (1 + η)w2 [(2 + ζ)qiw + (qi+1 (1 + ζ)qi)w] + [(1 + ζ)q2 i + (qi+1 (1 + ζ)qi)qi] = (1 + η)w2 (2 + ζ)qi w + (1 + ζ)q2 i + (qi+1 (1 + ζ)qi)(qi w) (1 + η)w2 (2 + ζ)qi w + (1 + ζ)q2 i which is a positive parabola. Recall that η = ζ2 4(ζ+1) = (ζ+2)2 4(ζ+1) 1, thus the determinant is (2 + ζ)2q2 i 4(1 + η)(1 + ζ)q2 i = 0, therefore RHS LHS 0. Now we extend this conclusion to the case where v 0. In this case, RHS = δ (w v) + ζ ( v) (w v) + η (w v)2 since w, ζ, δ, η are all positive, this is apparently a decreasing function of v, thus it achieves minimum at v = 0, which is what we have already proven. So far, we ve proven the lemma in the case of w 0, v 0 and w 0, v 0, and symmetrically it holds for w 0, v 0 and w 0, v 0, which indicates that we can extend D to be a set containing both positive and negative numbers. In the de-normal FPQ case, the set of quantization points are: 2 +3 | x = 0, 1, , nm 1 2y | x = 0, 1, , nm 1, y = ne 2 + 3, , ne and we set the parameters for the nonlinear quantization bound to be: 2 ne , ζ = 1 nm , η = ζ2 4(1 + ζ) = 1 4nm(nm + 1) The proof for this case follows the exact same structure as the normal FPQ case. Lemma 3. Under condition of linear quantization when using low-precision representation (δ, b), for any w, v Rd where Q(δ,b)(w) = w, E h Q(δ,b)(w + v) w 2 2 i (w + v) w 2 2 + δ v 1 . where Q is the linear quantization function. Proof of Lemma 3. (This proof follows the same structure as the proof for lemma 1 in [8]) First, observe that this lemma holds if it holds for each dimension, so we only need to prove that for any w, v R where Q(δ,b)(w) = w, i.e. w dom(δ, b), E (Q(δ,b)(w + v) w )2 (w + v w )2 + δ|v| then we can sum up all the dimensions to get the result. Now we consider the problem in two situations. First, if w + v is within the range representable by (δ, b), then E Q(δ,b)(w + v) = w + v. In this case, E (Q(δ,b)(w + v) w )2 = E [(Q(δ,b)(w + v) (w + v)) ((w + v) w )]2 = E [Q(δ,b)(w + v) (w + v)]2 2[Q(δ,b)(w + v) (w + v)][(w + v) w ] + [(w + v) w ]2 = E [Q(δ,b)(w + v) (w + v)]2 2[(w + v) (w + v)][(w + v) w ] + [(w + v) w ]2 = [(w + v) w ]2 + E [Q(δ,b)(w + v) (w + v)]2 Since (w + v) is within representable range, E [Q(δ,b)(w + v) (w + v)]2 is equivalent to E [Q(δ, )(v) + w (w + v)]2 , which equals E [Q(δ, )(v) v]2 since Q(δ,b)(w) = w. Now we only need to prove that E [Q(δ, )(v) v]2 δ|v|. Observe that this trivially holds for v = 0, and is symmetrical for positive and negative v. Without loss of generality we assume v > 0, let z be the rounded-down quantization of v, then we have z 0. Then Q(δ,b)(v) will round to z + δ (the rounded-up quantization of v) with probability v z δ , and it will round to z with probability z+δ v δ . This quantization is unbiased because E Q(δ, )(w) = v z δ (z + δ) + z + δ v δ z = vz z2 + vδ zδ δ + z2 + zδ vz Thus, its variance will be E (Q(δ, )(v) v)2 = v z δ (z + δ v)2 + z + δ v = (v z)(z + δ v) z + δ v = δ(v z) (v z)2 δ(v z) δv. therefore E (Q(δ,b)(w + v) w )2 (w + v w )2 + δ|v| In the other case, when w + v is on the exterior of the representable region, the quantization function Q(δ,b) just maps it to the nearest representable value. Since w is in the interior of the representable region, this operation will make w + v closer to w . Thus, (Q(δ,b)(w + v) w )2 (w + v w )2, and so it will certainly be the case that E (Q(δ,b)(w + v) w )2 (w + v w )2 + δ|v|. Now that we ve proven the inequality for one dimension, we can sum up all d dimensions and get E h Q(δ,b)(w + v) w 2 2 i (w + v) w 2 2 + δ v 1 . For completeness, we also re-state the proof of following lemma 4, which was presented as equation (8) in [13], and here we present the proof for this lemma used in [8]. Lemma 4. Under the standard condition of Lipschitz continuity, if i is sampled uniformly at random from {1, . . . , N}, then for any w, E h fi(w) fi(w ) 2 2 i 2L (f(w) f(w )) . Proof of Lemma 4. For any i, define gi(w) = fi(w) fi(w ) (w w )T fi(w ). Clearly, if i is sampled randomly as in the lemma statement, E [gi(w)] = f(w). But also, w must be the minimizer of gi, so for any w gi(w ) min η gi(w η gi(w)) gi(w) η gi(w) 2 2 + η2L 2 gi(w) 2 2 2L gi(w) 2 2 . where the second inequality follows from the Lipschitz continuity property. Re-writing this in terms of fi and averaging over all the i now proves the lemma statement. Lemma 5. Under the condition of logarithmic quantization, for any w, v Rd where v Dd, E h Q(w) w 2 2 i w w 2 2 + δ w v 1 + ζ v 2 w v 2 + η w v 2 2 where Q is the non-linear quantization function. Note that the proof this lemma naturally extends to lemma 1, thus we omitted the proof for lemma 1 and just present the proof for lemma 5. Proof of Lemma 5. Here we only consider the positive case first, where D = {q0, q1, , qn 1} with [0, qn 1] being the representable range of D. As for the negative case, we will show later that it holds symmetrically. Observe that this lemma holds if it holds for each dimension, so we only need to prove that for any w, v R where v D, E [Q(w) w ]2 |w w |2+δ |w v|+ζ |v| |w v|+η |w v|2 then we can sum up all the dimensions and use Cauchy-Schwarz inequality to get the result. Now we consider the problem in two situations. First, if w is outside the representable range, the quantization function Q just maps it to the nearest representable value. Since w is in the interior of the representable range, this operation will make w closer to w . Thus, [Q(w) w ]2 (w w )2, and so it will certainly be the case that E [Q(w) w ]2 |w w |2+δ |w v|+ζ |v| |w v|+η |w v|2 Second, if w is within the representable range, then E [Q(w)] = w. In this case, E [Q(w) w ]2 = E [(Q(w) w) (w w )]2 = E [Q(w) w]2 2[Q(w) w](w w ) + (w w )2 = E [Q(w) w]2 2(w w)(w w ) + (w w )2 = (w w )2 + E [Q(w) w]2 Since w is within representable range, we can assume it is in [qi, qi+1), then E [Q(w) w]2 = qi+1 w qi+1 qi (w qi)2 + w qi qi+1 qi (qi+1 w)2 = (w qi)(qi+1 w) So now we only need to prove that (w qi)(qi+1 w) δ |w v|+ζ |v| |w v|+η |w v|2 Note that v D, so it is either v qi+1 or v qi. Firstly, if v qi+1, denote y = qi+1 w, then LHS = (w qi)(qi+1 w) = y (qi+1 qi y) = y (δ + ζqi y) RHS = δ (v w) + ζ v (v w) + η (v w)2 δ (qi+1 w) + ζ qi+1 (qi+1 w) + η (qi+1 w)2 = δy + ζqi+1y + ηy2 δy + ζqiy y2 = LHS Secondly, if 0 v qi, denote y = w qi, then LHS = (w qi)(qi+1 w) = y (qi+1 qi y) = y (δ + ζqi y) RHS = δ (w v) + ζ v (w v) + η (w v)2 = (ζ η) v2 + ( δ + ζw 2ηw) v + δw δ2 observe that ζ η > 0, so the right hand side is a concave function of v, thus it achieves minimum at either v = 0 or v = qi. At v = qi: RHS = δy + ζqiy + ηy2 δy + ζqiy y2 = LHS and at v = 0: RHS LHS = δ w + ζ 0 w + η w2 (w qi)(qi+1 w) = (1 + η)w2 + (δ qi qi+1)w + qiqi+1 = (1 + η)w2 (2 + ζ)qi w + qiqi+1 (1 + η)w2 (2 + ζ)qi w + (1 + ζ)q2 i which is a positive parabola. Recall that η = ζ2 4(ζ+1) = (ζ+2)2 4(ζ+1) 1, thus the determinant is (2 + ζ)2q2 i 4(1 + η)(1 + ζ)q2 i = 0, therefore RHS LHS 0. Now we extend this conclusion to the case where v 0. In this case, RHS = δ (w v) + ζ ( v) (w v) + η (w v)2 since w, ζ, δ, η are all positive, this is apparently a decreasing function of v, thus it achieves minimum at v = 0, which is what we have already proven. So far, we ve proven the lemma in the case of w 0, v 0 and w 0, v 0, and symmetrically it holds for w 0, v 0 and w 0, v 0, which indicates that we can extend D to be a set containing both positive and negative numbers, and we can reset D to be D = { qn, , q1, q0, q1, , qn 1} where q0 = 0, qi+1 qi = δ + ζqi Now we have proven all the lemmas we need. Next, we make some small modifications to the assumptions (weakening them) so that our theorems are shown in a more general sense. For assumption 2, we change it to: Assumption 7. All the gradients of the loss functions fi are L1-Lipschitz continuous in the sense of 1-norm to p-norm, that is, i {1, 2, n}, x, y, fi(x) fi(y) 1 L1 ||x y||p While in the body of the paper and in our experiments we choose p = 2 for simplicity, here we are going to prove that a generalization of Theorem 1 holds for all real numbers p. We also need a similar generalization of Assumption 3. Assumption 8. The average of the loss functions f = 1 n P i fi is µ1 strongly convex near the optimal point in the sense of p-norm, that is, 2 ||w w ||2 p f(w) f(w ) with p being any real number. This assumption is essentially the same as the assumption for strong convexity that we stated before, since in practice we would choose p = 2 and then µ1 and µ would be the same. But here we are actually presenting our result in a stronger sense in that we can choose any real number p and the proof goes the same. Now we are ready to prove the theorems. Note that the result of the following proof contains µ1 since we are proving a more general version of our theorems; substituting them with µ will lead to the same result that we stated before. Proof of Theorem 1. In low-precision SGD, we have: ut+1 = wt α ft(wt), wt+1 = Q(ut+1) by lemma 3, we know that E h wt+1 w 2 2 i = E Q(wt α ft(wt)) w 2 E wt α ft(wt) w 2 + δE h α ft(wt) 1 = E h wt w 2 2 i 2αE h (wt w )T ft(wt) i + α2E ft(wt) 2 + αδE h ft(wt) 1 E h wt w 2 2 i 2αE h (f(wt) f(w )) + µ 2 wt w 2 2 i + α2E ft(wt) 2 + αδE h ft(wt) 1 = (1 αµ)E h wt w 2 2 i + α2E ft(wt) 2 + αδE h ft(wt) 1 2αE [(f(wt) f(w ))] where the second inequality holds due to the strongly convexity assumption. According to the assumptions we had, we have: E h fi(w) 2 2 i = E h fi(w) fi(w ) + fi(w ) 2 2 i = E h fi(w) fi(w ) 2 2 + 2( fi(w) fi(w ))T fi(w ) + fi(w ) 2 2 i = E h fi(w) fi(w ) 2 2 + fi(w ) 2 2 i L2 E h w w 2 2 i + σ2 E [ fi(w) 1] = E [ fi(w) fi(w ) + fi(w ) 1] E [ fi(w) fi(w ) 1 + fi(w ) 1] L1 E [ w w 2] + σ1 where the last inequality holds due to assumption 2 where we let p = 2. Applying this result to the previous formula and we will have: E h wt+1 w 2 2 i (1 αµ)E h wt w 2 2 i + α2E ft(wt) 2 + αδE h ft(wt) 1 2αE [(f(wt) f(w ))] (1 αµ + α2L2)E h wt w 2 2 i + αδL1E [ wt w 2] 2αE [(f(wt) f(w ))] + α2σ2 + αδσ1 Here we introduce a positive constant C that we ll set later, and by basic inequality we get αδL1E [ wt w 2] CE [ wt w 2]2 + α2δ2L2 1 4C CE h wt w 2 2 i + α2δ2L2 1 4C E h wt+1 w 2 2 i (1 αµ + α2L2 + C)E h wt w 2 2 i 2αE [(f(wt) f(w ))] + α2σ2 + αδσ1 + α2δ2L2 1 4C one setting C to be αµ α2L2, we will have: 2αE [(f(wt) f(w ))] E h wt w 2 2 i E h wt+1 w 2 2 i +α2σ2+αδσ1+ α2δ2L2 1 4(αµ α2L2) since we can set α to be small enough such that αL2 µ 2 , then the result will become: 2αE [(f(wt) f(w ))] E h wt w 2 2 i E h wt+1 w 2 2 i + α2σ2 + αδσ1 + αδ2L2 1 2µ now we sum up this inequality from t = 0 to t = T 1 and divide by 2αT, then we get: t=o E [(f(wt) f(w ))] w0 w 2 2 E h w T w 2 2 i 2αT + ασ2 + δσ1 2 + δ2L2 1 4µ w0 w 2 2 2αT + ασ2 + δσ1 2 + δ2κ2 1µ 4 and since we sample w T uniformly from (wo, w1, , w T 1), we get E [(f( w T ) f(w ))] 1 2αT w0 w 2 2 + ασ2 + δσ1 2 + δ2κ2 1µ 4 Proof of Theorem 2. In low-precision SGD: ut+1 = wt αt fit(wt), wt+1 = Q(ut+1) f(wt+1) = f(Q(ut+1)) = f(ut+1 + (Q(ut+1) ut+1)) = f(ut+1) + (Q(ut+1) ut+1)T f(ut+1) 2 (Q(ut+1) ut+1)T 2f(ξt) (Q(ut+1) ut+1) f(ut+1) + (Q(ut+1) ut+1)T f(ut+1) + 1 2L Q(ut+1) ut+1 2 2 Since E [Q(ut+1)] = ut+1, by taking the expectation, we get: E [f(wt+1)] E [f(ut+1)] + 0 + 1 2LE h Q(ut+1) ut+1 2 2 i = E [f(wt αt fit(wt))] + 1 2LE h Q(ut+1) ut+1 2 2 i among which, f(wt αt fit(wt)) = f(wt) αt f(wt)T fit(wt) + α2 t 2 fit(wt)T 2f(ξ t) fit(wt) f(wt) αt f(wt)T fit(wt) + α2 t L 2 fit(wt) 2 2 = f(wt) αt f(wt)T fit(wt) + α2 t L 2 f(wt) + ( fit(wt) f(wt)) 2 2 since E [fit] = f, by taking the expectation of previous terms we have: E [f(wt αt fit(wt))] = E [f(wt)] αt E h f(wt) 2 2 i + α2 t L 2 n E h f(wt) 2 2 + fit(wt) f(wt) 2 2 io = E [f(wt)] αt α2 t L 2 E h f(wt) 2 2 i + α2 t L 2 E h fit(wt) f(wt) 2 2 i meanwhile, according to previous lemma and assumption 4, we know that 2L Q(ut+1) ut+1 2 2 2Lδ E [ ut+1 wt 1] 2LδE [ αt fit(wt) 1] now according to the assumption of the variance of fit(wt) and substitute the results into previous expression, we will get: E [f(wt+1)] E [f(wt αt fit(wt))] + 1 E [f(wt)] αt α2 t L 2 E h f(wt) 2 2 i + α2 t L 2 E h fit(wt) f(wt) 2 2 i + 1 E [f(wt)] αt α2 t L 2 E h f(wt) 2 2 i + α2 tσ2 0L + αt Lδσ1 2 then we can sum the result from t = 0 to T 1, and we have: αt α2 t L 2 E h f(wt) 2 2 i f(w0) E [f(w T )] + T α2 tσ2 0L + αt Lδσ1 if we set αt = α and select w T uniformly at random from {w0, w1, . . . , w T 1}, we get: E h f( w T ) 2 2 i = 1 t=0 E h f(wt) 2 2 i f(w0) E [f(w T )] T + α2σ2 0L + αLδσ1 2 2α α2L f(w0) f T + α2σ2 0L + αLδσ1 2α α2L = 2 2α α2L f(w0) f T + ασ2 0L + Lδσ1 Proof of Theorem 3 . In low-precision SGD, we have: ut+1 = wt α ft(wt), wt+1 = Q(ut+1) by lemma 5, we know that E h wt+1 w 2 2 i = E Q(wt α ft(wt)) w 2 E wt α ft(wt) w 2 + δE h α ft(wt) 1 i + ζE h wt 2 α ft(wt) 2 i + ηE α ft(wt) 2 E h wt w 2 2 i 2αE h (wt w )T ft(wt) i + α2E ft(wt) 2 + αδE h ft(wt) 1 i + ζE h ( wt w 2 + w 2) α ft(wt) 2 i + ηα2E ft(wt) 2 E h wt w 2 2 i 2αE h (f(wt) f(w )) + µ 2 wt w 2 2 i + α2E ft(wt) 2 + αδE h ft(wt) 1 i + αζE h ( wt w 2 + w 2) ft(wt) 2 i + ηα2E ft(wt) 2 = (1 αµ)E h wt w 2 2 i + (1 + η)α2E ft(wt) 2 2αE [(f(wt) f(w ))] + αζE h ( wt w 2 + w 2) ft(wt) 2 i + αδE h ft(wt) 1 where the third inequality holds due to the strongly convexity assumption. According to the assumptions we had, we have: E h fi(w) 2 2 i = E h fi(w) fi(w ) + fi(w ) 2 2 i = E h fi(w) fi(w ) 2 2 + 2( fi(w) fi(w ))T fi(w ) + fi(w ) 2 2 i = E h fi(w) fi(w ) 2 2 + fi(w ) 2 2 i L2 E h w w 2 2 i + σ2 fi(w) 2 = fi(w) fi(w ) + fi(w ) 2 fi(w) fi(w ) 2 + fi(w ) 2 L w w 2 + σ fi(w) 1 = fi(w) fi(w ) + fi(w ) 1 fi(w) fi(w ) 1 + fi(w ) 1 L1 w w 2 + σ1 where the last inequality holds due to assumption 2 where we let p = 2. Apply this result to the previous formula, denote η = 1 + η, and then we will have: E h wt+1 w 2 2 i (1 αµ)E h wt w 2 2 i + η α2E ft(wt) 2 2αE [(f(wt) f(w ))] + αζE h ( wt w 2 + w 2) ft(wt) 2 i + αδE h ft(wt) 1 (1 αµ + η α2L2)E h wt w 2 2 i + αδL1E [ wt w 2] + η α2σ2 + αδσ1 + αζE [( wt w 2 + w 2)(L w w 2 + σ)] 2αE [(f(wt) f(w ))] = (1 αµ + αζL + η α2L2)E h wt w 2 2 i 2αE [(f(wt) f(w ))] + (αδL1 + αζL w 2 + αζσ)E [ wt w 2] + η α2σ2 + αδσ1 + αζσ w 2 Here we introduce a positive constant C that we ll set later, and by basic inequality we get (αδL1 + αζL w 2 + αζσ)E [ wt w 2] CE [ wt w 2]2 + (αδL1 + αζL w 2 + αζσ)2 CE h wt w 2 2 i + α2(δL1 + ζL w 2 + ζσ)2 E h wt+1 w 2 2 i (1 αµ + αζL + η α2L2 + C)E h wt w 2 2 i 2αE [(f(wt) f(w ))] + η α2σ2 + αδσ1 + αζσ w 2 + α2(δL1 + ζL w 2 + ζσ)2 one setting C to be αµ αζL η α2L2, we will have: 2αE [(f(wt) f(w ))] E h wt w 2 2 i E h wt+1 w 2 2 i + η α2σ2 + αδσ1 + αζσ w 2 + α2(δL1 + ζL w 2 + ζσ)2 4(αµ αζL η α2L2) since we can set α to be small enough such that αµ αζL η α2L2 1 2αµ, then the result will become: 2αE [(f(wt) f(w ))] E h wt w 2 2 i E h wt+1 w 2 2 i + η α2σ2 + αδσ1 + αζσ w 2 + α(δL1 + ζL w 2 + ζσ)2 now we sum up this inequality from t = 0 to t = T 1 and divide by 2αT, then we get: t=o E [(f(wt) f(w ))] w0 w 2 2 E h w T w 2 2 i 2αT + η ασ2 + δσ1 + ζσ w 2 2 + (δL1 + ζL w 2 + ζσ)2 w0 w 2 2 2αT + η ασ2 + δσ1 + ζσ w 2 2 + (δL1 + ζL w 2 + ζσ)2 and since we sample w T uniformly from (wo, w1, , w T 1), we get E [(f( w T ) f(w ))] 1 2αT w0 w 2 2+η ασ2 + δσ1 + ζσ w 2 2 +(δL1 + ζL w 2 + ζσ)2 Proof of Theorem 4. In the case of non-linear quantization: E h Q(ut+1) ut+1 2 2 i δE [ ut+1 wt 1] + ζE [ wt 2 ut+1 wt 2] + ηE h ut+1 wt 2 2 i δE [ αt fit(wt) 1] + ζE [R0 αt fit(wt) 2] + ηE h αt fit(wt) 2 2 i αtδσ1 + αtζR0E [ fit(wt) 2] + α2 tηE h fit(wt) 2 2 i similar to the analysis in convex case, we introduce a positive constant C that can be decided later, and we have: αtζR0 fit(wt) 2 Cα2 t fit(wt) 2 2 + (ζR0)2 E h Q(ut+1) ut+1 2 2 i α2 t(η + C)E h fit(wt) 2 2 i + αtδσ1 + (ζR0)2 substituting into previous results, denote η = η + 1, and we have: E [f(wt+1)] E [f(wt)] αt α2 t L 2 E h f(wt) 2 2 i + α2 tσ2 0L 2 2LE h Q(ut+1) ut+1 2 2 i E [f(wt)] αt (η + C)α2 t L 2 E h f(wt) 2 2 i + α2 tσ2 0L + αt Lδσ1 2 + L(ζR0)2 Next, we sum the result from t = 0 to T 1, set αt = α and select w T uniformly at random from {w0, w1, . . . , w T 1}, then we get: E h f( w T ) 2 2 i = 1 t=0 E h f(wt) 2 2 i 2 2α (η + C)α2L f(w0) E [f(w T )] T + α2σ2 0L + αLδσ1 2 + L(ζR0)2 2 2α (η + C)α2L f(w0) f T + α2σ2 0L + αLδσ1 2α (η + C)α2L + L(ζR0)2 4C(2α (η + C)α2L) = 2 2α (η + C)α2L f(w0) f T + ασ2 0L + Lδσ1 2 (η + C)αL + L(ζR0)2 4C(2α (η + C)α2L) One possible setting of C is C = 1 αL η , then we get: E h f( w T ) 2 2 i f(w0) f 1 2αT + ασ2 0L + Lδσ1 + L(ζR0)2 1 2αT + ασ2 0L + Lδσ1 + L2(ζR0)2 4(1 (η + 1)αL) and if we set α small enough such that α < 1 2(η+1)L, then we get: E h f( w T ) 2 2 i 2(f(w0) f ) αT + ασ2 0L + Lδσ1 + 1 This is the result we stated in the theorem. Next we show the reasonability of our choice of constant C in the process of proof. Alternatively, we consider the optimal setting of C which minimizes the size of the noise ball (when T approaches infinity), that is: C = argmin ασ2 0L + Lδσ1 2 (η + C)αL + L(ζR0)2 4C(2α (η + C)α2L) = argmin ασ2 0L + Lδσ1 (2 η αL) αLC + L(ζR0)2/4α C[(2 η αL) αLC] S = C1 A Bx + C2 x(A Bx) where A = 2 η αL, B = αL, C1 = ασ2 0L + δσ1L, C2 = L(ζR0)2/4α then the optimal setting of C can be solved by S x = BC1 (A Bx)2 + C2(2Bx A) x2(A Bx)2 = 0 the solution to this equation is x = BC2 + p (BC2)2 + ABC1C2 2B = 1 αL 1 the approximation in the last step is based on the fact that the term inside the square root: AC1 BC2 = 4(2 η αL)(ασ0 + δσ1) is a small number. Meanwhile, since α is small, 1 αL is large, and η = 1 + η is small, we can see that our choice of C = 1 αL η is close enough to the optimal setting 1 αL 1 2η , thus is a reasonable setting. Proof of Theorem 5. In the normal FPQ case, the set of quantization points are: D = {0} s 1 + x 2y | x = 0, 1, , nm 1, y = ne 2 + 2, , ne then the parameters for the nonlinear quantization bound can be computed as: 2 ne , ζ = 1 nm , η = ζ2 4(1 + ζ) = 1 4nm(nm + 1) For NLQ-SGD, the noise ball size according to theorem 3 is: δσ1 + ζσ w 2 2 + (δL1 + ζL w 2 + ζσ)2 Denote this as 1 2A + 1 4µB2. When b is large, δ, ζ, η are small, then the dominating term for the noise ball is A = δσ1 + ζσ w 2 = 4sσ1 1 2 ne + σ w 2 1 nm = 4sσ1 1 2 ne + σ w 2 ne let the derivative over ne to be 0 and we get: A ne = 2(ln 2)sσ1 1 2 ne + σ w 2 1 C = 0, 2 ne = 2(ln 2)sσ1C ne = 2 log2 2(ln 2)sσ1C , be = log2 2b + 2 log2 And when b is small, δ, ζ, η are large and the dominating term for the noise ball is B = δL1+ζL w 2+ζσ = 4s L1 1 2 ne +(L w 2+σ) 1 nm = 4s L1 1 2 ne +(L w 2+σ)ne let the derivative of ne to be 0 and we get: B ne = 2(ln 2)s L1 1 2 ne + (L w 2 + σ) 1 2 ne = 2(ln 2)s L1C ne = 2 log2 2(ln 2)s L1C , be = log2 2b + 2 log2 2(ln 2)s L1 For b such that neither the terms dominates the result, we know the noise ball size is: 4µB2 = δσ1 + ζσ w 2 2 + (δL1 + ζL w 2 + ζσ)2 then the derivative of ne is: and since both A ne are increasing functions and we know that: ne=2 log2 2(ln 2)sσ1C ne=2 log2 2(ln 2)s L1C L w 2+σ = 0 then we know the solution of ne 1 2A + 1 4µB2 = 0 is in the interval between 2 log2 2(ln 2)sσ1C and 2 log2 2(ln 2)s L1C Proof of Theorem 6. In the denormal FPQ case, the set of quantization points are: 2 +3 | x = 0, 1, , nm 1 2y | x = 0, 1, , nm 1, y = ne 2 + 3, , ne then the parameters for the nonlinear quantization bound is: 2 ne , ζ = 1 nm , η = ζ2 4(1 + ζ) = 1 4nm(nm + 1) For NLQ-SGD, the noise ball size according to theorem 3 is: δσ1 + ζσ w 2 2 + (δL1 + ζL w 2 + ζσ)2 Denote this as 1 2A + 1 4µB2. When b is large, δ, ζ, η are small and the dominating term for the noise ball is A = δσ1 + ζσ w 2 = 8sσ1 2 ne + σ w 2 1 nm = 8sσ1 2 ne + σ w 2 ne let the derivative over ne to be 0 and we get: A ne = 8sσ1 2 ne + σ w 2 1 C = 0 denote V (x) = x ex, and Lambert W function W(y) = V 1(y), y 1 e. then we need A ne = 8sσ1 2 ne + σ w 2 1 C = 8sσ1 e C V (1 (ln 2)ne) + σ w 2 1 C = 0 thus we have: ne = 1 2 ln 2W eσ w 2 , be = log2 1 2 ln 2W eσ w 2 And when b is small, δ, ζ, η are large and the dominating term for the noise ball is B = δL1 + ζL w 2 + ζσ = 8s L1 2 ne + (L w 2 + σ)ne let the derivative of ne to be 0 and we get: B ne = 8s L1 2 ne +(L w 2+σ) 1 e C V (1 (ln 2)ne)+(L w 2+σ) 1 thus we have: ne = 1 2 ln 2W e(L w 2 + σ) be = 2 log2 1 2 ln 2W e(L w 2 + σ) For b such that neither the terms dominates the result, we know the noise ball size is: 4µB2 = δσ1 + ζσ w 2 2 + (δL1 + ζL w 2 + ζσ)2 4µ then the derivative of ne is: and since both A ne are increasing functions and we know that: ne=1 2 ln 2 W eσ w 2 ne=1 2 ln 2 W e(L w 2+σ) then we know the solution of ne 1 2A + 1 4µB2 = 0 is in the interval between 1 2 ln 2W eσ w 2 and 1 2 ln 2W e(L w 2+σ) Proof of Theorem 7. In normal floating point quantization, we know that δnormal = 4s 22be 1 , ζ = 2 bm, η = ζ2 same as before, denote ne = 2be, nm = 2bm, C = 2b = 2be+bm, then δnormal = 4s 2 ne , ζ = 1 nm = ne C , η = 1 4nm(nm + 1) and the noise ball size we wish to minimize (as T , α 0), according to Theorem 4 is: denote this result as S, then 2 ne + L2R2 0n2 e 2C2 , S ne = 2(ln 2)s Lσ1 2 ne + L2R2 0 C2 ne the noise ball is minimized when S ne = 0, that is: 2 ne = 2(ln 2)sσ1C2 2 ln 2 ne)e 1 2 ln 2 ne = (ln 2)2sσ1C2 LR2 0 let V (x) = x ex, and Lambert W function is W(y) = V 1(y), y 1 e, then the solution is ne = 2 ln 2W (ln 2)2sσ1C2 = 2 ln 2W (ln 2)2sσ122b be = log2 ne = log2 ln 2W (ln 2)2sσ122b For de-normal FPQ on non-convex objectives, S = 8s Lσ1ne 2 ne + L2R2 0n2 e 2C2 , S ne = 8sσ1L 2 ne + L2R2 0 C2 ne ne = 0, then we have: 2 ne + 8sσ1C this is a transcendental equation, which does not have an analytical solution, or does not have solutions at all. If there does exist a solution, we can solve it numerically and use it as the optimal setting of exponent bits.