# lowprecision_stochastic_gradient_langevin_dynamics__b2b6e7ec.pdf

Low-Precision Stochastic Gradient Langevin Dynamics

Ruqi Zhang 1 Andrew Gordon Wilson 2 Christopher De Sa 3

While low-precision optimization has been widely used to accelerate deep learning, low-precision sampling remains largely unexplored. As a consequence, sampling is simply infeasible in many large-scale scenarios, despite providing remarkable benefits to generalization and uncertainty estimation for neural networks. In this paper, we provide the first study of low-precision Stochastic Gradient Langevin Dynamics (SGLD), showing that its costs can be significantly reduced without sacrificing performance, due to its intrinsic ability to handle system noise. We prove that the convergence of low-precision SGLD with fullprecision gradient accumulators is less affected by the quantization error than its SGD counterpart in the strongly convex setting. To further enable low-precision gradient accumulators, we develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.

1. Introduction

Low-precision optimization has become increasingly popular in reducing computation and memory costs of training deep neural networks (DNNs). It uses fewer bits to represent numbers in model parameters, activations, and gradients, and thus can drastically lower resource demands (Gupta et al., 2015; Zhou et al., 2016; De Sa et al., 2017; Li et al., 2017). Prior work has shown that using 8-bit numbers in training DNNs achieves about 4 latency speed ups and memory reduction compared to 32-bit numbers on a wide variety of deep learning tasks (Sun et al., 2019; Yang et al., 2019; Wang et al., 2018b; Banner et al., 2018). As datasets and architectures grow rapidly, performing low-precision

1The University of Texas at Austin 2New York University 3Cornell University. Correspondence to: Ruqi Zhang <ruqiz@utexas.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

optimization enables training large-scale DNNs efficiently and enables many applications on different hardware and platforms.

Despite the impressive progress in low-precision optimization, low-precision sampling remains largely unexplored. However, we believe stochastic gradient Markov chain Monte Carlo (SGMCMC) methods (Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015) are particularly suited for low-precision arithmetic because of their intrinsic robustness to system noise. In particular: (1) SGMCMC explores weight space instead of converging to a single point, thus it should not require precise weights or gradients; (2) SGMCMC even adds noise to the system to encourage exploration and so is naturally more tolerant to quantization noise; (3) SGMCMC performs Bayesian model averaging during testing using an ensemble of models, which allows coarse representations of individual models to be compensated by the overall model average (Zhu et al., 2019).

SGMCMC is particularly compelling in Bayesian deep learning due to its ability to characterize complex and multimodal DNN posteriors, providing state-of-the-art generalization accuracy and calibration (Zhang et al., 2020; Li et al., 2016; Gan et al., 2017; Heek & Kalchbrenner, 2019). Moreover, low-precision approaches are especially appealing in this setting, where at test time we must store samples from a posterior over millions of parameters, and perform multiple forward passes through the corresponding models, which incurs significant memory and computational expenses.

In this paper, we give the first comprehensive study of low-precision Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011), providing both theoretical convergence bounds and promising empirical results in deep learning. On strongly log-concave distributions (i.e. strongly convex functions for SGD), we prove that the convergence of SGLD with full-precision gradient accumulators is more robust to the quantization error than its counterpart in SGD. Surprisingly, we find that SGLD with low-precision gradient accumulators can diverge arbitrarily far away from the target distribution with small stepsizes. We identify the source of the issue and develop a new quantization function to correct the bias with minimal overhead. Empirically, we demonstrate low-precision SGLD across different tasks, showing that it is able to provide superior

Low-Precision Stochastic Gradient Langevin Dynamics

generalization and uncertainty estimation using just 8 bits. We summarize our contributions as follows:

We provide a methodology for SGLD to leverage lowprecision computation, including a new quantization function, while still guaranteeing its convergence to the target distribution.

We offer theoretical results which explicitly show how quantization error affects the convergence of the sampler in the strongly convex setting, proving its robustness to quantization noise over SGD.

We show SGLD is particularly suitable for lowprecision deep learning over a range of experiments, including logistic regression, Bayesian neural networks on image and text classification.

In short, low-precision SGLD is often a compelling alternative to standard SGLD, improving speed and memory efficiency, while retaining accuracy. Moreover, SGLD is arguably more amenable to low-precision computations than SGD. Our code is available here.

2. Related Work

To speed up SGLD training, most existing work is on distributed learning with synchronous or asynchronous communication (Ahn et al., 2014; Chen et al., 2016; Li et al., 2019). Another direction is to shorten training time by accelerating the convergence using variance reduction techniques (Dubey et al., 2016; Baker et al., 2019), importance sampling (Deng et al., 2020) or a cyclical learning rate schedule (Zhang et al., 2020). To speed up SGLD during testing, distillation techniques are often used to save both compute and memory which transfer the knowledge of an ensemble of models to a single model (Korattikara et al., 2015; Wang et al., 2018a).

Low-precision computation has become one of the most common approaches to reduce latency and memory consumption in deep learning and is widely supported on new emerging chips including CPUs, GPUs and TPUs (Micikevicius et al., 2018; Krishnamoorthi, 2018; Esser et al., 2020). Two main directions to improve low-precision training include developing new number formats (Sun et al., 2019; 2020) or studying mixed-precision schemes (Courbariaux et al., 2015; Zhou et al., 2016; Banner et al., 2018). Recently, one line of work applies the Bayesian framework to learn a deterministic quantized neural network (Soudry et al., 2014; Cheng et al., 2015; Achterhold et al., 2018; van Baalen et al., 2020; Meng et al., 2020).

Low-precision computation is largely unexplored for Bayesian neural networks, despite their specific promise in this domain. Su et al. (2019) proposes a method to train

binarized variational BNNs, and Cai et al. (2018) develops efficient hardware for training low-precision variational BNNs. The only work on low-precision MCMC known to us is Ferianc et al. (2021), which directly applies posttraining quantization techniques from optimization (Jacob et al., 2018) to convert BNNs trained by Stochastic Gradient Hamiltonian Monte Carlo (Chen et al., 2014) into lowprecision models. We instead study training low-precision models by SGLD from scratch, to accelerate both training and testing.

3. Preliminaries

3.1. Stochastic Gradient Langevin Dynamics

In the Bayesian setting, given some dataset D, a model with parameters θ Rd, and a prior p(θ), we are interested in sampling from the posterior p(θ|D) exp( U(θ)), where the energy function is

x D log p(x|θ) log p(θ).

When the dataset is large, the cost of computing a sum over the entire dataset is expensive. Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) reduces the cost by using a stochastic gradient estimation U, which is an unbiased estimator of U based on a subset of the dataset D. Specifically, SGLD updates the parameter θ in the (k+1)-th step following the rule

θk+1 = θk α U(θk) +

2αξk+1, (1)

where α is the stepsize and ξ is a standard Gaussian noise. Compared to the SGD update, the only difference is that SGLD adds an additional Gaussian noise in each step, which essentially enables SGLD to characterize the full distribution instead of converging to a single point. The close connection between SGLD and SGD makes it convenient to implement and run on existing deep learning tasks for which SGD is the typical learning algorithm.

3.2. Low-Precision Training

We study training a low-precision model by SGLD from scratch, to reduce both training and testing costs. Specifically, we follow the framework in prior work to quantize the weights, activations, backpropagation errors, and gradients (Wu et al., 2018; Wang et al., 2018b; Yang et al., 2019). We mainly consider the effect of weight and gradient quantization following previous work (Li et al., 2017; Yang et al., 2019). Please refer to Appendix A for more details.

3.2.1. NUMBER REPRESENTATIONS

To represent numbers in low-precision, one simple way is to use fixed point, which has been utilized in both theory and

Low-Precision Stochastic Gradient Langevin Dynamics

practice (Gupta et al., 2015; Lin et al., 2016; Li et al., 2017; Yang et al., 2019). Specifically, suppose that we use W bits to represent numbers with F of those W bits to represent the fractional part. Then there is a distance between consecutive representable numbers, = 2 F , which is called quantization gap. The representable numbers also have a lower bound l and an upper bound u, where

l = 2W F 1, u = 2W F 1 2 F .

As shown above, when the number of bits decreases, the accuracy of number representation decreases. We use this type of number representation in our theoretical analysis and empirical demonstration following previous work (Li et al., 2017; Yang et al., 2019).

Another common type of number representation is floating point where each number has its own exponent part. Between fixed point and floating point, there is block floating point which allows all numbers within a block to share the same exponent (Song et al., 2018). We use block floating point for deep learning experiments since it has been shown more favorable for deep models (Yang et al., 2019).

3.2.2. QUANTIZATION

Having low-precision number representation in hand, we also need a quantization function Q to convert a real-valued number into a low-precision number. Such functions include deterministic rounding and stochastic rounding. Particularly, the deterministic rounding function Qd quantizes a number to its nearest representable neighbor as follows:

Qd(θ) = sign(θ) clip |θ|

where clip(x, l, u) = max(min(x, u), l). Instead, stochastic rounding Qs quantizes a number with a probability based on the distance to its representable neighbor:

, l, u , w.p. θ

, l, u , w.p. 1 θ

An important property of Qs is that E[Qs(θ)] = θ, which means the quantized number is unbiased. Qs is generally preferred over Qd in practice since it can preserve gradient information especially when the gradient update is smaller than the quantization gap (Gupta et al., 2015; Wang et al., 2018b). In what follows, we use QW and W to denote the weights quantizer and quantization gap, QG and G to denote gradients quantizer and quantization gap.

To do the gradient update in low-precision training, there are two common choices depending on whether we store an additional copy of full-precision weights. Full-precision gradient accumulators use a full-precision weight buffer

to accumulate gradient updates and only quantize weights before computing gradients. SGD with full-precision gradient accumulators (SGDLP-F) updates the weights as the following,

θk+1 = θk αQG U(QW (θk)) ,

where we use full-precision θk+1 and θk in the update, and only quantize the weight for forward and backward propagation (Courbariaux et al., 2015; Li et al., 2017).

However, gradient accumulators have to be frequently updated during training, therefore it will be ideal to also represent it in low-precision to further reduce the costs. To achieve it, we could instead do the update as follows,

θk+1 = QW θk αQG U(θk) , (2)

where θ is always represented in low-precision. This update of SGD is called using low-precision gradient accumulators (SGDLP-L). Both fulland low-precision gradient accumulators have been widely used: low-precision gradient accumulators are cheaper and faster because of having all numbers in low-precision, whereas full-precision gradient accumulators generally have better performance because of more precisely reflecting small gradient updates (Courbariaux et al., 2015; Li et al., 2017).

4. Low-Precision SGLD

In this section, we first study the convergence of lowprecision SGLD with full-precision gradient accumulators (SGLDLP-F) on strongly log-concave distributions (i.e. the energy function is strongly convex) and show that SGLDLPF is less affected by the quantization error than its SGD counterpart. Next we analyze low-precision SGLD with low-precision gradient accumulators (SGLDLP-L) under the same setup and prove that SGLDLP-L can diverge arbitrarily far away from the target distribution with a small stepsize, which however is typically required by SGLD to reduce asymptotic bias. Finally, we solve this problem by developing a variance-corrected quantization function and further prove that with this quantization function, SGLDLPL converges with small stepsizes.

4.1. Full-Precision Gradient Accumulators

As shown in Equation (1), the update of SGLD is simply a SGD update plus a Gaussian noise. Therefore the lowprecision formulation for SGD in Section 3.2 can be naturally extended to SGLD training. Similar to SGDLP-F, we can do low-precision SGLD with full-precision gradient accumulators (SGLDLP-F) as the following:

θk+1 = θk αQG U(QW (θk)) +

2αξk+1, (3)

Low-Precision Stochastic Gradient Langevin Dynamics

which can also be viewed as SGDLP-F plus a Gaussian noise in each step. However, the Gaussian noise turns out to help counter-effect the noise introduced by quantization, making SGLDLP-F more robust to inaccurate number representation and converging better than SGDLP-F as we show later in this section.

We now prove the convergence of SGLDLP-F. Our analysis is built upon the 2-Wasserstein distance bounds of SGLD in Dalalyan & Karagulyan (2019), where the target distribution is assumed to be smooth and strongly log-concave. We additionally assume the energy function has Lipschitz Hessian following recent work in low-precision optimization (Yang et al., 2019). In summary, the energy function U has the following assumptions, θ, θ Rd, it satisfies

U(θ) U(θ ) U(θ ) (θ θ ) (m/2) θ θ 2 2 , U(θ) U(θ ) 2 M θ θ 2 , 2U(θ) 2U(θ ) 2 Ψ θ θ 2,

for some positive constants m, M and Ψ. We further assume that the variance of the stochastic gradient is bounded E[ U(θ) U(θ) 2 2] κ2 for some constant κ. For simplicity, we consider SGLD with a constant stepsize α. We use stochastic rounding for quantizing both weights and gradients as it is generally better than deterministic rounding and has also been used in previous low-precision theoretical analysis (Li et al., 2017; Yang et al., 2019).

Theorem 1. We run SGLDLP-F under the above assumptions and with a constant stepsize α 2/(m + M). Let π be the target distribution, µ0 be the initial distribution and µK be the distribution obtained by SGLDLP-F after K iterations, then the 2-Wasserstein distance is

W2(µK, π) (1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2

Ψ 2 W d 4m , M W

( 2 G + M2 2 W )αd + 4ακ2

This theorem shows that SGLDLP-F converges to the accuracy floor min Ψ 2 W d 4m , M W

d 2m given large enough number of iterations K and small enough stepsize α. Besides, if we further assume that the energy function is quadratic, that is Ψ = 0, then SGLDLP-F converges to the target distribution asymptotically. This matches the result of SGDLP-F on a quadratic function which converges to the optimum asymptotically (Li et al., 2017). Our theorem also recovers the bound in Dalalyan & Karagulyan (2019) when the quantization gap is zero (we ignore 1.65M in the denominator in their bound for simplicity).

However, when the energy function is not quadratic, the convergence of SGLDLP-F to the target distribution has a O( 2 W ) rate whereas SGDLP-F to the optimum has a

O( W ) rate (Yang et al., 2019)1. Recall that W = 2 F

where F is the number of fractional bits, our result suggests that asymptotically, SGLD only needs half the number of bits as SGD needs to achieve the same convergence accuracy! Our comparison between SGLD and SGD also fits into the literature in comparing sampling and optimization convergence bounds (Ma et al., 2019; Talwar, 2019) (see Appendix E for more details). In summary, our theorem implies how sensitive SGLD is to the quantization error, and actually suggests that sampling methods are more suitable with low-precision computation than optimization methods.

4.2. Low-Precision Gradient Accumulators

As mentioned before, it will be ideal to further reduce the costs using low-precision gradient accumulators. Mimicking the update of SGDLP-L in Equation (2), it is natural to get the following update rule for SGLD with low-precision gradient accumulators (SGLDLP-L),

θk+1 = QW θk αQG U(θk) +

2αξk+1 . (4)

Surprisingly, while we can prove a convergence result for SGLDLP-L, our theory and empirical results suggest that it can diverge arbitrarily far away from the target distribution with small stepsizes.

Theorem 2. We run SGLDLP-L under the same assumptions as in Theorem 1. Let µ0 be the initial distribution and µK be the distribution obtained by SGLDLP-L after K iterations, then

W2(µK, π) (1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2

Ψ 2 W d 4m , M W

+ A + (1 αm)K + 1 W

where A = q

(α 2 G+α 1 2 W )d+4ακ2

Since the term A contains α 1 in the numerator, this theorem implies that as the stepsize α decreases, W2 distance between the stationary distribution of SGLDLP-L and the target distribution may increase. To test if this is the case, we empirically run SGLDLP-L on a standard Gaussian distribution in Figure 1. We use 8-bit fixed point and assign 3 of them to represent the fractional part. Our results verify that SGLDLP-L indeed diverges from the target distribution with small stepsizes. In the same time, SGLDLP-F always converges to the target distribution with different stepsizes, aligning with the result in Theorem 1.

One may choose a stepsize that minimizes the above W2 distance to avoid divergence, however, getting that optimal

1Their bound O( 2 W ) is stated for the squared norm therefore we take its square root to compare with our W2 distance bound which is stated for the norm.

Low-Precision Stochastic Gradient Langevin Dynamics

4 2 0 2 4 0.00

True Naive SGLDLP-L

4 2 0 2 4 0.00

True VC SGLDLP-L

4 2 0 2 4 0.00

True SGLDLP-F

(a) SGLD with α = 0.001

6 4 2 0 2 4 6 0.00

True Naive SGLDLP-L

4 2 0 2 4 0.00

True VC SGLDLP-L

4 2 0 2 4 0.00

True SGLDLP-F

(b) SGLD with α = 0.0001

Figure 1. Low-precision SGLD with varying stepsizes on a Gaussian distribution. Variance-corrected SGLD with low-precision gradient accumulators (VC SGLDLP-L) and SGLD with full-precision gradient accumulators (SGLDLP-F) converge to the true distribution, whereas na ıve SGLDLP-L diverges and the divergence increases as the stepsize decreases.

stepsize is in general difficult since the constants are unknown in practice. Moreover, enabling a small stepsize in SGLD is often desirable, since it is needed to reduce the asymptotic bias of the posterior approximation (Welling & Teh, 2011).

4.3. Variance-Corrected Quantization

To approach correcting the problem with the na ıve SGLD with low-precision gradient accumulators, we first need to identify the source of the issue. We show in the following that the reason is the variance of each dimension of θk+1 becomes larger due to using low-precision gradient accumulators. Specifically, given the stochastic gradient U, the update of full-precision SGLD can be viewed as sampling from a Gaussian distribution for each dimension i

θk+1,i N θk,i α U(θk)i, 2α , for i = 1, , d.

Since stochastic rounding is unbiased, using it as the weight quantizer QW and the gradient quantizer QG in SGLDLP-L gives us

E [θk+1,i] = E h QW θk,i αQG U(θk)

= θk,i α U(θk)i,

which has the same mean as θk+1 in full-precision. However, the variance of θk+1,i is now essentially larger than needed. If we ignore the variance from QG and the stochastic gradient, since they are present and have been shown to

work well in SGLDLP-F, the variance of θk+1,i is

Var [θk+1,i]

= E h Var h QW θk,i α U(θk)i +

2αξk+1,i ξk+1,i ii

+ Var h E h QW θk,i α U(θk)i +

2αξk+1,i ξk+1,i ii

= 2 W 4 χk+1,i + 2α.

where χk+1,i [0, 1] depends on the distance of θk+1,i to its discrete neighbor. The above equation shows that the variance in SGLDLP-L update is larger than the right variance value 2α. Empirically, from Figure 1, we can also find that na ıve SGLDLP-L estimates the mean correctly but variance wrongly. This validates our intuition that lowprecision gradient accumulators with stochastic rounding adds more variance than needed leading to an inaccurate variance estimation. Besides, we cannot simply use deterministic rounding to solve the problem, since it is a biased estimation and generally provides much worse results than stochastic rounding especially on deep neural networks (see Appendix F for an empirical demonstration).

To enable SGLD with low-precision gradient accumulators, we propose a new quantization function Qvc, denoting variance-corrected quantization, to fix the issue. The main idea of Qvc is to directly sample from the discrete weight space instead of quantizing a real-valued Gaussian sample. First we note that if we want a sample with mean µ 0 and variance v 2 W /4, we could sample from the following

Low-Precision Stochastic Gradient Langevin Dynamics

Algorithm 1 Variance-Corrected Low-Precision SGLD (VC SGLDLP-L).

given: Stepsize α, number of training iterations K, gradient quantizer QG and quantization gap of weights W . for k = 1 : K do

update θk+1 Qvc θk αQG U(θk) , 2α, W

end for output: samples {θk}

categorical distribution over { W , W , 0} to get it,

Cat(µ, v) =

W , w.p. v+µ2+µ W

2 2 W W , w.p. v+µ2 µ W

2 2 W 0, otherwise

Now we show how to use this categorical distribution to preserve the correct mean and variance for quantized θk+1. We do so considering two cases: when the Gaussian variance 2α is larger than the largest possible stochastic rounding variance 2 W /4, Qvc first adds a small Gaussian noise and uses a sample from Equation (5) to make up the remaining variance; in the other situation, Qvc directly samples from Equation (5) to achieve the target variance. The full description of Qvc is outlined in Algorithm 2.

Our variance-corrected quantization function Qvc always guarantees the correct mean, E [θk+1,i] = θk,i α U(θk)i, and further guarantees the correct variance Var [θk+1,i] = 2α most of the time except when v = 2α < vs. However that case rarely happens in practice, because the stepsize has to be extremely small. Besides, our quantization is simple to implement and its cost is negligible compared to gradient computation. Although Qvc only preserves the correctness of the first two moments (i.e. mean and variance), we show that this does not affect the performance much in both theory and practice.

We now prove that SGLDLP-L using Qvc, denoting VC SGLDLP-L, converges to the target distribution up to a certain accuracy level with small stepsizes.

Theorem 3. We run VC SGLDLP-L as in Algorithm 1. Besides the same assumptions in Theorem 1, we further assume the gradient is bounded E h QG( U(θk)) 1

v0 = 2 W /4. Then

W2(µK, π) (1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2

α 2 Gd + 4ακ2

+ (1 αm)K + 1

5v0d, if 2α > v0

max (2 W αG, 4αd) , otherwise

Algorithm 2 Variance-Corrected Quantization Function Qvc.

input: (µ, v, ) {Qvc returns a variable with mean µ and variance v} v0 2/4 { 2/4 is the largest possible variance that stochastic rounding can cause} if v > v0 then {add a small Gaussian noise and sample from the discrete grid to make up the remaining variance}

x µ + v v0ξ, where ξ N(0, Id) r x Qd(x) for all i do

sample ci from Cat(|ri|, v0) as in Equation (5) end for θ Qd(x) + sign(r) c else {sample from the discrete grid to achieve the target variance}

r µ Qs(µ) for all i do

r2 i + |ri|

( ri + sign(ri) )2

if v > vs then

sample ci from Cat(0, v vs) as in Equation (5) θi Qs(µ)i + ci else

θi Qs(µ)i end if end for end if clip θ if outside representable range return θ

This theorem shows that when the stepsize α 0, VC SGLDLP-L converges to the target distribution up to an error instead of diverging. Moreover, VC SGLDLP-L converges to the target distribution in O( W ) which is equivalent to the convergence rate of SGD with low-precision gradient accumulators to the optimum (Li et al., 2017; Yang et al., 2019). However, we show empirically that VC SGLDLP-L has a much better dependency on the quantization gap than SGD. We leave the improvement of the theoretical bound for future work.

We empirically demonstrate VC SGLDLP-L on the standard Gaussian distribution under the same setting as in the previous section in Figure 1. Regardless of the stepsize, VC SGLDLP-L converges to the target distribution and approximates the target distribution as accurately as SGLDLP-F, showing that preserving the correct variance is the key to ensuring correct convergence.

5. Experiments

We demonstrate the generalization accuracy and uncertainty estimation of low-precision SGLD with full-precision gradi-

Low-Precision Stochastic Gradient Langevin Dynamics

2 3 4 5 6 7 8 9 10 Number of Fractional Bits

SGDLP-L Naive SGLDLP-L VC SGLDLP-L SGDLP-F SGLDLP-F SGDFP (-0.566) SGLDFP (-0.567)

(a) Logistic regression

2 3 4 5 6 7 8 9 10 Number of Fractional Bits

SGDLP-L Naive SGLDLP-L VC SGLDLP-L SGDLP-F SGLDLP-F SGDFP (-1.14) SGLDFP (-1.17)

Figure 2. Test NLL on MNIST in terms of different precision. As the quantization level increases, VC SGLDLP-L and SGLDLP-F are able to maintain high performance whereas the corresponding SGD deteriorates quickly.

ent accumulators (SGLDLP-F) and with variance-corrected low-precision gradient accumulators (VC SGLDLP-L) on a logistic regression and multilayer perceptron on MNIST dataset (Section 5.1), Res Net-18 on CIFAR datasets and LSTM on IMDB dataset (Section 5.2), and Res Net-18 on Image Net dataset (Section 5.3). We use qtorch (Zhang et al., 2019) to simulate low-precision training on these experiments, and use the same quantization for weights, gradients, activations, backpropagation errors unless otherwise stated. For all experiments, SGLD collects samples from the posterior of the model s weight, and obtained the prediction on test data by Bayesian model averaging. We mainly compare low-precision SGLD with low-precision SGD which has been used to achieve state-of-the-art results in low-precision deep learning (Sun et al., 2019; 2020). We use SGLDFP and SGDFP to denote SGLD and SGD in full-precision respectively.

Logistic Regression We first empirically verify our theorems, including the bias dependence on the quantization levels and the tightness of the bounds, on a logistic regression on MNIST dataset. We use N(0, 1/6) as the prior distribution following Yang et al. (2019) and the resulting

posterior distribution is strongly log-concave and satisfies the assumptions in Section 4. We use fixed point numbers with 2 integer bits and vary the number of fractional bits which is corresponding to varying the quantization gap . We report test negative log-likelihood (NLL) with different numbers of fractional bits in Figure 2a.

From the results, we see that SGLDLP-F is more robust to the decay of the number of bits than its SGD counterpart since SGLDLP-F outperforms SGDLP-F on all number of bits and recovers the full-precision result with 6 bits whereas SGDLP-F needs 10 bits. This verifies Theorem 1 that SGLDLP-F converges to the target distribution up to an error and is more robust to the quantization gap than SGDLP-F. This also shows that the bound in Theorem 1 is relatively tight in terms of the quantization gap since SGLDLP-F needs roughly half number of bits compared to SGDLP-F to achieve the same convergence accuracy. With low-precision gradient accumulators, we can see that VC SGLDLP-L is significantly better than na ıve SGLDLP-L, indicating that our variance-corrected quantization effectively reduces the bias of gradient accumulators, which verifies Theorem 2 and Theorem 3. Moreover, VC SGLDLP-L outperforms SGDLP-L on all numbers of bits and even outperforms SGDLP-F when using 2 fractional bits. These observations demonstrate that with either fullor low-precision gradient accumulators, SGLD is able to maintain its high performance whereas SGD deteriorates quickly as the quantization noise increases.

Multilayer Perceptron To test whether these results apply to non-log-concave distributions, we replace the logistic regression model by a two-layer multilayer perceptron (MLP). The MLP has 100 hidden units and RELU nonlinearities. From Figure 2b, we observe similar results as on the logistic regression, suggesting that empirically our analysis still stands on general distributions and sheds light on the possibility of extending the theoretical analysis of low-precision SGLD to non-log-concave settings. Please note that as far as we understand, there is no theoretical convergence result of low-precision optimization in non-convex settings either.

5.2. CIFAR and IMDB

We consider image and sentiment classification tasks: CIFAR datasets (Krizhevsky et al., 2009) on Res Net-18 (He et al., 2016), and IMDB dataset (Maas et al., 2011) on LSTM (Hochreiter & Schmidhuber, 1997). We use 8-bit number representation since it becomes increasingly popular in training deep models and is powered by the new generation of chips (Sun et al., 2019; Banner et al., 2018; Wang et al., 2018b). We report test errors averaged over 3 runs with the standard error in Table 1.

Low-Precision Stochastic Gradient Langevin Dynamics

Table 1. Test errors (%) on CIFAR with Res Net-18 and IMDB with LSTM. Low-precision SGLD outperforms low-precision SGD across different datasets, architectures and number representations, and the improvement becomes larger when using more low-precision arithmetic.

CIFAR-10 CIFAR-100 IMDB

32-BIT FLOATING POINT

SGLDFP 4.65 0.06 22.58 0.18 13.43 0.21 SGDFP 4.71 0.02 22.64 0.13 13.88 0.29

CSGLDFP 4.54 0.05 21.63 0.04 13.25 0.18

8-BIT FIXED POINT

NA IVE SGLDLP-L 7.82 0.13 27.25 0.13 16.63 0.28 VC SGLDLP-L 7.13 0.01 26.62 0.16 15.38 0.27 SGDLP-L 8.53 0.08 28.86 0.10 19.28 0.63 SGLDLP-F 5.12 0.06 23.30 0.09 15.40 0.36 SGDLP-F 5.20 0.14 23.84 0.12 15.74 0.79

8-BIT BLOCK FLOATING POINT

NA IVE SGLDLP-L 5.85 0.04 26.38 0.13 14.64 0.08 VC SGLDLP-L 5.51 0.01 25.22 0.18 13.99 0.24 SGDLP-L 5.86 0.18 26.19 0.11 16.06 1.81 SGLDLP-F 4.58 0.07 22.59 0.18 14.05 0.33 SGDLP-F 4.75 0.05 22.9 0.13 14.28 0.17 VC CSGLDLP-L 4.97 0.10 22.61 0.12 13.09 0.27 CSGLD-F 4.32 0.07 21.50 0.14 13.13 0.37

Table 2. ECE (%) on CIFAR with Res Net-18. VC SGLDLP-L and SGLDLP-F achieve almost the same or even lower ECE than full-precision SGLD whereas the ECE of low-precision SGD increases significantly.

CIFAR-10 CIFAR-100

32-BIT FLOATING POINT

SGLD 1.11 3.92 SGD 2.53 4.97 CSGLDFP 0.66 1.38

8-BIT FIXED POINT

VC SGLDLP-L 0.6 3.19 SGDLP-L 3.4 10.38 SGLDLP-F 1.12 4.42 SGDLP-F 3.05 6.80

8-BIT BLOCK FLOATING POINT

VC SGLDLP-L 0.6 5.82 SGDLP-L 4.23 12.97 SGLDLP-F 1.19 3.78 SGDLP-F 2.76 5.2 VC CSGLDLP-L 0.51 1.39 CSGLD-F 0.56 1.33

Fixed Point We use 8-bit fixed point for weights and gradients but full-precision for activations since we find lowprecision activations significantly harm the performance. Similar to the results in previous sections, SGLDLP-F is better than SGDLP-F and VC SGLDLP-L significantly outperforms na ıve SGLDLP-L and SGDLP-L across datasets and architectures. Notably, the improvement of SGLD over SGD becomes larger when using more low-precision arithmetic. For example, on CIFAR-100, VC SGLDLP-L outperforms SGDLP-L by 2.24%, SGLDLP-F outperforms SGDLP-F by 0.54% and SGLDFP outperforms SGDFP by 0.06%. This demonstrates that SGLD is particularly compatible with low-precision deep learning because of its natural ability to handle system noise.

Block Floating Point We also consider block floating point (BFP) which is another common number type and is often preferred over fixed point on deep models due to less quantization error caused by overflow and underflow (Song et al., 2018). Following the block design in Yang et al. (2019), we use small-block for Res Net and big-block for LSTM. The Qvc function naturally generalizes to BFP and only needs a small modification (see Appendix G for the algorithm of Qvc with BFP). By using BFP, the results of all low-precision methods improve over fixed point. SGLDLPF can match the performance of SGLDFP with all numbers quantized to 8-bit except gradient accumulators. VC SGLDLP-L still outperforms na ıve SGLDLP-L indicating the effectiveness of Qvc with BFP. Again, SGLDFP-F and VC SGLDLP-L outperform their SGD counterparts on all

tasks, suggesting the general applicability of low-precision SGLD with different number types.

Cyclical SGLD We further apply low-precision to a recent variant of SGLD, c SGLD, which utilizes a cyclical learning rate schedule to speed up convergence (Zhang et al., 2020). We observe that the results of c SGLD-F are very close to those of c SGLDFP, and VC c SGLDLP-L can match or even outperforms full-precision SGD with all numbers quantized to 8 bits! These results indicate that diverse samples from different modes, obtained by the cyclical learning rate schedule, can counter-effect the quantization error by providing complementary predictions.

Expected Calibration Error Besides generalization performance, we further report the results of expected calibration error (ECE) (Guo et al., 2017) to demonstrate the uncertainty estimation of low-precision SGLD. In Table 2, we observe that SGLDLP-F and VC SGLDLP-L achieve almost the same or even lower ECE than full-precision SGLD, showing the ability of SGLD to give well-calibrated predictions does not degenerate due to using low-precision. VC SGLDLP-L sometimes gives lower ECE than SGLDLPF which may be due to the regularization effect of lowprecision arithmetic. Moreover, c SGLD in low-precision not only achieves the best accuracy but also has the best calibration, further suggesting that diverse samples obtained by a cyclical learning rate schedule have a positive effect on quantization. In contrast, the ECE of low-precision SGD increases significantly compared to full-precision SGD, im-

Low-Precision Stochastic Gradient Langevin Dynamics

Table 3. Test errors (%) on Image Net with Res Net-18. The improvement of SGLD over SGD becomes larger in low-precision than in full-precision.

TOP-1 TOP-5

32-BIT FLOATING POINT

SGLD 30.39 10.76 SGD 30.56 10.97

8-BIT BLOCK FLOATING POINT

SGLDLP-F 31.47 11.77 SGDLP-F 32.23 12.09

plying that the quantization error makes the standard DNNs even more overconfident, which might lead to wrong decisions in real-world applications.

SGLDLP-F vs VC SGLDLP-L We have provided two variants of low-precision SGLD for practical use. In general, SGLDLP-F has better performance while VC SGLDLP-L requires less computation, making them suitable for different cases. When the computation resources are very limited, e.g. on edge devices, VC SGLDLP-L is preferred for saving computation while when the resources are able to support full-precision gradient accumulators, SGLDLP-F is preferred for better performance.

5.3. Image Net

Finally, we test low-precision SGLD on a large-scale image classification dataset, Image Net, with Res Net-18. We train SGD for 90 epochs and train SGLD for 10 epochs using the trained SGD model as the initialization. In Table 3, we observe that the improvement of SGLD over SGD is larger in low-precision (0.76% top-1 error) than in fullprecision (0.17% top-1 error), showing the advantages of low-precision SGLD on large-scale deep learning tasks. We could not achieve reasonable results with low-precision gradient accumulators for SGD and SGLD, which might be caused by hyper-parameter tuning.

6. Conclusion

We provide the first comprehensive investigation for lowprecision SGLD. With full-precision gradient accumulators, we prove that SGLD is convergent and can be safely used in practice, and further show that it has a better dependency of convergence on the quantization gap than SGD. Moreover, we reveal issues in na ıvely performing low-precision computation in SGLD with low-precision gradient accumulators, and propose a new theoretically guaranteed quantization function to enable fully quantized sampling. We conduct experiments on a Gaussian distribution and a logis-

tic regression to empirically verify our theoretical results. Besides, we show that low-precision SGLD achieves comparable results with full-precision SGLD and outperforms low-precision SGD significantly on several Bayesian deep learning benchmarks.

MCMC was once the gold standard on small neural networks (Neal et al., 2011), but has been significantly limited by its high costs on large architectures in deep learning. We believe this work fills an important gap, and will accelerate the practical use of sampling methods on large-scale and resource-restricted machine learning problems.

Moreover, low-precision SGLD could broadly be used as a drop-in replacement for standard SGLD, as it can confer speed and memory advantages, while retaining accuracy.

Acknowledgements

RZ is supported by NSF AI Institute for Foundations of Machine Learning (IFML). AGW is supported by NSF CAREER IIS-2145492, NSF I-DISRE 193471, NIH R01DA048764-01A1, NSF IIS-1910266, Meta Core Data Science, Google AI Research, Big Hat Biosciences, Capital One, and an Amazon Research Award.

Achterhold, J., Koehler, J. M., Schmeink, A., and Genewein, T. Variational network quantization. In International Conference on Learning Representations, 2018.

Ahn, S., Shahbaba, B., and Welling, M. Distributed stochastic gradient mcmc. In International conference on machine learning, pp. 1044 1052. PMLR, 2014.

Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. Control variates for stochastic gradient mcmc. Statistics and Computing, 29(3):599 615, 2019.

Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 2018.

Cai, R., Ren, A., Liu, N., Ding, C., Wang, L., Qian, X., Pedram, M., and Wang, Y. Vibnn: Hardware acceleration of bayesian neural networks. ACM SIGPLAN Notices, 53 (2):476 488, 2018.

Chen, C., Ding, N., Li, C., Zhang, Y., and Carin, L. Stochastic gradient mcmc with stale gradients. Advances in Neural Information Processing Systems, 2016.

Chen, T., Fox, E., and Guestrin, C. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683 1691. PMLR, 2014.

Low-Precision Stochastic Gradient Langevin Dynamics

Cheng, Z., Soudry, D., Mao, Z., and Lan, Z. Training binary multilayer neural networks for image classification using expectation backpropagation. ar Xiv preprint ar Xiv:1503.03562, 2015.

Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123 3131, 2015.

Dalalyan, A. S. and Karagulyan, A. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12): 5278 5311, 2019.

De Sa, C., Feldman, M., R e, C., and Olukotun, K. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 561 574, 2017.

Deng, W., Lin, G., and Liang, F. A contour stochastic gradient langevin dynamics algorithm for simulations of multi-modal distributions. Advances in neural information processing systems, 33:15725 15736, 2020.

Dubey, K. A., J Reddi, S., Williamson, S. A., Poczos, B., Smola, A. J., and Xing, E. P. Variance reduction in stochastic gradient langevin dynamics. Advances in neural information processing systems, 29:1154 1162, 2016.

Esser, S. K., Mc Kinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantization. International Conference on Learning Representations, 2020.

Ferianc, M., Maji, P., Mattina, M., and Rodrigues, M. On the effects of quantisation on model uncertainty in bayesian neural networks. Uncertainty in Artificial Intelligence, 2021.

Gan, Z., Li, C., Chen, C., Pu, Y., Su, Q., and Carin, L. Scalable bayesian learning of recurrent neural networks for language modeling. Association for Computational Linguistics, 2017.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321 1330. PMLR, 2017.

Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. In International conference on machine learning, pp. 1737 1746. PMLR, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pp. 770 778, 2016.

Heek, J. and Kalchbrenner, N. Bayesian inference for large scale image classification. ar Xiv preprint ar Xiv:1908.03491, 2019.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735 1780, 1997.

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integerarithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704 2713, 2018.

Korattikara, A., Rathod, V., Murphy, K., and Welling, M. Bayesian dark knowledge. ar Xiv preprint ar Xiv:1506.04416, 2015.

Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. ar Xiv preprint ar Xiv:1806.08342, 2018.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Li, C., Stevens, A., Chen, C., Pu, Y., Gan, Z., and Carin, L. Learning weight uncertainty with stochastic gradient mcmc for shape classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5666 5675, 2016.

Li, C., Chen, C., Pu, Y., Henao, R., and Carin, L. Communication-efficient stochastic gradient mcmc for neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4173 4180, 2019.

Li, H., De, S., Xu, Z., Studer, C., Samet, H., and Goldstein, T. Training quantized nets: A deeper understanding. Advances in neural information processing systems, 2017.

Li, Z. and De Sa, C. M. Dimension-free bounds for lowprecision training. Advances in Neural Information Processing Systems, 2019.

Lin, D., Talathi, S., and Annapureddy, S. Fixed point quantization of deep convolutional networks. In International conference on machine learning, pp. 2849 2858. PMLR, 2016.

Ma, Y.-A., Chen, T., and Fox, E. B. A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 2015.

Low-Precision Stochastic Gradient Langevin Dynamics

Ma, Y.-A., Chen, Y., Jin, C., Flammarion, N., and Jordan, M. I. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42): 20881 20885, 2019.

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142 150, 2011.

Meng, X., Bachmann, R., and Khan, M. E. Training binary neural networks using the bayesian learning rule. In International Conference on Machine Learning, pp. 6852 6861. PMLR, 2020.

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. Mixed precision training. International Conference on Learning Representations, 2018.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

Song, Z., Liu, Z., and Wang, D. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Soudry, D., Hubara, I., and Meir, R. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems, volume 1, pp. 2, 2014.

Su, J., Cvitkovic, M., and Huang, F. Sampling-free learning of bayesian quantized neural networks. ar Xiv preprint ar Xiv:1912.02992, 2019.

Sun, X., Choi, J., Chen, C.-Y., Wang, N., Venkataramani, S., Srinivasan, V. V., Cui, X., Zhang, W., and Gopalakrishnan, K. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. Advances in neural information processing systems, 32:4900 4909, 2019.

Sun, X., Wang, N., Chen, C.-Y., Ni, J., Agrawal, A., Cui, X., Venkataramani, S., El Maghraoui, K., Srinivasan, V. V., and Gopalakrishnan, K. Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33, 2020.

Talwar, K. Computational separations between sampling and optimization. Advances in neural information processing systems, 2019.

van Baalen, M., Louizos, C., Nagel, M., Amjad, R. A., Wang, Y., Blankevoort, T., and Welling, M. Bayesian bits:

Unifying quantization and pruning. Advances in Neural Information Processing Systems, 2020.

Wang, K.-C., Vicol, P., Lucas, J., Gu, L., Grosse, R., and Zemel, R. Adversarial distillation of bayesian neural network posteriors. In International Conference on Machine Learning, pp. 5190 5199. PMLR, 2018a.

Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakrishnan, K. Training deep neural networks with 8-bit floating point numbers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7686 7695, 2018b.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688. Citeseer, 2011.

Wu, S., Li, G., Chen, F., and Shi, L. Training and inference with integers in deep neural networks. International Conference on Learning Representations, 2018.

Yang, G., Zhang, T., Kirichenko, P., Bai, J., Wilson, A. G., and De Sa, C. Swalp: Stochastic weight averaging in lowprecision training. International Conference on Machine Learning, 2019.

Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. Cyclical stochastic gradient mcmc for bayesian deep learning. International Conference on Learning Representations, 2020.

Zhang, T., Lin, Z., Yang, G., and De Sa, C. Qpytorch: A low-precision arithmetic simulation framework. ar Xiv preprint ar Xiv:1910.04540, 2019.

Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160, 2016.

Zhu, S., Dong, X., and Su, H. Binary ensemble neural network: More bits per network or more networks per bit? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4923 4932, 2019.

Low-Precision Stochastic Gradient Langevin Dynamics

A. Quantization Formulation

We follow the quantization framework in prior work (Wu et al., 2018; Wang et al., 2018b; Yang et al., 2019) to quantize weights, activations, backpropagation errors, and gradients, as outlined in Algorithm 3.

Algorithm 3 Low-Precision Training for SGLD.

given: L layers DNN {f1 . . . , f L}. Stepsize α. Weight, gradient, activation and error quantizers QW , QG, QA, QE. Variance-corrected quantization Qvc, deterministic rounding Qd, stochastic rounding Qs and quantization gap of weights W . Data batch sequence {(xk, yk)}K k=1. θfp k denotes the full-precision buffer of the weight. for k = 1 : K do

1. Forward Propagation:

a(0) k = xk a(l) k = QA(fl(a(l 1) k , θl k)), l [1, L] 2. Backward Propagation:

e(L) = a(L) k L(a(L) k , yk)

e(l 1) = QE

fl(a(l) k )

a(l 1) k e(l) k

g(l) k = QG

fl θ(l) k e(l) k

3. SGLD Update:

full-precision gradient accumulators: θfp k+1 θfp k αQG U(θk) +

2αξ, θk+1 QW θfp k+1

low-precision gradient accumulators: θk+1 Qvc θk αQG U(θk) , 2α, W

end for output: samples {θk}

B. Proof of Theorem 1

Our proofs in the paper follow Theorem 4 in Dalalyan & Karagulyan (2019), which provides a convergence bound of Langevin dynamics with noisy gradients. We state the result of Theorem 4 in Dalalyan & Karagulyan (2019) below.

We consider Langevin dynamics whose update rule is

θk+1 = θk α ( U(θk) + ζk) +

2αξk+1. (6)

The noise in the gradient ζk has the following three assumptions:

E h E [ζk|θk] 2 2 i δ2d, E h ζk E [ζk|θk] 2 2 i σ2d, ξk+1 is independent of (ζ0, , ζk),

where δ > 0 and σ > 0 are some constants. Under the same assumptions in Section 4, we have the convergence bound for the above Langevin dynamics. Theorem 4 (Theorem 4 in Dalalyan & Karagulyan (2019)). We run the above Langevin dynamics with α 2/(m + M). Let π be the target distribution, µ0 be the initial distribution and µK be the distribution obtained by the Langevin dynamics in Equation (6) after K iterations. Then

W2(µK, π) (1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2 + δ

d m + σ2(αd)1/2

1.65M + σ m.

Buit upon this theorem, we now prove Theorem 1.

Proof. We write the SGLDLP-F update as the following

θk+1 = θk αQG( U(QW (θk))) +

= θk α ( U(θk) + ζk) +

Low-Precision Stochastic Gradient Langevin Dynamics

ζk = QG( U(QW (θk))) U(θk)

= QG( U(QW (θk))) U(QW (θk))

+ U(QW (θk)) U(QW (θk)) + U(QW (θk)) U(θk).

Since E[ U(x)] = U(x) and E[Q(x)] = x, we have

E [ζk|θk] = E h QG( U(QW (θk))) U(QW (θk)) θk i

+ E h U(QW (θk)) U(QW (θk)) θk i + E [ U(QW (θk)) U(θk)|θk]

= E [ U(QW (θk)) U(θk)|θk] .

By the assumption, we know that

U(QW (θk)) U(θk) 2 2 M 2 QW (θk) θk 2 2 ,

then it follows that

E [ζk|θk] 2 2 = E [ U(QW (θk)) U(θk)|θk] 2 2

E h U(QW (θk)) U(θk) 2 2 θk i

M 2E h QW (θk) θk 2 2 θk i

M 2 2 W d 4 .

Let f : R Rd denote the function f(a) = U(θk + a(QW (θk) θk)).

By the mean value theorem, there will exist an a [0, 1] (a function of the weight quantization randomness) such that

f(1) f(0) = f (a).

E [ζk|θk] = E [ U(QW (θk)) U(θk)|θk]

= E 2U(θk + a(QW (θk) θk))(QW (θk) θk) θk

= E 2U(θk)(QW (θk) θk) θk

+ E 2U(θk + a(QW (θk) θk)) 2U(θk) (QW (θk) θk) θk

= E 2U(θk + a(QW (θk) θk)) 2U(θk) (QW (θk) θk) θk .

Now, by the assumption 2U(x) 2U(y) 2 Ψ x y 2, we get

E [ζk|θk] 2 = E 2U(θk + a(QW (θk) θk)) 2U(θk) (QW (θk) θk) θk 2 E 2U(θk + a(QW (θk) θk)) 2U(θk) (QW (θk) θk) 2 θk

E 2U(θk + a(QW (θk) θk)) 2U(θk) 2 QW (θk) θk) 2 θk

E [Ψ a(QW (θk) θk) 2 QW (θk) θk) 2|θk]

ΨE h QW (θk) θk 2 2 θk i

Ψ 2 W d 4 .

Low-Precision Stochastic Gradient Langevin Dynamics

This combined with the previous result gives us

E [ζk|θk] 2 min

Ψ 2 W d 4 , M W

Now considering the variance of ζk,

E h ζk E [ζk|θk] 2 2 i

E h ζk 2 2 i

E QG( U(QW (θk))) U(QW (θk)) 2

+ E U(QW (θk)) U(QW (θk)) 2

+ E h U(QW (θk)) U(θk) 2 2 i

2 Gd 4 + κ2 + M 2 2 W d 4 .

Recall that to apply the result of Theorem 4 in Dalalyan & Karagulyan (2019), we need

E h E [ζk|θk] 2 2 i δ2d, E h ζk E [ζk|θk] 2 2 i σ2d.

We set δ and σ to be

, σ2d = 2 Gd 4 + κ2 + M 2 2 W d 4 = ( 2 G + M 2 2 W )d + 4κ2

Since ζk is independent of the Gaussian noise ξi, for i = 0, . . . , k + 1, we have shown that the assumptions in Theorem 4 in Dalalyan & Karagulyan (2019) are satisfied. Thus we apply the result in Theorem 4 and get

W2(µK, π) (1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2 + δ

d m + σ2(αd)1/2

1.65M + σ m

(1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2 + δ

= (1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2 + min

Ψ 2 W d 4m , M W

( 2 G + M 2 2 W )αd + 4ακ2

Note that we ignore the 1.65M in the denominator to further simplify the bound.

C. Proof of Theorem 2

Proof. Recall that the update of SGLDLP-L is

θk+1 = QW θk αQG( U(θk)) +

To utilize the result in Dalalyan & Karagulyan (2019), we introduce an intermediate dynamic ψk+1 = θk αQG( U(θk))+

2αξk+1. Therefore θk = QW (ψk) and

ψk+1 = θk αQG( U(θk)) +

= QW (ψk) αQG( U(QW (ψk))) +

= ψk α( U(ψk) + ζk) +

Low-Precision Stochastic Gradient Langevin Dynamics

α + QG( U(θk)) U(ψk)

α + QG( U(θk)) U(θk) + U(θk) U(θk) + U(θk) U(ψk).

Similar to the previous proof in Section B, we know that

E [ζk|ψk] = E [ U(θk) U(ψk)|ψk] = E [ U(QW (ψk)) U(ψk)|ψk] ,

E [ζk|ψk] 2 min

Ψ 2 W d 4 , M W

and it suffices to set δ the same as in Section B. On the other hand, the variance will be bounded by

E h ζk E [ζk|ψk] 2 2 i E QG( U(θk)) U(θk) 2

+ E U(θk) U(θk) 2

α + U(θk) U(ψk)

2 Gd 4 + κ2 + E h F(ψk) F(θk) 2 2 i ,

where F(θ) = 1 2α θ 2 2 U(θ). Observe that since U is m-strongly convex and M-smooth, and α 1 M/2, F must be α 1-smooth, and so

E h ζk E [ζk|ψk] 2 2 i 2 Gd 4 + κ2 + 1

α2 E h ψk θk 2 2 i

2 Gd 4 + κ2 + 2 W d 4α2 .

This is essentially replacing the M 2 in the previous analysis with α 2. It suffices to set σ2d = 2 Gd 4 + κ2 + 2 W d 4α2 .

Supposing the distribution of ψK+1 is νK, applying Theorem 4 in Dalalyan & Karagulyan (2019) will give us the rate of

W2(νK, π) (1 αm)KW2(ν0, π) + 1.65(M/m)(αd)1/2 + min

Ψ 2 W d 4m , M W

(α 2 G + α 1 2 W )d + 4ακ2

We also have

W2(µk, νk) = inf J J (x,y)

Z x y 2 2 d J(x, y) 1/2 E h θk+1 ψk+1 2 2 i 1

Combining these two results, we get the final bound

W2(µK, π) W2(µK, νK) + W2(νK, π)

(1 αm)KW2(ν0, π) + 1.65(M/m)(αd)1/2 + min

Ψ 2 W d 4m , M W

(α 2 G + α 1 2 W )d + 4ακ2

(1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2 + min

Ψ 2 W d 4m , M W

(α 2 G + α 1 2 W )d + 4ακ2

4m + (1 αm)K + 1 W

Low-Precision Stochastic Gradient Langevin Dynamics

D. Proof of Theorem 3

Proof. Recall that the update of VC SGLDLP-L is

θk+1 = Qvc θk αQG( U(θk)), 2α, W .

We ignore the variance of QG since it is relatively small compared to the weight quantization variance in practice. Qvc is defined as in Algorithm 2 and we have E [θk+1|θk] = θk α U(θk).

Let ψk+1 = θk αQG( U(θk)) +

2αξk+1 then it follows that

ψk+1 θk+1 = θk αQG( U(θk)) +

2αξk+1 θk+1,

ψk+1 = ψk α( U(ψk) + ζk) +

α + QG( U(θk)) U(ψk)

α + QG( U(θk)) U(θk) + U(θk) U(θk) + U(θk) U(ψk).

Note that E[ψk θk] = 0. Similar to the previous proof in Section B, we know that

E [ζk|ψk] 2 2 = E [ U(θk) U(ψk)|ψk] 2 2 M 2E h ψk θk 2 2 ψk i .

When 2α > v0 = 2 W 4 , we have that

E h ψk θk 2 2 ψk i

= E θk 1 αQG( U(θk 1)) +

2αξk Qd θk 1 αQG( U(θk 1)) +

2α v0ξk sign(r)c 2

b = Qd θk 1 αQG( U(θk 1)) +

θk 1 αQG( U(θk 1)) +

E h ψk θk 2 2 ψk i

= E θk 1 αQG( U(θk 1)) +

2αξk θk 1 αQG( U(θk 1)) +

2α v0ξk b sign(r)c 2

2α v0ξk b sign(r)c 2

2α v0ξk b 2

+ E h sign(r)c 2 2 ψk i

2α v0ξk + W

2α v0)2E[ ξk 2 2] + (

2α v0) W E[ ξk 2] + 2v0d

2α v0)2 + (

2α v0) W + 2v0 d.

Low-Precision Stochastic Gradient Langevin Dynamics

Since 2xy x2 + y2, we get

2α v0)2 + 2 W 4 = (

2α v0)2 + v0.

The expression can be further simplified to be

E h ψk θk 2 2 ψk i 2(

2α v0)2 + 3v0 d.

We also note that

2α v0 = 2α (2α v0)

2α + 2α v0 = v0

2α + 2α v0 v0

then the expectation becomes

E h ψk θk 2 2 ψk i v2 0 α + 3v0

Since 2α > v0, it follows that

E h ψk θk 2 2 θk i (2v0 + 3v0) d = 5v0d.

Let A = 5v0d. Then we obtain

E [ζk|ψk] 2 2 M 2 A,

E [ζk|ψk] 2 Ψ A.

Therefore, it suffices to set δ = min ΨA, M

We now consider the variance which will be bounded by

E h ζk E [ζk|ψk] 2 2 i E QG( U(θk)) U(θk) 2

+ E U(θk) U(θk) 2

α + U(θk) U(ψk)

2 Gd 4 + κ2 + 1

α2 E h ψk θk 2 2 i

2 Gd 4 + κ2 + A

It suffices to set σ2d = 2 Gd 4 + κ2 + A

α2 . Supposing the distribution of ψK+1 is νK, applying Theorem 4 in Dalalyan & Karagulyan (2019) will give us the rate of

W2(νK, π) (1 αm)KW2(ν0, π) + 1.65(M/m)(αd)1/2 + min

α 2 Gd + 4ακ2

We also have

W2(µK, νK) = inf J J (x,y)

Z x y 2 2 d J(x, y) 1/2 E h θK+1 ψK+1 2 2 i 1

Low-Precision Stochastic Gradient Langevin Dynamics

Combining these two results, we get

W2(µK, π) W2(µK, νK) + W2(νK, π)

(1 αm)KW2(ν0, π) + 1.65(M/m)(αd)1/2 + min

α 2 Gd + 4ακ2

(1 αm)KW2(µ0, π) + 1.65(M/m)(αd)1/2 + min

α 2 Gd + 4ακ2

αm + (1 αm)K + 1

When 2α < 2 W 4 , since we assume that the gradient is bounded by E h QG( U(θk)) 1

E[ ψk θk 2 2] = E θk 1 αQG( U(θk 1)) θk +

= E θk 1 αQG( U(θk 1)) θk 2

max 2E θk 1 αQG( U(θk 1)) Qs θk 1 αQG( U(θk 1)) 2

Using the bound equation (6) in Li & De Sa (2019) gives us,

E θk 1 αQG( U(θk 1)) Qs θk 1 αQG( U(θk 1)) 2

W αE h QG( U(θk 1)) 1

It follows that

E h ψk θk 2 2 i max (2 W αG, 4αd) .

Let A = max (2 W αG, 4αd). The rest is that same as in the case 2α > v0.

E. Comparison to SGD Bounds

There have been many works on comparing optimization and sampling algorithms since they serve as two main computational strategies for machine learning (Ma et al., 2019; Talwar, 2019). For example, in Ma et al. (2019), the authors compare the total variation distance between the approximate distribution and the target distribution (sampling bound), with the objective gap (optimization bound). Following previous work, we compare our 2-Wasserstein distance bound with previous SGD bounds. Previous low-precision SGD convergence bounds are shown in terms of the squared distance to the optimum θK θ 2 2 (Yang et al., 2019). In order to compare our bounds with theirs, we consider a 2-Wasserstein distance between two point distributions. Let µK be the point distribution assigns zero probability everywhere except θK and π be the point distribution assigns zero probability everywhere except θ . Then we get

W2(µK, π) = inf J J (x,y)

Z x y 2 2 d J(x, y) 1/2 E h θK θ 2 2

Low-Precision Stochastic Gradient Langevin Dynamics

From Yang et al. (2019), we know that E h θK θ 2 2

2 is proportional to W . Therefore, our 2-Wasserstein distance is

O( 2 W ) whereas SGD s 2-Wasserstein distance is O( W ), which shows SGLD is more robust to the quantization error.

F. Deterministic Rounding vs Stochastic Rounding

Compared to deterministic rounding, stochastic rounding is unbiased and thus can preserve gradient information even when the gradient update is smaller than the quantization gap. In theory, deterministic rounding will make the convergence bound worse due to its bias. For example, in Theorem 1, using deterministic rounding as the weight quantizer makes the bias term becomes O( w), which is worse than the current O( 2 w). In practice, stochastic rounding generally provides much better results than deterministic rounding especially on deep neural networks (Gupta et al., 2015; Wang et al., 2018b). For example, on CIFAR10 with 8-bit block floating point, we found that using deterministic rounding to quantize the weight in SGDLP-L and SGLDLP-L gives test errors 7.44% and 7.37% respectively, which are much worse than using stochastic rounding (SGDLP-L:5.86%, na ıve SGLDLP-L: 5.85%, VC SGLDLP-L: 5.51%).

G. Algorithms with (Block) Floating Point Numbers

The qunatization gaps in floating point and block floating point change depending on the number values. Therefore, we need to compute the qunatization gap in each step in order to apply our variance-vorrected quantization function Qvc. It is easy to see that the qunatization gap can be computed as

( 2E[µ] W +2 where E[µ] = clip( log2(max |µ|) , l, u) block floating point 2E[µ] W where E[µ] = clip( log2(|µ|) , l, u) floating point. (7)

Deterministic rounding and stochastic rounding are defined using the above W . Then we obtain Qvc function with (block) floating point in Algorithm 5. This algorithm is the same as Algorithm 1 except the lines in red where we recompute the quantization gap after adding Gaussian noise to make sure it aligns with the quantization gap of x. VC SGLDLP-L with (block) floating point is outlined in Algorithm 4.

Algorithm 4 VC SGLDLP-L with (Block) Floating Point.

given: Stepsize α, number of training iterations K, gradient quantizer QG, deterministic rounding with (block) floating point Qd, stochastic rounding with (block) floating point Qs, F bits to represent the shared exponent (block floating point) or the exponent (floating point), W bits to represent each number in the block (block floating point) or the mantissa (floating point). let l 2F 1, u 2F 1 1 for k = 1 : K do

compute µ θk αQG U(θk 1)

compute W (µ) following Equation (7) update θk+1 Qvc (µ, 2α, W (µ)) end for output: samples {θk}

Low-Precision Stochastic Gradient Langevin Dynamics

Algorithm 5 Variance-Corrected Quantization Function Qvc with (Block) Floating Point.

input: (µ, v, ) v0 2/4 if v > v0 then

x µ + v v0ξ, where ξ N(0, Id) r x Qd(x) recompute W (x) following Equation (7) v0 2/4 for all i do

sample ci from Cat(|ri|, v0) as in Equation (5) end for θ Qd(x) + sign(r) c else

the same as in fixed point numbers end if

H. Experimental Details and Additional Results

H.1. Sampling methods

For both SGLD and low-precision SGLD, we collected samples {θk}J K=1 from the posterior of the model s weight, and obtained the prediction on test data {x , y } by Bayesian model averaging

p(y |x , D) 1

j=1 p(y |x , D, θj).

We train all methods on logistic regression and MLP for 20 epochs with learning rate 0.1 and batch size 64. We additionally report test error comparisons in Figure 3.

2 3 4 5 6 7 8 9 10 Number of Fractional Bits

Test Error (%)

SGDLP-L Naive SGLDLP-L VC SGLDLP-L SGDLP-F SGLDLP-F SGDFP (7.68%) SGLDFP (7.63%)

2 3 4 5 6 7 8 9 10 Number of Fractional Bits

Test Error (%)

SGDLP-L Naive SGLDLP-L VC SGLDLP-L SGDLP-F SGLDLP-F SGDFP (2.03%) SGLDFP (1.9%)

Figure 3. Test error on MNIST dataset in terms of different precision.

H.3. CIFAR and IMDB

For CIFAR datasets, we use batch size 128, learning rate 0.5 and weight decay 5e 4. We train the model for 245 epochs and used the same decay learning rate schedule as in Yang et al. (2019). We collect 50 samples for SGLD. For cyclical learning rate schedule, we use 7 cycles and collect 5 models per cycle (35 models in total). We use 10 bins for expected calibration error (ECE) following prior work (Guo et al., 2017).

For IMDB dataset, we use batch size 80, learning rate 0.3 and weight decay 5e 4. We use a two-layer LSTM. The embedding dimension and the hidden dimension are 100 and 256 respectively. We train the model for 50 epochs and used

Low-Precision Stochastic Gradient Langevin Dynamics

the same decay learning rate schedule as on CIFAR datasets. We collect 20 samples for SGLD. For cyclical learning rate schedule, we use 1 cycles and collect 20 models.

H.4. Image Net

We use batch size 256, learning rate 0.2 and weight decay 1e 4. We use the same decay learning rate schedule as in He et al. (2016) and collect 20 models for SGLD.