# analysis_of_quantized_models__7f20a9e8.pdf

Published as a conference paper at ICLR 2019

ANALYSIS OF QUANTIZED MODELS

Lu Hou1, Ruiliang Zhang1,2, James T. Kwok1

1Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong {lhouab,jamesk}@cse.ust.hk 2Tu Simple ruiliang.zhang@tusimple.ai

Deep neural networks are usually huge, which signiﬁcantly limits the deployment on low-end devices. In recent years, many weight-quantized models have been proposed. They have small storage and fast inference, but training can still be time-consuming. This can be improved with distributed learning. To reduce the high communication cost due to worker-server synchronization, recently gradient quantization has also been proposed to train deep networks with full-precision weights. In this paper, we theoretically study how the combination of both weight and gradient quantization affects convergence. We show that (i) weight-quantized models converge to an error related to the weight quantization resolution and weight dimension; (ii) quantizing gradients slows convergence by a factor related to the gradient quantization resolution and dimension; and (iii) clipping the gradient before quantization renders this factor dimension-free, thus allowing the use of fewer bits for gradient quantization. Empirical experiments conﬁrm the theoretical convergence results, and demonstrate that quantized networks can speed up training and have comparable performance as full-precision networks.

1 INTRODUCTION

Deep neural networks are usually huge. The high demand in time and space can signiﬁcantly limit deployment on low-end devices. To alleviate this problem, many approaches have been recently proposed to compress deep networks. One direction is network quantization, which represents each network weight with a small number of bits. Besides signiﬁcantly reducing the model size, it also accelerates network training and inference. Many weight quantization methods aim at approximating the full-precision weights in each iteration (Courbariaux et al., 2015; Lin et al., 2016; Rastegari et al., 2016; Li & Liu, 2016; Lin et al., 2017; Guo et al., 2017). Recently, loss-aware quantization minimizes the loss directly w.r.t. the quantized weights (Hou et al., 2017; Hou & Kwok, 2018; Leng et al., 2018), and often achieves better performance than approximation-based methods.

Distributed learning can further speed up training of weight-quantized networks (Dean et al., 2012). A key challenge is on reducing the expensive communication cost incurred during synchronization of the gradients and model parameters (Li et al., 2014a;b). Recently, algorithms that sparsify (Aji & Heaﬁeld, 2017; Wangni et al., 2017) or quantize the gradients (Seide et al., 2014; Wen et al., 2017; Alistarh et al., 2017; Bernstein et al., 2018) have been proposed.

In this paper, we consider quantization of both the weights and gradients in a distributed environment. Quantizing both weights and gradients has been explored in the Do Re Fa-Net (Zhou et al., 2016), QNN (Hubara et al., 2017), WAGE (Wu et al., 2018) and Zip ML (Zhang et al., 2017). We differ from them in two aspects. First, existing methods mainly consider learning on a single machine, and gradient quantization is used to reduce the computations in backpropagation. On the other hand, we consider a distributed environment, and use gradient quantization to reduce communication cost and accelerate distributed learning of weight-quantized networks. Second, while Do Re Fa-Net, QNN and WAGE show impressive empirical results on the quantized network, theoretical guarantees are not provided. Zip ML provides convergence analysis, but is limited to stochastic weight quantization, square loss with the linear model, and requires the stochastic gradients to be unbiased. This can

Published as a conference paper at ICLR 2019

be restrictive as most state-of-the-art weight quantization methods (Rastegari et al., 2016; Lin et al., 2016; Li & Liu, 2016; Guo et al., 2017; Hou et al., 2017; Hou & Kwok, 2018) are deterministic, and the resultant stochastic gradients are biased.

In this paper, we relax the restrictions on the loss function, and study in an online learning setting how the gradient precision affects convergence of weight-quantized networks in a distributed environment. The main ﬁndings are:

1. With either full-precision or quantized gradients, the average regret of loss-aware weight quantization does not converge to zero, but to an error related to the weight quantization resolution w and dimension d. The smaller the w or d, the smaller is the error (Theorems 1 and 2). 2. With either full-precision or quantized gradients, the average regret converges with a O(1/

T) rate to the error, where T is the number of iterations. However, gradient quantization slows convergence (relative to using full-precision gradients) by a factor related to gradient quantization resolution g and d. The larger the g or d, the slower is the convergence (Theorems 1 and 2). This can be problematic when (i) the weight quantized model has a large d (e.g., deep networks); and (ii) the communication cost is a bottleneck in the distributed setting, which favors a small number of bits for the gradients, and thus a large g. 3. For gradients following the normal distribution, gradient clipping renders the speed degradation mentioned above dimension-free. However, an additional error is incurred. The convergence speedup and error are related to how aggressive clipping is performed. More aggressive clipping results in faster convergence, but a larger error (Theorem 3). 4. Empirical results show that quantizing gradients signiﬁcantly reduce communication cost, and gradient clipping makes speed degradation caused by gradient quantization negligible. With quantized clipped gradients, distributed training of weight-quantized networks is much faster, while comparable accuracy with the use of full-precision gradients is maintained (Section 4).

Notations. For a vector x, x is the element-wise square root, x2 is the element-wise square, Diag(x) returns a diagonal matrix with x on the diagonal, and x y is the element-wise multiplication of vectors x and y. For a matrix Q, x 2 Q = x Qx. For a matrix X,

X is the element-wise square root, and diag(X) returns a vector extracted from the diagonal elements of X.

2 PRELIMINARIES

2.1 ONLINE LEARNING

Online learning continually adapts the model with a sequence of observations. It has been commonly used in the analysis of deep learning optimizers (Duchi et al., 2011; Kingma & Ba, 2015; Reddi et al., 2018). At time t, the algorithm picks a model with parameter wt S, where S is a convex compact set. The algorithm then incurs a loss ft(wt). After T rounds, the performance is usually evaluated by the regret R(T) = PT t=1 ft(wt) ft(w ) and average regret R(T)/T, where w = arg minw S PT t=1 ft(w) is the best model parameter in hindsight.

2.2 WEIGHT QUANTIZATION

In Binary Connect (Courbariaux et al., 2015), each weight is binarized using the sign function either deterministically or stochastically. In ternary-connect (Lin et al., 2016), each weight is stochastically quantized to { 1, 0, 1}. Stochastic weight quantization often suffers severe accuracy degradation, while deterministic weight quantization (as in the binary-weight-network (BWN) (Rastegari et al., 2016) and ternary weight network (TWN) (Li & Liu, 2016)) achieves much better performance.

In this paper, we will focus on loss-aware weight quantization, which further improves performance by considering the effect of weight quantization on the loss. Examples include loss-aware binarization (LAB) (Hou et al., 2017) and loss-aware quantization (LAQ) (Hou & Kwok, 2018). Let the full-precision weights from all L layers in the deep network be w. The corresponding quantized weight is denoted Qw(w) = ˆw, where Qw( ) is the weight quantization function. At the (t + 1)th iteration, the second-order Taylor expansion of ft( ˆw), i.e., ft( ˆwt) + ft( ˆwt) ( ˆw ˆwt) + 1

2( ˆw ˆwt) Ht( ˆw ˆwt) is minimized w.r.t. ˆw, where Ht is the Hessian at ˆwt. A direct computation of

Published as a conference paper at ICLR 2019

Ht is expensive. In practice, this is approximated by Diag( ˆvt), where ˆvt is the moving average:

ˆvt = βˆvt 1 + (1 β)ˆg2 t = Xt

j=1(1 β)βt jˆg2 j, (1)

with gt the stochastic gradient, β 1, and is readily available in popular deep network optimizers such as RMSProp and Adam. Diag( ˆvt) is also an estimate of Diag( p

diag(H2 t)) (Dauphin et al., 2015). Computationally, the quantized weight is obtained by ﬁrst performing a preconditioned gradient descent wt+1 = wt ηt Diag( ˆvt) 1ˆgt, followed by quantization via solving the following problem:

ˆwt+1 = Qw(wt+1) = arg min ˆw wt+1 ˆw 2 Diag( ˆvt) s.t. ˆw = αb, α > 0, b (Sw)d. (2)

For simplicity of notations, we assume that the same scaling parameter α is used for all layers. Extension to layer-wise scaling is straightforward. For binarization, Sw = { 1, +1}, the weight quantization resolution is w = 1, and a simple closed-form solution is obtained in (Hou et al., 2017). For m-bit linear quantization, Sw = { Mk, . . . , M1, M0, M1, . . . , Mk}, where k = 2m 1 1, 0 = M0 < <Mk are uniformly spaced, with weight quantization resolution w = Mr+1 Mr. An efﬁcient approximate solution of (2) is obtained in (Hou & Kwok, 2018).

2.3 GRADIENT QUANTIZATION

In a distributed learning environment with data parallelism, the main bottleneck is often on the communication cost due to gradient synchronization. By quantizing the gradients before synchronization (Seide et al., 2014; Wen et al., 2017; Alistarh et al., 2017), this cost can be signiﬁcantly reduced. For example, assuming that the full-precision gradient is 32-bit, the communication cost can be reduced 32/m times when gradients are quantized to m bits.

Most recent gradient quantization methods (Wen et al., 2017; Alistarh et al., 2017; Zhang et al., 2017) require the quantized gradient to be unbiased, and thus use stochastically quantized gradients. On the other hand, deterministic gradient quantization makes the quantized gradient biased, and the resultant analysis more complex. In this paper, we consider the more general m-bit stochastic linear quantization (Alistarh et al., 2017):

Qg(gt) = st sign(gt) qt, (3)

where st = gt , qt (Sg)d, and Sg = { Bk, . . . , B1, B0, B1, . . . , Bk}, with k = 2m 1 1, 0 = B0 < B1 < < Bk are uniformly spaced. The gradient quantization resolution is deﬁned as g = Br+1 Br. The ith element qt,i in qt is equal to Br+1 with probability (|gt,i|/st Br) /(Br+1 Br), and Br otherwise. Here, r is an index satisfying Br |gt,i|/st < Br+1. Note that Qg(gt) is an unbiased estimator of gt.

3 PROPERTIES OF A QUANTIZED MODEL

In this section, we consider quantization of both weights and gradients in a distributed environment with N workers using data parallelism. For easy illustration, we use the parameter server model (Li et al., 2014b) in Figure 1, though it also holds for other conﬁgurations such as the All Reduce model (Rabenseifner, 2004). At the tth iteration, worker n {1, 2, . . . , N} computes the full-precision gradient ˆg(n) t w.r.t. the quantized weight and quantizes ˆg(n) t to g(n) t = Qg(ˆg(n) t ). The quantized gradients are then synchronized and averaged at the parameter server as: gt = 1 N PN n=1 g(n) t . The server updates the second moment vt based on gt, and also the full-precision weight as wt+1 = wt ηt Diag( vt) 1 gt. The weight is quantized using loss-aware weight quantization to produce ˆwt+1 = Qw(wt+1), which is then sent back to all the workers.

3.1 ASSUMPTIONS

Analysis on quantized deep networks has only been performed on models with (i) full-precision gradients and weights quantized by stochastic weight quantization (Li et al., 2017; De Sa et al., 2018), or simple deterministic weight quantization using the sign (Li et al., 2017); (ii) full-precision weights and quantized gradients (Alistarh et al., 2017; Wen et al., 2017; Bernstein et al., 2018);

Published as a conference paper at ICLR 2019

Parameter server 1. Gradient descent: 𝑤𝑡+1 = 𝑤𝑡 𝜂𝑡𝐷𝑖𝑎𝑔( 𝑣𝑡) 1 𝑔𝑡 2. Weight quantization: 𝑤𝑡+1 = 𝑄𝑤𝑤𝑡+1

Worker 1 Gradient quantization: 𝑔t

(1) = Qg( 𝑔t

Worker 2 Gradient quantization: 𝑔t

(2) = Qg( 𝑔t

Worker N Gradient quantization: 𝑔t

(N) = Qg( 𝑔t

𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1

Figure 1: Distributed weight and gradient quantization with data parallelism.

(iii) quantized weights and quantized gradients (Zhang et al., 2017), but limited to stochastic weight quantization, square loss on linear model (i.e., ft(wt) = (x t wt yt)2) in Section 2.1), and unbiased gradient.

In this paper, we study the more advanced loss-aware weight quantization, with both full-precision and quantized gradients. As it is deterministic and has biased gradients, the above analysis do not apply here. Moreover, we do not assume a linear model, and relax the assumptions on ft as:

(A1) ft is convex; (A2) ft is twice differentiable with Lipschitz-continuous gradient; and (A3) ft has bounded gradient, i.e., ft(w) G and ft(w) G for all w S.

These assumptions have been commonly used in convex online learning (Hazan, 2016; Duchi et al., 2011; Kingma & Ba, 2015) and quantized networks (Alistarh et al., 2017; Li et al., 2017). Obviously, the convexity assumption A1 does not hold for deep networks. However, this facilitates analysis of deep learning models, and has been used in (Kingma & Ba, 2015; Reddi et al., 2018; Li et al., 2017; De Sa et al., 2018). Moreover, as will be seen, it helps to explain the empirical behavior in Section 4.

As in (Duchi et al., 2011; Kingma & Ba, 2015; Li et al., 2017), we assume that wm wn D and wm wn D for all wm, wn S. Moreover, the learning rate ηt decays as η/

t, where η is a constant (Hazan, 2016; Duchi et al., 2011; Kingma & Ba, 2015; Li et al., 2017).

For simplicity of notations, we denote the full-precision gradient ft(wt) w.r.t. the full-precision weight by gt, and the full-precision gradient ft(Qw(wt)) w.r.t. the quantized weight by ˆgt. As ft is twice differentiable (Assumption A2), using the mean value theorem, there exists p (0, 1) such that gt ˆgt = ft(wt) ft( ˆwt) = 2ft ( ˆwt + p(wt ˆwt)) (wt ˆwt). Let H t = 2ft( ˆwt +p(wt ˆwt)) be the Hessian at ˆwt +p(wt ˆwt). Moreover, let α = max{α1, . . . , αT }, where αt is the scaling parameter in (2) at the tth iteration.

3.2 WEIGHT QUANTIZATION WITH FULL-PRECISION GRADIENT

When only weights are quantized, the update for loss-aware weight quantization is

wt+1 = wt ηt Diag( p

where ˆvt is the moving average of the (squared) gradients ˆg2 t in (1).

Theorem 1. For loss-aware weight quantization with full-precision gradients and ηt = η/

t=1(1 β)βT t ˆgt 2 + ηG

wt ˆwt 2 H t, (4)

D2 + dα2 2w

For standard online gradient descent with the same learning rate scheme, R(T)/T converges to zero at the rate of O(1/

T) (Hazan, 2016). From Theorem 1, the average regret converges at the same

rate, but only to a nonzero error LD q

D2 + dα2 2w

4 related to the weight quantization resolution w and dimension d.

Published as a conference paper at ICLR 2019

3.3 WEIGHT QUANTIZATION WITH QUANTIZED GRADIENT

When both weights and gradients are quantized, the update for loss-aware weight quantization is

wt+1 = wt ηt Diag( p

where gt is the stochastically quantized gradient Qg( ft(Qw(wt))). The second moment vt is the moving average of the (squared) quantized gradients g2 t . The following Proposition shows that gradient quantization signiﬁcantly blows up the norm of the quantized gradient relative to its fullprecision counterparts. Moreover, the difference increases with the gradient quantization resolution g and dimension d.

Proposition 1. E( gt 2) ( 1+ 2d 1

2 g + 1) ˆgt 2.

Theorem 2. For loss-aware weight quantization with quantized gradients and ηt = η/

t=1(1 β)βT t E( gt 2) + ηG

t=1 E( gt 2)

wt ˆwt 2 H t), (6)

2d 1 2 g + 1 d

D2 + dα2 2w

The regrets in (4) and (6) are of the same form and differ only in the gradient used. Similarly, for the average regrets in (5) and (7), quantizing gradients slows convergence by a factor

2 g + 1, which is a direct consequence of the blowup in Proposition 1. These observations can be problematic as (i) deep networks typically have a large d; and (ii) distributed learning prefers using a small number of bits for the gradients, and thus a large g.

3.4 WEIGHT QUANTIZATION WITH QUANTIZED CLIPPED GRADIENTS

To reduce convergence speed degradation caused by gradient quantization, gradient clipping has been proposed as an empirical solution (Wen et al., 2017). The gradient ˆgt is clipped to Clip(ˆgt), where

Clip(ˆgt,i) = ˆgt,i |ˆgt,i| cσ, sign(ˆgt,i) cσ otherwise.

Here, c is a constant clipping factor, and σ is the standard deviation of elements in ˆgt. The update then becomes wt+1 = wt ηt Diag( ˇvt) 1ˇgt,

where ˇgt Qg(Clip(ˆgt)) Qg(Clip( ft(Qw(wt)))) is the quantized clipped gradient. The second moment ˇvt is computed using the (squared) quantized clipped gradient ˇg2 t .

As shown in Figure 2(a) of (Wen et al., 2017), the distribution of gradients before quantization is close to the normal distribution. Recall from Section 3.3 that the difference between E( gt 2) of the quantized gradient gt and the full-precision gradient ˆgt 2 is related to the dimension d. The following Proposition shows that E( ˇgt 2)/E( ˆgt 2) becomes independent of d if ˆgt follows the normal distribution and clipping is used.

Proposition 2. Assume that ˆgt follows N(0, σ2I), we have E( ˇgt 2) ((2/π) 1 2 c g+1)E( ˆgt 2).

However, the quantized clipped gradient may now be biased (i.e., E(ˇgt) = Clip(ˆgt) = ˆgt). The following Proposition shows that the bias is related to the clipping factor c. A larger c (i.e., less severe gradient clipping) leads to smaller bias.

Proposition 3. Assume that ˆgt follows N(0, σ2I), we have E( Clip(ˆgt) ˆgt 2)

dσ2(2/π) 1 2 F(c), where F(c) = ce c2

2 (1 + c2)(1 erf( c

2)), and erf(z) = 2 π R z 0 e t2 dt is the error function.

Published as a conference paper at ICLR 2019

Theorem 3. Assume that ˆgt follows N(0, σ2I). For loss-aware weight quantization with quantized clipped gradients and ηt = η/

t=1(1 β)βT t E( ˇgt 2) + ηG

t=1 E( ˇgt 2)

wt ˆwt 2 H t) + D XT

Clip(ˆgt) ˆgt 2), (8)

(2/π) 1 2 c g + 1 d

D2 + dα2 2w

d Dσ(2/π) 1 4 p

Note that terms involving gt in Theorem 2 are replaced by ˇgt. Moreover, the regret has an additional term D PT t=1 E( p

Clip(ˆgt) ˆgt 2) over that in Theorem 2. Comparing the average regrets in Theorems 1 and 3, gradient clipping before quantization slows convergence by a factor

(2/π) 1 2 c g + 1, as compared to using full-precision gradients. This is independent of d as the increase in E( ˇgt 2) is independent of d (Proposition 2). Hence, a g larger than the one in Theorem 2 can be used, and this reduces the communication cost in distributed learning.

Figure 2: F(c) vs clipping factor c.

A larger c (i.e., less severe gradient clipping) makes F(c) smaller (Figure 2). Compared with (6) in Theorem 2, the extra error

d Dσ(2/π) 1 4 p

F(c) in (9) is thus smaller, but convergence is also slower. Hence, there is a trade-off between the two. Remark 1. There are two scaling schemes in distributed training with data parallelism: strong scaling and weak scaling (Snavely et al., 2002). In this work, we consider weak scaling, which is more popular in deep network training. In weak scaling, the same data set size is used for each worker. The gradients are averaged over the N workers as1 gt = 1 N PN n=1 g(n) t . If the gradients before averaging are independent random variables with zero mean, and g(n) t 2 is bounded by G2, then E( gt 2) = E( 1

N PN n=1 g(n) t 2) = 1 N2 E(PN n=1 g(n) t 2) G2/N. From Theorems 1-3,

the convergence speed with one worker is determined by D2

q PT t=1(1 β)βT t E( gt 2) +

q PT t=1 E( gt 2) D2

TG2 , while

with N workers by ( D2

TG2/N = ( D2

(T/N)G2. Thus, with N workers, the number of iterations for convergence is subsequently reduced by a factor of 1/N as compared to using a single worker.

4 EXPERIMENTS

4.1 SYNTHETIC DATA

In this section, we ﬁrst study the effect of dimension d on the convergence speed and ﬁnal error of a simple linear model with square loss as in (Zhang et al., 2017). Each entry of the model parameter

1With a slight abuse of notation, gt here can be the full-precision gradient or quantized gradient with/without clipping.

Published as a conference paper at ICLR 2019

is generated by uniform sampling from [ 0.5, 0.5]. Samples xi s are generated such that each entry of xi is drawn uniformly from [ 0.5, 0.5], and the corresponding output yi from N(x i w , (0.2)2). At the tth iteration, a mini-batch of B = 64 samples are drawn to form Xt = [x1, . . . , x B] and yt = [y1, . . . , y B] . The corresponding loss is ft(wt) = X t wt yt 2/2B. The weights are quantized to 1 bit using LAB. The gradients are either full-precision (denoted FP) or stochastically quantized to 2 bits (denoted SQ2). The optimizer is RMSProp, and the learning rate is ηt = η/

t, where η = 0.03. Training is terminated when the average training loss does not decrease for 5000 iterations.

Figure 3(a) shows2 convergence of the average training loss PT t=1 ft(wt)/T, which differs from the average regret only by only a constant. As can be seen, for both full-precision and quantized gradients, a larger d leads to a larger loss upon convergence. Moreover, convergence is slower for larger d, particularly when the gradients are quantized. These agree with the results in Theorems 1 and 2.

(a) Linear model.

(b) Multi-layer perceptron.

(c) Cifarnet.

Figure 3: Convergence of the weight-quantized models with different d s.

4.2 CIFAR-10

In this experiment, we follow (Wen et al., 2017) and use the same train/test split, data preprocessing, augmentation and distributed Tensorﬂow setup.

4.2.1 VARYING d

We ﬁrst study the effect of d on deep networks. Experiments are performed on two neural network models. The ﬁrst one is a multi-layer perceptron with one layer of d hidden units (Reddi et al., 2016). The weights are quantized to 3 bits using LAQ3. The gradients are either full-precision (denoted FP) or stochastically quantized to 3 bits (denoted SQ3). The optimizer is RMSProp, and the learning rate is ηt = η/

t, where η = 0.1. The second network is the Cifarnet (Wen et al., 2017). We set d to be the number of ﬁlters in each convolutional layer. The gradients are either full-precision or stochastically quantized to 2 bits (denoted SQ2). Adam is used as the optimizer. The learning rate is decayed from 0.0002 by a factor of 0.1 every 200 epochs as in (Wen et al., 2017).

Figures 3(b) and 3(c) show convergence of the average training loss for both networks. As can be seen, similar to that in Section 4.1, a larger d leads to larger convergence degradation of the quantized gradients as compared to using full-precision gradients. However, unlike the linear model, a larger d does not necessarily lead to a larger loss upon convergence.

4.2.2 WEIGHT QUANTIZATION RESOLUTION w

We use the same Cifarnet model as in (Wen et al., 2017), with d = 64. Weights are quantized to 1 bit (LAB), 2 bits (LAQ2), or m bits (LAQm). The gradients are full-precision (FP) or stochastically quantized to m = {2, 3, 4} bits (SQm) without gradient clipping. Adam is used as the optimizer. The learning rate is decayed from 0.0002 by a factor of 0.1 every 200 epochs. Two workers are used in this experiment. Figure 4 shows convergence of the average training loss with different numbers of bits for the quantized weight. With full-precision or quantized gradients, weight-quantized networks have larger training losses than full-precision networks upon convergence. The more bits are used, the smaller is the ﬁnal loss. This agrees with the results in Theorems 1 and 2. Table 1 shows the test

2Legend W(LAB)-G(FP) means that weights are quantized using LAB and gradients are full-precision.

Published as a conference paper at ICLR 2019

set accuracies. Weight-quantized networks are less accurate than their full-precision counterparts, but the degradation is small when 3 or 4 bits are used.

(b) G(SQ2).

(c) G(SQ3).

(d) G(SQ4).

Figure 4: Convergence with different numbers of bits for the weights on CIFAR-10. The gradient is full-precision (denoted G(FP)) or m-bit quantized (denoted G(SQm)) without gradient clipping.

Table 1: Testing accuracy (%) on CIFAR-10 with two workers.

gradient weight FP LAB LAQ2 LAQ3 LAQ4

FP 83.74 80.37 82.11 83.14 83.35 SQ2 (no clipping) 81.40 78.67 80.27 81.27 81.38 SQ2 (clip,c = 3) 82.99 80.25 81.59 83.14 83.40 SQ3 (no clipping) 83.24 80.18 81.63 82.75 83.17 SQ3 (clip, c = 3) 83.89 80.13 81.77 82.97 83.43 SQ4 (no clipping) 83.64 80.44 81.88 83.13 83.47 SQ4 (clip, c = 3) 83.80 79.27 81.42 82.77 83.43

4.2.3 GRADIENT QUANTIZATION RESOLUTION g

We use the same Cifarnet model as in (Wen et al., 2017). Adam is used as the optimizer. The learning rate is decayed from 0.0002 by a factor of 0.1 every 200 epochs. Figure 5 shows convergence of the average training loss with different numbers of bits for the quantized gradients, again without gradient clipping. Using fewer bits yields a larger ﬁnal error, and using 2or 3-bit gradients yields larger training loss and worse accuracy than full-precision gradients (Figure 5 and Table 1). The fewer bits for the gradients, the larger the gap. The degradation is negligible when 4 bits are used. Indeed, 4-bit gradient sometimes has even better accuracy than full-precision gradient, as its inherent randomness encourages escape from poor sharp minima (Wen et al., 2017). Moreover, using a larger m results in faster convergence, which agrees with Theorem 2.

(b) W(LAQ2).

(c) W(LAQ3).

(d) W(LAQ4).

Figure 5: Convergence with different numbers of bits for the gradients on CIFAR-10. The weight is binarized (denoted W(LAB)) or m-bit quantized (denoted W(LAQm)). Gradients are not clipped.

4.2.4 GRADIENT CLIPPING

In this section, we perform experiments on gradient clipping, with clipping factor c in {1, 2, 3}, using the Cifarnet (Wen et al., 2017). LAQ2 is used for weight quantization and SQ2 for gradient quantization. Adam is used as the optimizer. The learning rate is decayed from 0.0002 by a factor of 0.1 every 200 epochs. Figure 6(a) shows histograms of the full-precision gradients before clipping. As can be seen, the gradients at each layer before clipping roughly follow the normal distribution, which veriﬁes the assumption in Section 3.4. Figure 6(b) shows the average gt 2/ ˆgt 2 (for nonclipped gradients) and ˇgt 2/ ˆgt 2 (for clipped gradients) over all iterations. The dimensionalities (d) of the various Cifarnet layers are conv1 : 1600, conv2 : 1600, fc3 : 884736, fc4 : 73728,

Published as a conference paper at ICLR 2019

(a) Histograms of gradients before clipping.

ˆgt 2 or ˇgt 2

(c) Training curve.

Figure 6: Results for LAQ2 with SQ2 on CIFAR-10 with two workers. (a) Histograms of gradients at different Cifarnet layers before clipping (visualized by Tensorboard); (b) Average gt 2/ ˆgt 2 (for non-clipped gradients) and ˇgt 2/ ˆgt 2 (for clipped gradients); and (c) Training curves.

softmax : 1920. Layers with large d have large gt 2/ ˆgt 2 values, which agrees with Proposition 1. With clipped gradients, ˇgt 2/ ˆgt 2 is much smaller and does not depend on d, agreeing with Proposition 3. Figure 6(c) shows convergence of the average training loss. Using a smaller c (more aggressive clipping) leads to faster training (at the early stage of training) but larger ﬁnal training loss, agreeing with Theorem 3.

Figure 7 shows convergence of the average training loss with different numbers of bits for the quantized clipped gradient, with c = 3. By comparing3 with Figure 5, gradient clipping achieves faster convergence, especially when the number of gradient bits is small. For example, 2-bit clipped gradient has comparable speed (Figure 7) and accuracy (Table 1) as full-precision gradient.

(a) W(LAB).

(b) W(LAQ2).

(c) W(LAQ3).

(d) W(LAQ4).

Figure 7: Convergence with different numbers of bits for gradients (with c = 3) on CIFAR-10.

4.2.5 VARYING THE NUMBER OF WORKERS

In Remark 1, we showed that using multiple workers can reduce the number of training iterations required. In this section, we vary the number of workers in a distributed learning setting with weak scaling, using the Cifarnet (Wen et al., 2017). We ﬁx the mini-batch size for each worker to 64, and set a smaller number of iterations when more workers are used.

We use 3-bit quantized weight (LAQ3), and gradients are full-precision or stochastically quantized to m = {2, 3, 4} bits (SQm). Table 2 shows the testing accuracies with varying number of workers N. Observations are similar to those in Section 4.2.4. 2-bit quantized clipped gradient has comparable performance as full-precision gradient, while the non-clipped counterpart requires 3 to 4 bits for comparable performance.

Table 2: Testing accuracy (%) on CIFAR-10 with varying number of workers (N).

weight gradient N = 4 N = 8 N = 16 FP FP 83.28 83.38 83.76 FP 82.92 82.93 83.12 SQ2 (no clipping) 81.53 81.08 81.30 SQ2 (clip, c = 3) 83.01 82.94 82.73 LAQ3 SQ3 (no clipping) 82.64 82.42 82.27 SQ3 (clip, c = 3) 82.93 83.16 82.54 SQ4 (no clipping) 83.03 82.53 82.61 SQ4 (clip, c = 3) 82.51 83.07 82.52

3The curves for full-precision gradient are the same in Figures 5 and 7, and can be used as a common baseline.

Published as a conference paper at ICLR 2019

4.3 IMAGENET

In this section, we train the Alex Net on Image Net. We follow (Wen et al., 2017) and use the same data preprocessing, augmentation, learning rate, and mini-batch size. Quantization is not performed in the ﬁrst and last layers, as is common in the literature (Zhou et al., 2016; Zhu et al., 2017; Polino et al., 2018; Wen et al., 2017). We use Adam as the optimizer. We experiment with 4-bit loss-aware weight quantization (LAQ4), and the gradients are either full-precision or quantized to 3 bits (SQ3).

Table 3 shows the accuracies with different numbers of workers. Weight-quantized networks have slightly worse accuracies than full-precision networks. Quantized clipped gradient outperforms the non-clipped counterpart, and achieves comparable accuracy as full-precision gradient.

Table 3: Top-1 and top-5 accuracies (%) on Image Net.

weight gradient N = 2 N = 4 N = 8 top-1 top-5 top-1 top-5 top-1 top-5 FP FP 55.08 78.33 55.45 78.57 55.40 78.69 FP 53.79 77.21 54.22 77.53 54.73 78.12 LAQ4 SQ3 (no clipping) 52.48 75.97 52.87 76.40 53.18 76.62 SQ3 (clip, c = 3) 54.13 77.27 54.23 77.55 54.34 78.07

Figure 8 shows the speedup in distributed training of a weight-quantized network with quantized/full-precision gradient compared to training with one worker using full-precision gradient. We use the performance model in (Wen et al., 2017), which combines lightweight proﬁling on a single node with analytical communication modeling. We use the All Reduce communication model (Rabenseifner, 2004), in which each GPU communicates with its neighbor until all gradients are accumulated to a single GPU. We do not include the server s computation effort on weight quantization and the worker s effort on gradient clipping, which are negligible compared to the forward and backward propagations in the worker. As can be seen from the Figure, even though the number of bits used for gradients increases by one at every aggregation step in the All Reduce model, the proposed method still signiﬁcantly reduces network communication and speeds up training. When the bandwidth is small (Figure 8(a)), communication is the bottleneck, and using quantizing gradients is signiﬁcantly faster than the use of full-precision gradients. With a larger bandwidth (Figure 8(b)), the difference in speedups is smaller. Moreover, note that on the 1Gbps Ethernet with quantized gradients, its speedup is similar to those on the 10Gbps Ethernet with full-precision gradients.

(a) 1Gbps Ethernet.

(b) 10Gbps Ethernet. Figure 8: Speedup of Image Net training on a 16-node GPU cluster. Each node has 4 1080ti GPUs with one PCI switch.

5 CONCLUSION

In this paper, we studied loss-aware weight-quantized networks with quantized gradient for efﬁcient communication in a distributed environment. Convergence analysis is provided for weight-quantized models with full-precision, quantized and quantized clipped gradients. Empirical experiments conﬁrm the theoretical results, and demonstrate that quantized networks can speed up training and have comparable performance as full-precision networks.

ACKNOWLEDGMENTS

We thank NVIDIA for the gift of GPU card.

Published as a conference paper at ICLR 2019

A. F. Aji and K. Heaﬁeld. Sparse communication for distributed gradient descent. In International Conference on Empirical Methods in Natural Language Processing, pp. 440 445, 2017.

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-efﬁcient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1707 1718, 2017.

J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. sign SGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 2018.

M. Courbariaux, Y. Bengio, and J. P. David. Binary Connect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3105 3113, 2015.

Y. Dauphin, H. de Vries, and Y. Bengio. Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems, pp. 1504 1512, 2015.

C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. R e. Highaccuracy low-precision training. Preprint ar Xiv:1803.03383, 2018.

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pp. 1223 1231, 2012.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

Y. Guo, A. Yao, H. Zhao, and Y. Chen. Network sketching: Exploiting binary structure in deep CNNs. In International Conference on Computer Vision and Pattern Recognition, pp. 5955 5963, 2017.

E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3-4):157 325, 2016.

L. Hou and J. T. Kwok. Loss-aware weight quantization of deep networks. In International Conference on Learning Representations, 2018.

L. Hou, Q. Yao, and J. T. Kwok. Loss-aware binarization of deep networks. In International Conference on Learning Representations, 2017.

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869 6898, 2017.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

C. Leng, H. Li, S. Zhu, and R. Jin. Extremely low bit neural network: Squeeze the last bit out with admm. In AAAI Conference on Artiﬁcial Intelligence, 2018.

F. Li and B. Liu. Ternary weight networks. Preprint ar Xiv:1605.04711, 2016.

H. Li, S. De, Z. Xu, C. Studer, H. Samet, and Goldstein T. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, 2017.

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, and A. Ahmed. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation, volume 583, pp. 598, 2014a.

M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efﬁcient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19 27, 2014b.

Published as a conference paper at ICLR 2019

X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 344 352, 2017.

Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. In International Conference on Learning Representations, 2016.

A. Polino, R. Pascanu, and D. Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations, 2018.

R. Rabenseifner. Optimization of collective reduction operations. In International Conference on Computational Science, pp. 1 9, 2004.

M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: Image Net classiﬁcation using binary convolutional neural networks. In European Conference on Computer Vision, 2016.

S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pp. 314 323, 2016.

S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Interspeech, 2014.

A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A framework for performance modeling and prediction. In ACM/IEEE Conference on Supercomputing, pp. 1 17, 2002.

J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsiﬁcation for communication-efﬁcient distributed optimization. Preprint ar Xiv:1710.09854, 2017.

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Tern Grad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, 2017.

S. Wu, G. Li, F. Chen, and L. Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.

H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. The Zip ML framework for training models with end-to-end low precision: The cans, the cannots, and a little bit of deep learning. In International Conference on Machine Learning, 2017.

S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Do Re Fa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Preprint ar Xiv:1606.06160, 2016.

C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In International Conference on Learning Representations, 2017.

Published as a conference paper at ICLR 2019

A.1 PROOF OF THEOREM 1

First, we introduce the following two lemmas.

Lemma 1. Let vt = Pt j=1(1 β)βt jg2 j, where β [0, 1). Assume that gj < G for j {1, 2 . . . , t}. Then,

vt,i (1 β)g2 t,i, (10) vt,i G , (11)

j=1 (1 β)βt j gj 2. (12)

Proof. vt,i = Pt j=1(1 β)βt jg2 j,i (1 β)g2 t,i. Moreover,

j=1 (1 β)βt jg2 j,i

j=1 (1 β)βt j G2 G ,

j=1 (1 β)βt jg2 j,i =

j=1 (1 β)βt j gj 2.

Lemma 2. [Lemma 10.3 in (Kingma & Ba, 2015)] Let g1:T,i = [g1,i, g2,i, . . . , g T,i] be the vector containing the ith element of the gradients for all iterations up to T, and gt be bounded as in Assumption A3. Then,

g2 t,i t 2G g1:T,i .

Proof. (of Theorem 1) When only weights are quantized, the update for loss-aware weight quantization is wt+1 = wt ηt Diag( p

ˆvt) 1ˆgt. (13)

The update for the ith entry of wt is wt+1,i = wt,i ηt ˆgt,i

ˆvt,i . This implies

(wt+1,i w i )2 = (wt,i w i )2 2ηt(wt,i w i ) gt,i p

+2ηt(wt,i w i )(gt,i ˆgt,i) p

ˆvt,i + η2 t ( ˆgt,i p

After rearranging,

gt,i(wt,i w i ) =

ˆvt,i 2ηt ((wt,i w i )2 (wt+1,i w i )2)

+(wt,i w i )(gt,i ˆgt,i) + ηt

2 ( ˆg2 t,i p

ˆvt,i ). (14)

Since ft is convex, we have

ft(wt) ft(w ) g t (wt w ) =

i=1 gt,i(wt,i w i ). (15)

Published as a conference paper at ICLR 2019

As ft is convex (Assumption A1) and twice differentiable (Assumption A2), 2ft LI, where I is the identity matrix. Combining with the assumption in Section 3.1 that wm wn D, we have

wt w 2 H t L wt w 2 LD2, wt ˆwt 2 H t L wt ˆwt 2 LD2. (16)

Let x, y = x y be the dot product between two vectors x and y. Combining (14) and (15), sum over all the dimensions i {1, 2 . . . , d} and over all iterations t {1, 2, . . . , T}, we have

t=1 ft(wt) ft(w )

ˆv1,i 2η1 (w1,i w i )2 +

ˆvt 1,i 2ηt 1

(wt,i w i )2

t=1 wt w , gt ˆgt +

(1 β)ˆg2 t,i

2 t (wt w ), H

2 t (gt ˆgt) + ηG 1 β

i=1 ˆg1:T,i

wt ˆwt 2 H t + ηG 1 β

i=1 ˆg1:T,i

T ˆv T,i + ηG 1 β

i=1 ˆg1:T,i +

wt ˆwt 2 H t

t=1 (1 β)βT t ˆgt 2 + ηG 1 β

wt ˆwt 2 H t. (17)

The ﬁrst inequality comes from (10) in Lemma 1. In the second inequality, the ﬁrst term comes from Pd i=1

ˆv1,i 2η1 (w1,i w i )2 + Pd i=1 PT t=2

ˆvt 1,i 2ηt 1

(wt,i w i )2 = Pd i=1

ˆv T,i 2ηT (w T,i

w i )2 and the domain bound assumption in Section 3.1 (i.e., (w T,i w i ) w T w D ).

The second term comes from the deﬁnition of H t (i.e., H

2 t (gt ˆgt) = H

2 t (wt ˆwt)), and Lemma 2. The third inequality comes from Cauchy s inequality. The fourth inequality comes from (16). The last inequality comes from (12) in Lemma 1.

For m-bit (m > 1) loss-aware weight quantization in (13), as

ˆwt = arg minw ˆw wt 2 Diag(

ˆvt 1,i( ˆwi wt,i)2

s.t. ˆw = αtbt, α > 0, b (Sw)d.

If αt Mk wt,i αt Mk, as p

ˆvt 1,i > 0, the optimal ˆwt,i satisﬁes ˆwt,i {αt Mr,i, αt Mr+1,i}, where r is the index that satisﬁes αt Mr,i wt,i αt Mr+1,i. Since α = max{α1, . . . , αT }, we have

( ˆwt,i wt,i)2 = min{(αt Mr+1,i |wt,i|)2, (αt Mr,i |wt,i|)2}

αt Mr+1,i αt Mr,i

2 α2 2 w 4 . (18)

Otherwise (i.e., wt,i is exterior of the representable range), the optimal ˆwt,i is just the nearest representable value of wt,i. Thus,

( ˆwt,i wt,i)2 = (|wt,i| αt Mk)2 w2 t,i. (19)

Published as a conference paper at ICLR 2019

From (18) and (19), and sum over all the dimensions, we have

ˆwt wt 2 dα2 2 w 4 + wt 2 dα2 2 w 4 + D2. (20)

From (16) and (20),

ˆwt wt 2 H t L D2 + dα2 2 w 4

From (21) and Assumption A3, we have from (17)

R(T) D2 d G

T 2η + ηd G2

D2 + dα2 2w

Thus, the average regret is

R(T)/T D2 G

2η + ηG2 1 β

D2 + dα2 2w

A.2 PROOF OF PROPOSITION 1

Lemma 3. For stochastic gradient quantization in (3), E( gt) = ˆgt, and E( gt ˆgt 2) g ˆgt ˆgt 1.

Proof. Denote the ith element of the quantized gradient gt by gt,i. For two adjacent quantized values Br,i, Br+1,i with Br,i |gt,i|/st < Br+1,i,

E( gt,i) = E(st sign(ˆgt,i) qt,i) = st sign(ˆgt,i) E(qt,i) = st sign(ˆgt,i) (p Br+1,i + (1 p)Br,i) = st sign(ˆgt,i) (p(Br+1,i Br,i) + Br,i)

= st sign(ˆgt,i) |ˆgt,i|

st = ˆgt,i.

Thus, E( gt) = ˆgt, and the variance of the quantized gradients satisfy

E( gt ˆgt 2) =

i=1 (st Br+1,i |ˆgt,i|)(|ˆgt,i| st Br,i)

i=1 (Br+1,i |ˆgt,i|

st )(|ˆgt,i|

i=1 (Br+1,i Br,i)|ˆgt,i|

= g ˆgt ˆgt 1.

Proof. From Lemma 3,

E( gt 2) = E( gt ˆgt 2) + ˆgt 2 g ˆgt ˆgt 1 + ˆgt 2.

Denote {x1, x2, . . . , xd} as the absolute values of the elements in ˆgt sorted in ascending order. From Cauchy s inequality, we have

ˆgt ˆg 1 = xd

2d 1 2 x2 i + 1 1 +

2d 1x2 d) + x2 d

i=1 x2 i = 1 +

2d 1 2 ˆgt 2.

Published as a conference paper at ICLR 2019

The equality holds iff x1 = x2 = =

2 1+ 2d 1xd. Thus, we have

E( gt 2) (1 +

2d 1 2 g + 1) ˆgt 2.

A.3 PROOF OF THEOREM 2

Proof. When both weights and gradients are quantized, the update is

wt+1 = wt ηt Diag( p

vt) 1 gt. (23)

Similar to the proof of Theorem 1, and using that E( gt) = ˆgt (Lemma 3), we have

2η1 (w1,i w i )2) +

2ηt 1 )(wt,i w i )2 !

t=1 E( wt w , gt gt ) +

(1 β) g2 t,i )

2 t (wt w ), H

2 t (gt ˆgt)

i=1 E( g1:T,i )

wt ˆwt 2 H t)

i=1 E( g1:T,i )

v T ) + ηG 1 β

i=1 E( g1:T,i )

wt ˆwt 2 H t)

t=1 (1 β)βT t E( gt 2) + ηG 1 β

t=1 E( gt 2)

wt ˆwt 2 H t).

As (21) in the proof of Theorem 1 still holds, using Proposition 1 and Assumption A3, we have

t=1 (1 β)βT t ˆgt 2 + ηG 1 β

2d 1 2 g + 1 + LD

D2 + dα2 2w

2η + ηG2 1 β

2d 1 2 g + 1 + LD

D2 + dα2 2w

Published as a conference paper at ICLR 2019

E(R(T)/T) D2 G

2η + ηG2 1 β

2d 1 2 g + 1 d

D2 + dα2 2w

A.4 PROOF OF PROPOSITION 2

Proof. From Lemma 3, E(Qg(Clip(ˆgt))) = Clip(ˆgt), and E( Qg(Clip(ˆgt)) 2) E( g Clip(ˆgt) Clip(ˆgt) 1 + Clip(ˆgt) 2).

As [ˆgt]i N(0, σ2), its pdf is f(x) = 1

2σ2 . Thus,

E( Clip(ˆgt) 1) E( ˆgt 1) =

i=1 E(|[ˆgt]i|) = d Z +

|x|f(x)dx = 2d Z +

2σ2 dx2 = d 2σ2

2πσ2 e u 2σ2 +

0 = (2/π) 1 2 dσ.

As E( Clip(ˆgt) 2) E( ˆgt 2), and that E( ˆgt 2) = dσ2, we have

E( Qg(Clip(ˆgt)) 2) E( g Clip(ˆgt) Clip(ˆgt) 1 + Clip(ˆgt) 2)

gcσ(2/π) 1 2 dσ + E( Clip(ˆgt) 2)

((2/π) 1 2 c g + 1)E( ˆgt 2).

A.5 PROOF OF PROPOSITION 3

E( Clip(ˆgt) ˆgt 2) = d( Z +

cσ (x cσ)2f(x)dx + Z cσ

( x cσ)2f(x)dx)

cσ (x cσ)2f(x)dx

2πσ2 σ3( ce c2

2 (1 + c2)(1 erf( c

= (2/π) 1 2 dσ2F(c).

A.6 PROOF OF THEOREM 3

Proof. When both weights and gradients are quantized, and gradient clipping is applied before gradient quantization, the update then becomes

wt+1 = wt ηt Diag( ˇvt) 1ˇgt. (24)

Similar to the proof of Theorem 1, and using that E(Qg(Clip(ˆgt))) = Clip(ˆgt), we have

2η1 (w1,i w i )2) +

2ηt 1 )(wt,i w i )2 !

t=1 E( wt w , gt ˇgt ) +

(1 β)ˇg2 t,i

Published as a conference paper at ICLR 2019

T ˇv T,i) +

t=1 E wt w , gt ˆgt +

t=1 E wt w , ˆgt Clip(ˆgt)

i=1 E( ˇg1:T,i )

T ˇv T,i) +

2 t (wt w ), H

2 t (gt ˆgt)

t=1 E wt w , ˆgt Clip(ˆgt) + ηG 1 β

i=1 E( ˇg1:T,i )

d T 2η E( ˇv T ) + ηG 1 β

i=1 E( ˇg1:T,i )

wt ˆwt 2 H t) +

(ˆgt Clip(ˆgt)) 2)

t=1 (1 β)βT t E( ˇgt 2) + ηG

t=1 E( ˇgt 2)

wt ˆwt 2 H t) + D

ˆgt Clip(ˆgt) 2).

(21) in the proof of Theorem 1 still holds. Similar to the proof of Theorem 2, using the domain bound assumption in Section 3.1 (i.e., wm wn D and wm wn D for any wm, wn S), and Proposition 3, we have

E(R(T)/T) D2 G

2η + ηG2 1 β

(2/π) 1 2 c g + 1 d

D2 + dα2 2w

d Dσ(2/π) 1 4 p