# demystifying_and_generalizing_binaryconnect__c51e50e6.pdf Demystifying and Generalizing Binary Connect Tim Dockhorn University of Waterloo Yaoliang Yu University of Waterloo Eyyüb Sari Huawei Noah s Ark Lab Mahdi Zolnouri Huawei Noah s Ark Lab Vahid Partovi Nia Huawei Noah s Ark Lab Binary Connect (BC) and its many variations have become the de facto standard for neural network quantization. However, our understanding of the inner workings of BC is still quite limited. We attempt to close this gap in four different aspects: (a) we show that existing quantization algorithms, including post-training quantization, are surprisingly similar to each other; (b) we argue for proximal maps as a natural family of quantizers that is both easy to design and analyze; (c) we refine the observation that BC is a special case of dual averaging, which itself is a special case of the generalized conditional gradient algorithm; (d) consequently, we propose Prox Connect (PC) as a generalization of BC and we prove its convergence properties by exploiting the established connections. We conduct experiments on CIFAR-10 and Image Net, and verify that PC achieves competitive performance. 1 Introduction Scaling up to extremely large datasets and models has been a main ingredient for the success of deep learning. Indeed, with the availability of big data, more computing power, convenient software, and a bag of training tricks as well as algorithmic innovations, the size of models that we routinely train in order to achieve state-of-the-art performance has exploded, e.g., to billions of parameters in recent language models [9]. However, high memory usage and computational cost at inference time has made it difficult to deploy these models in real-time or on resource-limited devices [28]. The environmental impact of training and deploying these large models has also been recognized [40]. A common approach to tackle these problems is to compress a large model through quantization, i.e., replacing high-precision parameters with lower-precision ones. For example, we may constrain a subset of the weights to be binary [1, 5, 11, 29, 33, 37, 45] or ternary [26, 53]. Quantization can drastically decrease the carbon footprint of training and inference of neural networks, however, it may come at the cost of increased bias [21]. One of the main methods to obtain quantized neural networks is to encourage quantized parameters during gradient training using explicit or implicit regularization techniques, however, other methods are possible [16 18, 20, 25, 32, 51]. Besides the memory benefits, the structure of the quantization can speed up inference using, for example, faster matrix-vector products [18, 23]. Training and inference can be made even more efficient by also quantizing the activations [37] or gradients [52]. Impressive performance has been achieved with quantized networks, for example, on object detection [44] and natural language processing [43] tasks. The theoretical underpinnings of quantized neural networks, such as when and why their performance remains reasonably well, have been actively studied [3, 13, 22, 41, 46]. Work done during an internship at Huawei Noah s Ark Lab. Correspondence to tim.dockhorn@uwaterloo.ca. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Binary Connect [BC, 11] and its many variations [10, 37, 53] are considered the gold standard for neural network quantization. Compared to plain (stochastic) gradient descent, BC does not evaluate the gradient at the current iterate but rather at a (close-by) quantization point using the Straight Through Estimator [6]. Despite its empirical success, BC has largely remained a training trick [1] and a rigorous understanding of its inner workings has yet to be found, with some preliminary steps taken in Li et al. [27] for the convex setting and in Yin et al. [45] for a particular quantization set. As pointed out in Bai et al. [5], BC only evaluates gradients at the finite set of quantization points, and therefore does not exploit the rich information carried in the continuous weights network. Bai et al. [5] also observed that BC is formally equivalent to the dual averaging algorithm [31, 42], while some similarity to the mirror descent algorithm was found in Ajanthan et al. [1]. The main goal of this work is to significantly improve our understanding of BC, by connecting it with well-established theory and algorithms. In doing so we not only simplify and improve existing results but also obtain novel generalizations. We summarize our main contributions in more details: In Section 2, we show that existing gradient-based quantization algorithms are surprisingly similar to each other: the only high-level difference is at what points we evaluate the gradient and perform the update. In Section 3, we present a principled theory for constructing proximal quantizers. Our results unify previous efforts, remove tedious calculations, and bring theoretical convenience. We illustrate our theory by effortlessly designing a new quantizer that can be computed in one-pass, works for different quantization sets (binary or multi-bit), and includes previous attempts as special cases [5, 11, 45, 53]. In Section 4, we significantly extend the observation of Bai et al. [5] that the updates of BC are the same as the dual averaging algorithm [31, 42]: BC is a nonconvex counterpart of dual averaging, and more importantly, dual averaging itself is simply the generalized conditional gradient algorithm applied to a smoothened dual problem. The latter fact, even in the convex case, does not appear to be widely recognized to the best of our knowledge. In Section 5, making use of the above established results, we propose Prox Connect (PC) as a family of algorithms that generalizes BC and we prove its convergence properties for both the convex and the nonconvex setting. We rigorously justify the diverging parameter in proximal quantizers and resolve a discrepancy between theory and practice in the literature [1, 5, 45]. In Section 6, we verify that PC outperforms BC and Prox Quant [5] on CIFAR-10 for both finetuning pretrained models as well as end-to-end training. On the more challenging Image Net dataset, PC yields competitive performance despite of minimal hyperparameter tuning. 2 Background We consider the usual (expected) objective ℓ(w) = Eℓ(w, ξ), where ξ represents random sampling. For instance, ℓ(w) = 1 n Pn i=1 ℓi(w) where ξ is uniform over n training samples (or minibatches thereof), w are the weights of a neural network and ℓi may be the cross-entropy loss (of the i-th training sample). We denote a sample (sub)gradient of ℓat w and ξ as e ℓ(w) = ℓ(w, ξ) so that Ee ℓ(w) = ℓ(w). Throughout, we use the starred notation w for continuous weights and reserve w for (semi)discrete ones. We are interested in solving the following (nonconvex) problem: min w Q ℓ(w), (1) where Q Rd is a discrete, nonconvex quantization set. For instance, on certain low-resource devices it may be useful or even mandatory to employ binary weights, i.e., Q = { 1}d. Importantly, our goal is to compete against the non-quantized, continuous weights network (i.e. Q = Rd). In other words, we do not necessarily aim to solve (the hard, combinatorial) problem (1) globally and optimally. Instead, we want to find discrete weights w Q that remain satisfactory when compared to the non-quantized continuous weights. This is how we circumvent the difficulty in (1) and more importantly how the structure of ℓcould come into aid. If ℓis reasonably smooth, a close-by quantized weight of a locally, or even globally, optimal continuous weight will likely yield similar performance. Tremendous progress has been made on quantizing and compressing neural networks. While it is not possible to discuss all related work, below we recall a few families of gradient-based quantization algorithms that directly motivate our work; more methods can be found in recent surveys [15, 36]. Binary Connect (BC). Courbariaux et al. [11] considered binary networks where Q = { 1}d and proposed the Binary Connect algorithm: w t+1 = w t ηt e ℓ(P(w t )), (2) where P is a projector that quantizes the continuous weights w t either deterministically (by taking the sign) or stochastically. Note that the (sample) gradient is evaluated at the quantized weights wt := P(w t ), while its continuous output, after scaled by the step size ηt, is added to the continuous weights w t . For later comparison, it is useful to break the BC update into the following two pieces: wt = P(w t ), w t+1 = w t ηt e ℓ(wt). (3) Other choices of P [1, 45] and Q [44, 53] in this framework have also been experimented with. Prox Quant (PQ). Bai et al. [5] applied the usual proximal gradient to solve (1): wt+1 = P(wt ηt e ℓ(wt)), (4) which can be similarly decomposed into: wt = P(w t ), w t+1 = wt ηt e ℓ(wt). (5) Thus, the only high-level difference between BC and PQ is that the former updates the continuous weights w t while the latter updates the quantized weights wt. This seemingly minor difference turns out to cause drastically different behaviors of the two algorithms. For example, choosing P to be the Euclidean projection to Q works well for Binary Connect but not at all for Prox Quant. Reversing Binary Connect. The above comparison naturally suggests a reversed variant of BC: wt = P(w t ), w t+1 = wt ηt e ℓ(w t ), (6) which amounts to switching the continuous and quantized weights in the BC update. Similar to PQ, the choice of P is critical in this setup. To the best of our knowledge, this variant has not been formally studied before. Reversing Binary Connect may not be intuitive, however, it serves as a starting point for a more general method (see Section 5). Post-Training Quantization. Lastly, we can also rewrite the naive post-training quantization scheme in a similar form: wt = P(w t ), w t+1 = w t ηt e ℓ(w t ), (7) where we simply train the continuous network as usual and then quantize at the end. Note that the quantized weights wt do not affect the update of the continuous weights w t . 3 What Makes a Good Quantizer? As we have seen in Section 2, the choice of the quantizer P turns out to be a crucial element for solving (1). Indeed, if P = PQ is the projector onto the discrete quantization set Q, then BC (and Prox Quant) only evaluate the gradient of ℓat (the finite set of) points in Q. As a result, the methods will not be able to exploit the rich information carried in the continuous weights network, which can lead to non-convergence [Fig. 1b, 5]. Since then, many semi-discrete quantizers, that turn continuous weights into more and more discrete ones, have been proposed [1, 5, 33, 45]. In this section, we present a principled way to construct quantizers that unifies previous efforts, removes tedious calculations, and also brings theoretical convenience when it comes down to analyzing the convergence of the algorithms in Section 2. Our construction is based on the proximal map Pµ r : Rd Rd of a (closed) function r: Pµ r (w ) = arg min w 1 2µ w w 2 2 + r(w), (8) where µ > 0 is a smoothing parameter. The proximal map is well-defined as long as the function r is lower bounded by a quadratic function, in particular, when r is bounded from below. If r is proper and (closed) convex, then the minimizer on the right-hand side of (8) is uniquely attained, while for general (nonconvex) functions the proximal map may be multi-valued (hence the notation ). If r = ιQ is the indicator function (see (15) below), then Pµ r reduces to the familiar projector PQ (for any µ). Remarkably, a complete characterization of such maps on the real line is available: Theorem 3.1 ([Proposition 3, 48]). A (possibly multi-valued) map P: R R is a proximal map (of some function r) iff it is (nonempty) compact-valued, monotone and has a closed graph. The underlying function r is unique (up to addition of constants) iff P is convex-valued, while r is convex iff P is nonexpansive (i.e. 1-Lipschitz continuous). The sufficient and necessary conditions of Theorem 3.1 allow one to design proximal maps P directly, without needing to know the underlying function r at all (even though it is possible to integrate a version of r from P). The point is that, as far as quantization algorithms are concerned, having P is enough, and hence one is excused from the tedious calculations in deriving P from r as is typical in existing works. For example, the (univariate) mirror maps constructed in Ajanthan et al. [Theorem 1, 1], such as tanh, are proximal maps according to Theorem 3.1: In our notation, the (Mirror Descent) updates of Ajanthan et al. [1] are the same as (3) with P = ( Φ) 1 for some mirror map Φ. Since Φ is taken to be strictly convex in Ajanthan et al. [1], it is easy to verify that P satisfies all conditions of Theorem 3.1.2 The next result allows us to employ stochastic quantizers, where in each iteration we randomly choose one of Pi to quantize the weights3. We may also apply different quantizers to different layers of a neural network. Theorem 3.2. Let Pi : Rd Rd, i = [k], be proximal maps. Then, the averaged map P := Pk i=1 αi Pi, where αi 0, Pk i=1 αi = 1, (9) is also a proximal map. Similarly, the product map P := P1 P2 Pk, w = (w 1, . . . , w k) 7 P1(w 1), . . . , Pk(w k) (10) is a proximal map (from Rdk to Rdk). (The proof of Theorem 3.2 and all other omitted proofs can be found in Appendix A.) Example 3.3. Let Q be a quantization set (e.g. Q = { 1, 0, 1}). Clearly, the identity map id and the projector PQ are proximal maps. Therefore, their convex combination Pµ = id+µPQ 1+µ , µ 0, is also a proximal map, which is exactly the quantizer used in Yin et al. [45]. Lastly, we mention that it is intuitively desirable to have Pµ PQ when µ increases indefinitely, so that the quantizer eventually settles on bona fide discrete values in Q. This is easily achieved by letting minimizers of the underlying function r, or the (possibly larger set of) fixed points of Pµ, approach Q. We give one such construction by exploiting Theorem 3.1 and Theorem 3.2. 3.1 General Piecewise Linear Quantizers Let Qj = {qj,k}bj k=1 (with qj,1 qj,bj) be the quantization set for the j-th weight group4 w j Rdj. Let pj,k+1 := qj,k+qj,k+1 2 , k [bj 1], be the middle points. We introduce two parameters ρ, ϱ 0 and define horizontal shifts: q j,k := pj,k (qj,k ρ), q+ j,k := pj,k+1 (qj,k + ρ), (11) vertical shifts: p j,k+1 := qj,k (pj,k+1 ϱ), p+ j,k+1 := qj,k+1 (pj,k+1 + ϱ), (12) with q j,1 = qj,1 and q+ j,bj = qj,bj. Then, we define Lϱ ρ as the piece-wise linear map (that simply connects the points by straight lines): Lϱ ρ(w ) := qj,k, if q j,k w q+ j,k qj,k + (w q+ j,k) p j,k+1 qj,k pj,k+1 q+ j,k , if q+ j,k w < pj,k+1 p+ j,k+1 + (w pj,k+1) qj,k+1 p+ j,k+1 q j,k+1 pj,k+1 , if pj,k+1 < w q j,k+1, 2The above reasoning hinges on Φ being univariate. More generally, if a multivariate mirror map Φ is 1-strongly convex (as is typical in Mirror Descent), then it follows from Moreau [Corollaire 10.c, 30] that P = ( Φ) 1, being a nonexpansion, is again a proximal map (of some convex function). 3This form of stochasticity still leads to determinisitc networks, and is therefore conceptually different from probabilistic (quantized) networks [35, 39]. 4In this section, time subscripts are not needed. Instead, we use subscripts to indicate weight groups. (a) ρ = 0, ϱ = 0.2. (b) ρ = ϱ = 0.2. (c) ρ = 0.2, ϱ = 0. Figure 1: Different instantiations of the proximal map Lϱ ρ in (13) for Q = { 1, 0, 1}. for all w w j. At the middle points, Lϱ ρ(pj,k+1) may take any value within the two limits. Following the commonly used weight clipping in BC [11], we may set Lϱ ρ(w ) = qj,1 for w < qj,1 and Lϱ ρ(w ) = qj,bj for w > qj,bj, however, other choices may also work well. The proximal map Lϱ ρ of an example ternary component is visualized in Figure 1 for different choices of ρ and ϱ. The (horizontal) parameter ρ controls the discretization vicinity within which a continuous weight will be pulled exactly into the discrete set Qj, while the (vertical) parameter ϱ controls the slope (i.e. expansiveness) of each piece. It follows at once from Theorem 3.1 that Lϱ ρ is indeed a proximal map. In particular, setting ρ = 0 (hence continuous weights are only discretized in the limit) and ϱ = µ 2(1+µ) we recover Example 3.3 (assuming w.l.o.g. that qj,k+1 qj,k 1). On the other hand, setting ρ = ϱ leads to a generalization of the (binary) quantizer in Bai et al. [5], which keeps the slope to the constant 1 (while introducing jumps at middle points) and happens to be the proximal map of the distance function to Qj [5]. Of course, setting ρ = ϱ = 0 yields the identity map and allows us to skip quantizing certain weights (as is common practice), while letting ρ, ϱ recovers the projector PQj. Needless to say, we may adapt the quantization set Qj and the parameters ϱ and ρ for different weight groups, creating a multitude of quantization schemes. By Theorem 3.2, the overall operator remains a proximal map. 4 Demystifying Binary Connect (BC) Bai et al. [5] observed that the updates of BC are formally the same as the dual averaging (DA) algorithm [31, 42], even though the latter algorithm was originally proposed and analyzed only for convex problems. A lesser known fact is that (regularized) dual averaging itself is a special case of the generalized conditional gradient (GCG) algorithm. In this section, we first present the aforementioned fact, refine the observation of Bai et al. [5], and set up the stage for generalizing BC. 4.1 Generalized Conditional Gradient is Primal-Dual We first present the generalized conditional gradient (GCG) algorithm [8, 49] and point out its ability to solve simultaneously the primal and dual problems. Let us consider the following regularized problem: min w Rd f(w) := ℓ(w) + r(w), (14) where r is a general (nonconvex) regularizer. Setting r to the indicator function of Q, i.e., r(w) = ιQ(w) = 0, if w Q , otherwise , (15) reduces (14) to the original problem (1). As we will see, incorporating an arbitrary r does not add any complication but will allow us to immediately generalize BC. Introducing the Fenchel conjugate function ℓ (resp. r ) of ℓ(resp. r): ℓ (w ) := supw w, w ℓ(w), (16) which is always (closed) convex even when ℓitself is nonconvex, we state the Fenchel Rockafellar dual problem [38]: min w Rd ℓ ( w ) + r (w ), (17) which, unlike the original problem (14), is always a convex problem. We apply the generalized conditional gradient algorithm [8, 49] to solving the dual problem5 (17): Given w t , we linearize the function r and solve z t = arg min w ℓ ( w ) + w , wt = ℓ (wt), wt := r (w t ), (18) where we used the fact that ( ℓ ) 1 = ℓ . Then, we take the convex combination w t+1 = (1 λt)w t + λtz t , where λt [0, 1]. (19) The following theorem extends Bach [Proposition 4.2, 4] and Yu [Theorem 4.6, 47] to any λt: Theorem 4.1. Suppose r is L-smooth (i.e. r is L-Lipschitz continuous), then for any w: λτ πτ [(ℓ + r )(wτ) (ℓ + r )(w)] (1 λ0) (w, w0) + λ2 τ 2πτ L w τ z τ 2 2, (20) where wt := r (w t ), πt := Qt τ=1(1 λτ), and π0 := 1. (w, wt) := r (w) r (wt) w wt, w t is the Bregman divergence induced by the convex function r . While the convergence of w t to the minimum of (17) is well-known (see e.g. Yu et al. [49]), the above result also implies that a properly averaged iterate wt also converges to the minimum of the dual problem of (17): Corollary 4.2. Let wt := Pt τ=0 Λt,τwτ, where Λt,τ := λτ πτ /Ht and Ht := Pt τ=0 λτ πτ . Then, we have for any w: (ℓ + r )( wt) (ℓ + r )(w) (1 λ0) (w, w0) τ=0 λτΛt,τ w τ z τ 2 2. (21) Assuming {z τ} is bounded (e.g. when ℓ is Lipschitz continuous), the right-hand side of (21) diminishes if λt 0 and P t λt = . Setting λt = 1 t+1 recovers ergodic averaging wt = 1 t+1 Pt τ=0 wτ for which the right-hand side of (21) diminishes at the rate6 O(log t/t); see Appendix A.2 for details. Thus, GCG solves problem (17) and its dual simultaneously. 4.2 BC DA GCG We are now in a position to reveal the relationships among Binary Connect (BC), (regularized) dual averaging (DA) and the generalized conditional gradient (GCG). Since Theorem 4.1 requires r to be L-smooth, in the event that it is not we resort to a smooth approximation known as the Moreau envelope [30]: M µ r (w ) = min z 1 2µ w z 2 2 + r (z ), (22) where the minimizer is (by definition) exactly Pµ r (w ). It is well-known that M µ r is (1/µ)-smooth and (M µ r ) = r + µ 2 2 2 [30]. We then apply GCG to the approximate dual problem: min w Rd ℓ ( w ) + M µ r (w ), (23) whose own Fenchel Rockfellar dual is: min w Rd ℓ (w) + (M µ r ) (w) = min w Rd ℓ (w) + r (w) + µ 2 w 2 2. (24) 5GCG is usually applied to solving the primal problem (14) directly. Our choice of the dual problem here is to facilitate later comparison with dual averaging and Binary Connect. 6The log factor can be removed by setting, for example, λt = 2 2+t instead. The updates of GCG applied to the approximate problem (23) are thus: wt := M µ r (w t ) = P1/µ r (w t /µ) (see Proposition A.2 for derivation) (25) w t+1 = (1 λt)w t + λtz t , where z t = ℓ (wt), (26) which is exactly the updates of (regularized) dual averaging [42] for convex problems where7 r = r and ℓ = ℓ. Nesterov [31] motivated dual averaging by the natural desire of non-decreasing step sizes, whereas conventional subgradient algorithms counter-intuitively assign smaller step sizes to more recent iterates instead. Based on our explanation, we conclude this is possible because dual averaging solves an (approximate) smoothened dual problem, hence we can afford to use a constant (rather than a diminishing/decreasing) step size. Defining πt := Qt s=1(1 λs) for t 1, π0 := 1, π 1 := (1 λ0) 1, and setting w t = w t /πt 1, we have: wt = P1/µ r (πt 1w t /µ), w t+1 = w t λt πt ℓ (wt). (27) Let us now reparameterize πt = λt = ηt 1+Pt τ=1 ητ and 1 πt = 1 + Pt τ=1 ητ, (28) for t 1. If we also allow µ = µt to change adaptively from iteration to iteration (as in dual averaging), in particular, if µt = πt 1, we obtain the familiar update: wt = P1/πt 1 r (w t ), w t+1 = w t ηt ℓ (wt). (29) For nonconvex problems, we may replace r and ℓ with their nonconvex counterparts r and ℓ, respectively. We remark that r (resp. ℓ ) is the largest convex function that is (pointwise) majorized by r (resp. ℓ). In particular, with r = ιQ and P1/µ r = PQ (for any µ) we recover the Binary Connect update (3). While the parameter µ plays no role when r is an indicator function, we emphasize that for general r we should use the quantizer P1/πt 1 r , where importantly 1/πt 1 hence the quantizer converges to minimizers of r asymptotically. Neglecting this crucial detail may lead to suboptimality as is demonstrated in the following example: Example 4.3. Bai et al. [5] constructed the following intriguing example: 2w2, Q = { 1}, P1/µ r (w) = sign(w) ϵ|w|+ 1 µ for |w| 1 , (30) and they showed non-convergence of the algorithm w w η ℓ(P1/µ r (w)), where µ is a fixed constant. If we use P1/πt 1 r with some diverging 1/πt 1 instead, the resulting Binary Connect, with diminishing ηt or ergodic averaging, would actually converge to 0 (since P1/πt 1 r sign). 5 Prox Connect (PC): A Generalization of Binary Connect We are now ready to generalize BC by combining the results from Section 3 and Section 4. Replacing the convex envelopes, ℓ and r , with their nonconvex counterparts and replacing deterministic gradients with stochastic gradients (as well as the change-of-variable w t w t ), we obtain from (29) a family of algorithms which we term Prox Connect (PC): wt = P1/πt 1 r (w t ), w t+1 = w t ηt e ℓ(wt), (31) where the quantizer P1/πt 1 r may be designed directly by following Section 3. We have already seen in Section 4 that BC belongs to PC by choosing P1/πt 1 r = PQ (in which case πt 1 plays no role). The analysis in Section 4, initially tailored to convex functions, immediately generalizes to the nonconvex algorithm PC (for nonconvex ℓ, nonconvex r, and stochastic gradients e ℓ): 7That is, if we set λt = 1/t and allow for time-dependent µt = t; see Xiao [Algorithm 1, 42]. Theorem 5.1. Fix any w, the iterates in (31) satisfy: τ=s ητ[ wτ w, e ℓ(wτ) +r(wτ) r(w)] s 1(w) t(w)+ τ=s τ(wτ), (32) where τ(w) := rτ(w) rτ(wτ+1) w wτ+1, w τ+1 is the Bregman divergence induced by the (possibly nonconvex) function rτ(w) := 1 πτ r(w) + 1 The summand on the left-hand side of (32) is related to the duality gap in Yu et al. [49], which is a natural measure of stationarity for the nonconvex problem (14). Indeed, it reduces to the familiar ones when convexity is present: Corollary 5.2. For convex ℓand any w, the iterates in (31) satisfy: min τ=s,...,t E[f(wτ) f(w)] 1 Pt τ=s ητ E s 1(w) t(w)+ Xt τ=s τ(wτ) . (33) If r is also convex, then min τ=s,...,t E[f(wτ) f(w)] 1 Pt τ=s ητ E s 1(w)+ Xt η2 τ 2 e ℓ(wτ) 2 2 , (34) E f( wt) f(w) 1 Pt τ=s ητ E s 1(w)+ Xt η2 τ 2 e ℓ(wτ) 2 2 , (35) Pt τ=s ητ wτ Pt τ=s ητ . The right-hand sides of (34) and (35) diminish iff ηt 0 and P t ηt = (assuming boundedness of the stochastic gradient). We note some trade-off in choosing the step size ητ: both the numerator and denominator of the right-hand sides of (34) and (35) are increasing w.r.t. ητ. The same conclusion can be drawn for (33) and (32), where τ also depends on ητ (through the accumulated magnitude of w τ+1). A detailed analysis may need to take specific properties of r or P into account [45]. Prox Quant vs Prox Connect. It is worthwhile to point out one important difference between Prox Quant and Prox Connect: Bai et al. [5] proved convergence (to some notion of stationarity) of Prox Quant for a fixed quantizer (see Bai et al. [Theorem 5.1, 5]), i.e., Pµ r for a fixed µ, but their experiments relied on incrementing µ so that their quantizer approaches the projector PQ. This creates some discrepancy between theory and practice. The same comment also applies to [1]. In contrast, Prox Connect is derived from a rigorous theory that automatically justifies a diverging µ. In particular, choosing a constant step size ητ η0 would lead to 1/πt 1 t, matching the current practice that is now justifiable if r is strongly convex; see Appendix A.3.1 for details. Connection to Existing Algorithms. Besides the obvious BC, Prox Connect generalizes many other quantization algorithms. As it turns out, many of these algorithms can also be realized using our proposed proximal quantizer Lϱ ρ from Section 3; see Table 1 for a sample summary. reverse Prox Connect. In Section 2, we discussed the idea of reversing Binary Connect. As Prox Connect generalizes BC, we also present reverse Prox Connect (r PC) as a generalization of reversing Binary Connect: wt = P1/πt 1 r (w t ), w t+1 = wt ηt e ℓ(w t ), (36) In contrast to reversing Binnary Connect, r PC is not completely without merit: it evaluates the gradient at the continuous weights w t and hence is able to exploit a richer landscape of the loss. Even when stuck at a fixed discrete weight wt, r PC may still accumulate sizable updates (as long as the step size and the gradient remain sufficiently large) to allow it to eventually jump out of wt: note that the continuous weights w t still get updated. Finally, for constant step size η and π, we note that fixed points of r PC, when existent, satisfy: w = P1/π r (w ) η ℓ(w ) w = (id + η ℓ) 1 id + η h r πη i 1 (w ) (37) =: B η, ℓ, r πη (w ), (38) Table 1: A sample summary of existing quantization algorithms. The PC column indicates if the method is a special case of our proposed Prox Connect algorithm. The Lϱ ρ-column indicates if the method uses a quantizer which is a special case of our general quantizer Lϱ ρ introduced in Section 3. If so, the ρ, ϱ-column states how ρ and ϱ were chosen (in practice): increasing , fixed to 0, or fixed to . Other than Prox Quant-Ternary, all methods can compute their quantizers in a single neural network pass. : Trained Ternary methods might use a quantizer different than Lϱ ρ to improve performance. Method PC One-pass Learnable parameters Lϱ ρ ρ, ϱ Prox Quant-Binary-L1 [5] Prox Quant-Ternary [5] - Binary Connect [11] , Binary Relax [45] 0, Binary Weight [37] - Mirror Descent View [1] - Trained Ternary [53] , Ternary Weight [26] - Prox Connect (ours) - where for simplicity we assumed deterministic gradients ℓand r to be convex such that P1/π r = (id+ r/π) 1. The operator B is known as the backward-backward update (as opposed to the forwardbackward update in Prox Quant), and it is known that when η 0 slowly, backward-backward updates converge to a stationary point [34]. Thus, despite of our current limited understanding of r PC, there is some reason (and empirical evidence as shown in Section 6) to believe it might still be interesting. Importance of GCG framework: Deriving PC from the GCG framework may let us transfer recent advances on GCG [7, 14, 50] to neural network quantization. Furthermore, it is the cornerstone in justifying the widely-adopted diverging smoothing parameter. 6 Experiments 6.1 Classification on CIFAR-10 We perform image classification on CIFAR-10 [24] using Res Net20 and Res Net56 [19], comparing Binary Connect [11] with Prox Quant [5] and (reverse)Prox Connect using our proposed proximal operator Lϱ ρ. For fair comparison, we set ρ = ϱ in Lϱ ρ as this resembles the quantizer (for binary quantization) used in the original Prox Quant algorithm. Similar to Bai et al. [5], we increase the parameter ρ (or equivalently ϱ) linearly: ρt = (1 + t/B) ρ0. In contrast to Bai et al. [5], however, we increase ρ after every gradient step rather than after every epoch as this is more in line with our analysis. We treat ρ0 as a hyperparameter for which we conduct a small grid search. We consider binary (Q = { 1, 1}), ternary (Q = { 1, 0, 1}) and quaternary (Q = { 1, 0.3, 0.3, 1}) quantization. Details for all CIFAR-10 experiments can be found in Appendix B.1. Fine-Tuning Pretrained Models. In this experiment, all quantization algorithms are initialized with a pretrained Res Net. The test accuracies of the pretrained Res Net20 and Res Net56 are 92.01 and 93.01, respectively. Table 2 shows the final test accuracies for the different models. For Prox Quant and (reverse)Prox Connect we respectively picked the best ρ0 values; results for all ρ0 values can be found in Appendix B.1.7. (Reverse)Prox Connect outperforms the other two methods on all six settings. End-To-End Training. To save computational costs, it is important that quantization algorithms also perform well when they are not initialized with pretrained full-precision model. We therefore compare the four methods for randomly initialized models; see Table 3 for the results. Prox Connect outperforms all other methods on all six tasks. Interestingly, Prox Quant and reverse Prox Connect perform considerably worse for all six tasks when compared to fine-tuning. The performance drop of Binary Connect and Prox Connect when compared to fine-tuning is only significant for ternary Table 2: Fine-tuning pretrained Res Nets. Final test accuracy: mean and standard deviation in 3 runs. Model Quantization BC [11] PQ [5] r PC (ours) PC (ours) Res Net20 Binary 90.31 (0.00) 89.94 (0.10) 89.98 (0.17) 90.31 (0.21) Ternary 74.95 (0.16) 91.46 (0.06) 91.47 (0.19) 91.37 (0.18) Quaternary 91.43 (0.07) 91.43 (0.21) 91.43 (0.06) 91.81 (0.14) Res Net56 Binary 92.22 (0.12) 92.33 (0.06) 92.47 (0.29) 92.65 (0.16) Ternary 74.68 (1.4) 93.07 (0.02) 92.84 (0.11) 93.25 (0.12) Quaternary 93.20 (0.06) 92.82 (0.16) 92.91 (0.26) 93.42 (0.12) Table 3: End-to-end training of Res Nets. Final test accuracy: mean and standard deviation in 3 runs. Model Quantization BC [11] PQ [5] r PC (ours) PC (ours) Res Net20 Binary 87.51 (0.21) 81.59 (0.75) 81.82 (0.32) 89.92 (0.65) Ternary 27.10 (0.21) 47.98 (1.30) 47.17 (1.94) 84.09 (0.16) Quaternary 89.91 (0.09) 85.29 (0.09) 85.05 (0.27) 90.17 (0.14) Res Net56 Binary 89.79 (0.45) 86.13 (1.71) 86.25 (1.50) 91.26 (0.59) Ternary 30.31 (7.79) 50.54 (3.68) 42.95 (1.57) 84.36 (0.75) Quaternary 90.69 (0.57) 87.81 (1.60) 87.30 (1.02) 91.70 (0.14) quantization. We found that Prox Quant and reverse Prox Connect can be quite sensitive to the choice of ρ0, whereas Prox Connect is stable in this regard; see Appendix B.1.7. 6.2 Classification on Image Net We perform a small study on Image Net [12] using Res Net18 [19]. As can be seen in Table 4, BC performs slightly better for fine-tuning whereas PC performs slightly better for end-to-end training. This is not an exhaustive study, but rather a first indication that PC can yield competitive performance on large scale datasets. For more details on the experiments see Appendix B.2. Table 4: Fine-tuning (left) and end-to-end training (right). Final test accuracy: mean and standard deviation over three runs. BC [11] PC (ours) ρ0 = 2e 2 ρ0 = 4e 2 65.84 (0.04) 65.44 (0.13) 65.70 (0.04) BC [11] PC (ours) ρ0 = 2.5e 3 ρ0 = 5e 3 63.79 (0.12) 63.89 (0.14) 63.67 (0.12) 7 Conclusion Capitalizing on a principled approach for designing quantizers and a surprising connection between Binary Connect and the generalized conditional gradient (GCG) algorithm, we proposed Prox Connect as a unification and generalization of existing neural network quantization algorithms. Our analysis refines prior convergence guarantees and our experiments confirm the competitiveness of Prox Connect. In future work, we plan to apply Prox Connect to training other models such as transformers. The connection with GCG also opens the possibility for further acceleration. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for their constructive comments as well as the area chair and the senior area chair for overseeing the review process. YY thanks NSERC for funding support. We thank Jingjing Wang for some early discussions. [1] Thalaiyasingam Ajanthan, Kartik Gupta, Philip Torr, Richard Hartley, and Puneet Dokania. Mirror Descent View for Neural Network Quantization. In International Conference on Artificial Intelligence and Statistics, 2021. [2] Milad Alizadeh, Javier Fernández-Marqués, Nicholas D. Lane, and Yarin Gal. An Empirical study of Binary Neural Networks Optimisation. In International Conference on Learning Representations, 2019. [3] Alexander G. Anderson and Cory P. Berg. The High-Dimensional Geometry of Binary Neural Networks. In International Conference on Learning Representations, 2018. [4] Francis Bach. Duality Between Subgradient and Conditional Gradient Methods. SIAM Journal on Optimization, 25(1):115 129, 2015. [5] Yu Bai, Yu-Xiang Wang, and Edo Liberty. Prox Quant: Quantized Neural Networks via Proximal Operators. In International Conference on Learning Representations, 2018. [6] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. ar Xiv:1308.3432, 2013. [7] Leonard Berrada, Andrew Zisserman, and M. Pawan Kumar. Deep Frank Wolfe For Neural Network Optimization. In International Conference on Learning Representations, 2019. [8] Kristian Bredies and Dirk A. Lorenz. Iterated Hard Shrinkage for Minimization Problems with Sparsity Constraints. SIAM Journal on Scientific Computing, 30(2):657 683, 2008. [9] Tom Brown and et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, pages 1877 1901, 2020. [10] Shangyu Chen, Wenya Wang, and Sinno Jialin Pan. Meta Quant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization. In Advances in Neural Information Processing Systems, 2019. [11] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binary Connect: Training Deep Neural Networks with binary weights during propagations. In Advances in Neural Information Processing Systems, volume 28, pages 3123 3131, 2015. [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255, 2009. [13] Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. On the Universal Approximability and Complexity Bounds of Quantized Re LU Neural Networks. In International Conference on Learning Representations, 2019. [14] Pavel Dvurechensky, Petr Ostroukhov, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Self-Concordant Analysis of Frank Wolfe Algorithms. In International Conference on Machine Learning, 2020. [15] Yunhui Guo. A Survey on Methods and Theories of Quantized Neural Networks. ar Xiv:1808.04752, 2018. [16] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In International Conference on Machine Learning, 2015. [17] Kai Han, Yunhe Wang, Yixing Xu, Chunjing Xu, Enhua Wu, and Chang Xu. Training Binary Neural Networks through Learning with Noisy Supervision. In International Conference on Machine Learning, 2020. [18] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network. ACM SIGARCH Computer Architecture News, 44(3):243 254, 2016. [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [20] Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization. In Advances in Neural Information Processing Systems, 2019. [21] Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Characterising Bias in Compressed Models. ar Xiv:2010.03058, 2020. [22] Lu Hou, Ruiliang Zhang, and James T. Kwok. Analysis of Quantized Models. In International Conference on Learning Representations, 2019. [23] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. Journal of Machine Learning Research, 18(1):6869 6898, 2017. [24] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009. The MIT License. [25] Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [26] Fengfu Li, Bo Zhang, and Bin Liu. Ternary Weight Networks. ar Xiv:1605.04711, 2016. [27] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training Quantized Nets: A Deeper Understanding. In Advances in Neural Information Processing Systems, 2017. [28] Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, and Song Han. MCUNet: Tiny Deep Learning on Io T Devices. Advances in Neural Information Processing Systems, 2020. [29] Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. In International Conference on Learning Representations, 2020. [30] Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93:273 299, 1965. [31] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221 259, 2009. [32] E. Park, J. Ahn, and S. Yoo. Weighted-Entropy-Based Quantization for Deep Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7197 7205, 2017. [33] Vahid Partovi Nia and Mouloud Belbahri. Binary Quantizer. Journal of Computational Vision and Imaging Systems, 4(1):3, 2018. [34] Gregory B Passty. Ergodic Convergence to a Zero of the Sum of Monotone Operators in Hilbert Space. Journal of Mathematical Analysis and Applications, 72(2):383 390, 1979. [35] Jorn WT Peters and Max Welling. Probabilistic Binary Neural Networks. ar Xiv:1809.03368, 2018. [36] Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, and Nicu Sebe. Binary Neural Networks: A Survey. Pattern Recognition, 105:107281, 2020. [37] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Image Net Classification Using Binary Convolutional Neural Networks. In European Conference on Computer Vision, 2016. [38] R. Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis. Springer, 1998. [39] Oran Shayer, Dan Levi, and Ethan Fetaya. Learning Discrete Weights Using the Local Reparameterization Trick. In International Conference on Learning Representations, 2018. [40] Emma Strubell, Ananya Ganesh, and Andrew Mc Callum. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645 3650, 2019. [41] Yanzhi Wang, Zheng Zhan, Liang Zhao, Jian Tang, Siyue Wang, Jiayu Li, Bo Yuan, Wujie Wen, and Xue Lin. Universal Approximation Property and Equivalence of Stochastic Computing Based Neural Networks and Binary Neural Networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019. [42] Lin Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research, 11:2543 2596, 2010. [43] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. Alternating Multi-bit Quantization for Recurrent Neural Networks. In International Conference on Learning Representations, 2018. [44] P Yin, S Zhang, YY Qi, and J Xin. Quantization and Training of Low Bit-Width Convolutional Neural Networks for Object Detection. Journal of Computational Mathematics, 37(3), 2019. [45] Penghang Yin, Shuai Zhang, Jiancheng Lyu, Stanley Osher, Yingyong Qi, and Jack Xin. Binary Relax: A Relaxation Approach for Training Deep Neural Networks with Quantized Weights. SIAM Journal on Imaging Sciences, 11(4):2205 2223, 2018. [46] Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley J. Osher, Yingyong Qi, and Jack Xin. Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets. In International Conference on Learning Representations, 2019. [47] Yaoliang Yu. Fast Gradient Methods for Structured Sparsity. Ph D thesis, University of Alberta, 2013. [48] Yaoliang Yu, Xun Zheng, Micol Marchetti-Bowick, and Eric Xing. Minimizing Nonconvex Non-Separable Functions. In Artificial Intelligence and Statistics, 2015. [49] Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans. Generalized Conditional Gradient for Structured Sparse Estimation. Journal of Machine Learning Research, 18:1 46, 2017. [50] Mingrui Zhang, Lin Chen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Quantized Frank Wolfe: Faster Optimization, Lower Communication, and Projection Free. In International Conference on Artificial Intelligence and Statistics, 2020. [51] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. In International Conference on Learning Representations, 2017. [52] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Do Re Fa Net: Training low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. ar Xiv:1606.06160, 2016. [53] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained Ternary Quantization. In International Conference on Learning Representations, 2018. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] As discussed, the goal of our paper is to improve the understanding of Binary Conenct [11] and obtain novel generalizations. We therefore focus more on theory than empirical results. As a result, we only did extensive experiments on one dataset (CIFAR-10) with one architecture (Res Net20/56). As discussed in Section 6.2, our experiments on Image Net are not exhaustive; both due to environmental reasons as well as limited availability of compute. In the future, we aim to do more exhaustive testing of Prox Connect (different architectures and problems), while keeping in mind the carbon footprint that comes with it. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See the first paragraph of Section 1: quantization leads to decreased carbon footprint for training and inference (positive societal impact), however, it may come at the cost of increased bias (negative societal impact). (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are provided in the Appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] We will provide the code once it has been approved by our internal process. We only ran experiments on publicly available datasets. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 6, Appendix B.1 and Appendix B.2. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] All experiments are over three random seeds; we provide mean and standard deviation for all results. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] The amount of compute in the exploration stage of our project was insignificant compared to the experiments reported in this work. See Appendix B.1.6 and Appendix B.8 for the total amount of compute we used for the experiments reported in this work. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 6.1 and Section 6.2. (b) Did you mention the license of the assets? [Yes] Yes, the license of CIFAR-10 (MIT license) is included in the bibliography entry. The Image Net dataset is is available for free to researchers for non-commercial use . (c) Did you include any new assets either in the supplemental material or as a URL? [No] We will provide the code once it has been approved by our internal process. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]