# parq_piecewiseaffine_regularized_quantization__01ac36bc.pdf

PARQ: Piecewise-Affine Regularized Quantization

Lisa Jin 1 Jianhao Ma 2 Zechun Liu 3 Andrey Gromov 1 Aaron Defazio 1 Lin Xiao 1

Abstract We develop a novel optimization method for quantization-aware training (QAT). Specifically, we show that convex, piecewise-affine regularization (PAR) can effectively induce the model parameters to cluster towards discrete, quantized values. We minimize PAR-regularized loss functions using an aggregate proximal stochastic gradient method (AProx) and show that it enjoys last-iterate convergence. Our approach provides an interpretation of the straight-through estimator (STE), a widely used heuristic for QAT, as the asymptotic form of PARQ. We present numerical experiments to demonstrate that PARQ obtains competitive performance on convolutionand transformer-based vision tasks.

1. Introduction

Modern deep learning models exhibit exceptional vision and language processing capabilities, but often come with excessive sizes and demands on memory and computing. Quantization is an effective approach for model compression, which can significantly reduce their memory footprint, computing cost, as well as the latency for inference (e.g., Han et al., 2016; Sze et al., 2017). There are two main classes of quantization methods: post-training quantization (PTQ) and quantization-aware training (QAT). Both are widely adopted and receive extensive research; see the recent survey papers by Gholami et al. (2022) and Fournarakis et al. (2022) and references therein.

PTQ converts the weights of a pre-trained model directly to lower precision without repeating the training pipeline; thus it has less overhead and is relatively easy to apply (Nagel et al., 2020; Cai et al., 2020; Chee et al., 2024). However, it is limited mainly to 4 or more bit regimes and can suffer steep performance drops with fewer bits (Yao et al., 2022; Dettmers & Zettlemoyer, 2023). This is especially the case

1Meta FAIR, United States. 2Dept. of Industrial and Operational Engineering, University of Michigan, Ann Arbor, MI, United States. 3Meta Reality Labs, United States. Correspondence to: Lisa Jin <lvj@meta.com>, Lin Xiao <linx@meta.com>. Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

for transformer-based models, which are more difficult to quantize (Bai et al., 2021; Qin et al., 2022) compared to convolutional architectures (Martinez et al., 2019; Qin et al., 2020). On the other hand, QAT integrates quantization into pre-training and/or fine-tuning processes and can produce low-bit (including binary) models with mild performance degradation (e.g. Fan et al., 2021; Liu et al., 2022).

A key ingredient of QAT is the so-called straight-through estimator (STE), which was proposed as a heuristic (Bengio et al., 2013; Courbariaux et al., 2015) and has been extremely successful in practice (e.g., Rastegari et al., 2016; Hubara et al., 2018; Esser et al., 2019). Many efforts have been made to demystify the effectiveness of STE, especially through the lens of optimization algorithms (e.g., Li et al., 2017; Yin et al., 2018; 2019; Bai et al., 2019; Ajanthan et al., 2021; Dockhorn et al., 2021; Lu et al., 2023). However, significant gaps remain between theory and practice.

In this paper, we develop a principled method for QAT based on convex regularization and interpret STE as the asymptotic form of an aggregate proximal stochastic gradient method. The convex regularization framework admits stronger convergence guarantees than previous work and allows us to prove the last-iterate convergence of the method.

1.1. The Straight-Through Estimator (STE)

We consider training a machine learning model with parameters w Rd and let f(w, z) denote the loss of the model on a training example z. Our goal is to minimize the population loss f(w) = Ez[f(w, z)] where z follows some unknown probability distribution. Here, we focus on the classical stochastic gradient descent (SGD) method. During each iteration of SGD, we draw a random training example (or mini-batch) zt and update the model parameter as

wt+1 = wt ηt f(wt, zt), (1)

where f( , zt) denotes the stochastic gradient with respect to the first argument (here being wt) and ηt is the step size.

QAT methods modify SGD by adding a quantization step. In particular, the Binary Connect method (Courbariaux et al., 2015) can be written as

ut+1 = ut ηt f(Q(ut), zt), (2)

where Q( ) is the coordinate-wise projection onto the set

PARQ: Piecewise-Affine Regularized Quantization

Figure 1. A quantization map with Q = {0, q1, q2}.

{ 1}d. It readily generalizes to projection onto Qd where Q is a finite set of arbitrary quantization values. Figure 1 shows an example with Q = {0, q1, q2}.

Notice that in Equation (2) we switched the notation from wt to ut, because we would like to define wt = Q(ut) as the quantized model parameters. This reveals a key feature of QAT: the stochastic gradient in (2) is computed at wt

instead of ut itself (which would be equivalent to (1)). Here we regard ut as a full-precision latent variable that is used to accumulate the gradient computed at wt, and the quantization map Q( ) is applied to the latent variable ut+1 to generate the next quantized variable wt+1.

The notion of STE arises from the intent of computing an approximate gradient of the loss function with respect to ut. Let us define the function f(u, z) := f(Q(u), z) = f(w, z) in light of w = Q(u). Then we have for each i = 1, . . . , d,

However, due to the staircase shape of the quantization map, we have d Q(ui)/dui = 0 and thus f(u, z) = 0 almost everywhere, which prevent effective learning. In order to fix this problem, STE tries to construct a nontrivial gradient with respect to u, by simply treating Q( ) as the identity map during backpropagation, i.e., replacing d Q(ui)/dui with 1 in the above equation. This leads to the straight-through approximation

f(u, z) STE f(w, z) = f(Q(u), z),

so that one can interpret Equation (2) as an (approximate) SGD update for minimizing the composite function f(u).

There are several issues with this argument. First, we know exactly that d Q(ui)/dui = 0 almost everywhere, so there is no need for approximation. Second, any approximation that replaces 0 with 1 in this context warrants scrutiny of the resulting bias and the consequences on training stability. Existing works on this are restricted to special cases and weak convergence results (Li et al., 2017; Yin et al., 2019).

Alternatively, we can view (2) as an implicit algorithm for updating wt and analyze its convergence. More explicitly,

ut+1 = ut ηt f(wt, zt),

wt+1 = Q(ut+1). (3)

Here ut serves as an auxiliary variable that accumulates past gradients evaluated at w0, . . . , wt (similar to momentum). This formulation allows application of the powerful framework of regularization and proximal gradient methods (e.g., Bai et al., 2019; Dockhorn et al., 2021). And this is the path we take in this paper.

1.2. Outline and contributions

In Section 2, we review the framework of regularization and introduce a family of convex, piecewise-affine regularizers (PAR). In addition, we derive the first-order optimality conditions for minimizing PAR-regularized functions.

In Section 3, we derive an aggregate proximal gradient method (AProx) for solving PAR-regularized minimization problems and provide its convergence analysis for convex losses. AProx applies a soft-quantization map that evolves over the iterations and asymptotically converges to hard quantization, thus giving a principled interpretation of STE.

In Section 4, we present PARQ (Piecewise-Affine Regularized Quantization), a practical implementation of AProx with PAR regularization that does not need to pre-determine the quantization values and regularization strength.

In Section 5, we conduct QAT experiments on low-bit quantization of convolutionand transformer-based vision models and demonstrate that PARQ obtains competitive performance compared to STE/Binary Connect, as well as other methods based on nonconvex regularization.

We note that Dockhorn et al. (2021) already used the regularization framework and proximal optimization to interpret (demystify) Binary Connect and developed a generalization called Prox Connect. In fact, AProx is equivalent to Prox Connect albeit following quite different derivations. Nevertheless, we make the following novel contributions.

We propose convex PAR to induce quantization.

Dockhorn et al. (2021) focused on monotone (nondecreasing) proximal maps, which can correspond to arbitrary regularization. Although they presented convergence results for convex regularization, no such example was given to demonstrate its relevance. Beyond closing this gap between theory and practice, our construction of convex PAR is rather surprising counterintuitive for the purpose of quantization.

We derive first-order optimality conditions for minimizing PAR-regularized functions. They reveal the critical role of nonsmoothness in inducing quantization.

PARQ: Piecewise-Affine Regularized Quantization

(a) Ψ(w) = w 1

(b) Ψ(w) = min v { 1} w v 1

Figure 2. Illustration of two nonsmooth regularizers.

We prove last-iterate convergence of AProx. The convergence results of Dockhorn et al. (2021) concern the averaged iterates generated by Prox Connect/AProx. While such results are conventional in the stochastic optimization literature, they are far from satisfactory for QAT, because the averaged iterate may not be quantized even if every iterate is quantized. Last-iterate convergence gives a much stronger guarantee.

We propose a practical implementation called PARQ that can adaptively choose the quantization values and regularization strength in an online fashion.

Our implementation of PARQ in Py Torch is available at ht tps://github.com/facebookresearch/parq.

2. Piecewise affine regularization (PAR)

Regularization is a common approach for inducing desired properties of machine learning models, by minimizing a weighted sum of the loss function f and a regularizer Ψ:

minimize w Rd f(w) + λΨ(w), (4)

where λ R+ is a parameter to balance the relative strength of regularization. For example, it is well known that L2regularization helps generalization by preferring smaller model parameters, and L1-regularization, illustrated in Figure 2(a), induces sparsity (e.g., Hastie et al., 2009).

There have been many attempts of using regularization to induce quantization (e.g., Carreira-Perpi n an & Idelbayev, 2017; Yin et al., 2018; Bai et al., 2019). An obvious choice is to let Ψ be the indicator function of Qd; in other words, Ψ(w) = Pd i=1 δQ(wi) where

( 0 if wi Q, + otherwise. (5)

Then minimizing f(w) + λΨ(w) is equivalent to the constrained optimization problem of minimizing f(w) subject to w Qd, which is combinatorial in nature and very hard to solve in general. Yin et al. (2018) propose to use the Moreau envelope of the indicator function, which under the Euclidean metric gives Ψ(w) = minv Qd v w 2 2. A nonsmooth version is proposed by Bai et al. (2019) under

w 0 q1 q2 q3 q1 q2 q3

Figure 3. Convex PAR: Ψ(w) = maxk{ak(|w| qk) + bk}.

the L1-metric, resulting in Ψ(w) = minv Qd v w 1; Figure 2(b) shows a W-shaped example in one dimension.

We argue that the effectiveness of a regularizer for inducing quantization largely relies on two critical properties: nonsmoothness and convexity. Smooth regularizers such as dist(w, Qd) := minv Qd v w 2 2 behave like w 2 2 locally, thus do not induce zero or any discrete structure. On the other hand, nonsmooth regularizers behave like w 1 near zero, so they can trap model parameters at the set of nondifferentiable points more suitable for quantization.

Convexity concerns the global behavior of regularization. For example, the popularity of L1-regularization for sparse optimization is largely attributed to its convexity besides being nonsmooth. On the other hand, it is hard for a gradientbased algorithm to cross the middle hill in the nonconvex W-shaped regularizer shown in Figure 2(b), if the initial weights are trapped in the wrong valley from the optimal ones. Therefore, ideally we would like to construct a regularizer that is both nonsmooth and convex.

2.1. Definition of PAR

To simplify presentation, we assume Ψ(w) = Pd i=1 Ψ(wi) and use the same notation Ψ for the function of a vector or one of its coordinates (it should be self-evident from the context). For most of the discussion, we focus on the scalar case and omit the subscript i or simply assume d = 1.

Suppose that the set of target quantization values is given as Q = {0, q1, . . . , qm} with 0 = q0 < q1 < < qm. We define a piecewise-affine regularizer (PAR) as

Ψ(w) = max k {0,...,m}{ak(|w| qk) + bk}, (6)

where the slopes {ak}m k=0 are free parameters that satisfy 0 a0 < a1 < < am = + , and {bk}m k=0 are determined by setting b0 = 0, q0 = 0, and

bk = bk 1 + ak 1(qk qk 1), k = 1, . . . , m.

As shown in Figure 3, ( qk, bk) are the reflection points of the piecewise-affine graph. The function Ψ(w) is convex because the maximum of finite linear functions is convex (Boyd & Vandenberghe, 2004, Section 3.2.3).

PARQ: Piecewise-Affine Regularized Quantization

w q1 q1 a0 =0

(a) 1-bit: Q = { q1}

w q1 q2 q1 q2 a0 =0

(b) 2-bit: Q = { q1, q2}

(d) Ternary: Q = {0, q1}

Figure 4. Three special cases of PAR for low-bit quantization.

We note that setting a0 = 0 effectively removes q0 = 0 from the quantization set Q because it is no longer a reflection point of Ψ. Figure 4 illustrates three special cases of PAR for low-bit quantization, where both Figures 4(a) and 4(b) have a0 = 0. Finally, we note that the above definition of PAR is symmetric around zero, for the convenience of presentation. It is straightforward to extend to the asymmetric case.

2.2. Optimality conditions

In order to understand how PAR can induce quantization, we examine the optimality conditions of minimizing PARregularized functions. Suppose f is differentiable and w is a solution to the optimization problem (4). The first-order optimality condition for this problem is (see, e.g., Wright & Recht, 2022, Theorem 8.18)

0 f(w ) + λ Ψ(w ),

where Ψ(w ) denotes the subdifferential of Ψ at w , and λ Ψ(w ) means multiplying each element of the set Ψ(w ) by λ. For convenience, we rewrite it as f(w ) λ Ψ(w ), which breaks down into the following cases:

w i = qk, = if(w ) λ (ak 1, ak)

w i ( qk, qk 1) = if(w ) = λ ak 1 w i = 0 = if(w ) λ ( a0, a0)

w i (qk 1, qk) = if(w ) = λ ak 1 w i = qk, = if(w ) λ ( ak, ak 1).

Here i denotes the ith coordinate of the vector f, the subscript i runs from 1 through d, and the piecewise-affine index k runs from 1 through m. The symbol = (= ) means that the expression on the left side is a necessary (sufficient) condition for the expression on the right side.

a1+ q1 a1+ q2

a2+ q2 a2+ q3

q2 q1 a2 a1

Figure 5. Graph of proxΨ(u).

We immediately recognize that the sufficient condition for w i = 0 (third equation above) is the same as for the L1regularization Ψ(w) = λ a0 w 1. Further examination reveals that for any weight not clustered at a discrete value in Q, i.e., if w i (qk 1, qk) for some k, the corresponding partial derivative if(w ) must equal the singleton λak 1. Conversely, almost all values of the partial derivatives of f, except for the 2m discrete values, { λak 1}m k=1, can be balanced by assigning the model parameters at the 2m + 1 discrete values in Q = {0, q1, . . . , qm}. Intuitively, this implies that the model parameters at optimality are more likely to cluster at these discrete values. We will derive an algorithm that manifests this property rigorously in Section 3.

2.3. Proximal mapping of PAR

A fundamental tool for solving problem (4) is the proximal map of the regularizer Ψ, defined as

proxΨ(u) = arg minw Ψ(w) + 1

2 w u 2 2 .

See, e.g., Wright & Recht (2022, 8.6) for further details. For the PAR function defined in (6), its proximal map has the following closed-form solution (letting a 1 = 0)

( sgn(u)qk if |u| [ak 1+qk, ak+qk], u sgn(u)ak if |u| [ak+qk, ak+qk+1]. (7) where sgn( ) denotes the sign or signum function.

Figure 5 shows the graph of proxΨ(u), which is clearly monotone non-decreasing in u. According to Yu et al. (2015, Proposition 3), a (possibly multi-valued) map is a proximal map of some function if and only if it is compact-valued, monotone and has a closed graph. For example, the hardquantization map in Figure 1 is the proximal map of the (nonconvex) indicator function δQ in (5). Dockhorn et al. (2021) work with monotone proximal maps directly without specifying the regularizer itself. In contrast, we construct a convex regularizer, and show that it can effectively induce quantization and obtain competitive performance in practice, together with stronger convergence guarantees.

PARQ: Piecewise-Affine Regularized Quantization

ηtλa0 ηtλa0+q1

ηtλa1+q1 ηtλa1+q2

ηtλa2+q2 ηtλa2+q3

q2 q1 ηtλ(a2 a1)

Figure 6. Graph of proxηtλΨ(u).

3. The AProx Algorithm

The regularization structure of problem (4) can be well exploited by the proximal gradient method

wt+1 = proxηtλΨ wt ηt f(wt) , (8)

where proxηtλΨ is the proximal map of the scaled function ηtλΨ. Since ηtλ effectively scales the slopes {ak}m k=1 (with Q fixed), we obtain proxηtλΨ by simply replacing ak in (7) with ηtλak and the corresponding map is shown in Figure 6.

If f is convex and f is L-Lipschitz continuous, then using the constant step size ηt = 1/L leads to a convergence rate of O(1/t) (e.g., Wright & Recht, 2022, Theorem 9.6).

In the context of machine learning, we minimize the expected loss over a large amount of data, i.e., f(w) = Ez[f(w, z)]. The Prox-SGD method replaces f(wt) in (8) with the stochastic gradient gt := wf(wt, zt):

wt+1 = proxηtλΨ wt ηtgt . (9)

However, it is well known that for the (proximal) SGD method to converge, we need diminishing and nonsummable step sizes (e.g., Robbins & Monro, 1951), i.e.,

ηt 0 and P t=1 ηt = + . (10)

In this case, the flat segments on the graph of proxηtλΨ, as shown in Figure 6, with lengths ηtλ(ak ak 1), will all shrink to zero when ηt 0 (except at the two ends because am = + ). Therefore, the graph converges to the identity map clipped flat outside of [ qm, +qm] and we lose the action of quantization. This issue parallels that of using Prox-SGD with L1-regularization, which does not produce sparse solutions because of the shrinking deadzone in the soft-thresholding operator as ηt 0 (Xiao, 2010).

To overcome the problem of diminishing regularization, we propose AProx, an aggregate proximal stochastic gradient method. Aprox shares a similar form with Binary Connect (Courbariaux et al., 2015). Specifically, it replaces the hardquantization Q( ) in (3) with an aggregate proximal map:

ut+1 = ut ηtgt,

wt+1 = proxγtλΨ(ut+1), (11)

γ 1 t u q1 q2 q3

λa0 λa0+γ 1 t q1

λa1+γ 1 t q1 λa1+γ 1 t q2

λa2+γ 1 t q2 λa2+γ 1 t q3

γ 1 t (q2 q1) λ(a2 a1)

Figure 7. Graph of proxγtλΨ(u) with scaled input.

where γt = Pt s=1 ηs. Here proxγtλΨ is called an aggregate map because λΨ is scaled by the aggregate step size γt. In fact, Binary Connect is a special case of AProx with Ψ being the indicator function of Qd given in (5). The indicator function and its proximal map (Figure 1) is invariant under arbitrary scaling, thus hiding the subtlety of aggregation.

The graph of proxγtλΨ can be obtained by replacing ηt in Figure 6 with γt. However, according to (10), we have

γt = Pt s=1 ηs + ,

which implies that the flat segments in the graph, now with lengths γtλ(ak ak 1), grow larger and larger, which is opposite to the Prox-SGD method. (In both cases, the sloped segments has fixed length qk qk 1.)

For the ease of visualization, we rescale the input u by γ 1 t and obtain the graph in Figure 7. In this scaled graph, the lengths of the flat segments λ(ak ak 1) stay constant but the sloped segments, with lengths γ 1 t (qk qk 1), shrink as γt increases. Asymptotically, as γt , the graph converges to hard quantization, as shown in Figure 8.

3.1. AProx versus Prox-SGD and Prox Connect

To better understand the difference between AProx and Prox-SGD, we rewrite Prox-SGD in (9) as

ut+1 = wt ηtgt, wt+1 = proxηtλΨ(ut+1), (12)

which differ from AProx in (11) in two places (highlighted in blue). Here we give an intuitive interpretation of these differences. First, notice that the objective in (4) is the sum of f and λΨ, and both methods make progress by using the stochastic gradient of f (forward step) and the proximal map of λΨ (backward step) in a balanced manner.

In Prox-SGD, ut+1 is a combination of wt and ηtgt. But wt already contains contributions from both f and λΨ, through { ηsgs}t 1 s=1 and {proxηsλΨ}t 1 s=1 respectively. Therefore, from ut+1 to obtain wt+1, we should use proxηtλΨ to balance ηtgt.

PARQ: Piecewise-Affine Regularized Quantization

γ 1 t u q1 q2 q3

λa0 λa1 λa2

Figure 8. Asymptotic scaled mapping as γt 0.

For AProx, ut+1 is used to accumulate Pt s=1 ηsgs, solely contributed from f. Thus in computing wt+1, we need to strike a balance with the contribution from λΨ with the aggregated strength γt = Pt s=1 ηs.

While the total contributions from the forward steps ( ηtgt) and backward steps (proxλΨ) are balanced in both cases, Prox-SGD spreads the backward steps on every iterate wt so the quantization effect on the last iterate eventually diminishes. In contrast, AProx always applies an aggregate proximal map to generate the last iterate, in order to balance the accumulation of pure forward steps in ut+1.

The above interpretation highlights the importance of balance between the forward and backward steps in minimizing the sum of f and λΨ. With the flexibility of allowing any step size rule that satisfies (10), it can be considered as a more flexible variant, or a generalization, of the regularized dual averaging (RDA) method of Xiao (2010).

Dockhorn et al. (2021) used the regularization framework and proximal maps to interpret Binary Connect/STE and developed a generalization called Prox Connect. It is derived from the generalized conditional gradient method (Yu et al., 2017), through the machinery of Fenchel-Rockafellar duality. We derived AProx as an direct extension of RDA (Xiao, 2010), but realized that it is indeed equivalent to Prox Connect, with some minor differences in setting γt. Nevertheless, our construction through balancing forward and backward steps provides a more intuitive understanding of the algorithm and may shed light on further development of structure-inducing optimization algorithms.

3.2. Convergence Analysis

To simplify the presentation, we define

Fλ(w) := Ez[f(w, z)] + λΨ(w).

The following theorem concerns the convergence of AProx in terms of the weighted average wt = 1 Pt s=1 ηs Pt s=1 ηsws. This result appeared in Dockhorn et al. (2021, Cor. 5.2.). We include it here as a basis for proving last-iterate convergence and give its proof in Appendix A.1 for completeness.

Algorithm 1 PARQ

input: w1 Rd, number of quantization bits n,

step sizes {ηt}T t=1, slope schedule {ρ 1 t }T t=1 initialize: u1 = w1

for t = 1, 2, . . . , T 1 do

ut+1 = ut ηt f(wt, zt) Qt+1 = LSBQ(ut+1, n) wt+1 = prox PARQ(ut+1, Qt+1, ρt) end for output: w T

Theorem 3.1. Assume that f(w, z) is convex in w for any z, Ψ is convex, and Fλ is continuous with Lipschitz constant G. Also, let W be the set of minimizers of Fλ(w). Then,

(a) If the stepsize ηt satisfies (10) and {ws}t s=1 are generated by algorithm (11), then the weighted average wt

converges in expectation to a point in W . (b) Let w0 be an initial point, R =minw W w0 w 2 and the step size ηt = R 2G q

E Fλ( wt) Fλ(w ) GR2 + 1.5 ln(t)

where the expectation E[ ] is taken with respect to the sequence of random variables {w1, . . . , wt}.

While convergence results on the averaged iterates wt are conventional in the stochastic optimization literature, they are far from satisfactory for QAT. In particular, the averaged iterates wt are most likely not quantized even if every iterate wt is quantized. Therefore, only the last iterate is meaningful for QAT in practice.

In general, last-iterate convergence of stochastic/online algorithms is crucial for regularized optimization problems aiming for a structured solution (such as sparsity and quantization). Here we establish last-iterate convergence of AProx.

Theorem 3.2 (Last-iterate convergence of AProx for convex optimization). Under the same assumptions as in Theorem 3.1, the last iterate wt of AProx satisfies

E Fλ(wt) Fλ(w ) GR2 + 1.5 ln(t)

The proof of Theorem 3.2 is provided in Appendix A.2. We note that this convergence rate matches the average-iterate convergence rate established in Theorem 3.1.

We note that AProx updates the variable ut with a simple SGD step. In practice, replacing it with more sophisticated methods such as Adam (Kingma & Ba, 2014) or Adam W (Loshchilov & Hutter, 2018) gives better performance. We leave their convergence analysis for future work.

PARQ: Piecewise-Affine Regularized Quantization

(a) prox PARQ(u, Q, ρ)

(b) prox Bin Rel(u, Q, ρ)

Figure 9. Proximal maps of PARQ and Binary Relax.

4. PARQ: A Practical Implementation

A practical issue for implementing AProx with PAR is how to choose the PAR parameters {qk}m k=1 and {ak}m 1 k=0 , as well as the regularization strength λ; see their roles in the proximal map in Figure 7. In particular, {qk} are the target quantization values for wt and λ and {ak} determine the quantization thresholds on the scaled input γ 1 t ut. In practice, it is very hard to choose these parameters a priori for different models and datasets. Therefore, we propose a heuristic approach to estimate the target values {qk} online and at the same time avoid setting λ and {ak} explicitly.

Given a vector ut Rd, we need to quantize it (elementwise) to a vector wt Qd where wt i Q for i = 1, . . . , d. We use the least-squares binary quantization (LSBQ) approach (Pouransari et al., 2020) to estimate the target quantization values in Q. LSBQ employs a form of n-bit scaled binary quantization. Specifically, let

wi = Pn j=1 vjsj(ui),

where the vj s satisfy v1 vn 0 and each sj : R { 1, 1} is a binary function. The optimal {vj, sj( )}n j=1 for approximating u Rd in the leastsquares sense can be found by solving the problem:

minimize{vj,sj( )} Pd i=1 ui Pn j=1 vjsj(ui) 2

subject to v1 v2 vn 0, sj : R { 1, 1}, j = 1, . . . , n.

For n = 1 (1-bit quantization), the solution is well-known:

v1 = u 1/d, and s1(ui) = sgn(ui);

see, e.g., Rastegari et al. (2016). Pouransari et al. (2020) derived the solutions for the n = 2 case and the ternary case (n = 2 with v1 = v2). For n > 2, there is no closed-form solution, but Pouransari et al. (2020) gives a simple greedy algorithm for foldable representations, which satisfy

sj(ui) = sgn(ui Pj 1 ℓ=1 vℓsℓ(ui)), j = 1, . . . , n.

This is the scheme that we adopt in PARQ.

Table 1. Res Net test accuracy on CIFAR-10. Full-precision (FP) accuracy is shown in parentheses under each model depth.

Depth # bits STE Binary Relax PARQ

1 89.56 0.18 89.98 0.13 90.48 0.26 T 90.94 0.15 91.25 0.07 91.45 0.11 2 91.22 0.15 91.57 0.06 91.71 0.03 3 91.84 0.22 91.77 0.05 91.97 0.04 4 91.93 0.04 91.92 0.16 91.93 0.05

1 91.55 0.33 91.75 0.37 91.47 0.35 T 92.42 0.09 92.34 0.23 92.97 0.15 2 92.72 0.27 92.30 0.40 92.77 0.10 3 92.73 0.44 92.86 0.40 92.86 0.25 4 92.34 0.23 92.59 0.10 92.78 0.30

Once a set of (exact or approximate) solution {vj}n j=1 is obtained, the resulting quantization values can be written in the form v1 vn by choosing either + or between the adjacent operands. For example, the largest and smallest values in Q = { q1, . . . , qm} are qm = v1 + + vn and qm = v1 vn. Since there are n binary bits, the total number of target values is |Q| = 2n.

The selection of {ak} and λ is somewhat arbitrary and not consequential. We can choose them so that the asymptotic graph in Figure 8 matches the hard-quantization map depicted in Figure 1. That is, we can let λak = (qk +qk+1)/2, but never really use them once Q is found by LSBQ.

While in theory we require γt = Pt s=1 ηs + , in practice γt does not become very large due to the finite number of iterations we run with diminishing step sizes. Therefore, its effect on scaling the horizontal axis in Figures 7 and 8 is limited and can be absorbed by tuning the step size. On the other hand, we would like the proximal map to be able to converge to hard-quantization by the end of training (so we have fully quantized solutions). For this purpose, we use an independent schedule for growing the slope of the slanted segments. Specifically, we emulate the proximal map in Figure 7 with the one in Figure 9(a), where Q is calculated from LSBQ, and ρ is the slope of the slanted segments. For convenience, we specify a schedule for the inverse slope ρ 1 t to vary monotonically from 1 to 0 during T steps of training (so the slope ρt go to infinity). For example,

ρ 1 t = 1 1 + exp (s(t t1)), (13)

where s > 0 is the steepness parameter, and t1 is the transition center (usually t1 = T/2). This schedule changes ρ 1 t roughly from 1 to 0, taking value 0.5 at the transition center t1. The steepness parameter s controls how fast the transition from 1 to 0 happens, with larger s corresponding to steeper transitions.

Putting everything together, we have PARQ in Algorithm 1.

PARQ: Piecewise-Affine Regularized Quantization

Table 2. Quantized Res Net-50 test accuracy on Image Net.

Depth # bits STE Binary Relax PARQ

1 66.17 0.04 66.14 0.28 66.71 0.13 T 70.94 0.19 71.59 0.11 71.45 0.11 2 72.38 0.10 72.64 0.17 72.71 0.19 3 73.58 0.09 74.02 0.09 73.94 0.10 4 74.52 0.04 74.58 0.04 74.83 0.19

5. Experiments

We train quantized convolutional and vision-transformer models using QAT on image classification tasks across five bit-widths: ternary (T) and 1 4 bits. For each model and bit-width pair, we compare PARQ with two existing QAT methods: STE/Binary Connect (Courbariaux et al., 2015) and Binary Relax (Yin et al., 2018).

Specifically, STE/Binary Connect uses the hard-quantization map in Figure 1, PARQ applies the proximal map in Figure 9(a) with slope annealing, and Binary Relax effectively uses the proximal map in Figure 9(b) where the slope of slanted segments gradually decreases to 0. We note that prox PARQ is the proximal map of a convex PAR, but STE and prox Bin Rel do not correspond to convex regularization.

Each entry in Tables 1 3 shows the mean and standard deviation of test accuracies over three randomly seeded runs.

5.1. Res Net on CIFAR-10

We first evaluate quantized Res Net-20 and Res Net-56 (He et al., 2016) on CIFAR-10. All weights, including those in the final projection layer, are quantized. We train for 200 epochs using SGD with 0.9 momentum and 2e 4 weight decay. Following Zhu et al. (2022), the 0.1 learning rate decays by a factor of 10 at epochs 80, 120, and 150.

As shown in Table 1, PARQ performs competitively to STE and Binary Relax across all bit-widths. For 1-bit Res Net-20, it outperforms STE by nearly one accuracy point. It is the only QAT method for ternary Res Net-56 reaching within 0.1 points of full-precision accuracy.

5.2. Res Net on Image Net

For QAT of Res Net-50 (He et al., 2016) on Image Net, we quantize all residual block weights per channel by computing Q row-wise over tensors. We use SGD with 0.1 learning rate, 0.9 momentum, and 1e 4 weight decay. The learning rate decays by a factor of 10 every 30 epochs.

Similar to the experiments on CIFAR-10, PARQ performs capably against STE and Binary Relax in Table 2. It shows a slight advantage in the most restrictive 1-bit case, achieving a half-point margin over the other two methods.

Table 3. Quantized Dei T test accuracy on Image Net.

Size # bits STE Binary Relax PARQ

1 51.62 0.18 52.62 0.03 55.43 0.23 T 61.43 0.08 62.18 0.11 62.32 0.28 2 64.81 0.15 65.20 0.04 66.60 0.18 3 69.02 0.11 69.26 0.03 69.60 0.22 4 70.95 0.11 71.06 0.09 71.21 0.11

1 70.07 0.03 70.69 0.07 73.40 0.19 T 75.83 0.06 76.02 0.03 76.74 0.06 2 77.40 0.01 77.43 0.04 77.94 0.04 3 79.02 0.14 79.11 0.07 79.04 0.04 4 79.57 0.04 79.55 0.12 79.61 0.04

1 78.79 0.03 79.02 0.03 79.35 0.04 T 80.50 0.01 80.61 0.08 80.62 0.01 2 80.73 0.17 80.81 0.14 80.97 0.20 3 80.54 0.20 80.94 0.05 81.49 0.13 4 80.45 0.10 80.76 0.12 81.60 0.12

5.3. Dei T on Image Net

Applying QAT to a different architecture, we experiment with Data-efficient image Transformers (Touvron et al., 2021, Dei T). Our Dei T experiments include the Ti, S, and B model sizes with 5M, 22M, and 86M parameters, respectively. Attention block weights are quantized channel-wise as in Section 5.2. Embeddings, layer normalization parameters, and the final projection weights are left at full precision, following the setting of Rastegari et al. (2016).

We use Adam W (Loshchilov & Hutter, 2018) to train for 300 epochs with a 5e 4 learning rate and 0.05 weight decay. We hold the learning rate at 1e 8 for the final 20 epochs (after PARQ and Binary Relax converge to hard-quantization); this boosts performance relative to the default 1e 5 minimum. We apply Rand Augment (Cubuk et al., 2020) and all prior regularization strategies (Zhang et al., 2018; Yun et al., 2019) except repeated augmentation (Berman et al., 2019).

Table 3 reveals that PARQ s performance trends persist across model sizes. For 1-bit Dei T-Ti and Dei T-S, PARQ outperforms Binary Relax by nearly three accuracy points. PARQ also achieves the best accuracy for ternary and and 2-bit Dei T-S, as well as 3and 4-bit Dei T-B models.

Figure 11 shows the training loss curves of three different QAT methods along with full precision (FP) training on the Dei T-Ti model. We observe that in the initial phase, PARQ closely follows the FP curve because the slope of the slanted segments in its proximal map (Figure 9(a)) is close to 1. Then the training loss of PARQ increases due to the relatively sharp transition of the slope, and it follows the STE curve closely in the second half of the training process as its proximal map converges to hard quantization. The training curve of Binary Relax has a more gradual transition.

PARQ: Piecewise-Affine Regularized Quantization

0.4 0.2 0 0.2 0.4 0.2

0.4 0.2 0 0.2 0.4 u

0.4 0.2 0 0.2 0.4 u

Figure 10. PARQ proximal maps during early, middle, and late stages of training 2-bit Dei T-Ti (value weights from an attention layer).

50 100 150 200 250 0

FP STE Bin Rel PARQ

Figure 11. Training loss curves for 2-bit Dei T-Ti model (top) and the ρ 1 t schedule in (13) with s = 50 and t1 = 0.5T.

Figure 10 gives snapshots of how PAR gradually induces quantization in model parameters: compare the middle stage plot with Figure 9(a) and the late stage plot with Figure 1. Figure 12 shows the evolution of {q1, q2} (estimated by LSBQ) during the training of a 2-bit Dei T-Ti model. They are from the same layer as the one used in Figure 10 and with the same weight initialization. It shows that both q1 and q2 start small from initialization, expand rapidly in the early stage of training, then slowly contract in later epochs.

Our experiments demonstrate that PARQ achieves competitive performance compared with QAT methods that correspond to using nonconvex regularization. Compared with using hard-quantization (STE) throughout the training process, the gradual evolution of PARQ from piecewise-affine soft quantization to hard quantization helps the training process to be more stable, and often converges to better local minima. This is more evident in the most demanding cases of low-bit quantization of smaller models.

50 100 150 200 250 300 0

Figure 12. Evolution of {q1, q2} estimated by LSBQ.

6. Conclusion

We developed a novel optimization method for quantizationaware training (QAT) based on the framework of convex, piecewise-affine regularization (PAR). In order to avoid the diminishing regularization effect of the standard proximal SGD method, we propose an aggregate proximal (AProx) algorithm. The asymptotic form of AProx with PAR corresponds to hard quantization, thus giving a principled interpretation of the straight-through estimator (STE), which is a widely successful heuristic for QAT.

The convex regularization framework of PARQ allows the development of strong convergence guarantees. In particular, for convex loss functions, we are able to prove lastiterate convergence of the AProx method. For future work, we are interested in extending the convergence analysis for nonconvex loss functions, as well as for variants of AProx that incorporate stochastic momentum and diagonal scaling.

We have focused on PAR as an effective regularization in an optimization framework. It would also be very interesting to investigate its generalization capability in a statistical learning framework, which will help us better understand the tradeoff between model size and prediction performance.

PARQ: Piecewise-Affine Regularized Quantization

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ajanthan, T., Gupta, K., Torr, P., Hartley, R., and Dokania, P. Mirror descent view for neural network quantization. In International Conference on Artificial Intelligence and Statistics, pp. 2809 2817. PMLR, 2021.

Bai, H., Zhang, W., Hou, L., Shang, L., Jin, J., Jiang, X., Liu, Q., Lyu, M., and King, I. Binary BERT: Pushing the limit of BERT quantization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4334 4348, 2021.

Bai, Y., Wang, Y.-X., and Liberty, E. Prox Quant: Quantized neural networks via proximal operators. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, May 2019.

Bengio, Y., L eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation, August 2013. ar Xiv:1308.3432.

Berman, M., J egou, H., Vedaldi, A., Kokkinos, I., and Douze, M. Multigrain: a unified image embedding for classes and instances. ar Xiv:1902.05509, 2019.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 (3-4):231 357, 2015.

Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. Zero Q: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13169 13178, 2020.

Carreira-Perpi n an, M. A. and Idelbayev, Y. Model compression as constrained optimization, with application to neural nets. Part II: Quantization, 2017. ar Xiv:1707.04319.

Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Qu IP: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36, 2024.

Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538 543, 1993. doi: 10.1137/0803026. URL https: //doi.org/10.1137/0803026.

Courbariaux, M., Bengio, Y., and David, J.-P. Binary Connect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, volume 28, Montr eal, Canada, December 2015.

Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Rand Augment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702 703, 2020.

Dettmers, T. and Zettlemoyer, L. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pp. 7750 7774. PMLR, 2023.

Dockhorn, T., Yu, Y., Sari, E., Zolnouri, M., and Partovi Nia, V. Demystifying and generalizing binaryconnect. Advances in Neural Information Processing Systems, 34: 13202 13216, 2021.

Esser, S. K., Mc Kinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantization. In International Conference on Learning Representations (ICLR), 2019.

Fan, A., Stock, P., Graham, B., Grave, E., Gribonval, R., Jegou, H., and Joulin, A. Training with quantization noise for extreme model compression. In International Conference on Learning Representations (ICLR), 2021.

Fournarakis, M., Nagel, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T. Quantizing neural networks. In Thiruvathukal, G. K., Lu, Y.-H., Kim, J., Chen, Y., and Chen, B. (eds.), Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence, chapter 11, pp. 235 272. CRC Press, 2022.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. In Thiruvathukal, G. K., Lu, Y.-H., Kim, J., Chen, Y., and Chen, B. (eds.), Low Power Computer Vision: Improve the Efficiency of Artificial Intelligence, chapter 13, pp. 291 326. CRC Press, 2022.

Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of

PARQ: Piecewise-Affine Regularized Quantization

the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016. URL http://arxiv.org/abs/1510.00149.

Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009. URL http://www-s tat.stanford.edu/ tibs/Elem Stat Learn/.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1 30, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014. ar Xiv:1412.6980.

Li, H., De, S., Xu, Z., Studer, C., Samet, H., and Goldstein, T. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, volume 31, pp. 5813 5823, 2017.

Liu, Z., Oguz, B., Pappu, A., Xiao, L., Yih, S., Li, M., Krishnamoorthi, R., and Mehdad, Y. Bi T: Robustly binarized multi-distilled transformer. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 14303 14316. Curran Associates, Inc., 2022. URL https://dl.acm.org/doi/abs/10.55 55/3600270.3601310.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2018.

Lu, Y., Yu, Y., Li, X., and Partovi Nia, V. Understanding neural network binarization with forward and backward proximal quantizers. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 40468 40486. Curran Associates, Inc., 2023.

Martinez, B., Yang, J., Bulat, A., and Tzimiropoulos, G. Training binary neural networks with real-to-binary convolutions. In International Conference on Learning Representations, 2019.

Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197 7206. PMLR, 2020.

Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004.

Orabona, F. Last iterate of SGD converges (even in unbounded domains), 2020. URL https://parame terfree.com/2020/08/07/last-iterate-o f-sgd-converges-even-in-unbounded-d omains/.

Pouransari, H., Tu, Z., and Tuzel, O. Least squares binary quantization of neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 698 699, 2020. URL ht tps://doi.org/10.1109/cvprw50498.2020. 00357.

Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., and Song, J. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2250 2259, 2020.

Qin, H., Ding, Y., Zhang, M., Yan, Q., Liu, A., Dang, Q., Liu, Z., and Liu, X. Bi BERT: Accurate fully binarized BERT. In Proceedings of International Conference on Learning Representations (ICLR), 2022.

Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. XNor-net: Image Net classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525 542. Springer, 2016.

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400 407, 1951.

Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295 2329, 2017.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J egou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347 10357. PMLR, 2021.

Wright, S. J. and Recht, B. Optimization for Data Analysis. Cambridge University Press, Cambridge, 2022.

Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(88):2543 2596, 2010. URL ht tp://jmlr.org/papers/v11/xiao10a.html.

Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., and He, Y. Zero Quant: Efficient and affordable posttraining quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35: 27168 27183, 2022.

PARQ: Piecewise-Affine Regularized Quantization

Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., and Xin, J. Binary Relax: A relaxation approach for training deep neural networks with quantized weights. SIAM Journal on Imaging Sciences, 11(4):2205 2223, 2018. https://doi.org/10.1137/18M1166134.

Yin, P., Lyu, J., Zhang, S., Osher, S. J., Qi, Y., and Xin, J. Understanding straight-through estimator in training activation quantized neural nets. In Proceedings of International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum ?id=Skh4j Rc KQ.

Yu, Y., Zheng, X., Marchetti-Bowick, M., and Xing, E. Minimizing Nonconvex Non-Separable Functions. In Lebanon, G. and Vishwanathan, S. V. N. (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 1107 1115, San Diego, California, USA, 09 12 May 2015. PMLR. URL https://proceedings.mlr.press/v38/ yu15.html.

Yu, Y., Zhang, X., and Schuurmans, D. Generalized conditional gradient for sparse estimation. Journal of Machine Learning Research, 18(144):1 46, 2017. URL http: //jmlr.org/papers/v18/14-348.html.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023 6032, 2019.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In Proceedings of International Conference on Learning Representations (ICLR), 2018.

Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. In International Conference on Learning Representations, 2022.

PARQ: Piecewise-Affine Regularized Quantization

A. Convergence analysis

A.1. Proof of Theorem 3.1

We consider the framework of online convex optimization, which is more general than stochastic optimization. In particular, let ft = f( , zt) be a function presented to us at each iteration t = 1, 2, . . ., and Ψ be a regularization function that we use throughout the whole process. The two-step presentation of AProx in (11) can be written in one-step as

wt+1 = arg min w W

s=1 ηs gs, w + λΨ(w) + 1

where w0 is the initial weight vector and gt = ft(wt). Moreover, we use a more general distance generating function h to replace (1/2) 2 2, and define the Bregman divergence as

Dh(u, w) = h(u) h(w) h(w), u w .

With Bregman divergence, a more general form of AProx can be written as

wt+1 = arg min w W

ηs gs, w + λΨ(w) + Dh(w, w0) . (15)

Assumption A.1. We make the following assumptions:

(a) Each loss function ft is convex and Lipschitz continuous with Lipschitz constant Gf.

(b) The regularizer Ψ is convex and Lipschitz continuous with Lipschitz constant GΨ.

(c) The function h is differentiable and strongly convex with convexity parameter ρ.

It follows from Assumption A.1(c) that Dh(u, w) is strongly convex in w with convexity parameter ρ.

Theorem A.2 (Regret bound for AProx). Under Assumption A.1, for any w Rd, it holds that

s=1 ηs fs(ws) + λΨ(ws) fs(w) λΨ(w) (Gf + λGΨ)2

s=1 2η2 s + Dh(w, w0). (16)

Proof. We adapt the proof of Bubeck (2015, Theorem 4.3) by adding the regularizer Ψ and replacing the term h(w) h(w0) with Dh(w, w0). An advantage of this replacement is that we can use any initial point w0 while the proof in (Xiao, 2010; Bubeck, 2015) requires w0 = arg min h(w).

Let w0 Rd be an arbitrary initial point and define ψ0(w) = Dh(w, w0). For t 1, define

s=1 ηs gs, w + λΨ(w) + Dh(w, w0).

The AProx algorithm (15) can be expressed as, for t 0,

wt+1 = arg min w ψt(w).

Since Dh(w, w0) is strongly convex in w with convexity parameter ρ, the same property holds for ψt for all t 0. According to a basic result on minimizing strongly convex functions (e.g., Chen & Teboulle, 1993, Lemma 3.2) and the fact that wt+1

minimizes ψt, we have ψt(wt+1) ψt(w) ρ

2 w wt+1 2, t = 0, 1, 2, . . . . (17)

From the definition of ψt and ψt 1, we have

ψt(wt) ψt(wt+1) = ψt 1(wt) ψt 1(wt+1) + ηt gt, wt wt+1 + λΨ(wt) λΨ(wt+1) . (18)

PARQ: Piecewise-Affine Regularized Quantization

For the left-hand side of (18), we apply (17) to obtain

ρ 2 wt+1 wt 2 ψt(wt) ψt(wt+1).

For the first term on the right-hand side of (18), we apply (17) again for ψt 1 to obtain

ψt 1(wt) ψt 1(wt+1) ρ

2 wt+1 wt 2.

For the second term on the right-hand side of (18), we have

gt, wt wt+1 + λΨ(wt) λΨ(wt+1) gt wt+1 wt + λΨ(wt) λΨ(wt+1)

Gf wt+1 wt + +λGΨ wt+1 wt

= (Gf + λGΨ) wt+1 wt , (19)

where in the first inequality we used H older s inequality, and in the second inequality we used Assumptions A.1(a) and A.1(b) respectively. Combining the above three inequalities with (18), we get

ρ wt+1 wt 2 ηt(Gf + λGΨ) wt+1 wt ,

which further implies wt+1 wt ηt(Gf + λGΨ)/ρ.

Combining this with (19) yields

gt, wt wt+1 + λΨ(wt) λΨ(wt+1) ηt(Gf + λGΨ)2/ρ. (20)

Next we prove that the following inequality holds for all w Rd and all t 0:

s=1 ηs gs, ws+1 + λΨ(ws+1)

s=1 ηs gs, w + λΨ(w) + Dh(w, w0). (21)

We proceed by induction. For the base case t = 0, the desired inequality becomes Dh(w, w0) 0, which is always true by the definition of Dh. Now we suppose (21) holds for t 1 and apply it with w = wt+1 in the first inequality below:

s=1 ηs gs, ws+1 + λΨ(ws+1)

s=1 ηs gs, ws+1 + λΨ(ws+1) + ηt gt, wt+1 + λΨ(wt+1)

s=1 ηs gs, wt+1 + λΨ(wt+1) + Dh(wt+1, w0) + ηt gt, wt+1 + λΨ(wt+1)

s=1 ηs gs, wt+1 + λΨ(wt+1) + Dh(wt+1, w0)

s=1 ηs gs, w + λΨ(w) + Dh(w, w0), w W.

In the last inequality above, we recognized the definition of ψt and used the fact that wt+1 is the minimizer of ψt. This finishes the proof of (21).

Finally we add Pt s=1 ηs ( gs, ws + Ψ(ws)) to both sides of (21) and rearrange terms to obtain

s=1 ηs gs, ws w + λΨ(ws) λΨ(w)

s=1 ηs gs, ws ws+1 + λΨ(ws) λΨ(ws+1) + Dh(w, w0). (22)

PARQ: Piecewise-Affine Regularized Quantization

For the left-hand side of (22), we use convexity of fs to obtain

fs(ws) fs(w) gs, ws w .

For the right-hand side of (22), we apply (20) to obtain

s=1 ηs gs, ws ws+1 + λΨ(ws) λΨ(ws+1) (Gf + λGΨ)2

Combining the above three inequalities together, we have

s=1 ηs fs(ws) + Ψ(ws) fs(w) λΨ(w) (Gf + λGΨ)2

s=1 η2 s + Dh(w, w0).

This finishes the proof of Theorem A.2.

Now we consider the stochastic optimization problem of minimizing f(w) + λΨ(w) where the loss function f(w) := Ez[f(w, z)]. We can regard the sequence of loss functions ft in the online optimization setting as f( , zt) and compare with w = arg min f(w) + λΨ(w). In this case, the regret bound (16) becomes

s=1 ηs f(ws, zs) + λΨ(ws) f(w , zs) λΨ(w ) (Gf + λGΨ)2

s=1 η2 s + Dh(w , w0).

Using a standard online-to-stochastic conversion argument (e.g., Xiao, 2010, Theorem 3), we can derive

s=1 ηs E f(ws) + λΨ(ws) f(w ) λΨ(w ) (Gf + λGΨ)2

s=1 η2 s + Dh(w , w0), (23)

where the expectation E[ ] is taken with respect to the random variables {w1, . . . , wt}, which in turn depends on {z1, . . . , zt}.

For the ease of presentation, we denote R2 = minw W Dh(w, w0). Moreover, we define a weighted average of all iterates up to iteration t:

wt = 1 Pt s=1 ηs

Then by convexity of f and Ψ, we obtain

E f( wt) + λΨ( wt) f(w ) λΨ(w )

ρ Pt s=1 η2 s + R2 Pt s=1 ηs . (24)

Constant stepsize. If the total number of iterations T is known ahead of time, then we can choose an optimal constant stepsize. Let ηs = η for all s = 1, . . . , T, then the bound in (24) becomes

Tη = (Gf + λGΨ)2

In order to minimize the above bound, we take η = R Gf +λGΨ p ρ

T and obtain

E f( w T ) + λΨ( w T ) f(w ) λΨ(w ) 2(Gf + λGΨ)R r 1

PARQ: Piecewise-Affine Regularized Quantization

Diminishing stepsize. The right-hand side of (24) has the same form as the convergence rate bound for the classical stochastic gradient or subgradient method (e.g., Nesterov, 2004, Section 3.2.3). A classical sufficient condition for convergence is X

s=1 ηs = + ,

s=1 η2 s < + .

In particular, if we take ηt = R 2(Gf +λGΨ) p ρ

t , we have

E f( wt) + λΨ( wt) f(w ) λΨ(w ) (Gf + λGΨ)R(2 + 1.5 ln(t)) ρt .

Finally, Theorem 3.1 is obtained with some simplification. In particular, if we choose the Bregman divergence as the Euclidean distance 1

2 2 2, then we have ρ = 1. This leads to

E f( wt) + λΨ( wt) f(w ) λΨ(w ) GR(2 + 1.5 ln(t))

where G := Gf + λGΨ. This completes the proof.

A.2. Proof of Theorem 3.2

For simplicity, we denote Fλ(w) = f(w) + λΨ(w) and G = Gf + λGΨ where Gf and GΨ are the Lipschitz constants of f and Ψ, respectively.

To establish the last-iterate convergence of AProx, we first introduce the following lemma, which connects the convergence of the last iteration to the convergence of the average iteration. Lemma A.3 (Lemma 1 in (Orabona, 2020)). Given that {ηt}T t=1 is a non-increasing positive sequence and {qt}T t=1 is a nonnegative sequence, the following inequality holds

t=T k+1 ηt (qt q T k) . (25)

Upon setting qt = E [Fλ(wt)] Fλ(w ) in Lemma A.3, we derive that

ηT E Fλ(w T ) Fλ(w ) 1

t=1 ηt E Fλ(wt) Fλ(w )

t=T k+1 ηt E Fλ(wt) Fλ(w T k) .

For the first term on the right-hand side, we apply Equation 23, which yields

t=1 ηt E Fλ(wt) Fλ(w ) G2

t=1 η2 t + Dh(w , w0)

To control the second term, we note that for any 1 k T 1

t=T k+1 ηt E Fλ(wt) Fλ(w T k) =

t=T k ηt E Fλ(wt) Fλ(w T k) G2

t=T k η2 t . (28)

Here we apply Equation 23 again for the last inequality upon setting w = w T k and use the fact that Dh(w, w) = 0 for all w W.

Combining the above two components together, we have

E Fλ(w T ) Fλ(w ) G2

+ Dh(w , w0)

ηT T . (29)

PARQ: Piecewise-Affine Regularized Quantization

Constant stepsize. If the total number of iterations T is known ahead of time, then we can choose an optimal constant stepsize. Let ηt = η for all s = 1, . . . , T, then the bound in (29) becomes

E Fλ(w T ) Fλ(w ) G2

η + Dh(w , w0)

ρ (2 + ln(T)) η + Dh(w , w0)

Here we use the fact that Pn k=1 1 k 1 + ln(n) for all n 1. In order to minimize the above bound, we take η =

(2+ln(T ))T and obtain

E Fλ(w T ) Fλ(w ) 2G

Dh(w , w0)(2 + ln(T))

Diminishing stepsize. Suppose we set the stepsize ηt = η

t. Then, Equation 29 reduces to

E Fλ(w T ) Fλ(w ) η

+ Dh(w , w0)

+ Dh(w , w0)

To proceed, note that

t dt = ln T T k

= ln 1 + k T k

k T k . (33)

Therefore, we have T 1 X

Invoking this result into Equation 32, we further have

E Fλ(w T ) Fλ(w ) 3ηG2(1 + ln(T))

T + Dh(w , w0)

Hence, upon setting η = 1

2 , we derive that

E Fλ(w T ) Fλ(w ) G 2

Specifically, if we choose the Bregman divergence as the Euclidean distance 1

2 2 2, then we have ρ = 1. Upon defining R = minw W w0 w 2, we have

E Fλ(w T ) Fλ(w ) GR(2 + 3

PARQ: Piecewise-Affine Regularized Quantization

test accuracy

1-bit Ternary 2-bit

50 100 150 200 250 0

STE Bin Rel PARQ

Figure 13. Dei T-S test accuracy (top row) and train loss (middle row) across several bit-widths (columns). All PARQ curves use a ρ 1

schedule with s = 1 and t1 = 0.5T (bottom row).

B. Additional experiment results

Figures 13 14 present accuracy and training loss curves for QAT of Dei T-S. In particular, Figure 13 uses s = 1, which is essentially linear during the annealing period. The 1-bit accuracy plots reveal that PARQ trains more stably than STE and Binary Relax; it does not exhibit any sudden drops in accuracy. It performs the most consistently on Dei T-S, suggesting the relative performance of QAT methods may vary by model size.

Ablation study on ρ 1 t . Table 4 shows results of 2-bit Dei T-Ti on Image Net, trained using different s (rows) and t1 (columns) values in Equation (13). This sweep reveals that a shallow s = 1 performs best for the model and dataset setup. A later transition center of t1 = 0.75T performs noticeably better for steepness values s {10, 20}.

Table 4. Ablation of parameters in (13) on 2-bit Dei T-Ti test accuracy. The only option for t1 is 0.5T for s = 1 since ρ 1 t decays linearly. t1 0.25T 0.5T 0.75T

1 66.60 10 64.11 64.62 66.02 20 63.74 63.88 66.17 40 64.05 63.89 63.89 80 64.06 64.28 63.73

PARQ: Piecewise-Affine Regularized Quantization

test accuracy

1-bit Ternary 2-bit

50 100 150 200 250 0

STE Bin Rel PARQ

Figure 14. Dei T-S test accuracy (top row) and train loss (middle row) across several bit-widths (columns). All PARQ curves use a ρ 1

schedule with s = 50 and t1 = 0.5T (bottom row).