# cauchyschwarz_regularizers__38f2ef7f.pdf

Published as a conference paper at ICLR 2025

CAUCHY SCHWARZ REGULARIZERS

Sueda Taner ETH Zurich, Switzerland taners@iis.ee.ethz.ch

Ziyi Wang ETH Zurich, Switzerland ziywang@student.ethz.ch

Christoph Studer ETH Zurich, Switzerland studer@ethz.ch

We introduce a novel class of regularization functions, called Cauchy Schwarz (CS) regularizers, which can be designed to induce a wide range of properties in solution vectors of optimization problems. To demonstrate the versatility of CS regularizers, we derive regularization functions that promote discrete-valued vectors, eigenvectors of a given matrix, and orthogonal matrices. The resulting CS regularizers are simple, differentiable, and can be free of spurious stationary points, making them suitable for gradient-based solvers and large-scale optimization problems. In addition, CS regularizers automatically adapt to the appropriate scale, which is, for example, beneﬁcial when discretizing the weights of neural networks. To demonstrate the efﬁcacy of CS regularizers, we provide results for solving underdetermined systems of linear equations and weight quantization in neural networks. Furthermore, we discuss specializations, variations, and generalizations, which lead to an even broader class of new and possibly more powerful regularizers.

1 INTRODUCTION

We focus on the design of novel regularization functions ℓ: RN R 0 that promote certain pre-deﬁned properties on the solution vector(s) ˆx RN of regularized optimization problems

ˆx arg min x RN f(x) + λℓ(x), (1)

where f : RN R is an objective function and λ R 0 a regularization parameter. One instance of such an optimization problem arises in the binarization of neural-network weights, where the solution(s) of (1) are the network s weights that should be binary-valued ˆx { α, +α}N, but with the appropriate scale α R automatically chosen by the regularizer the scale α can then be absorbed into the activation function.

1.1 CONTRIBUTIONS

We propose Cauchy Schwarz (CS) regularizers, a novel class of regularization functions that can be designed to impose a wide range of properties. We derive concrete examples of CS regularization functions that promote discrete-valued vectors (e.g., binaryand ternary-valued vectors), eigenvectors of a given matrix, and matrices with orthogonal columns. The resulting regularizers are (i) simple, (ii) automatically determine the appropriate scale, (iii) often free of any spurious stationary points, and (iv) differentiable, which enables the use of (stochastic) gradient-based numerical solvers that make them suitable to be used in large-scale optimization problems. In addition, we discuss a variety of specializations, variations, and generalizations, which allow for the design of an even broader class of new and possibly more powerful regularization functions. Finally, we showcase the efﬁcacy and versatility of CS regularizers for solving underdetermined systems of linear equations and neural network weight binarization and ternarization. All proofs and additional experimental results are relegated to the appendices in the supplementary material. The code for our numerical experiments is available under https://github.com/IIP-Group/CS_regularizers.

Published as a conference paper at ICLR 2025

1.2 NOTATION

Column vectors and matrices are written in boldface lowercase and uppercase letters, respectively. The entries of a vector x RN are [x]n = xn, n = 1, . . . , N, and transposition is x T. The Ndimensional all-zeros vector is 0N, and the all-ones vector 1N; we omit the dimension N if it is clear from the context. The inner product between the vectors x, y RN is x, y = x Ty, and linear dependence is denoted by

x y (a1, a2) R2\{(0, 0)} : a1x = a2y. (2)

For p 1, the p-norm of a vector x RN is x p (PN n=1 |xn|p)1/p, and we will frequently use the shorthand notation |[x]|p PN n=1 xp n (note the absence of absolute values). The entry on the nth row and kth column of a matrix X is Xn,k, the Frobenius norm is X F, and columnwise vectorization is vec(X). The N N identity matrix is IN and the M N all-zero matrix is 0M N.

1.3 RELEVANT PRIOR ART

Semideﬁnite relaxation (SDR) can be used for solving optimization problems with binary-valued solutions (Luo et al., 2010). Since SDR requires lifting (i.e., increasing the dimension of the original problem size), solving such problems quickly results in prohibitive complexity, even for moderatesized problems. As a remedy, non-lifting-based SDR approximations were proposed in (Shah et al., 2016; Castañeda et al., 2017). These methods utilize biconvex relaxation that scales better to larger optimization problems. Convex non-lifting-based approaches were also proposed for recovering binary-valued solutions from linear measurements using ℓ -norm regularization (Mangasarian & Recht, 2011). In contrast to such methods, the proposed CS regularizers are (i) differentiable, which enables their use together with differentiable objective functions and any (stochastic) gradient-based numerical solver, and (ii) can be specialized to impose a wider range of different structures.

Vector discretization is widely used for neural network parameter quantization (Hubara et al., 2018). Regularization-free approaches, e.g., the method from Rastegari et al. (2016), perform neural network binarization by simply quantizing the weights and adapting their scale to their average absolute value. Approaches that utilize projections onto discrete sets within gradient-descent-based methods have been proposed in Hou et al. (2016); Leng et al. (2018). In contrast, the proposed CS regularizers are differentiable and automatically adapt their scale to the appropriate magnitude of the solution vectors.

The prior art describes numerous vector discretization methods that rely on regularization functions. The methods in Hung et al. (2015); Tang et al. (2017); Wess et al. (2018); Bai et al. (2019); Darabi et al. (2019); Choi et al. (2020); Yang et al. (2021); Razani et al. (2021); Xu et al. (2023) use regularization functions related1 to the form ℓ(x, β) = PN n=1(x2 n β)2 for N-dimensional vectors x RN and either ﬁx the magnitude β2 (e.g., β = 1) or learn this additional parameter during gradient descent. Another strain regularization functions that utilizes trigonometric functions related2

to the form ℓ(x, β) = PN n=1 sin2(βπxn) have been proposed in Naumov et al. (2018); Elthakeb et al. (2020); Solodskikh et al. (2022). In contrast to all of the above regularization functions, the proposed CS regularizers (i) do not introduce additional trainable parameters while still being able to automatically adapt their scale to the vectors magnitude and (ii) can be designed to promote a wider range of structures. Furthermore, the proposed CS regularizers include the regularization functions of Tang et al. (2017); Darabi et al. (2019) as a special case; see App. B.3.

The recovery of matrices with orthogonal columns ﬁnds, for example, use in the orthogonal Procrustes problem: ˆX = arg min X RN K XA B F subject to XTX = IK for given matrices A and B. A closed-form solution to this problem is given by ˆX = UVT , where U and V are the leftand right-singular matrices, respectively, of the matrix BAT (Schönemann, 1966). Notably, this formulation constrains the columns of X to be orthonormal. A more general problem that relaxes this constraint to orthogonal columns with arbitrary norms was introduced in Everson (1998) along

1Some methods, e.g., Darabi et al. (2019), use non-differentiable regularizers with ||xn| β| instead of (x2 n β)2, while others, e.g., Xu et al. (2023), use regularizers of the form γ x α sign(x) and introduce additional scaling factors. 2The method in Naumov et al. (2018) ﬁxes the scale, while Elthakeb et al. (2020) utilizes a trainable parameter; the regularizer in Solodskikh et al. (2022) introduces an additional differentiable regularizer that imposes ﬁnite range.

Published as a conference paper at ICLR 2025

with iterative solution algorithms. In contrast to these approaches, CS regularizers can be used to solve Procrustes problems for matrices with orthogonal columns of arbitrary scale, without relying on matrix decomposition methods, such as the singular value, polar, or QR decomposition.

Promoting eigenvectors of a matrix is useful in applications where aligning a vector with these eigenvectors is desirable. For example, in principal component analysis (Jolliffe, 2002), data is projected onto the principal eigenvectors for dimensionality reduction. Encouraging this alignment can aid feature selection and thus improve dimensionality reduction. In this context, CS regularizers offer an advantage by promoting eigenvectors without explicitly computing eigenvalues or eigenvectors.

An instance of a CS regularizer was proposed in Ehrhardt & Arridge (2014), which is a regularization function based on the gradients of two vector-valued functions to measure how far these functions are from being parallel. In contrast, we present a general framework for designing a broad class of CS regularizers, encompassing the one from Ehrhardt & Arridge (2014) as a special case.

We ﬁnally note that the CS divergence (Principe, 2010) was used as a regularizer for improving variational autoencoders in Tran et al. (2022) and for promoting fairness in machine learning models in Liu et al. (2025). In contrast, we use the CS inequality (Steele, 2004) to design new regularization functions that can among many other structures be used to promote discrete-valued vectors (e.g., binary or ternary), eigenvectors to a given matrix, and matrices with orthogonal columns.

2 CAUCHY SCHWARZ REGULARIZERS

In this section, we introduce the general recipe for deriving CS regularizers. We then use this recipe to design speciﬁc CS regularization functions that promote discrete-valued (e.g., binary and ternary) vectors, eigenvectors of a given matrix, and matrices with orthogonal columns.

2.1 THE RECIPE

The following result is an immediate consequence of the CS inequality (Steele, 2004) and provides a recipe for the design of a wide range of regularization functions; a short proof is provided in App. A.1. Proposition 1. Fix two vector-valued functions g, h : RN RM and deﬁne the set

X x RN : g( x) h( x) . (3)

Then, the nonnegative regularization function

ℓ(x) g(x) 2 2 h(x) 2 2 | g(x), h(x) |2 (4)

is zero if and only if (iff) x X.

We call regularization functions derived from Proposition 1 CS regularizers. While CS regularizers are guaranteed to be (i) nonnegative and (ii) zero iff x X, it is also desirable for gradient-based numerical solvers that these regularizers do not exhibit any spurious stationary points. Deﬁnition 1. A spurious stationary point is a vector x / X for which ℓ(x) = 0.

Note that functions that do not have any spurious stationary points are also known as invex (Ben-Israel & Mond, 1986). In other words, all stationary points of invex functions are also global minima; this property can be useful to develop problem-speciﬁc algorithms and to analyze their convergence (see, e.g., Barik et al. (2023); Pinilla et al. (2022) and the references therein).

Whether or not a CS regularizer has spurious stationary points depends on the speciﬁc choice of the two functions g and h. Nonetheless, even if a CS regularizer has spurious stationary points, it may still accomplish the desired goal. We conclude by noting that any vector x for which g(x) = 0 or h(x) = 0 will minimize (4).

The following result will be useful below when we analyze properties of speciﬁc CS regularizers; a short proof is given in App. A.2. Lemma 1. Fix two vector-valued functions g, h : RN RM. Then, the following equalities hold:

ℓ(x) = g(x) 2 2 min β R βg(x) h(x) 2 2 = h(x) 2 2 min β R g(x) βh(x) 2 2. (5)

Published as a conference paper at ICLR 2025

Lemma 1 will be used to highlight the important auto-scale property CS regularizers, since setting β to its optimal value in (5) leads exactly to the regularization function in (4), as the optimization problem in β is continuous, quadratic, and has a closed-form solution.

2.2 RECOVERING DISCRETE-VALUED VECTORS

From Proposition 1, we can derive a range of differentiable CS regularizers that, when minimized, promote discrete-valued vectors. This can be accomplished by using entry-wise polynomials for the functions g and h. We next show three concrete and practically useful examples.

2.2.1 SYMMETRIC BINARY

Deﬁne g(x) [x2 1, . . . , x2 N]T and h(x) 1N. Then, Proposition 1 yields the following CS regularizer that promotes symmetric binary-valued vectors; see App. A.3 for the derivation. Regularizer 1 (Symmetric Binary). Let x RN and deﬁne

ℓbin(x) N|[x]|4 |[x]|2 2 . (6)

Then, the nonnegative function in (6) is only zero for symmetric binary-valued vectors, i.e., iff x { α, +α}N for any α R. Furthermore, ℓbin(x) does not have any spurious stationary points.

To gain insight into Regularizer 1, we invoke Lemma 1 and obtain

ℓbin(x) = N min β R 0

PN n=1 (xn β)(xn + β) 2. (7)

This equivalence implies that, for a given vector x, Regularizer 1 is the right-hand-side total square error in (7), but with optimally chosen scale α β; this is the auto-scale property of CS regularizers. In other words, the CS regularizer implicitly and automatically adapts its scale α to the scale of every argument x. Furthermore, this CS regularizer is zero iff x { α, α}N for some α R, as only xn = +α or xn = α for n = 1, . . . , N allows the right-hand-side of (7) to be zero.

We showcase the efﬁcacy of ℓbin for the recovery of binary-valued solutions in Section 3.1 and compare its advantages to existing binarizing regularizers (cf. Section 1.3), such as being differentiable, scaleadaptive, and free of additional optimization parameters, in App. C.1.

2.2.2 ONE-SIDED BINARY

Deﬁne g(x) [x2 1, . . . , x2 N]T and h(x) [x1, . . . , x N]T. Then, Proposition 1 yields the following CS regularizer that promotes one-sided binary-valued vectors; see App. A.4 for the derivation. Regularizer 2 (One-Sided Binary). Let x RN and deﬁne

ℓosb(x) |[x]|2|[x]|4 |[x]|3 2 . (8)

Then, the nonnegative function in (8) is only zero for one-sided binary-valued vectors, i.e., iff x {0, α}N for any α R. Furthermore, ℓosb(x) does not have any spurious stationary points.

To gain insight into Regularizer 2, we invoke Lemma 1 and obtain

ℓosb(x) = |[x]|2 min β R PN n=1 xn(xn β) 2. (9)

Once again, we observe this CS regularizer s auto-scale property and see that only vectors of the form x {0, α}N for some α R minimize (9).

2.2.3 SYMMETRIC TERNARY

Deﬁne g(x) [x3 1, . . . , x3 N]T and h(x) [x1, . . . , x N]T. Then, Proposition 1 yields the following CS regularizer that promotes symmetric ternary-valued vectors; see App. A.5 for the derivation. Regularizer 3 (Symmetric Ternary). Let x RN and deﬁne

ℓter(x) |[x]|2|[x]|6 |[x]|4 2 . (10)

Then, the nonnegative function in (10) is only zero for symmetric ternary-valued vectors, i.e., iff x { α, 0, +α}N for any α R. Furthermore, ℓter(x) does not have any spurious stationary points.

Published as a conference paper at ICLR 2025

To gain insight into Regularizer 3, we invoke Lemma 1 and obtain

ℓter(x) = |[x]|2 min β R 0

PN n=1 xn(xn β)(xn + β) 2. (11)

As above, we observe this CS regularizer s auto-scale property and see that only vectors of the form x { α, 0, +α}N for some α R minimize (11).

The CS regularizers introduced so far promote binaryor ternary-valued vectors; a visualization of their loss landscapes in two dimensions is provided in App. F. In App. B.1, we detail an approach that generalizes CS regularizers to a symmetric, discrete-valued set with 2B equispaced entries. In addition, all of the CS regularizers introduced above involve polynomials of higher (e.g., quartic) order, leading to potential numerical stability issues. In App. B.2, we propose alternative symmetric binarization regularizers that avoid such issues; similar alternative regularization functions can be derived for the other discretization regularizers.

2.3 RECOVERING EIGENVECTORS OF A GIVEN MATRIX

All CS regularizers introduced so far promote vectors with discrete-valued entries. In order to demonstrate the versatility of Proposition 1, we now propose a CS regularizer that promotes vectors that are eigenvectors of a given (and ﬁxed) matrix C RN N.

Deﬁne g(x) Cx and h(x) = x. Then, Proposition 1 yields the following CS regularizer that promotes eigenvectors of the matrix C; see App. A.6 for the derivation. Regularizer 4 (Eigenvector). Fix C RN N, let x RN, and deﬁne

ℓeig(x) Cx 2 2 x 2 2 (x T Cx)2 (12)

Then, the nonnegative function in (12) is only zero for eigenvectors of C and the all-zeros vector.

To gain insight into Regularizer 4, we invoke Lemma 1 and obtain

ℓeig(x) = |[x]|2 min β R Cx βx 2 2. (13)

As above, we observe this CS regularizer s auto-scale property and see that only scaled eigenvectors of the matrix C minimize (13).

2.4 RECOVERING MATRICES WITH ORTHOGONAL COLUMNS

Finally, we demonstrate that Proposition 1 can also be used to promote structure in matrices. The following CS regularizer promotes matrices X RN K with K N to have orthogonal columns.

Deﬁne g(X) vec(XTX) and h(x) vec(IK). Then, Proposition 1 yields the following CS regularizer that promotes matrices with orthogonal columns; see App. A.7 for the derivation. Regularizer 5 (Matrix with Orthogonal Columns). Let X RN K with K N and deﬁne

ℓom(X) K XTX 2 F X 4 F. (14)

Let X = USVT be the singular value decomposition of X. Then, we equivalently have,

ℓom(X) K PK k=1 S4 k,k PK k=1 S2 k,k 2 , (15)

which is the symmetric binarizer from (6) applied to the singular values of X. The nonnegative function in (14) is only zero for matrices X with pairwise orthogonal columns of equal length, i.e., iff XTX = αIK for α > 0. Furthermore, ℓom(X) does not have any spurious stationary points.

To gain insight into Regularizer 5, we invoke Lemma 1 and obtain

ℓom(X) = K min β R vec(XTX) βvec(IK) 2 2. (16)

Once again, we observe this CS regularizer s auto-scale property and see that only matrices X with orthogonal columns of the same norm minimize (16). Note that we have developed a regularizer that promotes the same norm for all columns of X for simplicity. However, one could modify h(x) to promote, for example, a certain user-deﬁned ratio between the norms of the columns.

Published as a conference paper at ICLR 2025

2.5 GENERALIZATIONS AND VARIATIONS

Proposition 1 and the underlying ideas enable the design of a much broader range of regularizers. We now propose one possible generalization, where we replace the CS inequality utilized in Proposition 1 with Hölder s inequality (Hölder, 1889), resulting in the following recipe for Hölder regularizers.

Proposition 2. Fix two vector-valued functions g, h: RN RN and deﬁne X as in (3). Let p, q 1 so that 1

q = 1 and let r > 0. Then, the nonnegative function

ℓ(x) g(x) r p h(x) r q | g(x), h(x) |r (17)

is zero iff x X.

Note that Proposition 1 is a special case of Proposition 2 by setting p = q = r = 2.

We include more generalizations and variations in App. B. Speciﬁcally, we propose a CS regularizer promoting vectors with symmetric equispaced discrete values in App. B.1, bounded CS regularizers for vector binarization in App. B.2, CS regularizers for discrete-valued vectors with ﬁxed scale in App. B.3, non-differentiable variants of CS regularizers in App. B.4, and a generalization of CS regularizers that is invariant to the scale of the functions g and h in App. B.5. We conclude by noting that many other CS or Hölder regularizers can be derived when combining the above ideas and those put forward in App. B. We also note that most of these results can be generalized to complex-valued vectors. A detailed investigation of such generalizations and variations is left for future work.

3 APPLICATION EXAMPLES

We now show several application examples for CS regularizers for vector discretization, recovery of eigenvectors of a given matrix, recovering matrices with orthogonal columns, and quantization of neural network weights.

3.1 RECOVERING DISCRETE-VALUED VECTORS

The proposed CS regularizers enable the recovery of binaryand ternary-valued vectors from underdetermined linear systems of equations. To this end, we solve systems of linear equations of the form b = Ax, where A RM N has i.i.d. standard normal entries and M < N. We create vectors x RN, whose entries are chosen i.i.d. with uniform probability from { 1, +1} for symmetric binary and from {0, +1} for one-sided binary, and with probability 0.25, 0.5, 0.25 from { 1, 0, +1}, respectively, for ternary-valued vectors. Then, we calculate b = Ax , and we try to recover the vector x from b by solving optimization problems of the form

ˆx arg min x RN ℓ( x) subject to b = A x (18)

using a projected gradient descent algorithm speciﬁcally, FISTA with backtracking (Beck & Teboulle, 2009; Goldstein)3. Here, ℓ( x) are the CS regularizers from Section 2.2. We ﬁx N = 100 and vary M between 30 and 90.4 We declare success for recovering x if the returned solution ˆx satisﬁes ˆx x 2/ x 2 10 2. Fig. 1 shows the success probabilities with respect to the undersampling ratio γ along with error bars calculated from the standard error of the mean.

Symmetric Binary We ﬁrst recover symmetric binary-valued solutions using Regularizer 1 with ℓbin from (6). For this scenario, Mangasarian & Recht (2011) showed that ℓ -norm minimization recovers the binary-valued solution as long as γ = M/N satisﬁes γ > 0.5 and N approaches inﬁnity. Thus, our baseline is ℓ -norm minimization, which we solve with Douglas Rachford splitting (Eckstein & Bertsekas, 1992) as in Studer et al. (2015). Fig. 1a shows the success rate for ℓbin and ℓ -norm minimization with respect to the undersampling ratio γ. For ℓbin minimization, we observed that different initializations have an impact on the success rate since the objective is non-convex. Thus, we allow at most 10 random initializations of projected gradient descent. We note

3We run projected gradient descent and the baseline algorithms for a maximum of 104 iterations. 4We also study the impact of the sparsity of x while the number of measurements M is ﬁxed in App. C.4. For each M, we randomly generate 1000 problem instances and report the average success probability.

Published as a conference paper at ICLR 2025

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0

(a) Symmetric binary

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0

(b) One-sided binary

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0

(c) Symmetric ternary

Figure 1: Probability of success for recovering vectors with (a) binary, (b) one-sided binary, and (c) symmetric ternary values dependent on the undersampling ratio M/N.

that multiple initializations of ℓ -norm minimization do not affect its success rate as the problem is convex. We see from Fig. 1a that ℓbin minimization has, for any undersampling ratio γ, a higher probability of successfully recovering the true solution than ℓ -norm minimization.

We discuss the advantages of ℓbin over existing binarizing regularizers in App. C.1. We also provide a comparison of the success rates of ℓbin and ℓ -norm minimization with existing binarizing regularizers in App. C.2, and observe that ℓbin achieves the highest success rate. Moreover, in App. C.3, we provide the success rates of other binarizing CS regularizer variants (e.g., the Hölder, non-differentiable, and scale-invariant regularizers), where ℓbin, once again, achieves best performance.

In App. C.7, we showcase yet another application for the ℓbin regularizer. Speciﬁcally, we use it to ﬁnd approximate solutions to the weighted maximum cut (MAX-CUT) problem (Commander, 2009). Our simulation results in App. C.7 show that our approach does not identify optimal solutions for large graphs, but signiﬁcantly improves the objective values compared to the initialization.

One-Sided Binary We now recover vectors with one-sided binary-valued entries using Regularizer 2 with ℓosb from (8). In this experiment, we ran projected gradient descent for only one random initialization. Our baseline for this scenario is ℓ1-norm minimization (Cai et al., 2009), as the generated vectors are sparse with half of the entries being nonzero (on average). We solve the ℓ1-norm minimization problem with Douglas Rachford splitting. Fig. 1b demonstrates that ℓosb minimization signiﬁcantly outperforms ℓ1-norm minimization for all undersampling ratios.

Symmetric Ternary We also recover vectors with symmetric ternary-valued entires using Regularizer 3 with ℓter from (10). In this experiment, we ran projected gradient descent for only one random initialization. We use ℓ1-norm minimization as our baseline as the generated vectors are sparse with half of the entries being nonzero (on average). Fig. 1c demonstrates that ℓter minimization signiﬁcantly outperforms ℓ1-norm minimization for all undersampling ratios.

We provide similar simulation results for symmetric two-bit-valued solution recovery in App. C.5.

3.2 RECOVERING EIGENVECTORS OF A MATRIX

The proposed CS regularizers also enable the recovery of eigenvectors of a given (and ﬁxed) matrix. We once again consider a system of linear equations of the form b = Ax, where A RM N has i.i.d. standard normal entries and M < N. We create vectors x RN by uniform randomly choosing an eigenvector of a matrix C RN N with i.i.d. standard normal entries. We calculate b = Ax

and solve (18) using Regularizer 4 with ℓeig. As a baseline, we consider minimizing ℓµ( x, µ) C x µ x 2 2 subject to b = A x, similarly to (18), with an additional optimization parameter (i.e., µ) compared to minimizing ℓeig( x). We use FISTA as in Section 3.1 for both algorithms. Fig. 7 in App. C.6 demonstrates that the success rate of ℓeig remains to be consistently above 0.8, approaching 1 at an M/N-ratio of 0.5, and surpassing that of the baseline that minimizes ℓµ.

Published as a conference paper at ICLR 2025

3.3 RECOVERING MATRICES WITH ORTHOGONAL COLUMNS

We now demonstrate that the proposed regularizers can also impose structure to matrices. To this end, we consider a system of linear equations AX = B, where A RM N has i.i.d. standard normal entries with M < N, and X RN K with XTX = IK. We solve ˆX arg min X RN K ℓom( X) subject to A X = B (19)

using FISTA as in Section 3.1 with a maximum number of 1000 iterations. We have observed that for N = K = 10 and N = K = 100 and for values of M such that M/N [0.1, 1], the output of FISTA was always an orthogonal matrix, i.e., the success rate is always one.

3.4 QUANTIZING NEURAL NETWORK WEIGHTS

We now provide another application example for binarizing and ternarizing neural network weights. Our goal is to further highlight the simplicity, versatility, and effectiveness of CS regularizers.

3.4.1 METHOD

Our weight quantization procedure consists of three steps: (i) training with CS regularizers, (ii) weight quantization, and (iii) continued training of remaining parameters. We detail these steps below. In order to demonstrate solely the impact of CS regularizers, we neither modify the neural network architecture (e.g., we do not alter the layers or activations) nor the forward-backward propagation stages, since we do not introduce any non-differentiable operations during training.

Step 1: Regularized Training Let θ denote the set of all parameters of a neural network and L(θ) the loss function for learning the network s task. We solve the following optimization problem: ˆθ = arg min θ L(θ) + λ PK k=1 ηk ℓ(θk). (20)

Here, λ R 0 is a regularization parameter, θk is a vector consisting of the network weights which should share a common scaling factor (this can, for example, be an entire layer, one convolution kernel, or any other subset of network parameters), ηk is the associated normalization factor (e.g., the reciprocal value of the dimension of θk), and ℓcan be any CS regularizer (e.g., ℓbin or ℓter).

After training the network parameters with the CS regularizers for a given number of epochs, the weights contained in the vectors θk will be concentrated around { αk, αk} for symmetric binaryvalued regularization or { αk, 0, αk} for symmetric ternary-valued regularization for some αk > 0.

Step 2: Quantization The goal is to quantize the regularized weights {θk}K k=1 from the previous step. In what follows, we describe the quantization procedure for one weight vector w = θk.

We binarize the regularized weight vector w according to ˆw arg minx Xbin w x 2 2 with Xbin { x { α, α}N : α R} as in Rastegari et al. (2016), which is given by

ˆwn = α sign(wn), n = 1, . . . , N, with α = w 1/N. (21)

We ternarize the regularized weight vector w = 0N according to ˆw arg minx Xter w x 2 2 with Xter { x { α, 0, α}N : α R} as in Li et al. (2016), which is accomplished as follows: Let Iτ = {i : |wi| τ}. Then, ﬁnd the threshold that determines which entries of ˆw are nonzero as

τ = arg max τ {|wi|:i I}

i Iτ |wi|)2. (22)

Finally, compute the ternarized vector as

ˆwn = α sign(wn), |wn| τ

0, otherwise, n = 1, . . . , N, with α = 1 |Iτ | P

i Iτ |wi|. (23)

Step 3: Training with Quantized Weights After weight quantization, the number of trainable parameters is signiﬁcantly reduced since we now have only one scale factor for a vector of quantized weights. Hence, we ﬁx the signs of the weights and continue training only their shared scale factors alongside other tunable network parameters (e.g., biases, batch normalization parameters, etc.) without the use of CS regularizers and for a small number of epochs.

Published as a conference paper at ICLR 2025

3.4.2 EXPERIMENTAL RESULTS

We conduct experiments on the benchmark datasets Image Net (ILSVRC12) (Deng et al., 2009) and CIFAR-10 (Krizhevsky, 2009) for image classiﬁcation using Py Torch (Paszke et al., 2019). We follow classical data augmentation strategies as detailed in App. D.1.

Implementation As in Rastegari et al. (2016); Qin et al. (2020); He et al. (2020), we regularize and quantize all network layers except for the ﬁrst convolutional layer and the last fully-connected layer. For convolutional layers, we apply the CS regularizer to vectors consisting of the weights in all kernels that produce one output channel; this leads to one scaling factor for each output channel following the approach from Rastegari et al. (2016). For fully-connected layers, we use one CS regularizer for each row of the weight matrix; this leads to one scaling factor for each output feature. We set the weights ηk in (20) to the reciprocal of the dimension of the corresponding weight vector θk.

For Image Net, we use Res Net-18 (He et al., 2016), initialize the weights with a pretrained fullprecision model from Py Torch, and train the network for 40 and 20 epochs in Steps 1 and 3, respectively, with a batch size of 1024. For CIFAR-10, we use Res Net-20, initialize the weights with a pretrained full-precision model from Idelbayev (2021) similarly to Qin et al. (2020), and train the network for 400 and 20 epochs in Steps 1 and 3, respectively, with a batch size of 128. For both datasets, we set λ = 10 for binarization and λ = 105 for ternarization. 5 We use the Adam optimizer (Kingma & Ba, 2017) with its learning rate initialized by 0.001 and the cosine annealing learning rate scheduler (Loshchilov & Hutter, 2016).

Weight Distribution Fig. 2 illustrates the impact of CS regularizers on the network weights with histograms for one output channel of one convolutional layer based on training Res Net-18 on Image Net. We observe in Fig. 2(a) that the weight distribution of the pretrained network resembles that of a zero-mean Gaussian distribution. Figures 2(b) and (c) reveal, as expected, that the weights are becoming more concentrated around binary values after 5 and 20 epochs of training with the regularizer ℓbin, respectively. Figures 2(d) and (e) reveal a similar behavior when using ℓter.

Performance Evaluation In App. D.3, Tables 8-10, we provide the top-1 accuracy for our methods along with the full-precision baseline and various SOTA baselines that binarize or ternarize the weights of the network while the activations are left in full precision. All SOTA top-1 accuracy results in Tables 8-10 are taken from the corresponding papers.

For Image Net, we observe from Tables 8 and Table 9 that for binary-valued weights, our approach outperforms SQ-BWN (Dong et al., 2017), BWN (Rastegari et al., 2016), HWGQ (Cai et al., 2017), PCCN (Gu et al., 2019), and the ternary TWN (Li et al., 2016), while the accuracy we achieve is 4.9% lower than the best SOTA method Proxy BNN (He et al., 2020). For ternary-valued weights, our approach outperforms TWN and SQ-TWN (Dong et al., 2017), while the accuracy we achieve is 2.8% lower than the SOTA method QIL (Jung et al., 2019).

For CIFAR-10, we observe from Table 10 that for binary-valued weights, our approach outperforms Do Re Fa-Net by a small margin and achieve the same performance as LQ-Net (Zhang et al., 2018), while the accuracy of our approach is 0.9% lower than the SOTA methods DAQ (Kim et al., 2021) and LCR-BNN (Shang et al., 2022); these two methods are also the only methods outperforming our ternary-valued approach by 0.2%.

While some of the SOTA methods achieve better accuracy than our approach, our results (i) require a simpler training procedure6 and (ii) showcase the potential of CS regularizers: We only have one step of regularized training with full-precision weights, a quantization step, and a second step of training with fewer parameters; this procedure does not require any additional storage at any stage of training. In contrast, all of the baseline methods retain both the quantized and full-precision values for the weights during training, and use the quantized weights in forward and backward propagation while the full-precision values are updated with the gradients calculated with respect to the quantized values. This results in additional storage. Moreover, to reduce the quantization error or to alleviate

5We chose the regularization parameter λ empirically based on using 1/10th of the training sets for validation. We have observed that changes by a factor of 10 in λ do not have a substantial impact on the resulting accuracies. Please see Tables 3-6 in App. D.2 for an ablation study for varying λ. 6Please see Table 7 for the advantages/disadvantages of our training strategy compared to the SOTA methods.

Published as a conference paper at ICLR 2025

2 0 2 weight values

(a) Pretrained

0.05 0.00 0.05 weight values

(b) ℓbin, 5 epochs

0.04 0.00 0.04 weight values

(c) ℓbin, 20 epochs

0.06 0.00 0.06 weight values

(d) ℓter, 5 epochs

0.05 0.00 0.05 weight values

(e) ℓter, 20 epochs

Figure 2: Neural network weight histograms of one output channel of a convolutional layer in the pretrained Res Net-18 model and the model after regularized training with ℓbin or ℓter over Image Net.

the mismatch between forward and backward propagation, references Gu et al. (2019); Hu et al. (2018); Kim et al. (2021); He et al. (2020); Yang et al. (2019); Zhang et al. (2018); Jung et al. (2019) introduce more trainable parameters and Shang et al. (2022) proposes a regularization method that requires the construction of matrices that scale with the square of the number of features in one layer. Please see Table 7 in App. D.3 for a comparison of the additional variables introduced by each SOTA method.

We conclude by noting that the proposed CS regularizers could be combined with any of these existing approaches for possible further accuracy improvements.

4 LIMITATIONS

While the proposed CS regularizers provide a recipe for designing regularization functions with a wide variety of properties, they suffer from a range of limitations that we summarize next.

First and foremost, CS regularizers are typically nonconvex. While we have been able to prove that some of the proposed nonconvex CS regularizers are free of any spurious stationary points (and thus are invex), convergence to a global minimum depends on the combination of the objective, regularizer, and optimization algorithm. Thus, multiple random restarts of the optimizer might be necessary. Depending on the speciﬁc regularizer, a solution algorithm that exploits invexity, such as the one presented in Barik et al. (2023), could be utilized instead.

Furthermore, some of the CS regularizers involve higher-order polynomials (e.g., the ternarization regularizer in (10) involves eighth-order polynomials), which can result in poor convergence behavior, especially around their minimum. To counteract this issue, one can either resort to scale-invariant versions as outlined in App. B.5 or to non-differentiable variants as outlined in App. B.4. In addition, utilizing adaptive step-size selection methods and schedules that adapt (e.g., increase) the regularization parameter λ over iterations could also be used to improve convergence.

Finally, we have only scratched the surface of many of the generalizations and variations put forward in Section 2.5 and in App. B. Besides that, we have investigated the efﬁcacy of CS regularizers with only four example tasks, i.e., solving underdetermined systems of linear equations for recovering discrete-valued solutions, eigenvectors, and matrices with orthogonal columns, and neural network weight quantization with a simple training recipe, in Section 3. A thorough theoretical analysis and simulative study of alternative CS regularizers in a broader range of applications, as well as combining CS regularizers with sophisticated SOTA quantized neural network training procedures, such as the ones in He et al. (2020); Jung et al. (2019), are left for future work.

5 CONCLUSIONS

We have proposed Cauchy Schwarz (CS) regularizers, a novel class of regularization functions that can be designed to promote a wide range of properties. We have derived example regularization functions that promote discrete-valued vectors, eigenvectors to matrices, or matrices with orthogonal columns, and we have outlined a range of specializations, variations, and generalizations that lead to an even broader class of new and possibly more powerful regularizers. For solving underdetermined systems of linear equations, we have shown that CS regularizers can outperform well-established baseline methods, such as ℓ -norm or ℓ1-norm minimization. For weight quantization of neural networks, we have shown that utilizing CS regularizers enables one to achieve competitive accuracy to existing quantization methods while using a simple training procedure.

Published as a conference paper at ICLR 2025

Yu Bai, Yu-Xiang Wang, and Edo Liberty. Prox Quant: Quantized neural networks via proximal operators. In International Conference on Learning Representations, ICLR, 2019.

Adarsh Barik, Suvrit Sra, and Jean Honorio. Invex programs: First order algorithms and their convergence. ar Xiv preprint ar Xiv:2307.04456, 2023.

Amir Beck and Marc Teboulle. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419 2434, January 2009.

A. Ben-Israel and B. Mond. What is invexity? J. Aust. Math. Soc. Ser. B. Appl. Math., 28(1):1 9, July 1986. doi: 10.1017/S0334270000005142.

T. Tony Cai, Guangwu Xu, and Jun Zhang. On recovery of sparse signals via ℓ1-norm minimization. IEEE Transactions on Information Theory, 55(7):3388 3397, June 2009.

Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave Gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

Oscar Castañeda, Sven Jacobsson, Giuseppe Durisi, Mikael Coldrey, Tom Goldstein, and Christoph Studer. 1-bit massive MU-MIMO precoding in VLSI. IEEE J. Emerging Sel. Topics Circuits Syst., 7(4):508 522, December 2017.

Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Learning sparse low-precision neural networks with learnable regularization. IEEE Access, 8:96963 96974, 2020.

Clayton W. Commander. Maximum cut problem, MAX-CUT, pp. 1991 1999. Springer US, Boston, MA, 2009. ISBN 978-0-387-74759-0. doi: 10.1007/978-0-387-74759-0_358. URL https: //doi.org/10.1007/978-0-387-74759-0_358.

Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux, and Vahid Partovi Nia. Regularized binary network training. ar Xiv preprint ar Xiv:1812.11800, 2019.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vision and Patt. Recog., pp. 248 255, May 2009. doi: 10.1109/CVPR.2009.5206848.

Yinpeng Dong, Jianguo Li, and Renkun Ni. Learning accurate low-bit deep neural networks with stochastic quantization. In BMVC, 01 2017. doi: 10.5244/C.31.189.

Jonathan Eckstein and Dimitri P. Bertsekas. On the douglas-rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55: 293 318, 1992.

Matthias Joachim Ehrhardt and Simon R. Arridge. Vector-valued image processing by parallel level sets. IEEE Transactions on Image Processing, 23(1):9 18, 2014. doi: 10.1109/TIP.2013.2277775.

Ahmed T Elthakeb, Prannoy Pilligundla, Fatemehsadat Mireshghallah, Tarek Elgindi, Charles-Alban Deledalle, and Hadi Esmaeilzadeh. Wave Q: Gradient-based deep quantization of neural networks through sinusoidal adaptive regularization. ar Xiv preprint ar Xiv:2003.00146, 2020.

Richard Everson. Orthogonal, but not orthonormal, Procrustes problems. Advances in Comp. Math., 3(4), 1998.

Tom Goldstein. fasta-matlab. https://github.com/tomgoldstein/fasta-matlab. Accessed: 2024-03-30.

Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4852 4861, 2019.

Published as a conference paper at ICLR 2025

Jiaxin Gu, Ce Li, Baochang Zhang, Jungong Han, Xianbin Cao, Jianzhuang Liu, and David Doermann. Projection convolutional neural networks for 1-bit CNNs via discrete back propagation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 8344 8351, July 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. in IEEE Conf. on Comp. Vision and Pattern Recognition (CVPR), pp. 770 778, June 2016. doi: 10.1109/CVPR.2016.90.

Xiangyu He, Zitao Mo, Ke Cheng, Weixiang Xu, Qinghao Hu, Peisong Wang, Qingshan Liu, and Jian Cheng. Proxy BNN: Learning binarized neural networks via proxy matrices. In European Conference on Computer Vision, pp. 223 241. Springer, August 2020.

Otto Hölder. Ueber einen Mittelwerthssatz. Nachrichten von der Königlichen Gesellschaft der Wissenschaften und der Georg-Augusts-Universität zu Göttingen, (2):38 47, 1889.

Lu Hou, Quanming Yao, and James T Kwok. Loss-aware binarization of deep networks. ar Xiv preprint ar Xiv:1611.01600, 2016.

Qinghao Hu, Peisong Wang, and Jian Cheng. From hashing to CNNs: Training binary weight networks via hashing. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, April 2018.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1 30, 2018.

Pei-Hen Hung, Chia-Han Lee, Shao-Wen Yang, V Srinivasa Somayazulu, Yen-Kuang Chen, and Shao-Yi Chien. Bridge deep learning to the physical world: An efﬁcient method to quantize network. In IEEE Workshop on Signal Processing Systems (Si PS). IEEE, 2015.

Yerlan Idelbayev. Proper Res Net implementation for CIFAR10/CIFAR100 in Py Torch. https: //github.com/akamaster/pytorch_resnet_cifar10, 2021. Accessed: 2024-0430.

Ian T Jolliffe. Principal component analysis for special types of data. Springer, 2002.

Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4350 4359, 2019.

Dohyung Kim, Junghyup Lee, and Bumsub Ham. Distance-aware quantization. In IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5251 5260, October 2021. doi: 10.110 9/ICCV48922.2021.00522.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, January 2017.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. URL https: //api.semanticscholar.org/Corpus ID:18268744.

Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with admm. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. ar Xiv preprint ar Xiv:1605.04711, 2016.

Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, and Mohsen Imani. Cauchy-schwarz fairness regularizer, 2025. URL https://openreview.net/forum?id=Jwo QZ9NKt H.

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016.

Published as a conference paper at ICLR 2025

Zhi-quan Luo, Wing-kin Ma, Anthony Man-cho So, Yinyu Ye, and Shuzhong Zhang. Semideﬁnite relaxation of quadratic optimization problems. IEEE Signal Processing Magazine, 27(3):20 34, May 2010.

Olvi L. Mangasarian and Benjamin Recht. Probability of unique integer solution to a system of linear equations. Eur. J. Oper. Res., 214(1):27 30, October 2011.

Yoshiki Matsuda. Benchmarking the MAX-CUT problem on the simulated bifurcation machine, 2019. URL https://medium.com/toshiba-sbm/benchmarking-the-max-cut-pr oblem-on-the-simulated-bifurcation-machine-e26e1127c0b0. Accessed: 2024-06-27.

Maxim Naumov, Utku Diril, Jongsoo Park, Benjamin Ray, Jedrzej Jablonski, and Andrew Tulloch. On periodic functions as regularizers for quantization of neural networks. ar Xiv preprint ar Xiv:1811.09862, 2018.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. 2019.

Samuel Pinilla, Tingting Mu, Neil Bourne, and Jeyan Thiyagalingam. Improved imaging by invex regularizers with global optima guarantees. Advances in Neural Information Processing Systems, 35:10780 10794, 2022.

Jose C Principe. Information theoretic learning: Renyi s entropy and kernel perspectives. Springer Science & Business Media, 2010.

Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2247 2256, June 2020. doi: 10.1109/CVPR42600.2020.00232.

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Image Net classiﬁcation using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525 542. Springer, 2016.

Ryan Razani, Gregoire Morin, Eyyub Sari, and Vahid Partovi Nia. Adaptive binary-ternary quantization. In Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4613 4618, June 2021.

Peter H Schönemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1):1 10, 1966.

Sohil Shah, Abhay Kumar Yadav, Carlos D Castillo, David W Jacobs, Christoph Studer, and Tom Goldstein. Biconvex relaxation for semideﬁnite programming in computer vision. pp. 717 735, September 2016.

Yuzhang Shang, Dan Xu, Bin Duan, Ziliang Zong, Liqiang Nie, and Yan Yan. Lipschitz continuity retained binary neural network. In European Conference on Computer Vision, pp. 603 619. Springer, October 2022.

Kirill Solodskikh, Vladimir Chikin, Ruslan Aydarkhanov, Dehua Song, Irina Zhelavskaya, and Jiansheng Wei. Towards accurate network quantization with equivalent smooth regularizer. In European Conference on Computer Vision, pp. 727 742. Springer, 2022.

J. Michael Steele. The Cauchy Schwarz Master Class: an Introduction to the Art of Mathematical Inequalities. Cambridge University Press, 2004.

Christoph Studer, Tom Goldstein, Wotao Yin, and Richard G. Baraniuk. Democratic representations. ar Xiv:1401.3420, April 2015.

Published as a conference paper at ICLR 2025

Sueda Taner and Christoph Studer. ℓp ℓq-norm minimization for joint precoding and peak-to-averagepower ratio reduction. In Proc. Asilomar Conf. Signals, Syst., Comput., pp. 437 442, October 2021.

Wei Tang, Gang Hua, and Liang Wang. How to train a compact binary neural network with high accuracy? In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31, 2017.

Linh Tran, Maja Pantic, and Marc Peter Deisenroth. Cauchy Schwarz regularized autoencoder. Journal of Machine Learning Research, 23(115):1 37, 2022.

Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch. Weighted quantizationregularization in DNNs for weight memory minimization toward HW implementation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):2929 2939, 2018.

Sheng Xu, Yanjing Li, Teli Ma, Mingbao Lin, Hao Dong, Baochang Zhang, Peng Gao, and Jinhu Lu. Resilient binary neural network. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 37, pp. 10620 10628, 2023.

Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. BSQ: Exploring bit-level sparsity for mixedprecision neural network quantization. ar Xiv preprint ar Xiv:2102.10462, 2021.

Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xiansheng Hua. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7308 7316, 2019.

Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. LQ-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365 382, 2018.

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160, 2016.

Published as a conference paper at ICLR 2025

Appendix: Cauchy Schwarz Regularizers

A PROOFS AND DERIVATIONS

A.1 PROOF OF PROPOSITION 1

From the Cauchy Schwarz inequality (Steele, 2004) follows that

| g(x), h(x) | g(x) 2 h(x) 2. (24)

Squaring both sides of (24) and rearranging terms leads to

0 g(x) 2 2 h(x) 2 2 | g(x), h(x) |2 ℓ(x). (25)

Equality in (24) holds iff g(x) h(x), for which ℓ(x) = 0.

A.2 PROOF OF LEMMA 1

Assume that h(x) = 0. Then,

g(x) βh(x) 2 2 β = 0 = β = g(x), h(x)

h(x) 2 2 . (26)

Plugging the right-hand-side into g(x) βh(x) 2 2 yields

min β R g(x) βh(x) 2 2 = g(x) 2 2 | g(x), h(x) |2

h(x) 2 2 . (27)

Multiplying both sides by h(x) 2 2 results in

ℓ(x) = h(x) 2 2 min β R g(x) βh(x) 2 2. (28)

If h(x) = 0, then (28) still holds. By swapping g(x) with h(x), the equalities in (5) follow.

A.3 DERIVATION OF REGULARIZER 1

Regularizer 1 is minimized by vectors x RN that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(x) = [x2 1, . . . , x2 N]T and h(x) = 1N. We have

g(x) h(x) (a1, a2) R2\{(0, 0)} : a1x2 n = a2, n = 1, . . . , N. (29)

If a1 = 0, then x2 n = a2/a1 which implies xn = α, n = 1, . . . , N, for some α R. If a1 = 0 then a2 = 0, so the condition a1x2 n = a2 cannot be satisﬁed; this implies that the only vectors that satisfy ℓbin(x) = 0 from (6) are in the following set:

Xbin = x { α, α}N : α R . (30)

The same result would also follow directly from the inspection of (7).

To establish the fact that Regularizer 1 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓbin(x) = 0 iff x Xbin. To this end, we inspect

xn = 4Nx3 n 4|[x]|2xn = 4xn(Nx2 n |[x]|2) = 0, n = 1, . . . , N. (31)

Clearly, every vector x Xbin satisﬁes (31). For any other vector, the gradient is nonzero. To prove this, it is sufﬁcient to show that the derivative is nonzero for a two-dimensional, non-binary-valued vector, because any vector with a non-binary-valued subvector is non-binary-valued (and any nonbinary-valued vector has a non-binary-valued subvector). To this end, let x = [α, β]T for α = β, α = β, and α, β = 0. Then, we have

x1 = 4α(2α2 (α2 + β2)) = 4α(α2 β2) = 0, (32)

which concludes our proof.

Published as a conference paper at ICLR 2025

A.4 DERIVATION OF REGULARIZER 2

Regularizer 2 is minimized by vectors x RN that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(x) = [x2 1, . . . , x2 N]T and h(x) = x. We have g(x) h(x) (a1, a2) R2\{(0, 0)} : a1x2 n = a2xn, n = 1, . . . , N. (33) If xn = 0, then the condition a1x2 n = a2xn is trivially satisﬁed. If xn = 0, then we inspect a1xn = a2. If a1 = 0, then xn = a2/a1, which implies xn = α for some α R. If a1 = 0 then a2 = 0, so the condition a1xn = a2 cannot be satisﬁed; this implies that the only vectors that satisfy ℓosb(x) = 0 from (8) are in the following set: Xosb = x {0, α}N : α R . (34) The same result would also follow directly from the inspection of (9).

To establish the fact that Regularizer 2 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓosb(x) = 0 iff x Xosb. To this end, we inspect

xn = 2xn |[x]|4 + 2x2 n|[x]|2 3xn|[x]|3 = 0, n = 1, . . . , N. (35) Clearly, every vector x Xosb satisﬁes (35). For any other vector, the gradient is nonzero. To prove this, it is sufﬁcient to show that the derivative is nonzero for a two-dimensional, non-one-sidedbinary-valued (non-OSB) vector, because any vector with a non-OSB subvector is non-OSB (and any non-OSB vector has a non-OSB subvector). Assume x = [α, β]T for α = β and α, β = 0. Then ℓosb(x)

x1 = 2αβ2(2α β)(α β), and by symmetry, ℓosb(x)

x2 = 2α2β(α 2β)(α β); this implies

that ℓosb(x)

x1 and ℓosb(x)

x2 cannot be zero simultaneously.

A.5 DERIVATION OF REGULARIZER 3

Regularizer 3 is minimized by vectors x RN that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(x) = [x3 1, . . . , x3 N]T and h(x) = x. We have g(x) h(x) (a1, a2) R2\{(0, 0)} : a1x3 n = a2xn, n = 1, . . . , N. (36) If xn = 0, then the condition a1x2 n = a2xn is trivially satisﬁed. If xn = 0, then we inspect a1x2 n = a2. If a1 = 0, then x2 n = a2/a1, which implies xn = α for some α R. If a1 = 0 then a2 = 0, so the condition a1x2 n = a2 cannot be satisﬁed; this implies that the only vectors that satisfy ℓosb(x) = 0 from (10) are in the following set: Xter = x { α, 0, α}N : α R . (37) The same result would also follow directly from the inspection of (11).

To establish the fact that Regularizer 3 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓter(x) = 0 iff x Xter. To this end, we inspect

xn = 2xn |[x]|6 + 3|[x]|2x4 n 4|[x]|4x2 n = 0, n = 1, . . . , N. (38) Clearly, every vector x Xter satisﬁes (38). For any other vector, the derivative is nonzero. To prove this, it is sufﬁcient to show that the derivative is nonzero for a two-dimensional, non-ternary-valued vector, because any vector with a non-ternary-valued subvector is non-ternary-valued (and, any non-ternary-valued vector has a non-ternary-valued subvector). Assume x = [α, β] for α = β, α = β and α, β = 0. Then ℓter(x)

x1 = 2αβ2(3α2 β2)(α2 β2), and, by symmetry, ℓter(x)

2α2β(α2 3β2)(α2 β2); this implies that ℓter(x)

x1 and ℓter(x)

x2 cannot be zero simultaneously.

A.6 DERIVATION OF REGULARIZER 4

Regularizer 4 is minimized by vectors x RN that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(x) = Cx and h(x) = x. By deﬁnition, we have that g(x) h(x) if and only if x is an eigenvector of C.

For the sake of completeness, we provide the stationary point analysis of Regularizer 4 below: xℓeig(x) = 2( x 2 2CT C + Cx 2 2IN 2(x T Cx))x = 0N. (39) Clearly, every eigenvector x of C satisﬁes (39). While it appears to be unlikely that any other x would satisfy (39), we are unable to provide a formal proof that rules out the existence of other solutions. Therefore, we refrain from claiming that Regularizer 4 has no spurious stationary points.

Published as a conference paper at ICLR 2025

A.7 DERIVATION OF REGULARIZER 5

Regularizer 5 is minimized by matrices X RN K that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(X) vec(XTX) and h(x) vec(IK).

To establish the fact that Regularizer 5 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓom(x) = 0 iff x Xom. To this end, we inspect

Xℓom(X) = 4NXXTX 4 X 2 FX = 0 (40)

4X(KXTX X 2 FIK) = 0 (41)

KXTX = X 2 FIK, (42)

which implies that X must have orthogonal columns of equal length.

A.8 PROOF OF PROPOSITION 2

From the triangle inequality and Hölder s inequality (Hölder, 1889) follows that

| g(x), h(x) | PN n=1 [g(x)]n[h(x)]n g(x) p h(x) q, (43)

where p, q 1 so that 1

q = 1. Raising the left-hand and the right-hand sides of (43) to the power of r > 0 and rearranging terms leads to

0 g(x) r p h(x) r q | g(x), h(x) |r ℓ(x). (44)

Both inequalities in (43) hold iff g(x) h(x), for which ℓ(x) = 0.

B GENERALIZATIONS AND VARIATIONS OF CS REGULARIZERS

B.1 BEYOND VECTOR TERNARIZATION

We now show one approach that generalizes CS regularizers to a symmetric, discrete-valued set with 2B equispaced entries. The idea behind this approach is as follows: (i) decompose x RN into a sum of B auxiliary vectors x = PB b=1 yb with yb RN and (ii) apply one regularization function to the auxiliary vectors yb, b = 1, . . . , B.

Deﬁne g {yb}B b=1 = g(y1)T, . . . , g(y B)T T using g(y) = [y2 1, . . . , y2 N]T and h {yb}B b=1 = h1(y1)T, . . . , h B(y B)T T using hb(y) = 4b 11N. Then, Proposition 1 yields the following CS regularizer that promotes symmetric equispaced-valued vectors; see App. B.1.1 for the derivation. Regularizer 6 (Symmetric Equispaced). Let yb RN for b = 1, . . . , B and deﬁne

ℓequ {yb}B b=1 KN PB b=1 |[yb]|4 PB b=1 4b 1|[yb]|2 2 (45)

with K PB b=1 42(b 1). Then, the nonnegative function (45) is only zero for vectors yb { 2b 1α, 2b 1α}N 0N, b = 1, . . . , B, for any α R; this implies that the sum of these vectors x PB b=1 yb is in the set Xequ { (2k 1)α}2B 1 k=1 with |Xequ| = 2B. Furthermore, ℓequ does not have any spurious stationary points.

To gain insight into Regularizer 6, we invoke Lemma 1 and obtain

ℓequ({yb}B b=1) = KN min β R

([yb]n 2b 1p

β)([yb]n + 2b 1p

We also observe this CS regularizer s auto-scale property and only vectors of the form yb { 2b 1α, 2b 1α}N for some α R minimize (46). This implies that the vectors x are of the form x { (2B 1)α, . . . , 3α, α, α, 3α, . . . , (2B 1)α}N for some α R.

In contrast to the initially introduced binarization and ternarization regularizers, Regularizer 6 introduces additional optimization parameters, multiplying the number of the optimization parameters by B. We provide an example use case of Regularizer 6 with simulation results for recovering two-bit solutions to underdetermined linear systems in App. C.5.

Published as a conference paper at ICLR 2025

B.1.1 DERIVATION OF REGULARIZER 6

Regularizer 6 is minimized by vectors {yb}B b=1 that satisfy the linear dependence condition g {yb}B b=1 h {yb}B b=1 for the speciﬁc choices g {yb}B b=1 = [ g(y1), . . . , g(y B)]T using g(y) = [y2 1, . . . , y2 N]T and h {yb}B b=1 = [ h(y1), . . . , h(y B)]T using h(yb) = 4b 11N. We have

g {yb}B b=1 h {yb}B b=1

(a1, a2) R2\{(0, 0)} : a1y2 b,n = a24b 1, b = 1, . . . , B, n = 1, . . . , N. (47)

If a1 = 0, then y2 n,b = a2 a1 4b 1 which implies yn,b = 0 or yb,n = α2b 1, n = 1, . . . , N, b = 1, . . . , B, for some α R. If a1 = 0 then a2 = 0, so the condition a1y2 b,n = a24b 1 cannot be satisﬁed; this implies that the only vectors yb that satisfy ℓequ {yb}B b=1 = 0 from (45) are in the following set: Yb,α = { 2b 1α, 2b 1α}N 0N, (48) with yb Yb,α, b = 1, . . . , B for any α R. The same result would also follow directly from the inspection of (46). Then, the vectors x = PB b=1 yb are in the following set:

Xequ = x { 2B 1α, . . . , 3α, α, α, 3α, . . . , 2B 1α}N : α R . (49)

To establish the fact that Regularizer 6 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓequ(yb) = 0, b = 1, . . . , B iff yb Yb,α, b = 1, . . . , B for any α R. To this end, we inspect ℓequ({yb}B b=1) [yb]n = 4[yb]n KN([yb]n)2 4b 1 PB b=1 4 b 1|[y b]|2 = 0 (50)

for n = 1, . . . , N and b = 1, . . . , B. Clearly, yb Yb,α, b = 1, . . . , B satisﬁes (50). For a set of vectors in any other form, the derivative is nonzero. To prove this, it is sufﬁcient to show that the derivative is nonzero for a pair of scalars (i.e., N = 1) (yb, yb ) for yb = 0, |yb| = 2b 1|α| and |yb | = 2b 1|α|, because any pair of vectors including these entries would not satisfy (50) (and any set of vectors that do not satisfy (50) must have such a pair of entries). We have that ℓequ(yb, yb )

yb = 4yb42(b 1) y2 b 4b 1α2 = 0, (51)

which concludes our proof.

B.2 BOUNDED CS REGULARIZERS FOR VECTOR BINARIZATION

We now propose alternative binarization regularizers that avoid potential numerical issues caused by higher-order polynomials.

Deﬁne b(x) (1 + x2) 1 and g(x) [b(x1), . . . , b(x N)]T. Furthermore, let h(x) 1N. Then, Proposition 1 yields the following CS regularizer that promotes symmetric binary-valued vectors. Regularizer 7 (Bounded Symmetric Binarizer). Let x RN and deﬁne

ℓbbin(x) N PN n=1 1 (1+x2n)2 PN n=1 1 1+x2n

Then, the nonnegative function in (52) is only zero for one-sided binary-valued vectors, i.e., iff x { α, α}N for any α R. Furthermore, ℓbbin(x) does not have any spurious stationary points.

An alternative binarization regularizer can be obtained as follows. Deﬁne b(x) e x2 and g(x) [b(x1), . . . , b(x N)]T. Furthermore, let h(x) 1N. Then, Proposition 1 yields the following CS regularizer that promotes symmetric binary-valued vectors. Regularizer 8 (Alternative Bounded Symmetric Binarizer). Let x RN and deﬁne

ℓbin,exp(x) N PN n=1 e 2x2 n PN n=1 e 2x2 n 2 (53)

Then, the nonnegative function in (53) is only zero for one-sided binary-valued vectors, i.e., iff x {0, α}N for any α R.

Note that by normalizing (52) and (53) with 1/N 2, the maximum value of the resulting CS regularizer is bounded from above by 1.

Published as a conference paper at ICLR 2025

B.2.1 DERIVATION OF REGULARIZER 7 FROM APP. B.2

Regularizer 7 is minimized by vectors x RN that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(x) = [(1 + x1) 2, . . . , (1 + x N) 2]T and h(x) = 1N. We have

g(x) h(x) (a1, a2) R2\{(0, 0)} : a1(1 + xn)2 = a2, n = 1, . . . , N. (54)

If a1 = 0, then (1 + xn)2 = a2/a1 which implies xn = α, n = 1, . . . , N, for some α R. If a1 = 0 then a2 = 0, so the condition a1(1 + xn)2 = a2 cannot be satisﬁed; this implies that the only vectors that satisfy ℓbbin(x) = 0 from (52) are in Xbin.

To establish the fact that Regularizer 7 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓbbin(x) = 0 iff x Xbin. To this end, we inspect

xn = 4(1 + x2 n) 2xn N 1+x2n + PN k=1 1 1+x2 k = 0, n = 1, . . . , N. (55)

Clearly, every vector x Xbin satisﬁes (55), while the derivative is nonzero for any other vector, similarly to (31).

B.2.2 DERIVATION OF REGULARIZER 8 FROM APP. B.2

Regularizer 8 is minimized by vectors x RN that satisfy the linear dependence condition g(x) h(x) for the speciﬁc choices g(x) = [e x2 1, . . . , e x2 N ]T and h(x) = 1N. We have

g(x) h(x) (a1, a2) R2\{(0, 0)} : a1e x2 n = a2, n = 1, . . . , N. (56)

If a1 = 0, then e x2 n = a2/a1 which implies xn = α, n = 1, . . . , N, for some α R. If a1 = 0 then a2 = 0, so the condition a1e x2 n = a2 cannot be satisﬁed; this implies that the only vectors that satisfy ℓbin,exp(x) = 0 from (53) are in Xbin.

To establish the fact that Regularizer 7 does not have any spurious stationary points (thus, that the regularizer is invex), we need to show that ℓbin,exp(x) = 0 iff x Xbin. To this end, we inspect

ℓbin,exp(x)

xn = 4 Ne 2x2 n + e x2 n PN k=1 e x2 k xn = 0, n = 1, . . . , N. (57)

Clearly, every vector x Xbin satisﬁes (57), while the derivative is nonzero for any other vector, similarly to (31).

B.3 CS REGULARIZERS FOR DISCRETE-VALUED VECTORS WITH FIXED SCALE

If one is, for example, interested in promoting binary-valued vectors with predeﬁned scale, i.e., x { α, α}N but for a given ﬁxed value of α, then one can use g(x) [x2 1, . . . , x2 N, α2]T and h(x) 1N+1 in (4). We note, however, that this particular binarization regularizer with α = 1 has been utilized before in Tang et al. (2017); similar regularizers can be found in Hung et al. (2015); Darabi et al. (2019). In general, the idea of augmenting the functions g and h with constants removes the auto-scale property of CS regularizers.

B.4 NON-DIFFERENTIABLE CS REGULARIZERS

One can also develop non-differentiable variants of CS regularizers. For example, by deﬁning g(x) |x1|, . . . , |x N| T and h(x) 1N in Proposition 1, one obtains the CS regularizer ℓbin(x) N|[x]|2 x 2 1, (58) which also promotes symmetric binary-valued entries. Intriguingly, this regularizer is equal to the scaled empirical variance of the entry-wise absolute values of x RN, i.e., ℓbin(x) = N 2Var(|x|). One could also combine the idea of (58) with Proposition 2 using p = q = 2 and r = 1 to obtain ℓbin(x)

N x 2 x 1, (59) which also promotes symmetric binary-valued entries. Such alternative versions might result in better empirical convergence if, for example, they are used within auto-differentiation frameworks that allow for non-differentiable functions. We conclude by noting that the speciﬁc regularizer in (58) has been used in Taner & Studer (2021) for dynamic-range reduction of complex-valued data in wireless systems.

Published as a conference paper at ICLR 2025

Table 1: Comparison of regularizers for vector binarization. Advantages are designated by (+) and disadvantages by (-).

Regularizer Differentiable (+) Scale-adaptive (+) Requires additional optimization variables (-) P

n(|xn| 1)2 No No No P

n(|xn| β)2 No Yes Yes P

n(x2 n 1)2 Yes No No P

n(x2 n β)2 Yes Yes Yes Ours (ℓbin) Yes Yes No

B.5 SCALE-INVARIANT HÖLDER REGULARIZER

By slightly modifying the proof of Proposition 2, one can also develop Hölder regularizers that are scale-invariant, i.e., in which scaling the entire vector-valued function g(x) or h(x) with a nonzero constant has no impact on the regularizer s function value.

Proposition 3. Fix two vector-valued functions g, h: RN RN and deﬁne X as in (3). Let p, q 1 so that 1

q = 1 and let r > 0. Furthermore, set ε 0. Then, the nonnegative function

ℓ(x) g(x) r p h(x) r q + ε | g(x), h(x) |r + ε 1 (60)

is zero iff x X.

The proof of Proposition 3 follows that of App. A.8, but where we ﬁrst add ε 0 to the left-hand and right-hand sides of (43), followed by a division by | g(x), h(x) | + ε and rearranging terms to arrive at ℓ(x) in (60).

Such scale-invariant regularization functions require special attention. First, while the parameter ε > 0 prevents the denominator in (60) from becoming zero, only ε = 0 leads to a scale-invariant regularizer. Second, regularizers derived from Proposition 3 may have signiﬁcantly more spurious stationary points than those obtained via Proposition 2. Third, evaluating the gradient of regularizers derived from (60) is typically more involved. Nonetheless, their (approximately) scale-invariant property might turn out to be useful in some applications and outweigh the above drawbacks. We note that scale-invariant versions of CS regularizers can also be obtained as a special case of Proposition 3.

B.6 VECTORS IN THE NULLSPACE OF A GIVEN MATRIX

As our last regularizer, we propose a variant that promotes unit-norm vectors in the nullspace of a given (and ﬁxed) matrix C RM N. Deﬁne g(x) [(Cx)T, x 2 2 1, 1]T and h(x) [0T M 1, 0, 1]T. Then, Proposition 1 yields the following CS regularizer that promotes unit-norm vectors in the nullspace of C.

Regularizer 9 (Nullspace Vector). Fix C RM N with M N and let x RN. Deﬁne

ℓns(x) Cx 2 2 + ( x 2 2 1)2 (61)

Then, the nonnegative function in (61) is zero for only unit-norm vectors x in the nullspace of C, i.e., iff Cx = 0M 1 with x 2 2 = 1.

C ADDITIONAL RESULTS

C.1 COMPARISON OF THE BINARIZING CS REGULARIZER WITH EXISTING REGULARIZERS

Table 1 summarizes the key properties of existing regularizers from Section 1.3 and how our regularizer can be superior to those, i.e., by being differentiable, scale-adaptive, and avoiding additional optimization parameters.

Published as a conference paper at ICLR 2025

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0

Figure 3: Probability of success in binary solution recovery with the CS regularizer ℓbin and three baselines.

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0

Figure 4: Probability of success in binary solution recovery with ﬁve CS regularizer variants and an ℓ -norm minimization baseline.

C.2 COMPARISONS WITH BASELINES FOR BINARY RECOVERY

We follow the same experimental setup as in Section 3.1 and provide experiments for two additional baselines: (i) assuming that the scale β is known and ﬁxed as a constant and (ii) letting β be a separate (and explicit) optimization parameter (that is learned together with the entries of the vector). As mentioned in Section 1.3, we use ℓbin,β=1 PN n=1(x2 n β)2 with known and ﬁxed β = 1, and ℓbin,β PN n=1(x2 n β)2 with additional optimization parameter β.

In Fig. 3, we observe that both of these baseline methods achieve comparable recovery performance, but CS regularizers have the advantages of (i) not requiring to know the scale a-priori and (ii) not introducing additional optimization parameters.

C.3 ADDITIONAL RESULTS FOR MORE CS REGULARIZERS FOR BINARY RECOVERY

We follow the same experimental setup as Section 3.1 and provide experiments for four additional variants of CS regularizers: Here, ℓbin,H refers to the Hölder CS regularizer from (17) with p = q = 2, r = 1, ℓbin,si to the scale-invariant Hölder CS regularizer from (60) with p = q = 2, r = 1, ℓbin to the non-differentiable CS regularizer from (58), and ℓbin,exp to the bounded CS regularizer from (53).

In Fig. 4, we observe that while ℓbin,H has comparable success rate to ℓbin, the remaining variants are outperformed by the baseline ℓ -norm.

Published as a conference paper at ICLR 2025

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0

(a) One-sided binary

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0

(b) Symmetric ternary

Figure 5: Probability of success for recovering vectors with (a) one-sided binary and (b) symmetric ternary values dependent on the density ratio K/N.

C.4 ADDITIONAL RESULTS FOR SPARSE RECOVERY

In this subsection, we slightly modify our experimental setup from Section 3.1 in order to compare the solution recovery performance of ℓosband ℓter-minimization to that of ℓ1-norm minimization with respect to the sparsity of x . We ﬁx N = 100 and M = 75. We create vectors x RN with a ﬁxed number of K uniform randomly chosen nonzero entries; these nonzero entries are +1 for one-sided binary vectors, and are chosen i.i.d. with uniform probability from { 1, +1} for ternary vectors. We vary K from 20 to 80. For each K, we randomly generate 1000 problem instances and report the average success probability and the standard deviation from the mean. We only allow one random initialization, and the remaining details of the setup are the same as those presented in Section 3.1.

Fig. 5 demonstrates the success rate of (a) ℓosb-minimization and (b) ℓter-minimization compared to ℓ1-norm minimization with respect to the density ratio of x given by δ = K/N. In Fig. 5 (a), we observe that while the success rate of ℓ1-norm minimization reduces with δ > 0.3 and almost reaches 0 at δ = 0.5, the success rate of ℓosb-minimization is almost always 1 for any density ratio. In Fig. 5 (b), we observe that ℓ1-norm minimization follows the same trend as in Fig. 5 (a) as expected, while the success rate of ℓter-minimization increases with density. The success rate of ℓter-minimization surpasses that of ℓ1-norm minimization for δ > 0.4.

C.5 SIMULATION RESULTS FOR TWO-BIT SOLUTION RECOVERY FOR REGULARIZER 6

We provide an example use case of the symmetric equispaced regularizer from (45), similarly to those of the binary and ternary regularizers from Section 3.1. To this end, we consider systems of linear equations b = Ax, where A RM N has i.i.d. standard normal entries and M < N. For two-bit-valued solution vector recovery, we create two vectors y1 RN and y2 RN whose entries are chosen i.i.d. with uniform probability from { 1, +1} and { 2, +2}, respectively, and calculate x = y1 + y2. Then, we calculate b = Ax , and we try to recover the vector x from b by solving optimization problems of the form

ˆy1, ˆy2 arg min y1, y2 RN ℓequ( y1, y2) + λ A( y1 + y2) b 2 2 (62)

using a gradient descent algorithm speciﬁcally, FISTA with backtracking (Beck & Teboulle, 2009; Goldstein) for a maximum of 104 iterations. We ﬁx N = 10, λ = 10 5 and vary M between 3 and 9. For each M, we randomly generate 1000 problem instances, with at most 10 random initializations each, and report the average success probability. We declare success for recovering x if the returned solution ˆx satisﬁes ˆx x 2/ x 2 10 2. Fig. 6 shows the success probabilities with respect to the undersampling ratio γ along with (negligibly small) error bars calculated from the standard error of the mean. Here, we observe that ℓequ can recover symmetric two-bit vectors for large undersampling ratios with a reasonable success probability.

Published as a conference paper at ICLR 2025

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0

Figure 6: Two-bit solution recovery for N = 10 with CS regularizer ℓequ.

C.6 SIMULATION RESULTS FOR EIGENVECTOR RECOVERY FROM SECTION 3.2

Please see Fig. 7. Here, we observe that minimizing ℓeig provides a higher success rate than the baseline ℓµ, which requires an additional optimization parameter compared to ℓeig.

0.3 0.4 0.5 0.6 0.7 0

Figure 7: Eigenvector recovery for N = 100 with the regularizers ℓeig and baseline ℓµ, where µ must also be learned in the baseline method.

C.7 APPROXIMATING MAXIMUM-CUT PROBLEMS WITH CS REGULARIZERS

We now showcase another application in which CS regularizers can be utilized. Speciﬁcally, CS regularizers can be used to ﬁnd approximate solutions to the well-known weighted maximum cut (MAX-CUT) problem (Commander, 2009). MAX-CUT of a graph is the partition of a graph s vertices into two disjoint sets such that the total weight of the edges between these two sets is maximized. For an undirected weighted graph G = (V, E), this maximization problem can be formulated as the following integer quadratic programming problem:

maximize x { 1,+1}N 1 2

1 i<j N wij(1 xixj)

Here, xi { 1, +1}, i = 1, . . . , N, denotes the binary set label for the ith vertex of the graph, N is the number of vertices, and wij R denotes the weight of the edge between the ith and jth vertices.

The MAX-CUT problem is NP-hard, and many approximations have been proposed in the literature. Classical approximations based on semideﬁnite and continuous relaxation; see, e.g, Commander (2009) and the references therein. Here, we propose a continuous reformulation that utilizes CS regularizers:

ˆx arg min x RN ℓMC(x) + λℓbin(x) subject to |xi| 1, (64)

Published as a conference paper at ICLR 2025

which we attempt to solve using a projected gradient descent, similarly to Section 2.2 with a ﬁxed maximum number of iterations.

To evaluate this approximate approach, we ran our projected gradient descent algorithm with random initializations for 10 trials. First, we considered small graphs; here, we set λ = 1.

For the N = 5, E = 7 graph from Matsuda (2019), we recovered the MAX-CUT solution in less than 300 iterations in each of the 10 trials with random initializations.

For a graph with N = 4 vertices, E = 5 edges, and weights w12 = 10, w13 = 20, w14 = 30, w24 = 40, and w34 = 50, we recovered the MAX-CUT in less than 40 iterations in each of the 10 trials.

Second, we considered the larger graphs, where wi,j { 1, 1}, given in Matsuda (2019) for benchmarking; here, we set λ = 10 7. The table below demonstrates the graph ID, the average cut values ℓMC for the initializations and our recovered solutions across 10 trials, and the maximum cut values from Matsuda (2019).

Table 2: Comparison of Our Cut and Known Maximum Cut (Matsuda, 2019) for Different Graphs

Graph ID Initial Cut Our Cut Known Maximum Cut (Matsuda, 2019)

G10 67.4 50.5 1769.1 26.8 2000 G11 7.2 14.9 482.0 10.46 564 G12 -0.2 24.3 470.8 13.7 556 G13 22.6 25.1 494.2 8.1 582

Our approach struggles to recover the MAX-CUT for large graphs; nonetheless, it signiﬁcantly improves the objective value compared to the initialization, demonstrating its potential. Moreover, with notable computational advantages over the method in Matsuda (2019), our approach shows promise and could inspire future research.

D DETAILS OF NEURAL NETWORK QUANTIZATION EXPERIMENTS FROM SECTION 3.4

D.1 DATASETS AND PREPROCESSING

Image Net has over 1.2 M training images and 50 k validation images from 1000 object classes. We train and evaluate our network on the training and validation splits, respectively, and report the top-1 accuracy for performance evaluation. We adopt a typical data augmentation strategy on the training images as resizing the shorter side of the images to 256 pixels, taking a random crop of size 224 224, and applying a random horizontal ﬂip. For validation, we apply the same resizing and take the center 224 224 crop.

CIFAR-10 (Krizhevsky, 2009) consists of over 50 k training images and 10 k testing images from 10 object classes. We adopt a typical data-augmentation strategy on the training images as padding by 4 pixels, taking a random 32 32 crop, and applying a random horizontal ﬂip. For testing, we use the original images.

D.2 THE IMPACT OF VARYING THE REGULARIZATION PARAMETER ON THE CLASSIFICATION ACCURACIES

In Tables 3-6, we provide the average accuracy and standard deviation across 10 runs for various values of λ. The chosen values are emphasized in bold. Here, we observe that changing the value of λ by a factor of 10 does not have a substantial impact on the accuracies.

Published as a conference paper at ICLR 2025

Table 3: Top-1 accuracy of binarized Res Net-18 on Image Net for regularized training for various values of λ.

λ 0.1 1 10 100 1000

Top-1 % 51.5 0.12 59.6 0.11 62.8 0.09 62.4 0.10 59.5 0.08

Table 4: Top-1 accuracy of ternarized Res Net-18 on Image Net for regularized training for various values of λ.

λ 1e3 1e4 1e5 1e6 1e7

Top-1 % 63.5 0.12 64.8 0.10 65.3 0.10 65.3 0.08 64.6 0.17

D.3 PERFORMANCE COMPARISON WITH SOTA BINARIZED AND TERNARIZED NEURAL NETWORKS FROM SECTION 3.4.2

In Table 7, we provide a comparison of the advantages/disadvantages of our training strategy compared to the SOTA methods. We remark that our approach is the only method that does not introduce any additional variables.

In Tables 8-10, we provide a comparison of the image classiﬁcation accuracy of our methods compared to the SOTA methods. Here, we report the average accuracy and standard deviation for 10 random initializations of training. We remark that the accuracy of our methods is comparable to that of the SOTA methods.

E COMPUTATIONAL RESOURCES

For our underdetermined linear systems experiments in Section 3.1, we used MATLAB. For the maximum number of 104 iterations, projected gradient descent and Douglas-Rachford splitting algorithms each took approximately one second at most.

For our neural network weight quantization experiments in Section 3.4, we use Py Torch (Paszke et al., 2019). We note that using CS regularizers does not incur a signiﬁcant additional cost in training; to this end, we measure the time it takes to train for one epoch in the following three scenarios on one NVIDIA RTX 4090 GPU: (i) training with full-precision weights without a regularizer, (ii) regularized training with full-precision weights (Step 1 in Section 3.2.1), and (iii) training with quantized weights (Step 3 in Section 3.2.1). For these three scenarios with Res Net-18, we measured 660 s, 680 s, and 650 s, respectively; with Res Net-20, we measured 4.1 s, 5.2 s, and 3.8 s. These numbers demonstrate that, while the calculation of the CS regularizers naturally results in some overhead in Step 1, the training might be even faster than full-precision training in Step 3, depending on the size of the network and the training dataset. Therefore, depending on the ratio between the number of epochs in Steps 1 and 3, CS-regularizer-based training could be quicker than full-precision training (and thus, all the other SOTA methods) for the same total number of epochs.

F REGULARIZATION FUNCTION LANDSCAPES

Fig. 8 illustrates the loss landscapes of Regularizer 1, Regularizer 2, and Regularizer 3 in R2. Here, we clearly see the global minima of each function.

Published as a conference paper at ICLR 2025

Table 5: Top-1 accuracy of binarized Res Net-20 on CIFAR10 for regularized training for various values of λ.

λ 0.1 1 10 100 1000

Top-1 % 87.1 0.17 89.5 0.20 90.3 0.23 89.7 0.18 88.4 0.24

Table 6: Top-1 accuracy of ternarized Res Net-20 on CIFAR10 for regularized training for various values of λ.

λ 1e3 1e4 1e5 1e6 1e7

Top-1 % 90.5 0.19 90.7 0.17 91.0 0.13 91.0 0.10 90.8 0.17

Table 7: Comparison of variables that are required by SOTA neural network quantization methods and CS regularizers (ours) for training. Each column represents variables that are required in addition to (unquantized) full-precision neural network training.

Method Trainable variables Non-trainable variables Tunable hyperparameters

SQ-BWN (Dong et al., 2017) Yes Yes 0 BWN (Rastegari et al., 2016) No Yes 0 HWGQ (Cai et al., 2017) No Yes 0 PCCN (Gu et al., 2019) Yes Yes 0 BWHN (Hu et al., 2018) No Yes 0 ADMM (Leng et al., 2018) Yes No 1 IR-Net (Qin et al., 2020) No Yes 0 LCR-BNN (Shang et al., 2022) No Yes 2 DAQ (Kim et al., 2021) No Yes 1 Proxy BNN (He et al., 2020) Yes Yes 1 TWN (Li et al., 2016) No Yes 1 QNet (Yang et al., 2019) No Yes 1 QIL (Jung et al., 2019) Yes Yes 0 Do Re Fa-Net (Zhou et al., 2016) No Yes 0 LQ (Zhang et al., 2018) Yes Yes 0 DSQ (Gong et al., 2019) Yes Yes 0

Ours (ℓbin and ℓter) No No 1

Method Top-1 (%)

Res Net-18 (FP) 69.8

SQ-BWN (Dong et al., 2017) 58.4 BWN (Rastegari et al., 2016) 60.8 HWGQ (Cai et al., 2017) 61.3 PCCN (Gu et al., 2019) 63.5 BWHN (Hu et al., 2018) 64.3 ADMM (Leng et al., 2018) 64.8 IR-Net (Qin et al., 2020) 66.5 LCR-BNN (Shang et al., 2022) 66.9 DAQ (Kim et al., 2021) 67.2 Proxy BNN (He et al., 2020) 67.7

Ours (ℓbin) 62.8 0.09

Table 8: Top-1 accuracy of Res Net-18 with binary-valued weights on Image Net. FP stands for the full-precision model accuracy.

Published as a conference paper at ICLR 2025

Method Top-1 (%)

Res Net-18 (FP) 69.8

TWN (Li et al., 2016) 61.8 SQ-TWN (Dong et al., 2017) 63.8 QNet (Yang et al., 2019) 66.5 ADMM (Leng et al., 2018) 67.0 LQ (Zhang et al., 2018) 68.0 QIL (Jung et al., 2019) 68.1

Ours (ℓter) 65.3 0.08

Table 9: Top-1 accuracy of Res Net-18 with ternary-valued weights on Image Net. FP stands for the full-precision model accuracy.

Method Top-1 (%)

Res Net-20 (FP) 91.7

Do Re Fa-Net (Zhou et al., 2016) 90.0 LQ (Zhang et al., 2018) 90.1 DSQ (Gong et al., 2019) 90.2 IR-Net (Qin et al., 2020) 90.8 DAQ (Kim et al., 2021) 91.2 LCR-BNN (Shang et al., 2022) 91.2

Ours (ℓbin) 90.3 0.17 Ours (ℓter) 91.0 0.11

Table 10: Top-1 accuracy of Res Net-20 on CIFAR-10 (c) with binaryand ternary-valued weights. FP stands for the full-precision model accuracy.

(a) Symmetric binary

(b) One-sided binary

(c) Symmetric ternary

Figure 8: Two-dimensional loss landscape of CS regularizers in logarithmic scale.