# bbit_marginal_regression__d93bd8e1.pdf

b-bit Marginal Regression

Martin Slawski Department of Statistics and Biostatistics Department of Computer Science Rutgers University martin.slawski@rutgers.edu

Ping Li Department of Statistics and Biostatistics Department of Computer Science Rutgers University pingli@stat.rutgers.edu

Abstract We consider the problem of sparse signal recovery from m linear measurements quantized to b bits. b-bit Marginal Regression is proposed as recovery algorithm. We study the question of choosing b in the setting of a given budget of bits B = m b and derive a single easy-to-compute expression characterizing the trade-off between m and b. The choice b = 1 turns out to be optimal for estimating the unit vector corresponding to the signal for any level of additive Gaussian noise before quantization as well as for adversarial noise. For b 2, we show that Lloyd-Max quantization constitutes an optimal quantization scheme and that the norm of the signal can be estimated consistently by maximum likelihood by extending [15].

1 Introduction

Consider the common compressed sensing (CS) model

yi = ai, x + σεi, i = 1, . . . , m, or equivalently

y = Ax + σε, y = (yi)m i=1, A = (Aij)m,n i,j=1, {ai = (Aij)n j=1}m i=1, ε = (εi)m i=1, (1)

where the {Aij} and the {εi} are i.i.d. N(0, 1) (i.e. standard Gaussian) random variables, the latter of which will be referred to by the term additive noise and accordingly σ > 0 as noise level , and x Rn is the signal of interest to be recovered given (A, y). Let s = x 0 := |S(x )|, where S(x ) = {j : |x j| > 0}, be the ℓ0-norm of x (i.e. the cardinality of its support S(x )). One of the celebrated results in CS is that accurate recovery of x is possible as long as m s log n, and can be carried out by several computationally tractable algorithms e.g. [3, 5, 21, 26, 29].

Subsequently, the concept of signal recovery from an incomplete set (m < n) of linear measurements was developed further to settings in which only coarsely quantized versions of such linear measurements are available, with the extreme case of single-bit measurements [2, 8, 11, 22, 23, 28, 16]. More generally, one can think of b-bit measurements, b {1, 2, . . .}. Assuming that one is free to choose b given a ﬁxed budget of bits B = m b gives rise to a trade-off between m and b. An optimal balance of these two quantities minimizes the error in recovering the signal. Such optimal trade-off depends on the quantization scheme, the noise level, and the recovery algorithm. This trade-off has been considered in previous CS literature [13]. However, the analysis therein concerns an oracle-assisted recovery algorithm equipped with knowledge of S(x ) which is not fully realistic.

In [9] a speciﬁc variant of Iterative Hard Thresholding [1] for b-bit measurements is considered. It is shown via numerical experiments that choosing b 2 can in fact achieve improvements over b = 1 at the level of the total number of bits required for approximate signal recovery. On the other hand, there is no analysis supporting this observation. Moreover, the experiments in [9] only concern a noiseless setting. Another approach is to treat quantization as additive error and to perform signal recovery by means of variations of recovery algorithms for inﬁnite-precision CS [10, 14, 18]. In this line of research, b is assumed to be ﬁxed and a discussion of the aforementioned trade-off is missing.

In the present paper we provide an analysis of compressed sensing from b-bit measurements using a speciﬁc approach to signal recovery which we term b-bit Marginal Regression. This approach builds on a method for one-bit compressed sensing proposed in an inﬂuential paper by Plan and Vershynin [23] which has subsequently been reﬁned in several recent works [4, 24, 28]. As indicated by the name, b-bit Marginal Regression can be seen as a quantized version of Marginal Regression, a simple

yet surprisingly effective approach to support recovery that stands out due to its low computational cost, requiring only a single matrix-vector multiplication and a sorting operation [7]. Our analysis yields a precise characterization of the above trade-off involving m and b in various settings. It turns out that the choice b = 1 is optimal for recovering the normalized signal x u = x / x 2, under additive Gaussian noise as well as under adversarial noise. It is shown that the choice b = 2 additionally enables one to estimate x 2, while being optimal for recovering x u for b 2. Hence for the speciﬁc recovery algorithm under consideration, it does not pay off to take b > 2. Furthermore, once the noise level is signiﬁcantly high, b-bit Marginal Regression is empirically shown to perform roughly as good as several alternative recovery algorithms, a ﬁnding suggesting that in high-noise settings taking b > 2 does not pay off in general. As an intermediate step in our analysis, we prove that Lloyd-Max quantization [19, 20] constitutes an optimal b-bit quantization scheme in the sense that it leads to a minimization of an upper bound on the reconstruction error.

Notation: We use [d] = {1, . . . , d} and S(x) for the support of x Rn. x x = (xj x j)n j=1. I(P) is the indicator function of expression P. The symbol means up to a positive universal constant . Supplement: Proofs and additional experiments can be found in the supplement.

2 From Marginal Regression to b-bit Marginal Regression

Some background on Marginal Regression. It is common to perform sparse signal recovery by solving an optimization problem of the form

2m y Ax 2 2 + γ

2 P(x), γ 0, (2)

where P is a penalty term encouraging sparse solutions. Standard choices for P are P(x) = x 0, which is computationally not feasible in general, its convex relaxation P(x) = x 1 or non-convex penalty terms like SCAD or MCP that are more amenable to optimization than the ℓ0-norm [27]. Alternatively P can as well be used to enforce a constraint by setting P(x) = ιC(x), where ιC(x) = 0 if x C and + otherwise, with C = {x Rn : x 0 s} or C = {x Rn : x 1 r} being standard choices. Note that (2) is equivalent to the optimization problem

min x η, x + 1

2 P(x), where η = A y

Replacing A A/m by E[A A/m] = I (recall that the entries of A are i.i.d. N(0, 1)), we obtain

min x η, x + 1

2 x 2 2 + γ

2 P(x), η = A y

which tends to be much simpler to solve than (2) as the ﬁrst two terms are separable in the components of x. For the choices of P mentioned above, we obtain closed form solutions:

P(x) = x 0 : bxj = ηj I(|ηj| γ1/2) P(x) = x 1 : bxj = (|ηj| γ)+ sign(ηj), P(x) = ιx: x 0 s : bxj = ηj I(|ηj| |η(s)|) P(x) = ιx: x 1 r : bxj = (|ηj| γ )+ sign(ηj) (4)

for j [n], where + denotes the positive part and |η(s)| is the sth largest entry in η in absolute magnitude and γ = min{γ 0 : Pn j=1(|ηj| γ)+ r}. In other words, the estimators are hardrespectively soft-thresholded versions of ηj = A j y/m which are essentially equal to the univariate (or marginal) regression coefﬁcients θj = A j y/ Aj 2 2 in the sense that ηj = θj(1 + OP(m 1)), j [n], hence the term marginal regression . In the literature, it is the estimator in the left half of (4) that is popular [7], albeit as a means to infer the support of x rather than x itself. Under (2) the performance with respect to signal recovery can still be reasonable in view of the statement below. Proposition 1. Consider model (1) with x = 0 and the Marginal Regression estimator bx deﬁned component-wise by bxj = ηj I(|ηj| |η(s)|), j [n], where η = A y/m. Then there exists positive constants c, C > 0 such that with probability at least 1 cn 1

x 2 C x 2 + σ

In comparison, the relative ℓ2-error of more sophisticated methods like the lasso scales as O({σ/ x 2} p

s log(n)/m) which is comparable to (5) once σ is of the same order of magnitude as x 2. Marginal Regression can also be interpreted as a single projected gradient iteration

from 0 for problem (2) with P = ιx: x 0 s. Taking more than one projected gradient iteration gives rise to a popular recovery algorithm known as Iterative Hard Thresholding (IHT, [1]).

Compressed sensing with non-linear observations and the method of Plan & Vershynin. As a generalization of (1) one can consider measurements of the form yi = Q( ai, x + σεi), i [m] (6)

for some map Q. Without loss generality, one may assume that x 2 = 1 as long as x = 0 (which is assumed in the sequel) by deﬁning Q accordingly. Plan and Vershynin [23] consider the following optimization problem for recovering x , and develop a framework for analysis that covers even more general measurement models than (6). The proposed estimator minimizes

min x: x 2 1, x 1

s η, x , η = A y/m. (7)

Note that the constraint set {x : x 2 1, x 1

s} contains {x : x 2 1, x 0 s}. The authors prefer the former because it is suited for approximately sparse signals as well and second because it is convex. However, the optimization problem with sparsity constraint is easy to solve:

min x: x 2 1, x 0 s η, x , η = A y/m. (8)

Lemma 1. The solution of problem (8) is given by bx = ex/ ex 2, exj = ηj I(|ηj| |η(s)|), j [n].

While this is elementary we state it as a separate lemma as there has been some confusion in the existing literature. In [4] the same solution is obtained after (unnecessarily) convexifying the constraint set, which yields the unit ball of the so-called s-support norm. In [24] a family of concave penalty terms including the SCAD and MCP is proposed in place of the cardinality constraint. However, in light of Lemma 1, the use of such penalty terms lacks motivation.

The minimization problem (8) is essentially that of Marginal Regression (3) with P = ιx: x 0 s, the only difference being that the norm of the solution is ﬁxed to one. Note that the Marginal Regression estimator is equi-variant w.r.t. re-scaling of y, i.e. for a y with a > 0, bx changes to abx. In addition, let α, β > 0 and deﬁne bx(α) and bx[β] as the minimizers of the optimization problems

min x: x 0 s η, x + α

2 x 2 2, min x: x 2 β, x 0 s η, x . (9)

It is not hard to verify that bx(α)/ bx(α) 2 = bx[β]/ bx[β] 2 = bx[1]. In summary, for estimating the direction x u = x / x 2 it does not matter if a quadratic term in the objective or an ℓ2-norm constraint is used. Moreover, estimation of the scale ψ = x 2 and the direction can be separated. Adopting the framework in [23], we provide a straightforward bound on the ℓ2-error of bx minimizing (8). To this end we deﬁne two quantities which will be of central interest in subsequent analysis. λ = E[g θ(g)], g N(0, 1), where θ is deﬁned by E[y1|a1] = θ( a1, x )

Ψ = inf{C > 0 : P{max1 j n |ηj E[ηj]| C p

log(n)/m} 1 1/n.}. (10)

The quantity λ concerns the deterministic part of the analysis as it quantiﬁes the distortion of the linear measurements under the map Q, while Ψ is used to deal with the stochastic part. The deﬁnition of Ψ is based on the usual tail bound for the maximum of centered sub-Gaussian random variables. In fact, as long as Q has bounded range, Gaussianity of the {Aij} implies that the {ηj E[ηj]}n j=1 are zero-mean sub-Gaussian. Accordingly, the constant Ψ is proportional to the sub-Gaussian norm of the {ηj E[ηj]}n j=1, cf. [25].

Proposition 2. Consider model (6) s.t. x 2 = 1 and (10). Suppose that λ > 0 and denote by bx the minimizer of (8). Then with probability at least 1 1/n, it holds that

So far s has been assumed to be known. If that is not the case, s can be estimated as follows. Proposition 3. In the setting of Proposition 2, consider bs = |{j : |ηj| > Ψ p

log(n)/m}| and bx as the minimizer of (8) with s replaced by bs. Then with probability at least 1 1/n, S(bx) S(x ) (i.e. no false positive selection). Moreover, if

min j S(x ) |x j| > (2Ψ/λ) p

log(n)/m, one has S(bx) = S(x ). (12)

b-bit Marginal Regression. b-bit quantized measurements directly ﬁt into the non-linear observation model (6). Here the map Q represents a quantizer that partitions R+ into K = 2b 1 bins {Rk}K k=1 given by distinct thresholds t = (t1, . . . , t K 1) (in increasing order) and t0 = 0, t K = + such that R1 = [t0, t1), . . . , RK = [t K 1, t K). Each bin is assigned a distinct representative from M = {µ1, . . . , µK} (in increasing order) so that Q : R M M is deﬁned by z 7 Q(z) = sign(z) PK k=1 µk I(|z| Rk). Expanding model (6) accordingly, we obtain

yi = sign( ai, x + σεi) PK k=1 µk I( |( ai, x + σεi)| Rk)

= sign( ai, x u + τεi) PK k=1 µk I( |( ai, x u + τεi)| Rk/ψ ), i [m],

where ψ = x 2, x u = x /ψ and τ = σ/ψ . Thus the scale ψ of the signal can be absorbed into the deﬁnition of the bins respectively thresholds which should be proportional to ψ . We may thus again ﬁx ψ = 1 and in turn x = x u, σ = τ w.l.o.g. for the analysis below. Estimation of ψ separately from x u will be discussed in an extra section.

In this section we study in detail the central question of the introduction. Suppose we have a ﬁxed budget B of bits available and are free to choose the number of measurements m and the number of bits per measurement b subject to B = m b such that the ℓ2-error bx x 2 of b-bit Marginal Regression is as small as possible. What is the optimal choice of (m, b)? In order to answer this question, let us go back to the error bound (11). That bound applies to b-bit Marginal Regression for any choice of b and varies with λ = λb and Ψ = Ψb, both of which additionally depend on σ, the choice of the thresholds t and the representatives µ. It can be shown that the dependence of (11) on the ratio Ψ/λ is tight asymptotically as m . Hence it makes sense to compare two different choices b and b in terms of the ratio of Ωb = Ψb/λb and Ωb = Ψb /λb . Since the bound (11) decays with

m, for b -bit measurements, b > b, to improve over b-bit measurements with respect to the total #bits used, it is then required that Ωb/Ωb > p

b /b. The route to be taken is thus as follows: we ﬁrst derive expressions for λb and Ψb and then minimize the resulting expression for Ωb w.r.t. the free parameters t and µ. We are then in position to compare Ωb/Ωb for b = b .

Evaluating λb = λb(t, µ). Below, denotes the entry-wise multiplication between vectors. Lemma 2. We have λb(t, µ) = α(t), E(t) µ /(1 + σ2), where

α(t) = (α1(t), . . . , αK(t)) , αk(t) = P {|eg| Rk(t)} , eg N(0, 1 + σ2), k [K],

E(t) = (E1(t), . . . , EK(t)) , Ek(t) = E[eg|eg Rk(t)], eg N(0, 1 + σ2), k [K].

Evaluating Ψb = Ψb(t, µ). Exact evaluation proves to be difﬁcult. We hence resort to an analytically more tractable approximation which is still sufﬁciently accurate as conﬁrmed by experiments.

Lemma 3. As |x j| 0, j = 1, . . . , n, and as m , we have Ψb(t, µ) p

α(t), µ µ .

Note that the proportionality constant (not depending on b) in front of the given expression does not need to be known as it cancels out when computing ratios Ωb/Ωb . The asymptotics |x j| 0, j [n], is limiting but still makes sense for s growing with n (recall that we ﬁx x 2 = 1 w.l.o.g.).

Optimal choice of t and µ. It turns that the optimal choice of (t, µ) minimizing Ψb/λb coincides with the solution of an instance of the classical Lloyd-Max quantization problem [19, 20] stated below. Let h be a random variable with ﬁnite variance and Q the quantization map from above.

min t,µ E[{h Q(h; t, µ)}2] = min t,µ E[{h sign(h) PK k=1 µk I(|h| Rk(t) )}2]. (13)

Problem (13) can be seen as a one-dimensional k-means problem at the population level, and it is solved in practice by an alternating scheme similar to that used for k-means. For h from a logconcave distribution (e.g. Gaussian) that scheme can be shown to deliver the global optimum [12]. Theorem 1. Consider the minimization problem mint,µ Ψb(t, µ)/λb(t, µ). Its minimizer (t , µ ) equals that of the Lloyd-Max problem (13) for h N(0, 1 + σ2). Moreover,

Ωb(t , µ ) = Ψb(t , µ )/λb(t , µ ) p

(σ2 + 1)/λb,0(t 0, µ 0),

where λb,0(t 0, µ 0) denotes the value of λb for σ = 0 evaluated at (t 0, µ 0), the choice of (t, µ) minimizing Ωb for σ = 0.

Regarding the choice of (t, µ) the result of Theorem 1 may not come as a suprise as the entries of y are i.i.d. N(0, 1 + σ2). It is less immediate though that this speciﬁc choice can also be motivated as the one leading to the minimization of the error bound (11). Furthermore, Theorem 1 implies that the relative performance of band b -bit measurements does not depend on σ as long as the respective optimal choice of (t, µ) is used, which requires σ to be known. Theorem 1 provides an explicit expression for Ωb that is straightforward to compute. The following table lists ratios Ωb/Ωb for selected values of b and b .

b = 1, b = 2

b = 2, b = 3

b = 3, b = 4

required for b b:

These ﬁgures suggests that the smaller b, the better the performance for a given budget of bits B.

Beyond additive noise. Additive Gaussian noise is perhaps the most studied form of perturbation, but one can of course think of numerous other mechanisms whose effect can be analyzed on the basis of the same scheme used for additive noise as long as it is feasible to obtain the corresponding expressions for λ and Ψ. We here do so for the following mechanisms acting after quantization. (I) Random bin ﬂip. For i [m]: with probability 1 p, yi remains unchanged. With probability p, yi is changed to an element from ( M M) \ {yi} uniformly at random. (II) Adversarial bin ﬂip. For i [m]: Write yi = qµk for q { 1, 1} and µk M. With probability 1 p, yi remains unchanged. With probability p, yi is changed to qµK.

Note that for b = 1, (I) and (II) coincide (sign ﬂip with probability p). Depending on the magnitude of p, the corresponding value λ = λb,p may even be negative, which is unlike the case of additive noise. Recall that the error bound (11) requires λ > 0. Borrowing terminology from robust statistics, we consider pb = min{p : λb,p 0} as the breakdown point, i.e. the (expected) proportion of contaminated observations that can still be tolerated so that (11) continues to hold. Mechanism (II) produces a natural counterpart of gross corruptions in the standard setting (1). It can be shown that among all maps M M M M applied randomly to the observations with a ﬁxed probability, (II) maximizes the ratio Ψ/λ, hence the attribute adversarial . In Figure 1 we display Ψb,p/λb,p for b {1, 2, 3, 4} for both (I) and (II). The table below lists the corresponding breakdown points. For simplicity, (t, µ) are not optimized but set to the optimal (in the sense of Lloyd-Max) choice (t 0, µ 0) in the noiseless case. The underlying derivations can be found in the supplement.

Figure 1 and the table provide one more argument in favour of one-bit measurements as they offer better robustness visa-vis adversarial corruptions. In fact, once the fraction of such corruptions reaches 0.2, b = 1 performs best on the measurement scale. For the milder corruption scheme (I), b = 2 turns out to the best choice for signiﬁcant but moderate p.

0 0.1 0.2 0.3 0.4 0.5 0

fraction of bin flips

b = 3 / 4 (~overlap)

0 0.1 0.2 0.3 0.4 0.5 0

fraction of gross corruptions

Figure 1: Ψb,p/λb,p (log10-scale), b {1, 2, 3, 4}, p [0, 0.5] for mechanisms (I, L) and (II, R).

4 Scale estimation

In Section 2, we have decomposed x = x uψ into a product of a unit vector x u and a scale parameter ψ > 0. We have pointed out that x u can be estimated by b-bit Marginal Regression

separately from ψ since the latter can be absorbed into the deﬁnition of the bins {Rk}. Accordingly, we may estimate x as bx = bxu bψ with bxu and bψ estimating x u and ψ , respectively. We here consider the maximum likelihood estimator (MLE) for ψ , by following [15] which studied the estimation of the scale parameter for the entire α-stable family of distributions. The work of [15] was motivated by a different line of one scan 1-bit CS algorithm [16] based on α-stable designs [17].

First, we consider the case σ = 0, so that the {yi} are i.i.d. N(0, (ψ )2). The likelihood function is

k=1 I(yi Rk) P(|yi| Rk) =

k=1 {2(Φ(tk/ψ) Φ(tk 1/ψ))}mk, (14)

where mk = |{i : |yi| Rk}|, k [K], and Φ denotes the standard Gaussian cdf. Note that for K = 1, L(ψ) is constant (i.e. does not depend on ψ) which conﬁrms that for b = 1, it is impossible to recover ψ . For K = 2 (i.e. b = 2), the MLE has a simple a closed form expression given by bψ = t1/Φ 1(0.5(1 + m1/m)). The following tail bound establishes fast convergence of bψ to ψ . Proposition 4. Let ε (0, 1) and c = 2{φ (t1/ψ )}2, where φ denotes the derivative of the standard Gaussian pdf. With probability at least 1 2 exp( cmε2), we have | bψ/ψ 1| ε.

The exponent c is maximized for t1 = ψ and becomes smaller as t1/ψ moves away from 1. While scale estimation from 2-bit measurements is possible, convergence can be slow if t1 is not well chosen. For b 3, convergence can be faster but the MLE is not available in closed form [15].

We now turn to the case σ > 0. The MLE based on (14) is no longer consistent. If x u is known then the joint likelihood of for (ψ , σ) is given by

Φ ui ψ ai, x u

Φ li ψ ai, x u

where [li, ui] denotes the interval the i-th observation is contained in before quantization, i [m]. It is not clear to us whether the likelihood is log-concave, which would ensure that the global optimum can be obtained by convex programming. Empirically, we have not encountered any issue with spurious local minima when using ψ = 0 and eσ as the MLE from the noiseless case as starting point. The only issue with (15) we are aware of concerns the case in which there exists ψ so that ψ ai, x u [li, ui], i [m]. In this situation, the MLE for σ equals zero and the MLE for ψ may not be unique. However, this is a rather unlikely scenario as long as there is a noticeable noise level. As x u is typically unknown, we may follow the plug-in principle, replacing x u by an estimator bxu.

5 Experiments

We here provide numerical results supporting/illustrating some of the key points made in the previous sections. We also compare b-bit Marginal Regression to alternative recovery algorithms.

Setup. Our simulations follow model (1) with n = 500, s {10, 20, . . ., 50}, σ {0, 1, 2} and b {1, 2}. Regarding x , the support and its signs are selected uniformly at random, while the absolute magnitude of the entries corresponding to the support are drawn from the uniform distribution on [β, 2β], where β = f (1/λ1,σ) p

log(n)/m and m = f 2(1/λ1,σ)2s log n with f {1.5, 3, 4.5, . . ., 12} controlling the signal strength. The resulting signal is then normalized to unit 2-norm. Before normalization, the norm of the signal lies in [1,

2] by construction which ensures that as f increases the signal strength condition (12) is satisﬁed with increasing probability. For b = 2, we use Lloyd-Max quantization for a N(0, 1)-random variable which is optimal for σ = 0, but not for σ > 0. Each possible conﬁguration for s, f and σ is replicated 20 times. Due to space limits, a representative subset of the results is shown; the rest can be found in the supplement.

Empirical veriﬁcation of the analysis in Section 3. The experiments reveal that what is predicted by the analysis of the comparison of the relative performance of 1-bit and 2-bit measurements for estimating x closely agrees with what is observed empirically, as can be seen in Figure 2.

Estimation of the scale and the noise level. Figure 3 suggests that the plug-in MLE for (ψ = x 2, σ) is a suitable approach, at least as long as ψ /σ is not too small. For σ = 2, the plug-in MLE for ψ appears to have a noticeable bias as it tends to 0.92 instead of 1 for increasing f (and thus increasing m). Observe that for σ = 0, convergence to the true value 1 is smaller as for σ = 1,

0.5 1 1.5 2 2.5 3 3.5 4

σ =0, s = 10

log2(error)

b = 1 b = 2 required improvement predicted improvement

0.5 1 1.5 2 2.5 3 3.5 4

σ =0, s = 50

log2(error)

b = 1 b = 2 required improvement predicted improvement

0.5 1 1.5 2 2.5 3 3.5 4

σ =1, s = 50

log2(error)

b = 1 b = 2 required improvement predicted improvement

0.5 1 1.5 2 2.5 3 3.5 4

σ =2, s = 50

log2(error)

b = 1 b = 2 required improvement predicted improvement

Figure 2: Average ℓ2-estimation errors x bx 2 for b = 1 and b = 2 on the log2-scale in dependence of the signal strength f. The curve predicted improvement (of b = 2 vs. b = 1) is obtained by scaling the ℓ2-estimation error by the factor predicted by the theory of Section 3. Likewise the curve required improvement results by scaling the error of b = 1 by 1/

2 and indicates what would be required by b = 2 to improve over b = 1 at the level of total #bits.

0.5 1 1.5 2 2.5 3 3.5 4

estimated norm of x*

0.5 1 1.5 2 2.5 3 3.5 4

estimated noise level

Figure 3: Estimation of ψ = x 2 (here 1) and σ. The curves depict the average of the plug-in MLE discussed in Section 4 while the bars indicate 1 standard deviation.

while σ is over-estimated (about 0.2) for small f. The above two issues are presumably a plug-in effect, i.e. a consequence of using bxu in place of x u.

b-bit Marginal Regression and alternative recovery algorithms. We compare the ℓ2-estimation error of b-bit Marginal Regression to several common recovery algorithms. Compared to apparently more principled methods which try to enforce agreement of Q(y) and Q(Abx) w.r.t. the Hamming distance (or a surrogate thereof), b-bit Marginal Regression can be seen as a crude approach as it is based on maximizing the inner product between y and Ax. One may thus expect that its performance is inferior. In summary, our experiments conﬁrm that this is true in low-noise settings, but not so if the noise level is substantial. Below we brieﬂy present the alternatives that we consider.

Plan-Vershynin: The approach in [23] based on (7) which only differs in that the constraint set results from a relaxation. As shown in Figure 4 the performance is similar though slightly inferior.

IHT-quadratic: Standard Iterative Hard Thresholding based on quadratic loss [1]. As pointed out above, b-bit Marginal Regression can be seen as one-step version of Iterative Hard Thresholding.

IHT-hinge (b = 1): The variant of Iterative Hard Threshold for binary observations using a hinge loss-type loss function as proposed in [11].

SVM (b = 1): Linear SVM with squared hinge loss and an ℓ1-penalty, implemented in LIBLINEAR [6]. The cost parameter is chosen from 1/

m log m.{2 3, 2 2, . . . , 23} by 5-fold cross-validation.

IHT-Jacques (b = 2): A variant of Iterative Hard Threshold for quantized observations based on a speciﬁc piecewiese linear loss function [9].

SVM-type (b = 2): This approach is based on solving the following convex optimization problem: minx,{ξi} γ x 1 + Pm i=1 ξi subject to li ξi ai, x ui + ξi, ξi 0, i [m], where [li, ui] is the bin observation i is assigned to. The essential idea is to enforce consistency of the observed and predicted bin assignments up to slacks {ξi} while promoting sparsity of the solution via an ℓ1penalty. The parameter γ is chosen from

m log m {2 10, 2 9, . . . , 23} by 5-fold cross-validation.

Turning to the results as depicted by Figure 4, the difference between a noiseless (σ = 0) and heavily noisy setting (σ = 2) is perhaps most striking. σ = 0: both IHT variants signiﬁcantly outperform b-bit Marginal Regression. By comparing errors for IHT, b = 2 can be seen to improve over b = 1 at the level of the total # bits. σ = 2: b-bit Marginal Regression is on par with the best performing methods. IHT-quadratic for b = 2 only achieves a moderate reduction in error over b = 1, while IHT-hinge is supposedly affected by convergence issues. Overall, the results suggest that a setting with substantial noise favours a crude approach (low-bit measurements and conceptually simple recovery algorithms).

0.5 1 1.5 2 2.5 3 3.5 4

σ =0, s = 50

log2(error)

Marginal Plan Vershynin IHT quadratic IHT hinge SVM

0.5 1 1.5 2 2.5 3 3.5 4

σ =2, s = 50

log2(error)

Marginal Plan Vershynin IHT quadratic IHT hinge SVM

0.5 1 1.5 2 2.5 3 3.5 4

σ =0, s = 50

log2(error)

Marginal Plan Vershynin IHT quadratic IHT Jacques SVM type

0.5 1 1.5 2 2.5 3 3.5 4

σ =2, s = 50

log2(error)

Marginal Plan Vershynin IHT quadratic IHT Jacques SVM type

Figure 4: Average ℓ2-estimation errors for several recovery algorithms on the log2-scale in dependence of the signal strength f . We contrast σ = 0 (L) vs. σ = 2 (R), b = 1 (T) vs. b = 2 (B).

6 Conclusion Bridging Marginal Regression and a popular approach to 1-bit CS due to Plan & Vershynin, we have considered signal recovery from b-bit quantized measurements. The main ﬁnding is that for b-bit Marginal Regression it is not beneﬁcial to increase b beyond 2. A compelling argument for b = 2 is the fact that the norm of the signal can be estimated unlike the case b = 1. Compared to high-precision measurements, 2-bit measurements also exhibit strong robustness properties. It is of interest if and under what circumstances the conclusion may differ for other recovery algorithms.

Acknowledgement. This work is partially supported by NSF-Bigdata-1419210, NSF-III-1360971, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137.

[1] T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27:265 274, 2009.

[2] P. Boufounos and R. Baraniuk. 1-bit compressive sensing. In Information Science and Systems, 2008.

[3] E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35:2313 2351, 2007.

[4] S. Chen and A. Banerjee. One-bit Compressed Sensing with the k-Support Norm. In AISTATS, 2015.

[5] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52:1289 1306, 2006.

[6] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classiﬁcation. Journal of Machine Learning Research, 9:1871 1874, 2008.

[7] C. Genovese, J. Jin, L. Wasserman, and Z. Yao. A Comparison of the Lasso and Marginal Regression. Journal of Machine Learning Research, 13:2107 2143, 2012.

[8] S. Gopi, P. Netrapalli, P. Jain, and A. Nori. One-bit Compressed Sensing: Provable Support and Vector Recovery. In ICML, 2013.

[9] L. Jacques, K. Degraux, and C. De Vleeschouwer. Quantized iterative hard thresholding: Bridging 1-bit and high-resolution quantized compressed sensing. ar Xiv:1305.1786, 2013.

[10] L. Jacques, D. Hammond, and M. Fadili. Dequantizing compressed sensing: When oversampling and non-gaussian constraints combine. IEEE Transactions on Information Theory, 57:559 571, 2011.

[11] L. Jacques, J. Laska, P. Boufounos, and R. Baraniuk. Robust 1-bit Compressive Sensing via Binary Stable Embeddings of Sparse Vectors. IEEE Transactions on Information Theory, 59:2082 2102, 2013.

[12] J. Kieffer. Uniqueness of locally optimal quantizer for log-concave density and convex error weighting function. IEEE Transactions on Information Theory, 29:42 47, 1983.

[13] J. Laska and R. Baraniuk. Regime change: Bit-depth versus measurement-rate in compressive sensing. ar Xiv:1110.3450, 2011.

[14] J. Laska, P. Boufounos, M. Davenport, and R. Baraniuk. Democracy in action: Quantization, saturation, and compressive sensing. Applied and Computational Harmonic Analysis, 31:429 443, 2011.

[15] P. Li. Binary and Multi-Bit Coding for Stable Random Projections. ar Xiv:1503.06876, 2015.

[16] P. Li. One scan 1-bit compressed sensing. Technical report, ar Xiv:1503.02346, 2015.

[17] P. Li, C.-H. Zhang, and T. Zhang. Compressed counting meets compressed sensing. In COLT, 2014.

[18] J. Liu and S. Wright. Robust dequantized compressive sensing. Applied and Computational Harmonic Analysis, 37:325 346, 2014.

[19] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129 137, 1982.

[20] J. Max. Quantizing for Minimum Distortion. IRE Transactions on Information Theory, 6:7 12, 1960.

[21] D. Needell and J. Tropp. Co Sa MP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26:301 321, 2008.

[22] Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66:1275 1297, 2013.

[23] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach. IEEE Transactions on Information Theory, 59:482 494, 2013.

[24] R. Zhu and Q. Gu. Towards a Lower Sample Complexity for Robust One-bit Compressed Sensing. In ICML, 2015.

[25] R. Vershynin. In: Compressed Sensing: Theory and Applications, chapter Introduction to the nonasymptotic analysis of random matrices . Cambridge University Press, 2012.

[26] M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55:2183 2202, 2009.

[27] C.-H. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27:576 593, 2013.

[28] L. Zhang, J. Yi, and R. Jin. Efﬁcient algorithms for robust one-bit compressive sensing. In ICML, 2014.

[29] T. Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations. IEEE Transactions on Information Theory, 57:4689 4708, 2011.