# random_projections_with_asymmetric_quantization__725e78e1.pdf

Random Projections with Asymmetric Quantization

Xiaoyun Li Department of Statistics Rutgers University Piscataway, NJ 08854 xiaoyun.li@rutgers.edu

Ping Li Cognitive Computing Lab Baidu Research USA Bellevue, WA 98004 liping11@baidu.com

Abstract The method of random projection has been a popular tool for data compression, similarity search, and machine learning. In many practical scenarios, applying quantization on randomly projected data could be very helpful to further reduce storage cost and facilitate more efﬁcient retrievals, while only suffering from little loss in accuracy. In real-world applications, however, data collected from different sources may be quantized under different schemes, which calls for a need to study the asymmetric quantization problem. In this paper, we investigate the cosine similarity estimators derived in such setting under the Lloyd-Max (LM) quantization scheme. We thoroughly analyze the biases and variances of a series of estimators including the basic simple estimators, their normalized versions, and their debiased versions. Furthermore, by studying the monotonicity, we show that the expectation of proposed estimators increases with the true cosine similarity, on a broader family of stair-shaped quantizers. Experiments on nearest neighbor search justify the theory and illustrate the effectiveness of our proposed estimators.

1 Introduction

The method of random projections (RP) [35] has become a popular technique to reduce data dimensionality while preserving distances between data points, as guaranteed by the celebrated Johnson Lindenstrauss (J-L) Lemma and variants [24, 12, 1]. Given a high dimensional dataset, the algorithm projects each data point onto a lower-dimensional random subspace. There is a very rich literature of research on the theory and applications of random projections, such as clustering, classiﬁcation, near neighbor search, bio-informatics, compressed sensing, etc. [22, 10, 4, 6, 8, 17, 18, 28, 15, 7, 19, 11, 9].

In recent years, random projections + quantization has been an active research topic. That is, the projected data, which are in general real-valued (i.e., inﬁnite precision), are quantized into integers in a small number of bits. Applying quantization on top of random projections has at least two major advantages: (i) the storage cost is further (substantially) reduced; and (ii) some important applications such as hashing-table-based near neighbor search, require using quantized data for indexing purposes.

The pioneering example of quantized random projections should be the so-called 1-bit (sign) random projections, initially used for analyzing the Max Cut problem [20] and then was adopted for near neighbor search [8] and compressed sensing [5, 23, 25]. As one would expect, storing merely 1-bit per projected data value in many situations might suffer from a substantial loss of accuracy, compared to using random projections with full (inﬁnite) precision. There have been various studies on (symmetrically) quantized random projections beyond the 1-bit scheme, e.g., [13, 37, 26, 29, 31]. In this paper, we further move to studying asymmetric quantization of random projections, a relatively new problem arising from practical scenarios which is also mathematically very interesting.

Everyday, the process of data collection is taking place in every possible place that one can think of, but it is often impractical to cast a universal encoding strategy on data storage methods for every place. As a consequence, it becomes a meaningful task to look into the estimation problems with data encoded by different algorithms, or namely, the asymmetric case. In this paper, we provide

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

some insights on this type of problems, and particularly, we consider recovering inner products from asymmetrically quantized random projections, arising from the following two practical scenarios.

Scenario 1: quantization vs. full-precision. Consider, for example, a retrieval system which uses random projections to process every data vector. To save storage, the projected data stored in the repository are quantized into a small number of bits. When a query data vector arrives, it is ﬁrst processed by random projections. We then have the option of quantizing the projected query data vector before conducting the similarity search (with vectors in the repository); but we do not have to do the quantization step since we still have the projected query data vector in full-precision (why waste?). This situation hence creates the quantization vs. full-precision estimation problem. This setting is natural and practical, and the estimation problem has been studied in the literature, for example [14, 21, 27]. Scenario 2: quantization with different bits. In applications such as large ad hoc networks [36, 30], data are collected and processed by different nodes (e.g., sensors or mobile devices) at different locations before sent to the central unit or cloud server. However, distinct nodes may use different quantization methods (or different bits) due to many possible reasons, e.g., memory capacity or purpose of data usage. In this situation, information retrieval among data sources using different quantization schemes could be on the cards. As a tightly related topic, asymmetric distributed source coding (with different bits from different sources) has also been considered in [3, 34] among others for sensor networks.

Scenario 1 is in fact an important special case of Scenario 2, where one source of data is quantized with inﬁnite bits. In this paper, we provide thorough statistical analysis on the above two scenarios.

Our contributions. The major contributions of this paper include the following:

In Section 3, we provide the bias and variance of linear and normalized inner product estimators in Scenario 1. We reveal an interesting connection between the variance of debiased inner product estimator and similarity search, which is very helpful in practice. In Sections 4 and 5, we conduct statistical analysis in Scenario 2, and prove the monotonicity of a large family of asymmetric quantized inner product estimators, which assures their validity for practical use. A new bound on the bias is also derived in the symmetric case. In Section 6, an empirical study on various real-world datasets conﬁrms the theoretical ﬁndings and well illustrates the effectiveness of proposed quantization schemes.

2 Preliminaries

Random Projections. Let U = [u1, ..., un]T Rn d be the original data matrix (with d possibly being large). Random projections are realized by Z = [z1, ..., zn]T = U R, where R Rd k, k d is a random matrix with i.i.d. standard Gaussian entries. Let 2 denote the l2 Euclidean norm. Throughout this paper, we assume that every data point is normalized to unit norm1, i.e., ui 2 = 1, 1 i n. We will hence use the terms inner product and cosine similarity interchangeably.

For the convenience of presentation, our results (estimators and properties) will be given for two pairs of data vectors, ui and uj (and correspondingly zi and zj). Let ρ = ui, uj be the inner product between ui and uj. We also denote x = zi and y = zj. It is then easy to verify that (x, y) is bi-variate normal: x y

Lloyd-Max (LM) quantization [33, 32]. Assume a random signal model with signals generated from a probability distribution with density X f. An M-level scalar quantizer q M( ) is speciﬁed by M +1 decision borders t0 < t1 < < t M and M reconstruction levels (or codes) µi, i = 1, ..., M, with the quantizing operator deﬁned as

q M(x) = µi , i = {i : ti 1 < x ti, 1 i M}. (2)

1Normalizing each data vector to the unit norm is a standard data preprocessing procedure for many applications such as clustering and classiﬁcation. In this paper, we adopt this assumption merely for convenience. When data is not normalized, our results still hold, although we will need to store the values of the norms.

The distortion is an important quantity that measures how much information is lost from the original signal due to quantization. In this paper, we will also assume M = 2b, with b = 1, 2, ..., being the number of bits used for the quantizer. Thus, we will write qb( ) instead of q M( ). Deﬁnition 1. The distortion of a b-bit quantizer Qb( ) with respect to distribution f is deﬁned as

E (X qb(X))2 = Z (x qb(x))2f(x)dx =

ti 1 (x µi)2f(x)dx. (3)

In this paper, f is the standard normal, i.e., f(x) = φ(x) = 1

2πe x2/2 in the conventional notation for Gaussian. Also, we will use Qb to denote Lloyd-Max (LM) quantizer which minimizes the distortion and Db to denote the corresponding value of distortion:

Qb = argmin q E (X qb(X))2 , Db = E (X Qb(X))2 (4)

A basic identity of LM quantizer is that E(Q2 b(X)) = E(Qb(X)X). In practice, Lloyd s algorithm [32] is used to ﬁnd the solution, which alternates between updating borders and reconstruction points until convergence (and the convergence is guaranteed).

Estimates using full-precision RP s. Consider observations xi yi

i.i.d. N 0 0

1 i k, as in (1). The task is to estimate ρ. One can use the usual simple estimator

i=1 xiyi, with E(ˆρf) = ρ, V ar(ˆρf) = 1 + ρ2

where E(ˆρ) is the expectation and V ar(ˆρ) is the variance. Note that the variance grows as |ρ| increases. One can take advantage of the following so-called normalized estimator :

ˆρf,n = Pk i=1 xiyi q Pk i=1 x2 i

q Pk i=1 y2 i , E(ˆρf,n) = ρ + O(1

k ), V ar(ˆρf,n) = (1 ρ2)2

ˆρf,n is nearly unbiased but it substantially reduces the variance especially near two extreme points ρ = 1. We refer readers to the classical textbook [2] and recent papers [28, 27] for more details.

Estimates using symmetric LM quantized RP s. [29] study the inner product estimator under LM quantization scheme, by analyzing the biases and variances of estimators in the symmetric case. That is, the observations xi and yi are quantized by the same LM scheme with the same number of bits (b). In this paper, we study the asymmetric setting by using b1 number of bits for quantizing xi and b2 number of bits for yi. Apparently, the work of [29] is a special case of our results (i.e., b1 = b2). Interestingly, our analysis also leads to a more reﬁned bound on the estimation bias in the symmetric case compared to the corresponding bound in [29]. See Section 4 for the detailed results.

3 Scenario 1: Quantization vs. Full-precision

Recall that, we have i.i.d. observations {xi, yi}, i = 1, 2, ..., k, from a standard bi-variate normal with xi N(0, 1), yi N(0, 1), and E(xiyi) = ρ. In this section, we study Scenario 1: quantization vs. full-precision. That is, we quantize xi with b bits and we leave yi intact. The task is to estimate ρ from {Qb(xi), yi}, i = 1, 2, ..., k. We can still try to use the simple estimator similar to (5):

i=1 Qb(xi)yi. (7)

As one would expect, this estimator ˆρb,f is no longer unbiased. We can show that E (ˆρb,f) = ξ1,1ρ. Hence, we can attempt to remove the bias by using the following debiased estimator

ˆρdb b,f = ˆρb,f

i=1 Qb(xi)yi. (8)

We will need to deﬁne ξ1,1. More generally and analogous to the notation in [29], we deﬁne

γα,β = E Qb(x)αyβ , ξα,β = E Qb(x)αxβ . (9)

That is, ξ1,1 = E (Qb(x)x). Note that ξα,β can be represented by γα,β, but we use both for convenience. Also note that ξ1,1 = ξ2,0 = 1 Db from deﬁnitions. For b = 1, 2, 3, 4, , we can compute ξ1,1 = 0.6366, 0.8825, 0.9655, 0.9905, 1, respectively (keeping four decimal points). In fact, it is also known that Db = 33/22π

12 2 2b, i.e., the bias decays at the rate of O(2 2b). In the following, Theorem 1 summarizes the expectations and variances of the two estimators ˆρb,f and ˆρdb b,f. Theorem 1.

E (ˆρb,f) = ξ1,1ρ, E ˆρdb b,f = ρ, (10)

V ar (ˆρb,f) = Vb,f

k , with Vb,f = (ξ2,2 ξ2,0 ξ2 1,1)ρ2 + ξ2,0 (11)

V ar ˆρdb b,f = V db b,f k , with V db b,f = (ξ2,2 ξ2,0 ξ2 1,1)ρ2 + ξ2,0 ξ2 1,1 . (12)

Normalized Estimator. We also attempt to take advantage of the (beneﬁcial) effect of normalization by deﬁning two normalized estimators and their variances, as summarized in Theorem 2. Theorem 2. As k , we have

ˆρb,f,n = Pk i=1 Qb(xi)yi q Pk i=1 Q2 b(xi) q Pk i=1 y2 i , E(ˆρb,f,n) = p

ξ1,1ρ + O(1

ˆρdb b,f,n = ˆρb,f,n p

ξ1,1 , E(ˆρdb b,f,n) = ρ + O(1

V ar (ˆρb,f,n) = Vb,f,n

k2 ), V ar(ˆρdb b,f,n) = V db b,f,n

Vb,f,n = γ4,0

γ2,0 + γ1,3

γ2,0 , V db b,f,n = Vb,f,n

ξ1,1 . (16)

3.1 Beneﬁts of normalized estimators and debiased estimators

Figure 1 plots (in the left two panels) the variances for two debiased estimators ˆρdb b,f and ˆρdb b,f,n, to illustrate the beneﬁts of normalization. The right panel of Figure 1 demonstrates that the variance of the normalized estimator is always smaller, and substantially so as ρ away from zero.

-1 -0.5 0 0.5 1 0.5

LM b=1 LM b=2 LM b=3 LM b=4 LM b=5 Full-precision

-1 -0.5 0 0.5 1 0

LM b=1 LM b=2 LM b=3 LM b=4 LM b=5 Full-precision

-1 -0.5 0 0.5 1 1

Varaince Ratio

LM b=1 LM b=2 LM b=3 LM b=4 LM b=5 Full-precision

Figure 1: Scenario 1: Comparisons of theoretical variances between two (debiased) estimators ˆρdb b,f and ˆρdb b,f,n. Left panel: the variance factor V db b,f for b = 1, 2, 3, 4, 5, . Middle panel: the variance

factor V db b,f,n (for the normalized estimator). Right panel: the variance ratio: V db b,f V db b,f,n .

To elaborate on the beneﬁt of debiased estimators, we evaluate the mean square errors (MSE): bias2 + variance. Given the beneﬁt of normalization, we consider the two normalized estimators:

MSE (ˆρb,f,n) = 1 p

ξ1,1 2 ρ2 + Vb,f,n

k2 ), MSE ˆρdb b,f,n = Vb,f,n

ξ1,1k + O( 1

Thus, to compare their mean square errors, we can examine the ratio: ξ1,1 + kρ2 ξ1,1(1

Vb,f,n , which will be larger than 1 quickly as k increases. Note that ξ1,1 1 but it is very close to 1 when b 3. In summary, the MSE of the debiased estimator quickly becomes smaller as k increases.

3.2 Analysis of mis-ordering probabilities in similarity search In similarity search, the estimates of inner products are subsequently used for ordering data vectors to identify the nearest neighbor for a given query. Intuitively, a more accurate estimator should provide a more accurate ordering, but a precise analysis is needed for the mis-ordering probabilities. Deﬁnition 2. Suppose u1, u2, u3 Rd are three data points (with u1 being a query) with unit norm and pair-wise cosine similarity ρ12, ρ13 and ρ23 respectively. For an estimator ˆρ, the probability of mis-ordering is deﬁned as

PM(u1; u2, u3) = Pr (ˆρ12 > ˆρ13|ρ12 < ρ13) .

Consider a case where u3 is the nearest point of u1 in the data space (which implies ρ12 < ρ13). If the estimation gives ˆρ12 > ˆρ13, we then make a wrong decision that u3 is not the nearest neighbor of u1.

Theorem 3. (Asymptotic mis-ordering) Suppose u1, u2, u3 Rd are three data points (with u1 being the query) on a unit sphere with pair-wise inner products ρ12, ρ13 and ρ23 respectively. Denote two estimators ˆρ and ˆρ based on k random projections such that as k , the normality

ˆρ N(αρ, ˆσ2 ρ) and ˆρ N(α ρ, ˆσ 2 ρ ) hold, with constants α, α > 0. Denote δ2 ρ = ˆσ2 ρ α2 , δ 2 ρ = ˆσ 2 ρ α 2 and the correlations C = corr(ˆρ12, ˆρ13), C = corr(ˆρ 12, ˆρ 13), respectively. If

δ ρ12 = aδρ12, δ ρ13 = a δρ13, C aa C < (1 a2)δ2 ρ12 + (1 a 2)δ2 ρ13 2δρ12δρ13 , (17)

with some 0 < a < 1, 0 < a < 1, then as k we have ˆPM(u1; u2, u3) > ˆP M(u1; u2, u3), where ˆPM(u1; u2, u3) and ˆP M(u1; u2, u3) are the mis-ordering probability of ˆρ and ˆρ , respectively.

Remark. There is an interesting connection with the variances of the aforementioned debiased estimators . Condition (17) basically assumes that the variance of the debiased ˆρ is smaller than that of the debiased ˆρ at ρ12 and ρ13 respectively by factors a and a . In a special case where a = a

and C = C , the last constraint in (17) reduces to C < δ2 ρ12+δ2 ρ13 2δρ12δρ13 , which always holds since the right-hand side is greater than 1. Also, note that, by Central Limit Theorem, the normality assumption is true for all the estimators we have discussed in this paper.

Although Theorem 3 is asymptotic, we are able to obtain valuable insights in ﬁnite sample case, since statistically a sufﬁciently large k is enough to provide good approximation to the normal distribution. The important message given by Theorem 3 is that estimators with lower debiased variance (δ) tend to have lower mis-ordering probability, which leads to a more accurate estimation of nearest neighbors in the original data space. This could be extremely feasible in numerous real world applications.

4 Scenario 2: Quantization with Different Bits

We now consider the more general case (Scenario 2) where the data vectors are LM quantized with different numbers of bits. That is, given observations {xi, yi}, 1 i n, we quantize xi using b1 bits and yi using b2 bits. Without loss of generality, we assume b1 < b2. Furthermore, we denote two Lloyd-Max quantizers as Qb1 and Qb2 and distortion Db1 and Db2, respectively. Similar to Scenario 1, we deﬁne the asymmetric estimator and the corresponding normalized estimator as

ˆρb1,b2 = 1

i=1 Qb1(xi)Qb2(yi), ˆρb1,b2,n = Pk i=1 Qb1(xi)Qb2(yi) q Pk i=1 Q2 b1(xi) q Pk i=1 Q2 b2(yi) . (18)

As one might expect, the analysis will become somewhat more difﬁcult. Similar to the analysis for Scenario 1, in this section we will use the following notations:

ξα,β = E Qb1(x)αxβ , γα,β = E Qb2(x)αxβ , ζα,β = E Qb1(x)αQb2(y)β , (19)

which allow us to express the expectation and variance of ˆρb1,b2 as follows.

E (ˆρb1,b2) = ζ1,1, V ar (ˆρb1,b2) = Vb1,b2

k , Vb1,b2 = ζ2,2 ζ2 1,1 (20)

ζ1,1 can be expressed as an inﬁnite sum, but it appears difﬁcult to be further simpliﬁed. Nevertheless, we are able to quantify the expectation of ˆρb1,b2 in Theorem 4.

Theorem 4. The following two bounds hold for ρ [ 1, 1]: E (ˆρb1,b2) (1 Db1)(1 Db2)ρ 1, and (21)

2 1 |E (ˆρb1,b2) ρ| 1 + 2, where (22)

1 Db2|ρ|3, 2 = (Db1 + Db2 Db1Db2)|ρ|.

Remark. When b2 (i.e., Scenario 1), we have Db2 0 and the bound reduces to an equality E (ˆρb1, ) = (1 Db1)ρ, which matches the result in Section 3.

Eq. (22) provides upper and lower bounds for the absolute bias of ˆρb1,b2. When b1 = b2 (i.e., the symmetric quantization case), Theorem 5 presents more reﬁned bounds of the bias of ˆρb1,b2.

Theorem 5. (Symmetric quantization) Suppose b1 = b2 = b. For ρ [ 1, 1], we have

(2Db D2 b)|ρ| Db(1 Db)|ρ|3 |E (ˆρb,b) ρ| (2Db D2 b)|ρ|. (23)

Remark. Compared to [29], which derived |E(ˆρb,b) ρ| 2Db|ρ|, our bounds are more tight.

What about the debiased estimator of ˆρb1,b2? It is slightly tricky because E(ˆρb1,b2) = ζ1,1 cannot be explicitly expressed as cρ for some constant c (otherwise the debiased estimator would be simply ˆρb1,b2/c). In Theorem 4, Eq. (21) implies that the expectation of ˆρb1,b2 is close to (1 Db1)(1 Db2)ρ. Thus, we recommend ˆρb1,b2 (1 Db1)(1 Db2) as the surrogate for the debiased estimator.

Next, we provide the expectation and variance of the normalized estimator in Theorem 6.

Theorem 6. (Normalized estimator) As k , we have

E (ˆρb1,b2,n) = ζ1,1 p

ξ2,0γ2,0 + O(1

k ), V ar (ˆρb1,b2,n) = Vb1,b2,n

Vb1,b2,n = ζ2,2 ζ2 1,1 ξ2,0γ2,0 ζ1,1ζ3,1 ζ2 1,1ξ2,0 ξ2 2,0γ2,0 ζ1,1ζ1,3 ζ2 1,1γ2,0 ξ2,0γ2 2,0 (25)

+ ζ2 1,1ζ2,2 ζ2 1,1ξ2,0γ2,0 2ξ2 2,0γ2 2,0 + ζ2 1,1ξ4,0 ζ2 1,1ξ2 2,0 4ξ3 2,0γ2,0 + ζ2 1,1γ4,0 ζ2 1,1γ2 2,0 4ξ2,0γ3 2,0 .

Remark. When b2 = , the expected value of ˆρb1,b2,n reduces to that of ˆρb1,f,n in Scenario 1. Additionally, we have ζ1,1 = ζ2,0ρ, γ2,0 = 1, and γ4,0 = 3. It is easy to check that the expression of the variance will reduce to the corresponding formula in Theorem 2. Also, note that ξ2,0 = 1 Db1, γ2,0 = 1 Db2, and ζ1,1 (1 Db1)(1 Db2)ρ. This means that we can practically use ˆρb1,b2,n

(1 Db1)(1 Db2) as surrogate for the debiased estimator of ˆρb1,b2,n.

We plot the related results in Figure 2, which veriﬁes the theories in Theorems 4, 5 and 6.

0 0.2 0.4 0.6 0.8 1 -8

log10(bias2)

0 0.2 0.4 0.6 0.8 1 -8

log10(bias2)

-1 -0.5 0 0.5 0

b1=1 b2=2 b1=1 b2=3 b1=2 b2=3 b1=3 b2=4 b1=4 b2=5 b1= b2=

Figure 2: Left panel: the absolute bias (solid curves, in log10 scale) of ˆρb1,b2 by simulations, along with the upper bound (red dashed curves) and lower bound (blue dashed curves) in Eq. (22). Middle panel: the absolute bias of ˆρb1,b2 with b1 = b2 (the symmetric case) along with the upper and lower bounds in Eq. (23). Right panel: The variance Vb1,b2,n of the normalized estimator in Theorem 6.

5 Monotonicity of Inner Product Estimates

In applications such as nearest neighbor retrieval, the order of distances tends to matter more than the exact values. Given an estimator ˆρ, one would hope that E(ˆρ) is monotone in ρ. This is indeed

the case in the full-precision situation. Recall that, in Section 2, given i.i.d. observations {xi, yi}, i = 1, 2, ...k, the full-precision estimator ˆρf = 1

k Pk i=1 xiyi is monotone in ρ in the expectation because E(ˆρf) = ρ. Naturally, one will ask if the expectations of our quantized estimators, e.g., ˆρb1,b2 = 1

k Pk i=1 Qb1(xi)Qb2(yi), are also monotone in ρ. This turns out to be non-trivial question.

We solve this important problem rigorously through several stages. Our analysis will not be restricted to LM quantizers. To do so, we will need the following deﬁnition about increasing quantizer .

Deﬁnition 3. (Increasing quantizer) Let Q be an M-level quantizer with boarders t0 < < t M and reconstruction levels µ1, ..., µM. We say that Q is an increasing quantizer if µ1 < < µM .

To proceed, we will prove the following three Lemmas for increasing quantizers.

Lemma 1. (1-bit vs. others) Suppose Qb1, Qb2 are increasing quantizers symmetric about 0, with b1 1, and b2 = 1. Then E(Qb1(x)Qb2(y)) is strictly increasing in ρ on [ 1, 1].

Lemma 2. (2-bit vs. 2-bit) Suppose Qb1, Qb2 are any two increasing quantizers symmetric about 0, with b1 = b2 = 2. Then E(Qb1(x)Qb2(y)) is strictly increasing in ρ on [ 1, 1].

Lemma 3. (Universal decomposition) For any increasing discrete quantizer Qb, b 3 which is symmetric about 0, there exist a 2-bit symmetric increasing quantizer Q2 and a (b-1)-bit symmetric increasing quantizer Qb 1 such that Qb = Qb 1 + Q2.

Once we have the above lemmas, we are ready to prove the monotonicity of E(Qb1(x)Qb2(y)).

Theorem 7. (Monotonicity) For any increasing quantizers Qb1 and Qb2 symmetric about 0 with bits b1 1 and b2 1, the function E(Qb1(x)Qb2(y)) is increasing in ρ.

Proof. By Lemma 1, we know that the statement is valid for b1 = 1, and arbitrary b2. Now we look at the case where b1 2, b2 2. By Lemma 3, we can always write

i=1 Q(i) 2 (x), Qb2(y) =

j=1 ˆQ(j) 2 (y),

where Q1 2, ..., Qb1 1 2 and ˆQ1 2, ..., ˆQb2 1 2 are two sets of symmetric increasing 2-bit quantizers. Thus,

E(Qb1(x)Qb2(y))

ρ = E(Pb1 1 i=1 Q(i) 2 (x) Pb2 1 j=1 ˆQ(j) 2 (y))

E( Q(i) 2 (x) ˆQ(j) 2 (y)) ρ > 0,

where the last equality is due to linearity of expectation and derivative, and the inequality holds because of Lemma 2. Therefore, E(Qb1(x)Qb2(y)) is increasing in ρ for any b1 1 and b2 1.

Recall that, in Section 3.2, we have proved the result for the mis-ordering probability, i.e., Theorem 3, which actually assumes estimators have expectations monotone in ρ. Therefore, Theorem 7 provides the necessary proof to support the assumption in Theorem 3.

6 Empirical Study: Similarity Search

In this section, we test proposed estimators on 3 datasets from the UCI repository (Table 1) [16]. The experiments clearly conﬁrm that the normalization step uniformly improves the search accuracy. The results also, to an extent, illustrate the inﬂuence of mis-ordering probability studied in Theorem 3.

Table 1: Datasets used in the empirical study. Mean ρ is the average pair-wise cosine similarity for sample pairs. Mean 1-NN ρ is the average cosine similarity of each point to its nearest neighbor.

Dataset # samples # features # classes Mean ρ Mean 1-NN ρ Arcene 200 10000 2 0.63 0.86 BASEHOCK 1993 4862 2 0.33 0.59 COIL20 1440 1024 20 0.61 0.93

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1, ) LM b=(2, ) LM b=(3, ) LM b=(4, ) LM b=(5, )

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1, ) LM b=(2, ) LM b=(3, ) LM b=(4, ) LM b=(5, )

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1,2) LM b=(1,3) LM b=(2,4) LM b=(3,4) LM b=(4,5)

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1,2) LM b=(1,3) LM b=(2,4) LM b=(3,4) LM b=(4,5)

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1, ) LM b=(2, ) LM b=(3, ) LM b=(4, ) LM b=(5, )

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1, ) LM b=(2, ) LM b=(3, ) LM b=(4, ) LM b=(5, )

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1,2) LM b=(1,3) LM b=(2,4) LM b=(3,4) LM b=(4,5)

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1,2) LM b=(1,3) LM b=(2,4) LM b=(3,4) LM b=(4,5)

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1, ) LM b=(2, ) LM b=(3, ) LM b=(4, ) LM b=(5, )

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1, ) LM b=(2, ) LM b=(3, ) LM b=(4, ) LM b=(5, )

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1,2) LM b=(1,3) LM b=(2,4) LM b=(3,4) LM b=(4,5)

26 27 28 29 210 211 212

Number of Projections

1-NN Precision (%)

LM b=( , ) LM b=(1,2) LM b=(1,3) LM b=(2,4) LM b=(3,4) LM b=(4,5)

Figure 3: Nearest neighbor search recovery results using cosine similarity and quantized estimators, from random projections. Columns 1 and 2 (Scenario 1): the estimator ˆρb,f and its normalized version ˆρb,f,n. Columns 3 and 4 (Scenario 2): the estimator ˆρb1,b2 and its normalized version ˆρb1,b2,n.

For each dataset, all the examples are preprocessed to have unit norm. The evaluation metric we adopt is the 1-NN precision, which is the proportion of nearest neighbors (NN) we can recover from the nearest neighbor estimated using quantized random projections, averaged over all the examples.

We summarize the results in Figure 3. First of all, we can see that, as the number of bits increases, the performance of the quantized estimators converges to that of the estimator with full-precision, as expected. Importantly, the normalization step of the estimators substantially improves the performances, by comparing Column 2 with Column 1 (for Scenario 1), and Column 4 with Column 3 (for Scenario 2). In addition, we can to an extent validate the assertions in Theorem 3, which states that smaller variance of debiased estimators could improve NN recovery precision.

In Figure 1 (left panel), we see that the variance of debiased estimate ˆρdb b,f with b = 1 is much smaller than using b 2 in high similarity region (e.g. |ρ| > 0.8), and roughly the same at ρ = 0.6. Since Arcene and COIL20 have high mean 1-NN ρ (0.86 and 0.93 respectively), Theorem 3 may imply that cosine estimation of ˆρdb 1,f should (in general) have smaller mis-ordering probability than b 2, implying higher 1-NN precision. On the other hand, the average 1-NN ρ of BASEHOCK is 0.59, so ˆρdb b,f with all b = 1, 2, ..., would likely give similar performance. These claims are consistent with Column 1 of Figure 3. The variance of the debiased normalized estimator ˆρdb b,f,n (Figure 1, middle panel) decreases as b increases, uniformly for any ρ. Hence by Theorem 3 we expect the 1-NN precision should increase with larger b on all 3 datasets, as conﬁrmed by Column 2 of Figure 3.

7 Conclusion In this paper, we conduct a comprehensive study of estimating inner product similarities from random projections followed by asymmetric quantizations. This setting is theoretically interesting and also has many practical applications. For example, in a retrieval system, data vectors (after random projections) in the repository are quantized to reduce storage and communications; when a new query vector arrives, it does not have to be quantized. Another example of asymmetric quantization may come from data collected from different sources with own quantization strategies. In this study, we propose a series of estimators for asymmetric quantization, starting with the simple basic estimator, then the normalized estimator, and then the debiased estimators. We provide a thorough analysis of the estimation errors. Furthermore, we analyze the problems of mis-ordering probabilities and monotonicity properties of estimators. While our methods and analyses are largely based on the classical Lloyd-Max (LM) method, they can be extended to other more general quantization schemes.

[1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671 687, 2003.

[2] Theodore W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, third edition, 2003.

[3] Jay M. Berger. A note on error detection codes for asymmetric channels. Information and Control, 4(1):68 73, 1961.

[4] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 245 250, San Francisco, CA, 2001.

[5] Petros Boufounos and Richard G. Baraniuk. 1-bit compressive sensing. In 42nd Annual Conference on Information Sciences and Systems (CISS), pages 16 21, Princeton, NJ, 2008.

[6] Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225 242, 2002.

[7] Emmanuel Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, Feb 2006.

[8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC), pages 380 388, Montreal, Canada, 2002.

[9] George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. Large-scale malware classiﬁcation using random projections and neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3422 3426, Vancouver, Canada, 2013.

[10] Sanjoy Dasgupta. Experiments with random projection. In Proceedings of the 16th Conference in Uncertainty in Artiﬁcial Intelligence (UAI), pages 143 151, Stanford, CA, 2000.

[11] Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC), pages 537 546, Victoria, Canada, 2008.

[12] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60 65, 2003.

[13] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokn. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometr (SCG), pages 253 262, Brooklyn, NY, 2004.

[14] Wei Dong, Moses Charikar, and Kai Li. Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 123 130, 2008.

[15] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 1306, April 2006.

[16] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[17] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efﬁcient similarity search and classiﬁcation via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 301 312, San Diego, CA, 2003.

[18] Xiaoli Zhang Fern and Carla E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the Twentieth International Conference (ICML), pages 186 193, Washington, DC, 2003.

[19] Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, and Nakul Verma. Learning the structure of manifolds using random projections. In Advances in Neural Information Processing Systems (NIPS), pages 473 480, Vancouver, Canada, 2007.

[20] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of ACM, 42(6):1115 1145, 1995.

[21] Albert Gordo, Florent Perronnin, Yunchao Gong, and Svetlana Lazebnik. Asymmetric distances for binary embeddings. IEEE Trans. Pattern Anal. Mach. Intell., 36(1):33 47, 2014.

[22] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 604 613, Dallas, TX, 1998.

[23] Laurent Jacques, Jason N. Laska, Petros T. Boufounos, and Richard G. Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Transactions on Information Theory, 59(4):2082 2102, 2013.

[24] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189 206, 1984.

[25] Ping Li. One scan 1-bit compressed sensing. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1515 1523, Cadiz, Spain, 2016.

[26] Ping Li. Binary and multi-bit coding for stable random projections. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1430 1438, Fort Lauderdale, FL, 2017.

[27] Ping Li. Sign-full random projections. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 4205 4212, Honolulu, Hawaii, 2019.

[28] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In 19th Annual Conference on Learning Theory (COLT), pages 635 649, Pittsburgh, PA, 2006.

[29] Ping Li and Martin Slawski. Simple strategies for recovering inner products from coarsely quantized random projections. In Advances in Neural Information Processing Systems (NIPS), pages 4567 4576, Long Beach, CA, 2017.

[30] Xiang-Yang Li. Wireless ad hoc and sensor networks: theory and applications. Cambridge University Press, 2008.

[31] Xiaoyun Li and Ping Li. Generalization error analysis of quantized compressive learning. In Advances in Neural Information Processing Systems (Neur IPS), Vancouver, Canada, 2019.

[32] Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, 28(2):129 136, 1982.

[33] Joel Max. Quantizing for minimum distortion. IRE Trans. Information Theory, 6(1):7 12, 1960.

[34] S Sandeep Pradhan and Kannan Ramchandran. Group-theoretic construction and analysis of generalized coset codes for symmetric/asymmetric distributed source coding. In Proceedings of Conference on Information Sciences and Systems (CISS), 2000.

[35] Santosh S. Vempala. The Random Projection Method. American Mathematical Society, 2004.

[36] Ossama Younis and Sonia Fahmy. Distributed clustering in ad-hoc sensor networks: A hybrid, energy-efﬁcient approach. In Proceedings the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), Hong Kong, China, 2004.

[37] Argyrios Zymnis, Stephen P. Boyd, and Emmanuel J. Candès. Compressed sensing with quantized measurements. IEEE Signal Process. Lett., 17(2):149 152, 2010.