# signfull_random_projections__cfd5f6c5.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Sign-Full Random Projections Ping Li Cognitive Computing Lab (CCL) Baidu Research USA Bellevue, WA 98004, USA pingli98@gmail.com The method of 1-bit ( sign-sign ) random projections has been a popular tool for efficient search and machine learning on large datasets. Given two D-dim data vectors u, v RD, one can generate x = PD i=1 uiri, and y = PD i=1 viri, where ri N(0, 1) iid. Then one can estimate the cosine similarity ρ from sgn(x) and sgn(y). In this paper, we study a series of estimators for sign-full random projections. First we prove E(sgn(x)y) = q 2 π ρ, which provides an estimator for ρ. Interestingly this estimator can be substantially improved by normalizing y. Then we study estimators based on E (y 1x 0 + y+1x<0) and its normalized version. We analyze the theoretical limit (using the MLE) and conclude that, among the proposed estimators, no single estimator can achieve (close to) the theoretical optimal asymptotic variance, for the entire range of ρ. On the other hand, the estimators can be combined to achieve the variance close to that of the MLE. In applications such as near neighbor search, duplicate detection, knn-classification, etc, the training data are first transformed via random projections and then only the signs of the projected data points are stored (i.e., the sgn(x)). The original training data are discarded. When a new data point arrives, we apply random projections but we do not necessarily need to quantize the projected data (i.e., the y) to 1-bit. Therefore, sign-full random projections can be practically useful. This gain essentially comes at no additional cost. Introduction Consider two high-dimensional data vectors, u, v RD. Suppose we generate a D-dim random vector whose entries are iid standard normal, i.e., ri N(0, 1), and compute i=1 uiri, y = We have in expectation E(xy) = u, v = PD i=1 uivi. If we generate x and y independently for k times, then 1 k Pk j=1 xjyj E(xy) = u, v , and the quality of approximation improves as k increases. This idea of random projections has been widely used for large-scale search and machine learning (Johnson and Lindenstrauss 1984; Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Vempala 2004; Papadimitriou et al. 1998; Dasgupta 1999; Datar et al. 2004; Li, Hastie, and Church 2006). A popular variant is the 1-bit random projections (Goemans and Williamson 1995; Charikar 2002), which we refer to as sign-sign random projections, based on the following result of collision probability P (sgn(x) = sgn(y)) = 1 cos 1 ρ PD i=1 uivi PD i=1 u2 i PD i=1 v2 i is the cosine similarity be- tween the two original data vectors u and v. The method of sign-sign random projections has become popular, for example, in many search related applications (Henzinger 2006; Manku, Jain, and Sarma 2007; Grimes 2008; Hajishirzi, Yih, and Kolcz 2010; T ure, Elsayed, and Lin 2011; Manzoor, Milajerdi, and Akoglu 2016). Note that by using only the signs of the projected data, we lose the information about the norms of the original vectors. Thus, in this context, with no loss of generality, we assume that the original data vectors are normalized, i.e., PD i=1 u2 i = PD i=1 v2 i = 1. In other words, we can assume the projected data to be x N(0, 1) and y N(0, 1). Interestingly, one can take advantage of E(sgn(x)y) and several variants to considerably improve 1-bit random projections. This gain essentially comes at no additional cost. Basically, the training data after projections are stored using signs (e.g., sgn(x)). When a new data vector arrives, however, we need to generate its random projections (y) but do not necessarily have to quantize them. Related Work The idea of 1-bit projections has been extended to sign cauchy projections for estimating χ2 similarity (Li, Samorodnitsky, and Hopcroft 2013), and to 1-bit minwise hashing (Broder 1997; Li and K onig 2010) for estimating the resemblance between sets. More general quantization schemes of random projections have been studied in, for example, (Li and Slawski 2017). In the context of random projections, the idea of estimating ρ from sgn(x) and y was explored in (Dong, Charikar, and Li 2008). Similar ideas were also studied in the quantization literature (without using random projections) (J egou, Douze, and Schmid 2011). Review of Estimators Based on Full Information after Random Projections In this context, since we are only concerned with estimating the cosine ρ, we can without loss of generality assume that the original data are normalized, i.e., u = v = 1. The projected data thus follow a bi-variant normal distribution: xj yj , iid j = 1, 2, ..., k. where ρ = PD i=1 uivi. The obvious estimator for ρ is based on the inner product: j=1 xjyj, E (ˆρf) = ρ, V ar (ˆρf) = Vf k , Vf = 1 + ρ2 one can find the derivation of variance (i.e., Vf) in (Li, Hastie, and Church 2006). Note that V ar(ˆρf) is the largest when |ρ| = 1. This is disappointing, because when two data vectors are identical, we ought to be able to estimate their similarity with no error. One can improve the estimator by simply normalizing the projected data. See (Anderson 2003) for the derivation. Pk j=1 xjyj q Pk j=1 x2 j q Pk j=1 y2 j , E (ˆρf,n) = ρ + O 1 V ar (ˆρf,n) = Vf,n , Vf,n = 1 ρ2 2 In particular, Vf,n = 0 when |ρ| = 1, as desired. One can further improve ˆρf,n but not too much. The theoretical limit (i.e., the Cram er-Rao bound) of the asymptotic variance (Lehmann and Casella 1998) can be obtained by the maximum likelihood estimator (MLE), which is the solution of the following cubic equation: j=1 xjyj + ρ j=1 xjyj = 0 This cubic equation can have multiple real roots with a small probability (Li, Hastie, and Church 2006), which decreases exponentially fast with increasing k. The MLE is asymptotically unbiased and its asymptotic variance becomes: E (ˆρf,m) = ρ + O 1 V ar (ˆρf,m) = Vf,m Estimator Based on Sign-Sign Random Projections From Pr (sgn(xj) = sgn(yj)) = 1 1 π cos 1 ρ, we have an asymptotically unbiased estimator and its variance: ˆρ1 = cos π j=1 1sgn(xj)=sgn(yj) E (ˆρ1) = ρ + O 1 , V ar (ˆρ1) = V1 V1 = cos 1 ρ π cos 1 ρ (1 ρ2) As later will be shown in Lemma 2, we have when |ρ| 1, 2π (1 |ρ|)3/2 + o (1 |ρ|)3/2 , This rate is slower than O (1 |ρ|)2 , which is the rate at which Vf,n and Vf,m approach 0. Figure 1 compares the estimators in terms of V1 Vf,m , Vf Vf,m , and Vf,n Vf,m . Basically, Vf,n < Vf always which means we should always use the normalized estimator. Note that V1 < Vf if |ρ| > 0.5902. -1 -0.5 0 0.5 1 1 Figure 1: Ratios (lower the better) of variance factors: V1 Vf,m , Vf Vf,m , Vm Vf,m , Vf,n Vf,m . They are always larger than 1, as Vf,m is the theoretically smallest variance factor. Note that Vm is the variance factor for the MLE of sign-full random projections. Estimators for Sign-Full Random Projections In many practical scenarios such as near-neighbor search and near-neighbor classification, we can store signs of the projected data (i.e., sgn(xj)) and discard the original highdimensional data. When a new data vector arrives, we generate its projected vector (i.e., y). At this point we actually have the option to choose whether we would like to use the full information or just the signs (i.e., sgn(yj)) to estimate the similarity ρ. If we are able to find a better (more accurate) estimator by using the full information of yj, there is no reason why we have to only use the sign of yj. We first study the maximum likelihood estimator (MLE), in order to understand the theoretical limit. Theorem 1 Given k iid samples (sign(xj), yj), j = 1, 2, ..., k, with xj, yj N(0, 1), E(xjyj) = ρ, the maximum likelihood estimator (MLE, denoted by ˆρm) is the solution to the following equation: 1 ρ2 sgn(xj)yj 1 ρ2 sgn(xj)yj sgn(xj)yj = 0 (2) 2πe t2/2 and Φ(t) = R t φ(t)dt are respectively the pdf and cdf of standard normal. E (ˆρm) = ρ + O 1 , V ar (ˆρm) = Vm ρ (1 ρ2)7/2 1 ρ2 sgn(xj)yj 1 ρ2 sgn(xj)yj sgn(xj)y3 j 1 ρ2 sgn(xj)yj 1 ρ2 sgn(xj)yj 3ρ (1 ρ2)5/2 1 ρ2 sgn(xj)yj 1 ρ2 sgn(xj)yj As the MLE equation (2) is quite sophisticated, we study this estimator mainly for theoretical interest, for example, for evaluating other estimators. We can evaluate the expectations in (3) by simulations. Figure 1 already plots Vm Vf,m , to compare ˆρm with three estimators: ˆρ1, ˆρf, ˆρf,n. The figure shows that ˆρm indeed substantially improves ˆρ1. Next, we seek estimators which are much simpler than ˆρm. Ideally, we look for estimators which can be written as inner products . In this paper, we study four such estimators. We first present a Lemma which will be needed for deriving these estimators and proving their properties. The results can also be easily validated by numerical integrations. 0 t3e t2/2Φ 2 2 + 3ρ ρ3 (5) 0 t2e t2/2Φ where we denote that tan 1 1 0 = tan 1 1 0+ = π The first estimator we present is based on the (odd) moments of (sgn(xj)yj) as shown in Theorem 2. E(sgn(xj)yj) = E (sgn(xj)yj)3 = 1 Proof Sketch: Note that: E (sgn(xj)2yj)2 = 1, and E (sgn(xj)yj)4 = 3. Because (xj, xj) is bi-variate normal, we have xj|yj N ρyj, (1 ρ2) , and E (sgn(xj)yj)) = E (yj E (sgn(xj)|yj)) =E (yj Pr (xj|yj 0) yj Pr (xj|yj < 0)) 2 π ρ, result in Lemma 1 Similarly, using result from Lemma 1 E sgn(xj)y3 j ) = 4 Z 0 t3φ(t)dt = 1 Theorem 2 leads to a simple estimator ˆρg and its variance: 2 sgn(xj)yj, E (ˆρg) = ρ (9) V ar (ˆρg) = Vg The variance does not vanish when |ρ| 1. Interestingly, the variance can be substantially reduced by applying a normalization step on yj, as shown in Theorem 3. Theorem 3 As k , the following asymptotic normality holds: Pk j=1 sgn(xj)yj k q Pk j=1 y2 j ρ D = N (0, Vg,n) (11) Vg,n = Vg ρ2 3/2 ρ2 (12) where Vg = π 2 ρ2 as in (10). Proof Sketch: Let Zk = Pk j=1 sgn(xj)yj k q Pk j=1 y2 j . As k , then j=1 y2 j E y2 j = 1, a.s. 1 k Pk j=1 sgn(xj)yj q 1 k Pk j=1 y2 j 2 π ρ = g, a.s. We express the deviation Zk g as 1 k Pk j=1 sgn(xj)yj g + g q 1 k Pk j=1 y2 j g 1 k Pk j=1 sgn(xj)yj g q 1 k Pk j=1 y2 j + g 1 q 1 k Pk j=1 y2 j q 1 k Pk j=1 y2 j j=1 sgn(xj)yj g + g 1 1 k Pk j=1 y2 j 2 + OP (1/k) Thus, to analyze the asymptotic variance, it suffices to study: E sgn(x)y g + g 1 y2 =E sgn(x)y g(1 + y2)/2 2 =E(y2) + g2E(1 + y4 + 2y2)/4 g E(sgn(x)(y + y3)) =1 + g2(1 + 3 + 2)/4 g E(sgn(x)(y + y3)) =1 + 3/2g2 g2 gg3 = 1 1 where we recall g3 = E sgn(x)y3) = 1 2π 6ρ 2ρ3 . Theorem 3 leads to the following estimator ˆρg,n: Pk j=1 sgn(xj)yj k q Pk j=1 y2 j , E (ˆρg,n) = ρ + O 1 V ar (ˆρg,n) = Vg,n This normalization always helps, because Vg,n Vg. We can develop more estimators based on Theorem 4. E (y 1x<0 + y+1x 0) = 1 + ρ E (y 1x<0 + y+1x 0)2 (14) E (y 1x 0 + y+1x<0) = 1 ρ E (y 1x 0 + y+1x<0)2 (16) This leads to another estimator, denoted by ˆρs: yj 1xj 0 + yj+1xj<0 (17) E (ˆρs) = ρ, V ar (ˆρs) = Vs Vs = 2π 1ρ<0 + 2 tan 1 p 1 ρ2 (1 ρ)2 After a careful check, the estimator proposed in (Dong, Charikar, and Li 2008) is equivalent to ˆρs, despite the different expressions. (Dong, Charikar, and Li 2008) used hyper-spherical projection which is equivalent to random projection for high-dimensional original data. The variance expression in (Dong, Charikar, and Li 2008) appeared different but it is indeed the same as the variance of ˆρs if we let original data dimension be large. We can again try to normalize the projected data for the hope of obtaining an improved estimator: Theorem 5 Pk j=1 yj 1xj 0 + yj+1xj<0 k q Pk j=1 y2 j 1 ρ D = N (0, Vs,n) Vs,n = Vs (1 ρ)2 4π 1 2ρ 2ρ2 (20) where Vs is in (18). . This leads to the following estimator: 2π yj 1xj 0 + yj+1xj<0 k q Pk j=1 y2 j (21) E (ˆρs,n) = ρ + O 1 , V ar (ˆρs,n) = Vs,n where Vs,n is in (20). The resultant estimator ˆρs,n has the property that the variance approaches 0 as ρ 1. The normalization step however does not always help. From (20), we have Vs Vs,n if ρ 2 0.3660. On the other hand, as shown in Figure 2, the normalization step only increases the variance slightly if ρ > 0.3660. Figure 2 plots the rations: Vm V1 , to compare those five estimators in terms of their improvements relative to the 1-bit estimator ˆρ1. As expected, the MLE ˆρm achieves the smallest asymptotic variance and Vm V1 = 2 π at ρ = 0 and Vm V1 0.36 at |ρ| 1. This means in the high similarity region, using ˆρm can roughly reduce the required number of samples (k) by a factor of 3. Overall, ˆρs,n is computationally simple and its variance is very close to the variance of the MLE, at least for ρ 0.4. We summarize some numerical values in Lemma 2. -1 -0.5 0 0.5 1 0.3 Figure 2: Variance ratios: Vm V1 , to compare the five estimators, in terms of the relative improvement with respect to the 1-bit estimator ˆρ1. The MLE (ˆρm, solid blue) achieves the lowest asymptotic variance. Lemma 2 At ρ = 0, π 0.6366, Vs V1 = 4 2π (1 |ρ|)3/2 + o (1 |ρ|)3/2 (22) Vs V1 = Vs,n 3π 0.4244, Vg V1 = , Vg,n Recommendation for Estimators We have studied four estimators (with closed forms) : 2 sgn(xj)yj, Pk j=1 sgn(xj)yj k q Pk j=1 y2 j yj 1xj 0 + yj+1xj<0 , 2π yj 1xj 0 + yj+1xj<0 k q Pk j=1 y2 j The choice depends on application scenarios. Presumably, for a given query, we would like to retrieve data points which have similarity ρ close to 1. However, for practical datasets, typically most data points are not similar at all. From Figure 2, if we hope to use one single estimator, then ˆρs,n is the overall best. We can also combine two estimators: ˆρs and ˆρg,n. Figure 2 shows ˆρs is better if ρ > 0.4437. For ρ < 0, we can always switch to the mirror version of the estimators. A Simulation Study We provide a simulation study to verify the theoretical properties of the four estimators for sign-full random projections: ˆρg, ˆρg,n, ˆρs, ˆρs,n, as well as ˆρ1 for sign-sign projections. For a given ρ, we simulate k standard bi-variate normal variables (xj, yj) with E(xjyj) = ρ, j = 1, ..., k. Then we choose an estimator ˆρ to estimate ρ. With 106 simulations, we plot the empirical mean square errors (MSEs): MSE(ˆρ) = Bias2(ˆρ) + V ar(ˆρ), together with the theoretical variance of ˆρ. If the empirical MSE curve and the theoretical variance overlap, we know that the estimator is unbiased and the theoretical variance formula is verified. 10 100 1000 k 10 100 1000 k 10 100 1000 k 10 100 1000 k 10 100 1000 k 10 100 1000 k Figure 3: Empirical MSEs (solid curves) for four proposed estimators, together with theoretical (asymptotic) variances (dashed curves), for 6 ρ values. For ˆρg and ˆρs, the solid and dashed curves overlap, confirming that they are unbiased and the variance formulas are correct. For ˆρg,n and ˆρs,n, the solid and dashed curves overlap when k is not too small. Figure 3 presents the results for 6 selected ρ values: 0.99, 0.95, 0.75, 0, 0.95, 0.99. Those simulations verify that both ˆρg and ˆρs are unbiased, while their normalized versions ˆρg,n and ˆρs,n are asymptotically (i.e., when k is not too small) unbiased. The (asymptotic) variance formulas for these four estimators are verified since the solid and dashed curves overlap (when k is not small). Figure 4 presents the ratios of empirical MSEs (solid curves): MSE(ˆρ1) MSE(ˆρs,n) and MSE(ˆρ1) MSE(ˆρg,n), together with the theo- retical asymptotic variance ratios (dashed curves): V1 Vs,n and V1 Vg,n . These curves again confirm the asymptotic variance formulas. In addition, they indicate that in the high similarity region, when the sample size k is not too large, the improved gained from using ˆρs,n can be substantially more than what are predicted by theory. For example, when ρ is close to 1 (e.g., ρ = 0.99), theoretically V1 Vs,n = 3 4π 2.3562, the actual improvement can be as much as a factor of 8 (at k = 10). This is the additional advantage of ˆρs,n. 10 100 1000 k 0 1 2 3 4 5 6 7 8 10 100 1000 k 10 100 1000 k 10 100 1000 k Figure 4: Empirical rations: MSE(ˆρ1) MSE(ˆρs,n) and MSE(ˆρ1) MSE(ˆρg,n), together with the theoretical asymptotic variance ratios (dashed curves): V1 Vs,n and V1 Vg,n . When k is not small, the solid and dashed curves overlap. At high similarity and small k, the improvement would be even more substantial. An Experimental Study To further verify the theoretical results, we conduct an experimental study on the ranking task for near-neighbor search on 4 public datasets (see Table 1 and Figure 5). Table 1: Information about the datasets Dataset # Train # Query # Dim MNIST 10,000 10,000 780 RCV1 10,000 10,000 47,236 Youtube Audio 10,000 11,930 2,000 Youtube Description 10,000 11,743 12,183,626 These four datasets are downloaded from either the UCI repository or the LIBSVM website. When a dataset contains significantly more than 10,000 training samples, we only use a random sample of it. The datasets represent a wide range of application scenarios and data types. See Figure 5 for the frequencies of all pairwise ρ values. For each data point in the query set, we estimate its similarity with every data point in the training set, using random projections. The goal is to return training data points with which the estimated similarities are larger than a prespecified threshold ρ0. For each query point, we rank all the 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 Youtube Audio 0 0.2 0.4 0.6 0.8 1 0 Youtube Description Figure 5: Histograms of all pairwise ρ values. (estimated) similarities and return top-L points. We can then compute the precision and recall Precision = # retrieved points with true similarities ρ0 Recall = # retrieved points with true similarities ρ0 #total points with true similarities ρ0 We report the averaged precision-recall values over all query data points. By varying L from 1 to the number of training data points, we obtain a precision-recall curve. Figure 6 presents the results for the RCV1 datasets, for ρ0 {0.9, 0.8, 0.6}, and for k {50, 100}. In the first row (i.e., ρ0 = 0.9), we can see that ˆρs,n is substantially more accurate than both ˆρ1 and ˆρg,n. Since this case represents the high-similarity region, as expected, ˆρg,n performs poorly. For smaller ρ0 values, ˆρg,n performs substantially better, also as expected. Figure 7, Figure 8, and Figure 9 present the results for the other three datasets. The trends are pretty much similar to what we observe in Figure 6. Conclusion The method of sign-sign (1-bit) random projections has been a standard tool in practice. In many scenarios such as nearneighbor search and near-neighbor classification, we can store signs of the projected data and discard the original high-dimensional data. As a new data point arrives, one can generate its projected vector and use it (without taking signs) to estimate the similarity. We study various estimators for sign-full random projections and compare their variances with the theoretical limit. Nevertheless, a combination of two estimators (ˆρg,n and ˆρs) can almost achieve the theoretical bound. For applications which only allows a single estimator, the proposed ˆρs,n is the overall best estimator. References Anderson, T. W. 2003. An Introduction to Multivariate Statistical Analysis. Hoboken, New Jersey: John Wiley & Sons, third edition. 0.5 0.6 0.7 0.8 0.9 1 0.82 ρ0 = 0.9, k = 50 0.5 0.6 0.7 0.8 0.9 1 0.86 ρ0 = 0.9, k = 100 0.4 0.5 0.6 0.7 0.8 0.9 1 0.65 ρ0 = 0.8, k = 50 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.78 ρ0 = 0.8, k = 100 1 0.3 0.4 0.5 0.6 0.7 0.8 0.1 ρ0 = 0.6, k = 50 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.55 ρ0 = 0.6, k = 100 Figure 6: RCV1: precision-recall curves for selected ρ0 and k values, and for three estimators:ˆρs,n, ˆρg,n, ˆρ1. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 ρ0 = 0.9, k = 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.2 ρ0 = 0.9, k = 100 0 0.2 0.4 0.6 0.8 0.1 ρ0 = 0.8, k = 50 0 0.2 0.4 0.6 0.8 0.3 ρ0 = 0.8, k = 100 0 0.2 0.4 0.6 0.8 0.2 ρ0 = 0.6, k = 50 0 0.2 0.4 0.6 0.8 0.4 ρ0 = 0.6, k = 100 Figure 7: MNIST: precision-recall curves for selected ρ0 and k values, and for three estimators:ˆρs,n, ˆρg,n, ˆρ1. 0.2 0.4 0.6 0.8 0.3 Youtube Audio ρ0 = 0.9, k = 50 1 0.2 0.4 0.6 0.8 0.4 Youtube Audio ρ0 = 0.9, k = 100 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.2 Youtube Audio ρ0 = 0.8, k = 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.4 Youtube Audio ρ0 = 0.8, k = 100 0 0.2 0.4 0.6 0.8 0.4 Youtube Audio ρ0 = 0.6, k = 50 1 0 0.2 0.4 0.6 0.8 0.5 Youtube Audio ρ0 = 0.6, k = 100 1 Figure 8: Youtube Audio: precision-recall curves for selected ρ0 and k values, and three estimators:ˆρs,n, ˆρg,n, ˆρ1. 0.6 0.7 0.8 0.9 1 0.91 ρ0 = 0.9, k = 50 Youtube Description 0.6 0.7 0.8 0.9 1 0.94 ρ0 = 0.9, k = 100 Youtube Description 0.5 0.6 0.7 0.8 0.9 1 0.75 Youtube Description ρ0 = 0.8, k = 50 1 0.6 0.7 0.8 0.9 1 0.9 ρ0 = 0.8, k = 100 Youtube Description 0.4 0.6 0.8 1 0 Youtube Description ρ0 = 0.6, k = 50 1 0.5 0.6 0.7 0.8 0.9 1 0.65 Youtube Description ρ0 = 0.6, k = 100 1 Figure 9: Youtube Description: precision-recall curves for selected ρ0 and k values, and three estimators:ˆρs,n, ˆρg,n, ˆρ1. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (Sequences), 21 29. Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), 380 388. Dasgupta, S. 1999. Learning mixtures of gaussians. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), 634 644. Datar, M.; Immorlica, N.; Indyk, P.; and Mirrokni, V. S. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry (So CG), 253 262. Dong, W.; Charikar, M.; and Li, K. 2008. Asymmetric distance estimation with sketches for similarity search in highdimensional spaces. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 123 130. Goemans, M. X., and Williamson, D. P. 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6):1115 1145. Gradshteyn, I. S., and Ryzhik, I. M. 1994. Table of Integrals, Series, and Products. New York: Academic Press, fifth edition. Grimes, C. 2008. Microscale evolution of web pages. In Proceedings of the 17th International Conference on World Wide Web (WWW), 1149 1150. Hajishirzi, H.; Yih, W.-t.; and Kolcz, A. 2010. Adaptive near-duplicate detection via similarity learning. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 419 426. Henzinger, M. R. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 284 291. J egou, H.; Douze, M.; and Schmid, C. 2011. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1):117 128. Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics 26:189 206. Lehmann, E. L., and Casella, G. 1998. Theory of Point Estimation. New York, NY: Springer, second edition. Li, P., and K onig, A. C. 2010. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web (WWW), 671 680. Li, P., and Slawski, M. 2017. Simple strategies for recovering inner products from coarsely quantized random projections. In Advances in Neural Information Processing Systems (NIPS), 4570 4579. Li, P.; Hastie, T.; and Church, K. W. 2006. Improving random projections using marginal information. In Proceedings of the 19th Annual Conference on Learning Theory (COLT), 635 649. Li, P.; Samorodnitsky, G.; and Hopcroft, J. E. 2013. Sign cauchy projections and chi-square kernel. In Advances in Neural Information Processing Systems (NIPS), 2571 2579. Manku, G. S.; Jain, A.; and Sarma, A. D. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web (WWW), 141 150. Manzoor, E. A.; Milajerdi, S. M.; and Akoglu, L. 2016. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1035 1044. Papadimitriou, C. H.; Raghavan, P.; Tamaki, H.; and Vempala, S. 1998. Latent semantic indexing: A probabilistic analysis. In Proceedings of the Seventeenth ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems (PODS), 159 168. T ure, F.; Elsayed, T.; and Lin, J. J. 2011. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 943 952. Vempala, S. 2004. The Random Projection Method. Providence, RI: American Mathematical Society.