# fully_understanding_the_hashing_trick__ead44a67.pdf

Fully Understanding The Hashing Trick

Casper Freksen Department of Computer Science Aarhus University, Denmark cfreksen@cs.au.dk

Lior Kamma Department of Computer Science Aarhus University, Denmark lior.kamma@cs.au.dk

Kasper Green Larsen Department of Computer Science Aarhus University, Denmark larsen@cs.au.dk

Feature hashing, also known as the hashing trick, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix A : Rn Rm (where m n) in order to reduce the dimension of the data from n to m while approximately preserving the Euclidean norm. Every column of A contains exactly one non-zero entry, equals to either 1 or 1. Weinberger et al. showed tail bounds on Ax 2 2. Speciﬁcally they showed that for every ε, δ, if x / x 2 is sufﬁciently small, and m is sufﬁciently large, then

Pr[ | Ax 2 2 x 2 2 | < ε x 2 2 ] 1 δ .

These bounds were later extended by Dasgupta et al. (2010) and most recently reﬁned by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and speciﬁcally the correct tradeoff between the pivotal parameters x / x 2, m, ε, δ remained an open question. We settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants hiding in the asymptotic notation are, in fact, very close to 1, thus further illustrating the tightness of the presented bounds in practice.

1 Introduction

Dimensionality reduction that approximately preserves Euclidean distances is a key tool used as a preprocessing step in many geometric, algebraic and classiﬁcation algorithms, whose performance heavily depends on the dimension of the input. Loosely speaking, a distance-preserving dimensionality reduction is an (often random) embedding of a high-dimensional Euclidean space into a space of low dimension, such that pairwise distances are approximately preserved (with high probability). Its applications range upon nearest neighbor search [AC09, HIM12, AIL+15], classiﬁcation and regression [RR08, MM09, PBMID14], manifold learning [HWB08] sparse recovery [CT06] and numerical linear algebra [CW09, MM13, Sár06]. For more applications see, e.g. [Vem05].

One of the most fundamental results in the ﬁeld was presented in the seminal paper by Johnson and Lindenstrauss [JL84].

All authors contributed equally, and are presented in alphabetical order.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Lemma 1 (Distributional JL Lemma). For every n N and ε, δ (0, 1), there exists a random m n projection matrix A, where m = Θ(ε 2 lg 1

δ ) such that for every x Rn

Pr[ | Ax 2 2 x 2 2 | < ε x 2 2 ] 1 δ (1)

The target dimension m in the lemma is known to be optimal [JW13, LN17].

Running Time Performances. Perhaps the most common proof of the lemma (see, e.g. [DG03, Mat08]) samples a projection matrix by independently sampling each entry from a standard Gaussian (or Rademacher) distribution. Such matrices are by nature very dense, and thus a naïve embedding runs in O(m x 0) time, where x 0 is the number of non-zero entries of x.

Due to the algorithmic signiﬁcance of the lemma, much effort was invested in ﬁnding techniques to accelerate the embedding time. One fruitful approach for accomplishing this goal is to consider a distribution over sparse projection matrices. This line of work was initiated by Achlioptas [Ach03], who constructed a distribution over matrices, in which the expected fraction of non-zero entries is at most one third, while maintaining the target dimension. The best result to date in constructing a sparse Johnson-Lindenstrauss matrix is due to Kane and Nelson [KN14], who presented a distribution over matrices satisfying (1) in which every column has at most s = O(ε 1 lg(1/δ)) non-zero entries. Conversely Nelson and Nguy ên [NN13] showed that this is almost asymptotically optimal. That is, every distribution over n m matrices satisfying (1) with m = Θ(ε 2 lg(1/δ)), and such that every column has at most s non-zero entries must satisfy s = Ω((ε lg(1/ε)) 1 lg(1/δ)).

While the bound presented by Nelson and Nguy ên is theoretically tight, we can provably still do much better in practice. Speciﬁcally, the lower bound is attained on vectors x Rn for which, loosely speaking, the mass of x is concentrated in few entries. Formally, the ratio x / x 2 is large. However, in practical scenarios, such as the term frequency - inverse document frequency representation of a document, we may often assume that the mass of x is well-distributed over many entries (That is, x / x 2 is small). In these common scenarios projection matrices which are signiﬁcantly sparser turn out to be very effective.

Feature Hashing. In the pursuit for sparse projection matrices, Weinberger et al. [WDL+09] introduced dimensionality reduction via Feature Hashing, in which the projection matrix A is, in a sense, as sparse as possible. That is, every column of A contains exactly one non-zero entry, randomly chosen from { 1, 1}. This techniqueis one of the most inﬂuencial mathematical tools in the study of scaling-up machine learning algorithms, mainly due to its simplicity and good performance in practice [Dal13, Sut15]. More formally, for n, m N+, the projection matrix A is sampled as follows. Sample h R [n] [m], and σ = σj j [n] R { 1, 1}n independently. For every i [m], j [n], let aij = aij(h, σ) := σj 1h(j)=i (that is, aij = σj iff h(j) = i and 0 otherwise). Weinberger et al. additionally showed exponential tail bounds on Ax 2 2 when the ratio x / x 2 is sufﬁciently small, and m is sufﬁciently large. These bounds were later improved by Dasgupta et al. [DKS10] and most recently by Dahlgaard, Knudsen and Thorup [DKT17a] improved these concentration bounds. Conversely, a result by Kane and Nelson [KN14] implies that if we allow x / x 2 to be too large, then there exist vectors for which (1) does not holds.

Finding the correct tradeoffs between x / x 2, and m, ε, δ in which feature hashing performs well remained an open problem. Our main contribution is settling this problem, and providing a complete and comprehensive understanding of the performance of feature hashing.

1.1 Main results

The main result of this paper is a tight tradeoff between the target dimension m, the approximation ratio ε, the error probability δ and x / x 2. More formally, let ε, δ > 0 and m N+. Let ν(m, ε, δ) be the maximum ν [0, 1] such that for every x Rn, if x ν x 2 then (1) holds. Our main result is the following theorem, which gives tight asymptotic bounds for the performance of feature hashing, thus closing the long-standing gap.

Theorem 2. There exist constants C D > 0 such that for every ε, δ (0, 1) and m N+ the following holds. If C lg 1

δ ε2 m < 2 ε2δ then

ν(m, ε, δ) = Θ

v u u tlg ε2m

Otherwise, if m 2 ε2δ then ν(m, ε, δ) = 1. Moreover if m < D lg 1

δ ε2 then ν(m, ε, δ) = 0.

While the bound presented in the theorem may strike as surprising, due to the intricacy of the expressions involved, the tightness of the result shows that this is, in fact, the correct and true bound. Moreover, the proof of the theorem demonstrates how both branches in the min expression are required in order to give a tight bound.

Experimental Results. Our theoretical bounds are accompanied by empirical results that shed light on the nature of the constants in Theorem 2. Our empirical results show that in practice the constants inside the Θ-notation are signiﬁcantly tighter than the theoretical proof might suggest, and in fact feature hashing performs well for a larger scope of vectors. Speciﬁcally, for a synthetic set of generated bit-vectors, we show that whenever 4 lg 1

δ ε2 m < 2 ε2δ the constant hidden by the Θ-notation is at least 0.75 (except for very sparse vectors, i.e. x 0 7). That is

ν(m, ε, δ) 0.725 ε min

v u u tlg ε2m

For a bag-of-words representation of 1500 NIPS papers with stopwords removed [DKT17b, New08] our experiments show that the constant is even larger, whereas the theoretical proof provides a much smaller constant of 2 6 in front of ε. Since feature hashing satisﬁes (1) whenever x ν(m, ε, δ) x 2, this implies that feature hashing works with a better constant than the theory suggests.

Proof Technique As a fundamental step in the proof of Theorem 2 we prove tight asymptotic bounds for high-order norms of the approximation factor.2 More formally, for every x Rn \ {0} let X(x) = | Ax 2 2 x 2 2|. The technical crux of our results is tight bounds on high-order moments of X(x). Note that by rescaling we may restrict our focus without loss of generality to unit vectors.

Notation 1. For every m, r, k > 0 denote

Λ(m, r, k) =

, mr > k mr

k ), r k ln( emr

In these notations our main technical lemmas are the following.

Lemma 3. For every even r m/4 and everyunit vector x Rn, X(x) r = O(Λ(m, r, x 2 )).

Lemma 4. For every k n and even r min{m/4, k}, X(x(k)) r = Ω(Λ(m, r, k)), where x(k) Rn is the unit vector whose ﬁrst k entries equal 1

While it might seem at a glance that bounding the high-order moments of X(x) is merely a technical issue, known tools and techniques could not be used to prove Lemmas 3, 4. Particularly, earlier work by Kane and Nelson [KN14, CJN18] and Freksen and Larsen [FL17] used high-order moments bounds as a step in proving probability tail bounds of random variables. The existing techniques, however, can not be adopted to bound high-order moments of X(x) (see also Section 1.2), and novel approaches were needed. Speciﬁcally, our proof incorporates a novel combinatorial scheme for counting edge-labeled Eulerian graphs.

2Given a random variable X and r > 0, the rth norm of X (if exists) is deﬁned as X r := rp

Previous Results. Weinberger et al. [WDL+09] showed that whenever m = Ω(ε 2 lg(1/δ)), then ν(m, ε, δ) = Ω(ε (lg(1/δ) lg(m/δ)) 1/2). Dasgupta et al. [DKS10] showed that under similar conditions ν(m, ε, δ) = Ω( ε (lg(1/δ) lg2(m/δ)) 1/2). These bounds were recently improved by

Dahlgaard et al. [DKT17a] who showed that ν(m, ε, δ) = Ω ε q

lg(1/ε) lg(1/δ) lg(m/δ) . Conversely,

Kane and Nelson [KN14] showed that for the restricted case of m = Θ(ε 2 lg(1/δ)), ν(m, ε, δ) =

O ε lg(1/ε)

lg(1/δ) , which matches the bound in Theorem 2 if, in addition, lg(1/ε) p

1.2 Related Work

The Count Sketch scheme, presented by Charikar et al. [CCF04], was shown to satisfy (1) by Thorup and Zhang [TZ12]. The scheme essentially samples O(lg(1/δ)) independent copies of a feature hashing matrix with m = O(ε 2) rows, and applies them all to x. The estimator for x 2 2 is then given by computing the median norm over all projected vectors. The Count Sketch scheme thus constructs a sketching matrix A such that every column has O(lg(1/δ)) non-zero entries. However, this construction does not provide a norm-preserving embedding into a Euclidean space (that is, the estimator of x 2 2 cannot be represented as a norm of Ax), which is essential for some applications such as nearest-neighbor search [HIM12].

Kane and Nelson [KN14] presented a simple construction for the so-called sparse Johnson Lindenstrauss transform. This is a distribution of m n matrices, for m = Θ(ε 2 lg(1/δ)), where every column has s non-zero entries, randomly chosen from { 1, 1}. Note that if s = 1, this distribution yields the feature hashing one. Kane and Nelson showed that for s = Θ(εm) this construction satisﬁes (1). Recently, Cohen et al. [CJN18] presented two simple proofs for this result. While their proof methods give (simple) bounds for high-order moments similar to those in Lemmas 3 and 4, they rely heavily on the fact that s is relatively large. Speciﬁcally, for s = 1 the bounds their method or an extension thereof give are trivial.

2 Bounding ν(m, ε, δ)

In this section we prove the principal part of Theorem 2, assuming Lemmas 3 and 4, whose proof is deferred to the full version of the paper. Formally we prove the following.

Theorem 5 (Main Part of Theorem 2). There exist constants ˆC > 0 such that for every ε, δ (0, 1)

and m N+, if ˆ C lg 1

δ ε2 m < 2 ε2δ then

ν(m, ε, δ) = Θ

v u u tlg ε2m

Fix ε, δ (0, 1) and an integer m. From Lemmas 3 and 4 there exist C1, C2 > 0 such that for every r, k, if r m/4 then for every unit vector x, X(x) r 2C2Λ(m, r, k). Moreover, if r k then

2 C1Λ(m, r, k) X(x(k)) r 2C2Λ(m, r, k) .

Note that in addition Λ(m, 2r, k) 4Λ(m, r, k). Denote ˆC = 2C2+2, and C = 2C1 + 2C2 + 5.

For the rest of the proof we assume that ˆ C log 1

δ ε2 m < 2 ε2δ, and we start by proving a lower bound on ν.

Lemma 6. ν(m, ε, δ) = Ω

Proof. Let r = lg 1

δ, let x Rn be a unit vector such that x min ε ln eεm

and let k := 1 x 2 max 2C2er2

r , 2C2er ε lg eε2m

. If k mr, then since r2

k is convex as a

function of k h 2C2er2

r , mr i then

r e ln2 eεm ln2 eεm

Moreover, if k mr then since r k ln emr

k2 is convex as a function of k 2C2er ε ln eε2m

r , mr , then

2C2 r k ln emr

e ln ε2m ln2 eε2m

Since clearly, q

m ε/2, then by Lemma 3 we have X(x) r r (ε/2)r, and thus

Pr Ax 2 2 1 > ε = Pr [ (X(x))r > εr ] 2 r = δ . (2)

Hence ν(m, ε, δ) min ε ln eεm

Lemma 7. ν(m, ε, δ) = O

To this end, let r = 1

δ , and denote

v u u tε lg ε2m

Assume ﬁrst that t 1 r, and let k = 1

t2 . We will show that E h X(x(k)) ri 2εr. Since t 1 r,

then k r. If eε

r r , then k = r2

r . Since eεm

r > e, then k mr. Therefore

E h X(x(k)) ri = X(x(k)) r r

r ln2 e2εm ln2 eεm

Otherwise, k = r eε ln eε2m

r . Moreover, since ε2m

r > 1, then k r/ε mr. Therefore

E h X(x(k)) ri = X(x(k)) r r r k ln emr

ln e3ε2m ln2 eε2m

Applying the Paley-Zygmund inequality we get that

Pr h Ax(k) 2 2 1 > ε i Pr X(x(k)) r > 1

2E[X(x(k))r] 1

2 C1Λ(m, r, k)

2C2Λ(m, 2r, k)

Therefore ν(m, ε, δ) x(k) = t.

Assume next that 1 r < t < p ε

4, and note that since eε

r 1 r, then m > e

r/(eε), and

r r 1 r then m > e1/(eε). Let k = 1

t2 , and consider independent h R [n] [m],

and σ = (σ1, . . . , σm) R { 1, 1}m. Let y Rn be deﬁned as follows. For every j [n], yj = x(k) j if and only if h(j) = 1, and yj = 0 otherwise. Denote z = x(k) y. Then x(k) 2 2 = y 2 2 + z 2 2,

and moreover, Ax(k) 2 2 = Ay 2 2 + Az 2 2, where A = A(h, σ). Let Efirst denote the event that |h 1({1})| = 2

εk, and that for all j [n], if h(j) = 1 then σj = 1, and let Erest denote the event that Az 2 2 z 2 2 < ε z 2 2. By Chebyshev s inequality, Pr[Erest | Efirst] = Ω(1). Note that if k = r2

r , then since m > max{e1/eε, e r} we get

δ C e ln eεm

δ C e(ln em ln 1

ε ln r) lg 1

δ C e(ln m 3 ln ln m) lg 1

and otherwise, k = r eε ln eε2m

εk εk = lg 1

δ e C ln eε2m

δ e C(ln em 2 ln 1

ε ln r) lg 1

δ e C(ln em 4 ln ln m) lg 1

Therefore for small enough ε,

Pr[Efirst] = k 2

Thus for small enough δ, Pr[Efirst Erest] δ. Conditioned on Efirst Erest we get that

Ax(k) 2 2 = Ay 2 2+ Az 2 4εk

k +(1 ε) z 2 2 = 4ε+(1 ε) k 2

εk k 4ε+(1 ε)2 > 1+ε ,

where the inequality before last is due to the fact that k 4

ε. Therefore ν(m, ε, δ) x(k) = t.

Finally, assume t > p ε

r r t > p ε

4, we get that m r eε2 er/(4e) r eε2δ1/(4e C) . Let k = 2

ε. Consider independent h R [n] [m], and σ = (σ1, . . . , σm) R { 1, 1}m, and let A = A(h, σ). Let Ecol denote the event that there are j = ℓ [k] such that for every p = q [k], h(p) = h(q) if and only if {p, q} = {j, ℓ}. Then for small enough ε, δ,

Pr[Ecol] = k 2

2m (1 ε/2) 1 k

2m (1 ε/2) 1 k2

2δ (1 ε/2) 1 4eδ1/(4Ce) δ .

Conditioned on Ecol we get that Ax(k) 2 2 1 = 2

k = ε. Therefore ν(m, ε, δ) p ε

2 O(t). This completes the proof of Lemma 7, and thus of Theorem 5.

3 Empirical Analysis

We complement our theoretical bounds with experimental results on both real and synthetic data. The goal of the experiments is to give bounds on some of the constants hidden in the main theorem. Our synthetic-data experiments show that for 4 lg 1

δ ε2 m < 2 ε2δ the constant inside the Θ-notation in Theorem 2 is at least 0.725 except for very sparse vectors ( x 0 7), where the constant is at least 0.6. Furthermore, we conﬁrm that ν(m, ε, δ) = 1 when m 2 ε2δ and that there exists data points where ν(m, ε, δ) < 1 while m = 2 γ

ε2δ , for some small γ. In addition, for the real-world data we tested feature hashing on, the constant is at least 1.1 or 0.8, based on the data set.

3.1 Experiment Setup and Analysis

To arrive at the results, we ran experiments and analysed the data in several phases. In the ﬁrst phase we varied the target dimension m over exponentially spaced values in the range [24, 214], and a parameter k which controls the ratio between the ℓ and the ℓ2 norm. The values of k varied over

1 2 3 4 5 lg 1/ε

min ε lg 1/δ lg εm lg 1/δ,

ε lg ε2m lg 1/δ lg 1/δ

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 lg 1/δ

Figure 1: The plot shows the ratio between ˆν values and the theoretical bound, abbreviated here as min{left, right}. This ratio corresponds to the constant in the Θ-notation in Theorem 2. The points are marked with blue circles if left < right, and with green s otherwise. The horizontal line at 0.725 is there to ease comparisons with Figure 2, while the line at 1.1 helps in comparing against real world data (Figure 4 and Figure 5).

exponentially spaced values in the range [21, 213]. Then for all m and k, we generated 224 vectors x with entries in {0, 1} such that x 2 =

k x , and for any given m and k the supports of the vectors were pairwise disjoint. We then hashed the generated vectors using feature hashing, and recorded the ℓ2 norm of the embedded vectors.

The second phase then calculated the distortion between the original and the embedded vectors, and computed the error probability ˆδ. Loosely speaking, ˆδ(m, k, ε) is the ratio of the 224 vectors for a given m and k that have distortion greater than ε. Formally, ˆδ is calculated using the following formula

ˆδ(m, k, ε) =

k x , Amx 2 2 x 2 2 ε x 2 2 {x : x 2 =

where ε was varied over exponentially spaced values in the range [2 10, 2 1]. Note that ˆδ tends to the true error probability as the number of vectors tends to inﬁnity. Computing ˆδ yielded a series of 4-tuples (m, k, ε, ˆδ) which can be interpreted as given target dimension m, ℓ /ℓ2 ratio 1/

k, distortion ε, we have measured that the failure probability is at most ˆδ.

In the third phase, we varied δ over exponentially spaced values in the range [2 20, 20], and calculated a value ˆν. Intuitively, ˆν(m, ε, δ) is the largest ℓ /ℓ2 ratio such that for all vectors having at most this ℓ /ℓ2 ratio the measured error probability ˆδ is at most δ. Formally,

ˆν(m, ε, δ) = max n 1

k : k k, ˆδ(m, k , ε) δ o .

Note once more that ˆν tends to the true ν value as the number of vectors tends to inﬁnity. To ﬁnd a bound on the constant of the Θ-notation in Theorem 2, we truncated data points that did not satisfy 4 lg 1

δ ε2 m < 2 ε2δ, and for the remaining points we plotted ˆν over the theoretical bound in Figure 1. From this plot we conclude that the constant is at least 0.6 on the large range on parameters we tested. However, the smallest values seem to be outliers and come from a combination of very sparse vectors (k = 7) and high target dimension (m = 214). For the rest of the data points the constant is at least 0.725. While there are data points where the constant is larger (i.e. feature hashing performs better), there are data points close to 0.725 over the entire range of ε and δ.

In Figure 2 we show that we indeed need both terms in the minimum in Theorem 2, by plotting the measured ˆν values over both terms in the minimum in the theoretical bound separately. For both terms there are points whose value is signiﬁcantly below 0.725.

To ﬁnd a bound on m where ˆν(m, ε, δ) = 1 we took the untruncated data and recorded the maximal ˆδ for each m and ε. We then plotted mε2ˆδ in Figure 3. From Figure 3 it is clear that ˆν(m, ε, δ) = 1

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 lg 1/δ

ˆν ε lg 1/δ lg εm lg 1/δ

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 lg 1/δ

ε lg ε2m lg 1/δ lg 1/δ

Figure 2: This plot shows the measured ˆν values over each of the two terms in the minimum in the theoretical bound (abbreviated here): min{left, right}. In the left subﬁgure the y-axis of the blue circles is ˆν left, while the y-axis of the green s in the right subﬁgure is ˆν right. Note that the x-axis (values of lg(1/δ) ) is the same in both subﬁgures, and the same as in the right subﬁgure of Figure 1. As in Figure 1, the horizontal line at 0.725 is there to ease comparison between the ﬁgures.

2 4 6 8 10 lg 1/ε

0 2 4 6 8 10 lg 1/δ

Figure 3: This plot shows the constant where ˆν(m, ε, δ) becomes 1. The theory states that if 2 mε2δ then ˆν(m, ε, δ) = 1. The distinct curves in the left plot correspond to distinct values of m.

when m 2 ε2δ. Furthermore, the ﬁgure also shows that there are data points where ˆν(m, ε, δ) < 1 while m = 2 γ

ε2δ , for some small γ. Therefore we conclude the bound m 2 ε2δ is tight.

3.2 Real-World Data

We also ran experiments on real-world data, namely bag-of-words representations of 1500 NIPS papers with stopwords removed [DKT17b, New08]. We ran experiments on this data set both with and without preprocessing with the common logarithmic term frequency - inverse document frequency (tfidf). These experiments were executed and analysed similarly to the synthetic experiments described above, except for a few changes. First, in order to explore any meaningful δ values we hashed each vector 220 times. In this way, iterating over the original vectors in the real world experiments plays a similar role to iterating over the k values in the synthetic experiments. Secondly, in these experiments m ranged over values in [24, 212].

The results of these experiments can be seen in Figure 4 and Figure 5, from which we conclude that feature hashing performs even better on the real-world data we tested compared to the synthetic data, as the Theorem 2 constant is always above 1.1 and 0.8 with and without tf-idf, respectively. Furthermore, for the vast majority of data points have a constant around or above 1.2.

1.0 1.5 2.0 2.5 3.0 3.5 lg 1/ε

min ε lg 1/δ lg εm lg 1/δ,

ε lg ε2m lg 1/δ lg 1/δ

6 8 10 12 14 16 lg 1/δ

Figure 4: This plot has a similar structure to Figure 1, but is based on the NIPS dataset preprocessed with tf-idf. This plot shows the measured ˆν values over the theoretical bound (abbreviated here): min{left, right}. This ratio corresponds to the constant in the Θ-notation in Theorem 2. The points are marked with blue circles if left < right, otherwise they are marked with green s. The horizontal lines at 0.725 and 1.1 are there to ease comparisons with Figure 1.

1.0 1.5 2.0 2.5 3.0 3.5 lg 1/ε

min ε lg 1/δ lg εm lg 1/δ,

ε lg ε2m lg 1/δ lg 1/δ

4 6 8 10 12 14 16 lg 1/δ

Figure 5: This plot has a similar structure to Figure 1, but is based on the NIPS dataset without tf-idf preprocessing. This plot shows the measured ˆν values over the theoretical bound (abbreviated here): min{left, right}. This ratio corresponds to the constant in the Θ-notation in Theorem 2. The points are marked with blue circles if left < right, otherwise they are marked with green s. The horizontal lines at 0.725 and 1.1 are there to ease comparisons with Figure 1.

3.3 Implementation Details

As random number generators, we used degree 20 polynomials modulo the Mersenne prime 261 1, where the coefﬁcients were random data from random.org. The random data was independent between experiments with diffent values of m, between synthetic and real world experiments, and between the random number generator used for vector generation and hashing.

Feature hashing was done using double tabulation hashing [Tho14] on 64 bit numbers. The tables in our implementation of double tabulation hashing were ﬁlled with numbers from the aforementioned random number generator. Double tabulation hashing has been proven to behave fully randomly with high probability [DKRT15].

In order to efﬁciently do the 220 hashings per vector for the real world data, we utilised the high independence of double tabulation hashing. Let d be the original dimension of the vectors. We blew up the source dimension to 220d, and at the ith rehash we shifted the coordinates of the original vector i d places to the right.

Acknowledgments

This work was supported by a Villum Young Investigator Grant.

[AC09] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302 322, 2009.

[Ach03] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671 687, June 2003.

[AIL+15] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, pages 1225 1233. MIT Press, 2015. Available from: http://dl.acm.org/citation.cfm?id=2969239. 2969376.

[CCF04] M. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3 15, 2004.

[CJN18] M. B. Cohen, T. S. Jayram, and J. Nelson. Simple analyses of the sparse Johnson Lindenstrauss transform. In 1st Symposium on Simplicity in Algorithms, SOSA 2018, January 7-10, 2018, New Orleans, LA, USA, pages 15:1 15:9, 2018.

[CT06] E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406 5425, 2006.

[CW09] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the Forty-ﬁrst Annual ACM Symposium on Theory of Computing, pages 205 214. ACM, 2009.

[Dal13] B. Dalessandro. Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms. Big Data, 1(2):110 112, 2013.

[DG03] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. Algorithms, 22(1):60 65, 2003.

[DKRT15] S. Dahlgaard, M. B. T. Knudsen, E. Rotenberg, and M. Thorup. Hashing for statistics over k-partitions. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), FOCS 15, pages 1292 1310. IEEE Computer Society, 2015.

[DKS10] A. Dasgupta, R. Kumar, and T. Sarlós. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, STOC 10, pages 341 350, 2010.

[DKT17a] S. Dahlgaard, M. Knudsen, and M. Thorup. Practical hash functions for similarity estimation and dimensionality reduction. In Advances in Neural Information Processing Systems 30, pages 6615 6625. Curran Associates, Inc., 2017.

[DKT17b] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017. Available from: http://archive.ics.uci.edu/ml.

[FL17] C. B. Freksen and K. G. Larsen. On using toeplitz and circulant matrices for Johnson Lindenstrauss transforms. In 28th International Symposium on Algorithms and Computation, ISAAC 2017, December 9-12, 2017, Phuket, Thailand, pages 32:1 32:12, 2017.

[HIM12] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321 350, 2012.

[HWB08] C. Hegde, M. Wakin, and R. Baraniuk. Random projections for manifold learning. In Advances in Neural Information Processing Systems 20, pages 641 648. Curran Associates, Inc., 2008.

[JL84] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982),

volume 26 of Contemporary Mathematics, pages 189 206. American Mathematical Society, 1984. [JW13] T. S. Jayram and D. P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algorithms, 9(3):26:1 26:17, 2013. [KN14] D. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. J. ACM, 61(1):4:1 4:23, January 2014. [LN17] K. G. Larsen and J. Nelson. Optimality of the Johnson-Lindenstrauss lemma. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 633 638, 2017. [Mat08] J. Matoušek. On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms, 33(2):142 156, 2008. [MM09] O. Maillard and R. Munos. Compressed least-squares regression. In Advances in Neural Information Processing Systems 22, pages 1213 1221. Curran Associates, Inc., 2009. [MM13] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Symposium on Theory of Computing Conference, STOC 13, Palo Alto, CA, USA, June 1-4, 2013, STOC 13, pages 91 100, 2013. [New08] D. Newman. Bag of words data set, 2008. Available from: https://archive.ics. uci.edu/ml/datasets/Bag+of+Words. [NN13] J. Nelson and H. L. Nguyen. Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the Forty-ﬁfth Annual ACM Symposium on Theory of Computing, STOC 13, pages 101 110. ACM, 2013. [PBMID14] S. Paul, C. Boutsidis, M. Magdon-Ismail, and P. Drineas. Random projections for linear support vector machines. ACM Trans. Knowl. Discov. Data, 8(4):22:1 22:25, 2014. [RR08] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177 1184. Curran Associates, Inc., 2008. [Sár06] T. Sárlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 143 152. IEEE Computer Society, 2006. [Sut15] S. Suthaharan. Machine Learning Models and Algorithms for Big Data Classiﬁcation: Thinking with Examples for Effective Learning. Springer Publishing Company, Incorporated, 1st edition, 2015. [Tho14] M. Thorup. Simple tabulation, fast expanders, double tabulation, and high independence. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science(FOCS), pages 90 99. IEEE Computer Society, 2014. [TZ12] M. Thorup and Y. Zhang. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM J. Comput., 41(2):293 331, April 2012. [Vem05] S. S. Vempala. The random projection method. DIMACS : series in discrete mathematics and theoretical computer science. American Mathematical Society, 2005. [WDL+09] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pages 1113 1120, 2009.