# understanding_sparse_jl_for_feature_hashing__4dbfc218.pdf

Understanding Sparse JL for Feature Hashing

Meena Jagadeesan Harvard University Cambridge, MA 02138 mjagadeesan@college.harvard.edu

Feature hashing and other random projection schemes are commonly used to reduce the dimensionality of feature vectors. The goal is to efﬁciently project a high-dimensional feature vector living in Rn into a much lower-dimensional space Rm, while approximately preserving Euclidean norm. These schemes can be constructed using sparse random projections, for example using a sparse Johnson Lindenstrauss (JL) transform. A line of work introduced by Weinberger et. al (ICML 09) analyzes the accuracy of sparse JL with sparsity 1 on feature vectors with small ℓ -to-ℓ2 norm ratio. Recently, Freksen, Kamma, and Larsen (Neur IPS 18) closed this line of work by proving a tight tradeoff between ℓ -to-ℓ2 norm ratio and accuracy for sparse JL with sparsity 1. In this paper, we demonstrate the beneﬁts of using sparsity s greater than 1 in sparse JL on feature vectors. Our main result is a tight tradeoff between ℓ -to-ℓ2 norm ratio and accuracy for a general sparsity s, that signiﬁcantly generalizes the result of Freksen et. al. Our result theoretically demonstrates that sparse JL with s > 1 can have signiﬁcantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this ﬁnding.

1 Introduction

Feature hashing and other random projection schemes are inﬂuential in helping manage large data [11]. The goal is to reduce the dimensionality of feature vectors: more speciﬁcally, to project high-dimensional feature vectors living in Rn into a lower dimensional space Rm (where m n), while approximately preserving Euclidean distances (i.e. ℓ2 distances) with high probability. This dimensionality reduction enables a classiﬁer to process vectors in Rm, instead of vectors in Rn. In this context, feature hashing was ﬁrst introduced by Weinberger et. al [29] for document-based classiﬁcation tasks such as email spam ﬁltering. For such tasks, feature hashing yields a lower dimensional embedding of a high-dimensional feature vector derived from a bag-of-words model. Since then, feature hashing has become a mainstream approach [28], applied to numerous domains including ranking text documents [4], compressing neural networks [7], and protein sequence classiﬁcation [5].

Random Projections

Dimensionality reduction schemes for feature vectors ﬁt nicely into the random projection literature. In fact, the feature hashing scheme proposed by Weinberger et al. [29] boils down to uniformly drawing a random m n matrix where each column contains one nonzero entry, equal to 1 or 1.

The ℓ2-norm-preserving objective can be expressed mathematically as follows: for error ϵ > 0 and failure probability δ, the goal is to construct a probability distribution A over m n real matrices

I would like to thank Prof. Jelani Nelson for advising this project.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

that satisﬁes the following condition for vectors x Rn:

PA A[(1 ϵ) x 2 Ax 2 (1 + ϵ) x 2] > 1 δ. (1)

The result underlying the random projection literature is the Johnson-Lindenstrauss lemma, which gives an upper bound on the dimension m achievable by a probability distribution A satisfying (1):

Lemma 1.1 (Johnson-Lindenstrauss [16]) For any n N and ϵ, δ (0, 1), there exists a probability distribution A over m n matrices, with m = Θ(ϵ 2 ln(1/δ)), that satisﬁes (1).

The optimality of the dimension m achieved by Lemma 1.1 has been proven [17, 15].

To speed up projection time, it is useful to consider probability distributions over sparse matrices (i.e. matrices with a small number of nonzero entries per column). More speciﬁcally, for matrices with s nonzero entries per column, the projection time for a vector x goes down from O(m x 0) to O(s x 0), where x 0 is the number of nonzero entries of x. In this context, Kane and Nelson [19] constructed sparse JL distributions (which we deﬁne formally in Section 1.1), improving upon previous work [2, 22, 12]. Roughly speaking, a sparse JL distribution, as constructed in [19], boils down to drawing a random m n matrix where each column contains exactly s nonzero entries, each equal to 1/ s or 1/ s. Kane and Nelson show that sparse JL distributions achieve the same (optimal) dimension as Lemma 1.1, while also satisfying a sparsity property.

Theorem 1.2 (Sparse JL [19]) For any n N and ϵ, δ (0, 1), a sparse JL distribution As,m,n (deﬁned formally in Section 1.1) over m n matrices, with dimension m = Θ(ϵ 2 ln(1/δ)) and sparsity s = Θ(ϵ 1 ln(1/δ)), satisﬁes (1).

Sparse JL distributions are state-of-the-art sparse random projections, and achieve a sparsity that is nearly optimal when the dimension m is Θ(ϵ 2 ln(1/δ)).2 However, in practice, it can be necessary to utilize a lower sparsity s, since the projection time is linear in s. Resolving this issue, Cohen [8] extended the upper bound in Theorem 1.2 to show that sparse JL distributions can achieve a lower sparsity with an appropriate gain in dimension. He proved the following dimension-sparsity tradeoffs:

Theorem 1.3 (Dimension-Sparsity Tradeoffs [8]) For any n N and ϵ, δ (0, 1), a uniform sparse JL distribution As,m,n (deﬁned formally in Section 1.1), with s Θ(ϵ 1 ln(1/δ)) and

m min 2ϵ 2/δ, ϵ 2 ln(1/δ)eΘ(ϵ 1 ln(1/δ)/s) , satisﬁes (1).

Connection to Feature Hashing

Sparse JL distributions have particularly close ties to feature hashing. In particular, the feature hashing scheme proposed by Weinberger et al. [29] can be viewed as a special case of sparse JL, namely with s = 1. Interestingly, in practice, feature hashing can do much better than theoretical results, such as Theorem 1.2 and Theorem 1.3, would indicate [13]. An explanation for this phenomenon is that the highest error terms in sparse JL stem from vectors with mass concentrated on a very small number of entries, while in practice, the mass on feature vectors may be spread out between many coordinates. This motivates studying the tradeoff space for vectors with low ℓ -to-ℓ2 ratio.

More formally, take Sv to be n x Rn | x

x 2 v o , so that S1 = Rn and Sv Sw for 0 v <

w 1. Let v(m, ϵ, δ, s) be the supremum over all 0 v 1 such that a sparse JL distribution with sparsity s and dimension m satisﬁes (1) for each x Sv. (That is, v(m, ϵ, δ, s) is the maximum v [0, 1] such that for every x Rn, if x v x 2 then (1) holds.) For s = 1, a line of work [29, 12, 18, 10, 19] improved bounds on v(m, ϵ, δ, 1), and was recently closed by Freksen et al. [13].

Theorem 1.4 ([13]) For any m N and ϵ, δ (0, 1), the function v(m, ϵ, δ, 1) is equal to f(m, ϵ, ln(1/δ)) where:

f(m, ϵ, p) =

1 if m 2ϵ 2ep

if Θ(ϵ 2p) m < 2ϵ 2ep

0 if m Θ(ϵ 2p).

2Nelson and Nguyen [25] showed that any distribution satisfying (1) requires sparsity Ω(ϵ 1 ln(1/δ)/ ln(1/ϵ)) when the dimension m is Θ(ϵ 2 ln(1/δ)). Kane and Nelson [19] also showed that the analysis of sparse JL distributions in Theorem 1.2 is tight at m = Θ(ϵ 2 ln(1/δ)).

Generalizing to Sparse Random Projections with s > 1

While Theorem 1.4 is restricted to the case of s = 1, dimensionality reduction schemes constructed using sparse random projections with sparsity s > 1 have been used in practice for projecting feature vectors. For example, sparse JL-like methods (with s > 1) have been used to project feature vectors in machine learning domains including visual tracking [27], face recognition [23], and recently in ELM [6]. Now, a variant of sparse JL is included in the Python sklearn library.3

In this context, it is natural to explore how constructions with s > 1 perform on feature vectors, by studying v(m, ϵ, δ, s) for sparse JL with s > 1. In fact, a related question was considered by Weinberger et al. [29] for multiple hashing, an alternate distribution over sparse matrices constructed by adding s draws from A1,m,n and scaling by 1/ s. More speciﬁcally, they show that v(m, ϵ, δ, s) min(1, s v(m, ϵ, δ, 1)) for multiple hashing. However, Kane and Nelson [19] later showed that multiple hashing has worse geometry-preserving properties than sparse JL: that is, multiple hashing requires a larger sparsity than sparse JL to satisfy (1).

Characterizing v(m, ϵ, δ, s) for sparse JL distributions, which are state-of-the-art, remained an open problem. In this work, we settle how v(m, ϵ, δ, s) behaves for sparse JL with a general sparsity s > 1, giving tight bounds. Our theoretical result shows that sparse JL with s > 1, even if s is a small constant, can achieve signiﬁcantly better norm-preservation properties for feature vectors than sparse JL with s = 1. Moreover, we empirically demonstrate this ﬁnding.

Main Results

We show the following tight bounds on v(m, ϵ, δ, s) for a general sparsity s:

Theorem 1.5 For any s, m N such that s m/e, consider a uniform sparse JL distribution (deﬁned in Section 1.1) with sparsity s and dimension m.4 If ϵ and δ are small enough5, the function v(m, ϵ, δ, s) is equal to f (m, ϵ, ln(1/δ), s), where f (m, ϵ, p, s) is6:

2ϵ 2ep, ϵ 2pe Θ max 1, pϵ 1

else, if max

Θ(ϵ 2p), s e Θ max 1, pϵ 1

else, if Θ(ϵ 2p) m min

ϵ 2eΘ(p), s e Θ max 1, pϵ 1

0 if m Θ(ϵ 2p).

Our main result, Theorem 1.5, signiﬁcantly generalizes Theorem 1.2, Theorem 1.3, and Theorem 1.4. Notice our bound in Theorem 1.5 has up to four regimes. In the ﬁrst regime, which occurs when m min(2ϵ 2/δ, ϵ 2 ln(1/δ)eΘ(max(1,ln(1/δ)ϵ 1/s))), Theorem 1.5 shows v(m, ϵ, δ, s) = 1, so (1) holds on the full space Rn. Notice this boundary on m occurs at the dimensionality-sparsity tradeoff in Theorem 1.3. In the last regime, which occurs when m Θ(ϵ 2 ln(1/δ)), Theorem 1.5 shows that v(m, ϵ, δ, s) = 0, so there are vectors with arbitrarily small ℓ -to-ℓ2 norm ratio where (1) does not hold. When s Θ(ϵ 1 ln(1/δ)), Theorem 1.5 shows that up to two intermediate regimes exist. One

of the regimes, Θ( ϵs min(ln( mϵ

p )/ p)), matches the middle regime of v(m, ϵ, δ, 1)

in Theorem 1.4 with an extra factor of s, much like the bound for multiple hashing in [29] that we mentioned previously. However, unlike the multiple hashing bound, Theorem 1.5 sometimes

has another regime, Θ( ϵs q

p )/ p), which does not arise for s = 1 (i.e. in Theorem

1.4).7 Intuitively, we expect this additional regime for sparse JL with s close to Θ(ϵ 1 ln(1/δ)):

3See https://scikit-learn.org/stable/modules/random_projection.html. 4We prove the lower bound on v(m, ϵ, δ, s) in Theorem 1.5 for any sparse JL distribution. 5By small enough , we mean the condition that ϵ, δ (0, C ) for some positive constant C . 6Notice that the function f (m, ϵ, p, s) is not deﬁned for certain constant-factor intervals between the boundaries of regimes (e.g. C1ϵ 2p m C2ϵ 2p). See Appendix A for a discussion.

7This regime does not arise for s = 1, since eΘ(pϵ 1) ϵ 2eΘ(p) for sufﬁciently small ϵ.

at s = Θ(ϵ 1 ln(1/δ)) and m = Θ(ϵ 2 ln(1/δ)), Theorem 1.2 tells us v(m, ϵ, δ, s) = 1, but if ϵ is a constant, then the branch Θ( ϵs ln mϵ

p /p) yields Θ(1/ p

ln(1/δ)), while the branch

p )/ p) yields Θ(1). Thus, it is natural that the ﬁrst branch disappears for large m.

Our result elucidates that v(m, ϵ, δ, s) increases approximately as s, thus providing insight into how even small constant increases in sparsity can be useful in practice. Another consequence of our result is a lower bound on dimension-sparsity tradeoffs (Corollary A.1 in Appendix A) that essentially matches the upper bound in Theorem 1.3. Moreover, we require new techniques to prove Theorem 1.5, for reasons that we discuss further in Section 1.2.

We also empirically support our theoretical ﬁndings in Theorem 1.5. First, we illustrate with realworld datasets the potential beneﬁts of using small constants s > 1 for sparse JL on feature vectors. We speciﬁcally show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the ℓ2 norm of each vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in comparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5. More speciﬁcally, we empirically show that v(m, ϵ, δ, s) is not smooth, and that the middle regime(s) of v(m, ϵ, δ, s) increases with s.

1.1 Preliminaries

Let As,m,n be a sparse JL distribution if the entries of a matrix A As,m,n are generated as follows. Let Ar,i = ηr,iσr,i/ s where {σr,i}r [m],i [n] and {ηr,i}r [m],i [n] are deﬁned as follows:

The families {σr,i}r [m],i [n] and {ηr,i}r [m],i [n] are independent from each other.

The variables {σr,i}r [m],i [n] are i.i.d Rademachers ( 1 coin ﬂips).

The variables {ηr,i}r [m],i [n] are identically distributed Bernoullis ({0, 1} random variables) with expectation s/m. The {ηr,i}r [m],i [n] are independent across columns but not independent within each column. For every column 1 i n, it holds that Pm r=1 ηr,i = s. Moreover, the random variables are negatively correlated: for every subset S [m] and every column 1 i n, it holds that E Q

r S E[ηr,i].

A common special case is a uniform sparse JL distribution, generated as follows: for every 1 i n, we uniformly choose exactly s of these variables in {ηr,i}r [m] to be 1. When s = 1, every sparse JL distribution is a uniform sparse JL distribution, but for s > 1, this is not the case.

Another common special case is a block sparse JL distribution. This produces a different construction for s > 1. In this distribution, each column 1 i n is partitioned into s blocks of m

consecutive rows. In each block in each column, the distribution of the variables {ηr,i} is deﬁned by uniformly choosing exactly one of these variables to be 1.8

1.2 Proof Techniques

We use the following notation. For any random variable X and value q 1, we call E[|X|q] the qth moment of X, where E denotes the expectation. We use X q to denote the q-norm (E[|X|q])1/q.

For every [x1, . . . , xn] Rn such that x 2 = 1, we need to analyze tail bounds of an error term, which for the sparse JL construction is the following random variable:

Ax 2 2 1 = 1

r=1 ηr,iηr,jσr,iσr,jxixj =: R(x1, . . . , xn).

An upper bound on the tail probability of R(x1, . . . , xn) is needed to prove the lower bound on v(m, ϵ, δ, s) in Theorem 1.5, and a lower bound is needed to prove the upper bound on v(m, ϵ, δ, s)

8Our lower bound in Theorem 1.5 applies to this distribution, though our upper bound does not. An interesting direction for future work would be to generalize the upper bound to this distribution.

in Theorem 1.5. It turns out that it sufﬁces to tightly analyze the random variable moments E[(R(x1, . . . , xn))q]. For the upper bound, we use Markov s inequality like in [13, 19, 3, 24], and for the lower bound, we use the Paley-Zygmund inequality like in [13]: Markov s inequality gives a tail upper bound from upper bounds on moments, and the Paley-Zygmund inequality gives a tail lower bound from upper and lower bounds on moments. Thus, the key ingredient of our analysis is a tight bound for R(x1, . . . , xn) q on Sv = n x Rn | x

x 2 v o at each threshold v value.

While the moments of R(x1, . . . , xn) have been studied in previous analyses of sparse JL, we emphasize that it is not clear how to adapt these existing approaches to obtain a tight bound on every Sv. The moment bound that we require and obtain is far more general: the bounds in [19, 9] are limited to Rn = S1 and the bound in [13] is limited to s = 1.9 The non-combinatorial approach in [9] for bounding R(x1, . . . , xn) q on Rn = S1 also turns out to not be sufﬁciently precise on Sv, for reasons we discuss in Section 2.10

Thus, we require new tools for our moment bound. Our analysis provides a new perspective, inspired by the probability theory literature, that differs from the existing approaches in the JL literature. We believe our style of analysis is less brittle than combinatorial approaches [13, 19, 3, 24]: in this setting, once the sparsity s = 1 case is recovered, it becomes straightforward to generalize to other s values. Moreover, our approach can yield greater precision than the existing non-combinatorial approaches [9, 8, 14], which is necessary for this setting. Thus, we believe that our structural approach to analyzing JL distributions could be of use in other settings.

In Section 2, we present an overview of our methods and the key technical lemmas to analyze R(x1, . . . , xn) q. We defer the proofs to the Appendix. In Section 3, we prove the tail bounds in Theorem 1.5 from these moment bounds. In Section 4, we empirically evaluate sparse JL.

2 Sketch of Bounding the Moments of R(x1, . . . , xn)

Our approach takes advantage of the structure of R(x1, . . . , xn) as a quadratic form of Rademachers (i.e. P t1,t2 at1,t2σt1σt2) with random variable coefﬁcients (i.e. where at1,t2 is itself a random variable). For the upper bound, we need to analyze R(x1, . . . , xn) q for general vectors [x1, . . . , xn]. For the lower bound, we only need to show R(x1, . . . , xn) q is large for single vector in each Sv, and we show we can select the vector in the ℓ2-unit ball with 1/v2 nonzero entries, all equal to v. For ease of notation, we denote this vector by [v, . . . , v, 0, . . . , 0] for the remainder of the paper.

We analyze R(x1, . . . , xn) q using general moment bounds for Rademacher linear and quadratic forms. Though Cohen, Jayram, and Nelson [9] also view R(x1, . . . , xn) as a quadratic form, we show in the supplementary material that their approach of bounding the Rademachers by gaussians is not sufﬁciently precise for our setting.11

In our approach, we make use of stronger moment bounds for Rademacher linear and quadratic forms, some of which are known to the probability theory community through Latała s work in [21, 20] and some of which are new adaptions tailored to the constraints arising in our setting. More speciﬁcally, Latała s bounds [21, 20] target the setting where the coefﬁcients are scalars. In our setting, however, the coefﬁcients are themselves random variables, and we need bounds that are tractable to analyze in this setting, which involves creating new bounds to handle some cases.

Our strategy for bounding R(x1, . . . , xn) q is to break down into rows. We deﬁne

Zr(x1, . . . , xn) := X

1 i =j n ηr,iηr,jσr,iσr,jxixj

9As described in [13], even for the case for s = 1, the approach in [19] cannot be directly generalized to recover Theorem 1.4. Moreover, the approach in [13], though more precise for s = 1, is highly tailored to s = 1, and it is not clear how to generalize it to s > 1. 10In predecessor work [14], we give a non-combinatorial approach similar to [9] for a sign-consistent variant of the JL distribution. Moreover, a different non-combinatorial approach for subspace embeddings is given in [8]. However, these approaches both suffer from issues in this setting that are similar to [9]. 11We actually made a similar conceptual point for a different JL distribution in our predecessor work [14], but the alternate bound that we produce there also suffers from precision issues in this setting.

so that R(x1, . . . , xn) = 1

s Pm r=1 Zr(x1, . . . , xn). We analyze the moments of Zr(x1, . . . , xn), and then combine these bounds to obtain moment bounds for R(x1, . . . , xn). In our bounds, we use the notation f g (resp. f g) to denote f Cg (resp. f Cg) for some constant C.

2.1 Bounding Zr(x1, . . . , xn) q

We show the following bounds on Zr(x1, . . . , xn) q. For the lower bound, as we discussed before, it sufﬁces to bound Zr(v, . . . , v, 0, . . . , 0) q. For the upper bound, we need to bound Zr(x1, . . . , xn) q for general vectors as a function of the ℓ -to-ℓ2 norm ratio.

Lemma 2.1 Let As,m,n be a sparse JL distribution such that s m/e. Suppose that x = [x1, . . . , xn] satisﬁes x v and x 2 = 1. If T is even, then:

Zr(x1, . . . , xn) T

m , for T = 2, 3 T se mv2 min T 2v2 ln(m T v2/s)2 , T ln(m/s) for T 3, T se mv2 , ln(Tmv2/s) T

v2 s m T v2 2/T , for T 3, T se mv2 , ln(Tmv2/s) > T.

Lemma 2.2 Let As,m,n be a sparse JL distribution. Suppose 1 v2 and T are even integers. Then, Zr(v, . . . , v, 0, . . . , 0) 2 s m. Moreover, if s m/e and T se mv2 , then Zr(v, . . . , v, 0, . . . , 0)IP1/v2 i=1 η1,i=2

T v2 s m T v2 2/T and

Zr(v, . . . , v, 0, . . . , 0) T

ln2(mv2T/s) for 1 ln(mv2T/s) T, v

T v2 s m T v2 2/T for ln(mv2T/s) > T.

We now sketch our methods to prove Lemma 2.1 and Lemma 2.2. For the lower bound (Lemma 2.2), we can view Zr(v, . . . , v, 0, . . . , 0) as a quadratic form P

t1,t2 at1,t2σt1σt2 where (at1,t2)t1,t2 [mn] is an appropriately deﬁned block-diagonal mn dimensional matrix. We can write Eσ,η[(Zr(v, . . . , v, 0, . . . , 0))q] as Eη [Eσ[(Zr(v, . . . , v, 0, . . . , 0))q]]: for ﬁxed ηr,i values, the coefﬁcients are scalars. We make use of Latała s tight bound on Rademacher quadratic forms with scalar coefﬁcients [21] to analyze Eσ[(Zr(v, . . . , v, 0, . . . , 0))q] as a function of the ηr,i. Then, we handle the randomness of the ηr,i by taking an expectation of the resulting bound on Eσ[(Zr(v, . . . , v, 0, . . . , 0))q] over the ηr,i values to obtain a bound on Zr(v, . . . , v, 0, . . . , 0) q.

For the upper bound (Lemma 2.1), since Latała s bound [21] is tight for scalar quadratic forms, the natural approach would be to use it to upper bound Eσ[(Zr(x1, . . . , xn))q] for general vectors. However, when the vector is not of the form [v, . . . , v, 0, . . . , 0], the asymmetry makes the resulting bound intractable to simplify. Speciﬁcally, there is a term, which can be viewed as a generalization of an operator norm to an ℓ2 ball cut out by ℓ hyperplanes, that becomes problematic when taking an expectation over the ηr,i to obtain a bound on Eσ,η[(Zr(x1, . . . , xn))q]. Thus, we construct simpler estimates that avoid these complications while remaining sufﬁciently precise for our setting. These estimates take advantage of the structure of Zr(x1, . . . , xn) and enable us to show Lemma 2.1.

2.2 Obtaining bounds on R(x1, . . . , xn) q

Now, we use Lemma 2.1 and Lemma 2.2 to show the following bounds on R(x1, . . . , xn) q:

Lemma 2.3 Suppose As,m,n is a sparse JL distribution such that s m/e, and let x = [x1, . . . , xn] be such that x 2 = 1. Then, R(x1, . . . , xn) 2

2 m. Now, suppose that 2 < q m is an even

integer and x v. If se mv2 q, then R(x1, . . . , xn) q q m. If se mv2 < q and if there exists a constant C2 1 such that C2q3mv4 s2, then R(x1, . . . , xn) q g where g is:

max q m, C1/3 2 q2v2

s ln2(qmv2/s)

if ln( qmv4

s2 ) 2, ln( qmv2

q m if ln( qmv4

s2 ) 2, ln( qmv2

max q m, qv2

s ln(qmv4/s2), min C1/3 2 q2v2

s ln2(qmv2/s), q s ln(m/s)

if ln( qmv4

s2 ) > 2, ln( qmv2

max q m, qv2

s ln(qmv4/s2) if ln( qmv4

s2 ) > 2, ln( qmv2

Lemma 2.4 Suppose As,m,n is a uniform sparse JL distribution. Let q be a power of 2, and suppose that 0 < v 0.5 and 1 v2 is an even integer. If qv2 s, then R(v, . . . , v, 0, . . . , 0) q q m. If m q, 2 ln(qmv4/s2) q, 2qv2 0.5s ln(qmv4/s2), and s m/e, then

R(v, . . . , v, 0, . . . , 0) q qv2

s ln(qmv4/s2). If s m/e, v

ln(m/s) q , and 1 ln(qmv2/s) q,

then R(v, . . . , v, 0 . . . , 0) q q2v2

s ln2(qmv2/s).

We now sketch how to prove bounds on R(x1, . . . , xn) q using bounds on Zr(x1, . . . , xn) T . To show Lemma 2.3, we show that making the row terms Zr(x1, . . . , xn) independent does not decrease R(x1, . . . , xn) q, and then we apply a general result from [20] for moments of sums of i.i.d symmetric random variables. For Lemma 2.4, handling the correlations between the row terms Zr(x1, . . . , xn) requires more care. We show that the negative correlations induced by having exactly s nonzero entries per column do not lead to signiﬁcant loss, and then stitch together R(v, . . . , v, 0, . . . , 0) q using the moments of Zr(v, . . . , v, 0, . . . , 0) that contribute the most.

3 Proof of Main Result from Moment Bounds

We now sketch how to prove Theorem 1.5, using Lemma 2.3 and Lemma 2.4. First, we simplify these bounds at the target parameters to obtain the following:

Lemma 3.1 Let As,m,n be a sparse JL distribution, and suppose ϵ and δ are small enough, s m/e, Θ(ϵ 2 ln(1/δ)) m < 2ϵ 2/δ, v f (m, ϵ, ln(1/δ), s), and p = Θ(ln(1/δ)) is even. If x = [x1, . . . , xn] satisﬁes x v and x 2 = 1, then R(x1, . . . , xn) p ϵ

Lemma 3.2 There is a universal constant D satisfying the following property. Let As,m,n be a uniform sparse JL distribution, and suppose ϵ, δ are small enough, s m/e, f (m, ϵ, ln(1/δ), s) 0.5, and q is an even integer such that q = min(m/2, Θ(ln(1/δ)). For each ψ > 0, there exists

v f (m, ϵ, ln(1/δ), s) + ψ, such that R(v, . . . , v, 0, . . . , 0) q 2ϵ and R(v,...,v,0,...,0) q R(v,...,v,0,...,0) 2q D.

Now, we use Lemma 3.1 and Lemma 3.2 to prove Theorem 1.5.

Proof of Theorem 1.5. Since the maps in As,m,n are linear, it sufﬁces to consider unit vectors x. First, we prove the lower bound on v(m, ϵ, δ, s). To handle m 2ϵ 2/δ, we take q = 2 in Lemma 3.1 and apply Chebyshev s inequality. Otherwise, we take p = ln(1/δ) (approximately) and apply Lemma 3.1 and Markov s inequality. We see that P[| Ax 2 2 1| ϵ] can be expressed as: P[|R(x1, . . . , xn)| ϵ] = P[R(x1, . . . , xn)p ϵp] ϵ p E[R(x1, . . . , xn)]p δ. Thus, condition (1) is satisﬁed for x Sv when v f (m, ϵ, ln(1/δ), s) as desired.

Now, we prove the upper bound on v(m, ϵ, δ, s). We need to lower bound the tail probability of R(v, . . . , v, 0, . . . , 0), and to do this, we use the Paley-Zygmund inequality applied to qth moments. Let D be deﬁned as in Lemma 3.2, and take q = min(m/2, ln(1/δ) 2

2 ln(D) ). By the Paley-Zygmund inequality and Lemma 3.2, there exists v f (m, ϵ, ln(1/δ), s) + ψ such that:

P[|R(v, . . . , v, 0, . . . , 0)| > ϵ] 0.25

R(v, v, . . . , v, 0, . . . , 0) q R(v, v, . . . , v, 0, . . . , 0) 2q

0.25D2q > δ.

Thus, it follows that supx Sf (m,ϵ,ln(1/δ),s)+ψ, x 2=1 P[| Ax 2 2 1| > ϵ] > δ as desired.

4 Empirical Evaluation

Recall that for sparse JL distributions with sparsity s, the projection time for an input vector x is O(s x 0), where x 0 is the number of nonzero entries in x. Since this grows linearly in s, in order to minimize the impact on projection time, we restrict to small constant s values (i.e. 1 s 16). In Section 4.1, we demonstrate on real-world data the beneﬁts of using s > 1. In Section 4.2, we illustrate trends in our theoretical bounds on synthetic data. Additional graphs can be found in Appendix I. For all experiments, we use a block sparse JL distribution to demonstrate that our theoretical upper bounds also empirically generalize to non-uniform sparse JL distributions.

4.1 Real-World Datasets

We considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents), and the Enron email dataset [26] (based on e-mails from the senior management of Enron).12 Both datasets were pre-processed with the standard tf-idf preprocessing. In this experiment, we evaluated how well sparse JL preserves the ℓ2 norms of the vectors in the dataset. An interesting direction for future work would be to empirically evaluate how well sparse JL preserves other aspects of the geometry of real-world data sets, such as the ℓ2 distances between pairs of vectors.

In our experiment, we estimated the failure probability ˆδ(s, m, ϵ) for each dataset as follows. Let D be the number of vectors in the dataset, and let n be the dimension (n = 101631, D = 11314 for News20; n = 28102, D = 39861 for Enron). We drew a matrix M As,m,n from a block sparse JL distribution. Then, we computed Mx 2

x 2 for each vector x in the dataset, and used these values

to compute an estimate ˆδ(s, m, ϵ) = number of vectors x such that Mx 2

D . We ran 100 trials to produce 100 estimates ˆδ(s, m, ϵ).

Figure 1: News20: ˆδ(m, s, 0.07) v. s

Figure 2: Enron: ˆδ(m, s, 0.07) vs. s

Figure 1 and Figure 2 show the mean and error bars (3 standard errors of the mean) of ˆδ(s, m, ϵ) at ϵ = 0.07. We consider s {1, 2, 4, 8, 16}, and choose m values so that 0.01 ˆδ(1, m, ϵ) 0.04.

All of the plots show that s {2, 4, 8, 16} achieves a lower failure probability than s = 1, with the differences most pronounced when m is larger. In fact, at m = 1000, there is a factor of four decrease in δ between s = 1 and s = 4, and a factor of ten decrease between s = 1 and s = 8, 16. We note that in plots in the Appendix, there is a slight increase between s = 8 and s = 16 at some ϵ, δ, m values (see Appendix I for a discussion of this non-monotonicity in s); however s > 1 still consistently beats s = 1. Thus, these ﬁndings demonstrate the potential beneﬁts of using small constants s > 1 in sparse JL in practice, which aligns with our theoretical results.

4.2 Synthetic Datasets

We used synthetic data to illustrate the phase transitions in our bounds on v(m, ϵ, δ, s) in Theorem 1.5 for a block sparse JL distribution. For several choices of s, m, ϵ, δ, we computed an estimate ˆv(m, ϵ, δ, s) of v(m, ϵ, δ, s) as follows. Our experiment borrowed aspects of the experimental design in [13]. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}). The binary vectors were deﬁned by a set W of values exponentially spread between 0.03 and 113: for each w W, we constructed a binary vector xw where the ﬁrst 1/w2 entries are nonzero, and computed an estimate ˆδ(s, m, ϵ, w) of the failure probability of the block sparse JL distribution on the speciﬁc vector xw (i.e. PA As,m,1/w2[ Axw 2 (1 ϵ) xw 2]). We computed each ˆδ(s, m, ϵ, w) using 100,000 samples from a block sparse JL distribution, as follows. In each sample, we independently

12Note that the News20 dataset is used in [10], and the Enron dataset is from the same collection as the dataset used in [13], but contains a larger number of documents. 13We took W = w | w 2 {986, 657, 438, 292, 195, 130, 87, 58, 39, 26, 18, 12, 9, 8, 7, 6, 5, 4, 3, 2, 1} .

drew a matrix M As,m,1/w2 and computed the ratio Mxw 2

xw 2 . Then, we took ˆδ(s, m, ϵ, w) :=

(number of samples where Mxw 2

xw 2 1 ϵ)/T. Finally, we used the estimates ˆδ(s, m, ϵ, w) to

obtain the estimate ˆv(m, ϵ, δ, s) = max n v W | ˆδ(s, m, ϵ, w) < δ for all w W where w v o .

Why does this procedure estimate v(m, ϵ, δ, s)? With enough samples, ˆδ(s, m, ϵ, w) PA As,m,1/w2[ Axw 2 (1 ϵ) xw 2].14 As a result, if xw is a violating vector, i.e. ˆδ(s, m, ϵ, w) δ, then likely PA As,m,n[ Axw 2 (1 ϵ) xw 2] δ, and so ˆv(m, ϵ, δ, s) v(m, ϵ, δ, s). For the other direction, we use that in the proof of Theorem 1.5, we show that asymptotically, if a violating vector (i.e. x s.t. PA As,m,n[ Ax 2 (1 ϵ) x 2] δ) exists in Sv, then there s a violating vector of the form xw for some w Θ(v). Thus, the estimate ˆv(m, ϵ, δ, s) = Θ(v(m, ϵ, δ, s)) as T and as precision in W goes to .

Figure 3: Phase transitions of ˆv(m, 0.1, 0.01, s)

Figure 4: Phase transitions of ˆv(m, 0.05, 0.05, s)

Figure 3 and Figure 4 show ˆv(m, ϵ, δ, s) as a function of dimension m for s {1, 2, 3, 4, 8} for two settings of ϵ and δ. The error-bars are based on the distance to the next highest v value in W.

Our ﬁrst observation is that for each set of s, ϵ, δ values considered, the curve ˆv(m, ϵ, δ, s) has sharp changes as a function of m. More speciﬁcally, ˆv(m, ϵ, δ, s) is 0 at small m, then there is a phase transition to a nonzero value, then an increase to a higher value, then an interval where the value appears ﬂat , and lastly a second phase transition to 1. The ﬁrst phase transition is shared between s values, but the second phase transition occurs at different dimensions m (but is within a factor of 3 between s values). Here, the ﬁrst phase transition likely corresponds to Θ(ϵ 2 ln(1/δ)) and the

second phase transition likely corresponds to min ϵ 2eΘ(ln(1/δ)), ϵ 2 ln(1/δ)eΘ(ln(1/δ)ϵ 1/s) .

Our second observation is that as s increases, the ﬂat part occurs at a higher y-coordinate. Here, the increase in the ﬂat y-coordinate as a function of s corresponds to the s term in v(m, ϵ, δ, s). Technically, according to Theorem 1.5, the ﬂat parts should be increasing in m at a slow rate: the empirical ﬂatness likely arises since W is a ﬁnite set in the experiments.

Our third observation is that s > 1 generally outperforms s = 1 as Theorem 1.5 suggests: that is, s > 1 generally attains a higher ˆv(m, ϵ, δ, s) value than s = 1. We note at large m values (where ˆv(m, ϵ, δ, s) is close to 1), lower s settings sometimes attain a higher ˆv(m, ϵ, δ, s) than higher s settings (e.g. the second phase transition doesn t quite occur in decreasing order of s in Figure 3): see Appendix I for a discussion of this non-monotonicity in s.15 Nonetheless, in practice, it s unlikely to select such a large dimension m, since the ℓ -to-ℓ2 guarantees of smaller m are likely sufﬁcient. Hence, a greater sparsity generally leads to a better ˆv(m, ϵ, δ, s) value, thus aligning with our theoretical ﬁndings.

14With 100,000 samples, running our procedure twice yielded the same ˆv(m, ϵ, δ, s) values both times. 15In Appendix I, we also show more examples where at large m values, lower s settings attain a higher ˆv(m, ϵ, δ, s) than higher s settings.

[1] The 20 newsgroups text dataset. https://scikit-learn.org/0.19/datasets/twenty_ newsgroups.html.

[2] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671 687, June 2003.

[3] Z. Allen-Zhu, R. Gelashvili, S. Micali, and N. Shavit. Sparse sign-consistent Johnson Lindenstrauss matrices: Compression with neuroscience-based constraints. In Proceedings of the National Academy of Sciences (PNAS), volume 111, pages 16872 16876, 2014.

[4] Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. Learning to rank with (a lot of) word features. Information Retrieval, 13(3):291 314, Jun 2010.

[5] C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classiﬁcation using feature hashing. Proteome Science, 10(1), 2012.

[6] C. Chen, C. Vong, C. Wong, W. Wang, and P. Wong. Efﬁcient extreme learning machine via very sparse random projection. Soft Computing, 22, 03 2018.

[7] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. Proceedings of the 32nd Annual International Conference on Machine Learning (ICML), pages 2285 2294, 2015.

[8] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 278 287, 2016.

[9] M. B. Cohen, T. S. Jayram, and J. Nelson. Simple analyses of the sparse Johnson-Lindenstrauss transform. In Proceedings of the 1st Symposium on Simplicity in Algorithms (SOSA), pages 1 9, 2018.

[10] S. Dahlgaard, M. Knudsen, and M. Thorup. Practical hash functions for similarity estimation and dimensionality reduction. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pages 6618 6628, 2017.

[11] B. Dalessandro. Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms. Big Data, 1(2):110 112, 2013.

[12] A. Dasgupta, R. Kumar, and T. Sarlos. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341 350, 2010.

[13] C. Freksen, L. Kamma, and K. G. Larsen. Fully understanding the hashing trick. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Neur IPS), pages 5394 5404, 2018.

[14] M. Jagadeesan. Simple analysis of sparse, sign-consistent JL. In Proceedings of the 23rd International Conference and 24th International Conference on Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 61:1 61:20, 2019.

[15] T.S. Jayram and D. P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and steaming problems with subconstant error. In ACM Transactions on Algorithms (TALG) - Special Issue on SODA 11, volume 9, pages 1 26, 2013.

[16] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189 206, 1984.

[17] D. M. Kane, R. Meka, and J. Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In Proceedings of the 14th International Workshop and 15th International Conference on Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 628 639, 2011.

[18] D. M. Kane and J. Nelson. A derandomized sparse Johnson-Lindenstrauss transform. Co RR, abs/1006.3585, 2010.

[19] D. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 16872 16876. ACM Press, 2012.

[20] R. Latała. Estimation of moments of sums of independent real random variables. Annals of Probability, 25(3):1502 1513, 1997.

[21] R. Latała. Tail and moment estimates for some types of chaos. Studia Mathematica, 135(1):39 53, 1999.

[22] P. Li, T. Hastie, and K. Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 06, pages 287 296, 2006.

[23] C. Ma, J. Jung, S. Kim, and S. Ko. Random projection-based partial feature extraction for robust face recognition. Neurocomputing, 149:1232 1244, 2015.

[24] J. Nelson and H.L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117 126, 2013.

[25] J. Nelson and H.L. Nguyen. Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pages 101 110, 2013.

[26] D. Newman. Bag of words data set. https://archive.ics.uci.edu/ml/datasets/Bag+ of+Words, 2008.

[27] H. Song. Robust visual tracking via online informative feature selection. Electronics Letters, 50(25):1931 1932, 2014.

[28] S. Suthaharan. Machine Learning Models and Algorithms for Big Data Classiﬁcation: Thinking with Examples for Effective Learning, volume 36 of Integrated Series in Information Systems. Springer US, Boston, MA, 2016.

[29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1113 1120, 2009.