# sparse_dimensionality_reduction_revisited__2e8434a5.pdf

Sparse Dimensionality Reduction Revisited

Mikael Møller Høgsgaard 1 Lior Kamma 2 Kasper Green Larsen 1 Jelani Nelson 3 Chris Schwiegelshohn 1

The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of n points in Rd into m = O(ε 2 lg n) dimensions while preserving all pairwise distances to within 1 ε. Each input point x is embedded to Ax, where A is an m d matrix having s non-zeros per column, allowing for an embedding time of O(s x 0). Since the sparsity of A governs the embedding time, much work has gone into improving the sparsity s. The current state-of-the-art by Kane and Nelson (2014) shows that s = O(ε 1 lg n) suffices. This is almost matched by a lower bound of s = Ω(ε 1 lg n/ lg(1/ε)) by Nelson and Nguyen (2013) for d = Ω(n). Previous work thus suggests that we have near-optimal embeddings. In this work, we revisit sparse embeddings and present a sparser embedding for instances in which d = no(1), which in many applications is realistic. Formally, our embedding achieves s = O(ε 1(lg n/ lg(1/ε)+lg2/3 n lg1/3 d)). We also complement our analysis by strengthening the lower bound of Nelson and Nguyen to hold also when d n, thereby matching the first term in our new sparsity upper bound. Finally, we also improve the sparsity of the best oblivious subspace embeddings for optimal embedding dimensionality.

1Computer Science Department, Aarhus University, Aarhus, Denmark 2School of Computer Science, Academic College of Tel-Aviv Yaffo, Tel-Aviv, Israel 3Department of EECS, UC Berkeley, Berkeley, CA, USA. Correspondence to: Mikael Møller Høgsgaard <hogsgaard@cs.au.dk>, Lior Kamma <liorkm@mta.ac.il>, Kasper Green Larsen <larsen@cs.au.dk>, Jelani Nelson <minilek@berkeley.edu>, Chris Schwiegelshohn <schwiegelshohn@cs.au.dk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

Dimensionality reduction is a central technique for speeding up algorithms for large scale data analysis and reducing memory consumption for storage. A Euclidean-distancepreserving dimensionality reduction is, loosely speaking, an embedding of a high-dimensional Euclidean space into a space of low dimension, that approximately preserves the Euclidean distance between every two points. One of the cornerstone results is the Johnson-Lindenstrauss transform (Johnson & Lindenstrauss, 1984), stating that every set of n points in a d-dimensional space can be embedded into only m = O(ε 2 lg n) dimensions while preserving all pairwise Euclidian distances between points to within a factor (1 ε). The simplest (random) constructions of such dimensionality reducing maps, known as the Distributional Johnson-Lindenstrauss Lemma, samples a random m d matrix A with entries either i.i.d. N(0, 1) distributed or as uniform Rademachers ( 1 or 1 with probability 1/2 each). For a set X Rd of n points, it then holds with probability at least 1 1/n that L = A/ m satisfies

x, y X : Lx Ly 2 2 (1 ε) x y 2 2 . (1)

We say that a matrix L satisfying (1) is an ε-JL matrix for X. It is worth noting that some works require that an ε-JL matrix satisfies (1) without the square on the Euclidian norm. The two definitions are equivalent up to a constant factor scaling in ε and we work with the former as it simplifies calculations.

While the target dimension of m = O(ε 2 lg n) is known to be optimal (Jayram & Woodruff, 2013; Larsen & Nelson, 2017), even when d = O(m), computing the embedding Lx of a point x using the construction above requires Ω(md) = Ω(ε 2d lg n) operations. In some applications, this may constitute the computation bottleneck, hence much work has gone into designing faster embedding algorithms. These works may roughly be categorized by two approaches. (1) Constructions that use structured embedding matrices with fast matrix-vector multiplication algorithms; and (2) constructions using sparse embedding matrices.

A classic example of the former approach is the Fast JL transform by Ailon and Chazelle (2009). Their construction embeds a point x by computing the product PHDx, where D is a diagonal matrix with random signs on the diagonal, H

Sparse Dimensionality Reduction Revisited

is a d d Hadamard matrix and P is a random sparse matrix where each entry is non-zero only with some small probability. The main idea in that construction is that HD spreads the mass of the vector x evenly among its coordinates, which allows for a very sparse m d embedding matrix P. In addition, a d d Hadamard matrix has an O(d lg d) matrix-vector multiplication algorithm. Analyzing the Fast JL transform, and specifically the correct tradeoff between the target dimension and embedding time, has been studied extensively (see e.g. (Do et al., 2009; Krahmer & Ward, 2011; Freksen & Larsen, 2020; Jain et al., 2020)). The state-of-the-art tight analysis by Fandina, Høgsgaard and Larsen (2023) shows that the embedding time can be bounded by O(d lg d + min{ε 1d lg n, m lg n max{1, ε lg n/ lg(1/ε)}}).

In the latter approach one instead designs embedding matrices with only s m non-zeros per column. Given such a sparse embedding matrix, it is straightforward to embed a point in O(ds) time instead of O(dm), hence minimizing s has been the focus of extensive work. The current sparsest embedding construction is due to Kane and Nelson (2014), achieving a sparsity upper bound of s = O(ε 1 lg n). Nelson and Nguyen (2013) presented a lower bound of s = Ω(ε 1 lg n/ lg(m/ lg n)) for any sparse ε-JL matrix, almost settling the optimality of the construction by Kane and Nelson. For optimal target dimension m = Θ(ε 2 lg n), this simplifies to s = Ω(ε 1 lg n/ lg(1/ε)). While O(dε 1 lg n/ lg(1/ε)) is often larger than the near O(d lg d) embedding time achieved by Fast JL, sparse embeddings have one significant advantage in that they may also exploit sparsity in the input points. Concretely, the embedding time of a point x is easily seen to be O(s x 0), where x 0 is the number of non-zero entries of x. In many applications, such as embedding bag-of-words and tf-idf representations of documents, the input points are indeed very sparse compared to the domain size d (one non-zero entry in x per word in the document, where d the number of distinct words in the dictionary). For efficient use in practice sparse dimensionality reductions techniques have for instance been implemented in the popular library scikit-learn (Pedregosa et al., 2011) as Sparse Random Projection.

Large Sets with Few Dimensions. While it may seem that there is little room for improvement in the lg(1/ε) gap between the upper and lower bounds known for the sparsity of ε-JL matrices, we identify a shortcoming in the lower bound of Nelson and Nguyen (2013). Concretely, the hard instance in their proof is the set {e1, . . . , en} of standard unit vectors. However, in many theoretical applications, the original dimension d is significantly smaller than the size n of the vector-set. In these scenarios, this hard instance does not exist, in which case the lower bound degenerates to s = Ω(ε 1 lg d/ lg(m/ lg d)). Yet, the upper bound analysis by Kane and Nelson is incapable of exploiting the

fact that d n and remains O(ε 1 lg n).

In addition to broadening our theoretical understanding of sparse dimensionality reduction, we also find that d n is a natural practical setting, also when combined with sparse input points. Consider, for instance, the Sentiment140 data set consisting of 1.6M tweets (Go et al., 2009). Using a bag-of-words or tf-idf representation of the tweets, where all words occuring less than 10 times in the 1.6M tweets have been stripped, results in a data set with n = 1.6 106, d = 37, 129 and an average of 12 words/non-zeros per tweet. These vectors are thus extremely sparse and have d a factor 43 less than n. Similarly, for the Kosarak data set (Benson et al., 2018) consisting of an anonymized clickstream from a Hungarian online news portal, there are n = 900, 002 transactions, each consisting of several items. It has a total of d = 41, 270 distinct items and each transaction consists of an average of 8.1 items. Here we also have an d that is a factor 22 less than n and very sparse input points. In general, when considering bag-of-words and tf-idf, one would assume that there is a fixed dictionary size d, while the number of data points n may be arbitrarily large, which further motivates distinguishing between n and d in the sparsity bounds.

One may thus hope to give upper bounds which depend on lg d rather than lg n. This is precisely the message of our work.

1.1. Main Results

Our first main result is an improved analysis of the random sparse embedding by Kane and Nelson (2014) reducing the O(ε 1 lg n) upper bound on the sparsity in the case d n. Formally we show the following.

Theorem 1.1. Let 0 < ε < ε0 for some constant ε0. There is a distribution over s-sparse matrices in Rm d with m = O(ε 2 lg n) and

ε lg n lg(1/ε) + lg2/3 n lg1/3 d ,

such that for every set of n vectors X Rd, it holds with probability at least 1 O(1/d) that a sampled matrix is an ε-JL matrix for X.

While the first term may resemble the lower bound presented by Nelson and Nguyen (2013), their lower bound did not apply when the size of X is significantly larger than the dimension d, and thus cannot consist of just the standard basis for Rd.

Our second result complements the upper bound in Theorem 1.1 with a tight lower bound on the sparsity of ε-JL matrices. We show that if m is sufficiently smaller than d, then every ε-JL matrix embedding d-dimensional vectors in

Sparse Dimensionality Reduction Revisited

Rm must have relatively dense columns. Formally we show the following.

Theorem 1.2. Let 0 < ε < 1/4, and let m be such that m = Ω(ε 2 lg n) and m (εd/ lg n)1 o(1). Then there is a set of n vectors X Rd such that any ε-JL matrix A embedding X into Rm, must have a column with sparsity s satisfying

s = Ω lg n ε lg(m/ lg n)

For optimal m = Θ(ε 2 lg n), this simplifies to s = Ω(ε 1 lg n/ lg(1/ε)).

Recall the comparable lower bound in (Nelson & Nguyen, 2013) was specifically for the case n = d. Combined with the refined upper bound, we now have a completely tight understanding of sparse dimensionality reduction when lg n lg d lg3(1/ε). While arguably being small asymptotic improvements, these are the first improvements in a decade and demonstrate that the dimension of the input data may be exploited to speed up embeddings.

Subspace Embeddings. Given a k-dimensional subspace V Rd, an ε-subspace embedding (Sarl os, 2006) is a matrix A Rm d satisfying that for all x V , Ax 2 2 (1 ε) x 2 2. It is known that there exists a subset V V of size O(1)k such that if A preserves the ℓ2 norm of every vector in V up to (1 + ε/2), then A is an ε-subspace embedding (Arora et al., 2006). The JL lemma thus implies that one can take m = O(k/ε2), and in fact this is optimal in the case that A is drawn from a fixed distribution over Rm d

that is independent of V (Nelson & Nguy ˆen, 2014) (a socalled oblivious subspace embedding (OSE)). OSE s can be used to speed up algorithms for approximate regression, low rank approximation, and a large number of other problems in numerical linear algebra; see the monograph by Woodruff (2014).

As a simple example, consider the problem of approximate linear regression in which one wants to find a β which approximately minimizes Xβ y 2 2 for some given X Rn d. This problem can be solved exactly in O(nd2) time by writing the Singular Value Decomposition X = UΣV

then setting βLS := V Σ 1U y. Then XβLS = UU y is the projection of y onto the column space of X, which minimizes the error. The sketch-and-solve paradigm (Sarl os, 2006), in one analysis, suggests taking A to be a subspace embedding for span{y, cols(X)} (which has dimension at most d + 1) then setting β to be the minimizer of AXβ Ay 2 2. Note AX is now a much smaller matrix, so one can compute β more quickly. However, we also need A to either be sparse or structured, so that AX can be computed quickly. Otherwise, if A is an arbitrary unstructured matrix, computing AX would take more time than computing βLS exactly!

Note that if each column of A has s nonzero entries, then AX can be computed in time O(s X 0), where X 0 is the number of nonzero entries in X. Simply using the Sparse JL transform (Kane & Nelson, 2014) would lead to m = O(k/ε2), s = O(k/ε). Clarkson and Woodruff (2013) showed that m = O(k2/ε2), s = O(1) is achievable, which for OSE s is optimal (Nelson & Nguy ˆen, 2014; Li & Liu, 2022). What though if we do not want to increase m at all beyond the optimal bound of O(k/ε2)? What is the best sparsity s achievable without sacrificing the asymptotic quality of dimensionality reduction? Nelson and Nguyen showed m = O((k/ε2) poly(ε 1 log k)) is achievable with s = poly(log(k/ε))/ε (2013), and conjectured that s = O((log k)/ε) suffices with m = O(k/ε2). Cohen provided an improved bound, showing m = O((k log k)/ε2), s = O((log k)/ε) suffices (2016), which remains the best known bound today. In particular, for m = O(k/ε2), despite the conjecture of (Nelson & Nguy ˆen, 2013), no sparsity bound better than s = O(k/ε) is known, which follows from black box application of Sparse JL. In this work, we provide the first proof that keeps m = O(k/ε2) while showing a sparsity bound that is o(k/ε). Specifically, we achieve s = O(k/(ε log(1/ε)) + 3p

k2 log k/ε). Formally we show the following. Theorem 1.3. Let 0 < ε < 1. There is a distribution over s-sparse matrices in Rm d with m = O(ε 2k) and

ε k lg(1/ε) + k2/3 lg1/3 k ,

such that for every k-dimensional subspace V Rd, it holds with probability at least 1 2 k2/3 that a sampled matrix is an ε-JL matrix for V .

While this is far from the conjectured optimal bound of O((log k)/ε), it provides the first analysis that maintains optimal m while providing sparsity s strictly better than applying Sparse JL as a black box.

Recent subsequently work by (Chenakkod et al., 2023) shows an incomparable sparsity bound of O(lg4 (k/δ) /ε6) with embedding dimension m = O((k + log(1/δ))/ε2) where δ is the failure probability. Thus for some parameter regimes of δ, ε and k the bound fails to beat or is even worse than the O(k/ε) which the black box approach of Sparse JL yields, which as mentioned the bound of Theorem 1.3 is strictly better than, for δ 2 k2/3.

2. Technical Overview

In this section, we present the central ideas employed in our new contributions. We first describe our improved upper bound analysis, then the main ideas in our lower bound, and finally the new subspace embedding results. For ease of notation, we henceforth write x to denote x 2.

Sparse Dimensionality Reduction Revisited

Sparser Dimensionality Reduction. One method for achieving Sparse JL matrices presented by Kane and Nelson (2014) is based on the Count Sketch algorithm (Charikar et al., 2004). An embedding matrix A is sampled by partitioning the m rows into s groups of m/s entries each. In every column of A a uniform random entry in each group is sampled and set uniformly to either 1/ s or 1/ s. All other entries are set to 0. Kane and Nelson then showed that if s = Ω(ε 1 lg(1/δ)) then for every unit vector x, it holds that Ax 2 1 ε with probability at least 1 δ. Setting δ = n 3, using linearity of A and a union bound over z = (y x)/ y x for all x, y in an input set of points/vectors X completes their proof. Hereafter we focus on showing that A preserves the norm of every vector in a set X of n2 unit vectors with good probability. Kane and Nelson also included a short argument showing that their analysis is tight for distances between the standard unit vectors e1, . . . , ed.

However, our key observation is that, if d n, then a naive union bound over all n2 unit vectors in X may be too loose. Concretely, there are much fewer than n2 vectors that are of this worst case form. In particular, when d n, then most vectors in a set X of cardinality n2 must have many entries that are small in magnitude. It is already known from work on Feature Hashing (Weinberger et al., 2009; Dahlgaard et al., 2017; Freksen et al., 2018; Jagadeesan, 2019) and the Fast JL transform (Ailon & Chazelle, 2009; Fandina et al., 2023) that vectors x with a small x to x ratio are easier to embed than worst case vectors. For instance, for optimal m = Θ(ε 2 lg n), Jagadeesan (2019) showed that as long as s = Ω(ε 1 lg n/ lg(1/ε)) and the ratio ν = x / x satisfies ν p

εs/ lg n, then Sparse JL preserves the norm of x to within 1 ε with probability at least 1 1/n3.

In order to exploit a small dimension d, we split every vector x X into two support-disjoint vectors, referred to as a head and a tail, where the head contains the top ℓentries of x and the tail contains the remaining entries. That is, we write x = xhead + xtail. Then

Ax 2 = Axhead 2 + Axtail 2 + 2 Axhead, Axtail .

We now treat these three terms separately. Showing that with high probability, Axhead 2 (1 ε) xhead 2, Axtail 2 (1 ε) xtail 2 and | Axhead, Axtail | ε (since xhead, xtail = 0). The technical crux lies in bounding the cross terms.

In order to bound the heads, the main observation is that there are about d ℓ dℓchoices for the positions of the heads. Once the positions have been chosen, we further approximate the heads by an ε-net of cardinality (1/ε)O(ℓ). Since d m = Ω(ε 2 lg n), the total number of heads we need to consider is dℓ(1/ε)O(ℓ) = d O(ℓ). Using the analysis

by Kane and Nelson with δ = d O(ℓ) shows that it suffices with s = Ω(ε 1ℓlg d) to get the required bound with high probability.

As for the tails, there are at most n2 distinct tails and they have xtail 1/

ℓ. We can thus use the result by Jagadeesan to show that Axtail 2 is within the interval (1 ε) xtail 2 whenever s satisfies both s = Ω(ε 1 lg n/ lg(1/ε)) and (1/

εs/ lg n, which is implied by s = Ω(ε 1 lg(n)/ℓ).

The main challenge lies in bounding the cross terms, showing | Axhead, Axtail | ε. Previous results, and specifically the aforementioned results by Kane and Nelson (2014) and Jagadeesan (2019) cannot be employed, as on one hand the number of pairs is very large, and more specifically depends polynomially on n, and on the other hand the ℓ /ℓ2 ratio of the corresponding vectors cannot be upper bounded as the heads have heavy entries. In order to bound the cross terms we present new concentration bounds on the Count Sketch-based construction by Kane and Nelson. We first show that for optimal dimension m = O(ε 2 lg n), for sparsity s εm and ℓ ε 1/2 we get that with high probability for every x X, there are only few rows in A where more than 5 non-zero entries coincide with the support of xhead. In turn, this means that most entries of Axhead are not too large, and specifically do not exceed p

5/s. We then turn to analyze the probability that for some x X,

Axhead, Axtail = X

i [m] (Axhead)i(Axtail)i

(Axhead)i X

j supp(xtail) aijxj

is at most ε. To this end, we partition the sum into two sums, where the first sum handles terms (i, j) where (Axhead)i and xj are large and the second sum handles the remaining terms. Here we exploit that (Axhead)i is small for most i as just argued. Furthermore, since x is unit length, there are also few choices of j where xj is large. The first sum can thus be handled by exploiting that there are few terms in the sum, and the second sum has strong concentration since the terms are small.

Stronger Lower Bound. To improve over the lower bound given by Nelson and Nguyen (2013), we first need to define a harder input instance. Concretely, they used the standard unit vectors e1, . . . , en, which as argued earlier, only is a valid input for d n.

Our hard instance X instead consists of all vectors of the form v S = P

|S| for subsets S [d] of cardinality lg n/ lg d, all the standard unit vectors e1, . . . , ed, as well as the origin 0.

Sparse Dimensionality Reduction Revisited

Now consider an m d embedding matrix A, such that each column of A has at most s non-zeros, and A is an ε-JL matrix for X. Since e1, . . . , ed and 0 are in the input, it must be the case that each column aj of A has norm in 1 ε 2. Now assume for simplicity that all the columns of A had precisely s non-zero entries and those took values { 1/ s, 1/ s}. For a subset T [m] of t entries and a list of t signs σ = (σ1, . . . , σt), we say that aj has the signature (T, σ) if aj is non-zero in every coordinate corresponding to T and its coordinates inside T have the signs σ. Any column would then have s t distinct signatures. Since there are m t 2t signatures and d columns, it follows by averaging that there must be a signature shared by at least d s t / m t 2t d(s/m)t columns. We set t roughly as c lg d/ lg(m/s) for a small constant c > 0, resulting in at least poly(d) columns sharing the same signature.

We now fix such a signature and let S be the subset of columns in A with that signature. If |S| = poly(d) ε 1 lg n/ lg d, then we can select ε 1 disjoint subsets S1, . . . , Sε 1 of S, each of cardinality lg n/ lg d. For each such subset Si, we know that the vector v Si is in X. Now inside the coordinates in T, all columns in Si are non-zero and have the same sign. Hence the entries of Av Si inside T are p

lg n/(s lg d) in magnitude as the columns add up. Moreover, the entries inside T also have the same signs across distinct Av Si and Av Sj.

If we now delete the entries in T from all such Av Pi, we are left with vectors whose norm is no more than 1 + ε < 2. Moreover, since v Si and v Sj have disjoint supports, they were orthogonal before embedding and thus to preserve their distance, the inner products of Av Si and Av Sj must be O(ε). Deleting the entries in T reduces these inner products by |T| lg n/(s lg d) = t lg n/(s lg d) lg n/(s lg(m/s)). If we call the resulting vectors Av Si, then it must hold that 0 Pε 1

i=1 Av Si 2 = Pε 1

i=1 Av Si 2 + P

j =i Av Si, Av Sj 2ε 1 + ε 1(ε 1 1)(O(ε) lg n/(s lg(m/s))). Multiplying by ε and solving for s gives s = Ω(ε 1 lg n/ lg(m/s)). Since m = Ω(ε 2 lg n), this is equivalent to s = Ω(ε 1 lg n/ lg(m/ lg n)).

To deal with columns of A that are not of the form { 1/ s, 0, 1/ s} we redefine signatures to be subsets of coordinates where aj has large norm restricted to those coordinates. Also, instead of the signs σ, we instead build a 1/4-net over the T coordinates and let the closest net point be a substitute for the signs.

Comparing our argument to that of Nelson and Nguyen (2013), the key difference lies in summing up multiple columns of A that all share the same signature. To ensure this sum of columns corresponds to a vector in X, we add every sum of lg n/ lg d columns v S to the input.

While the full proof is omitted from the main body of the

paper, for sake of completeness, a detailed proof of Theorem 1.2 can be found in Appendix B.

Subspace Embeddings. For subspace embeddings, we note that the classic approach for showing a sparsity of s = O(ε 1k) follows by constructing a 1/2-net N 1

2 V over the k-dimensional subspace V . One can then (roughly) show that if a linear embedding matrix A preserves the norm of all net points, then it preserves the pairwise distance between all points in V . Since such a net has cardinality 2O(k), the claimed sparsity follows from Kane and Nelson s s = O(ε 1 lg(1/δ)) with δ = 2 O(k).

A first attempt at improving over this would be to directly insert n = 2O(k) into our improved sparse embedding from above. This would result in a sparsity of s = O(ε 1(k/ lg(1/ε) + k2/3 lg1/3 d)). The first term is fine, but the latter term depends on d, which would make the bound incomparable to previous results that only depend on k and ε. We thus take a closer look at the origin of the dependency on d.

Recall that for a set of n vectors, such as the net N 1

2 , we partition the vectors w in N 1

2 into a head and a tail as w = whead + wtail where whead contains the largest ℓentries of w. We then observed that for an embedding matrix A, we have that Aw 2 can be written as

Awhead 2 + Awtail 2 + 2 Awhead, Awtail .

We then show that Awhead 2 and Awtail 2 are in the respective intervals (1 O(ε)) whead 2 and (1 O(ε)) wtail 2 and | Awhead, Awtail | = O(ε). For the second term, we exploited that wtail 1/

ℓand then combined this with the result by Jagadeesan for embedding vectors with a small . The requirement on s resulting from this term was s = Ω(ε 1 lg n/ lg(1/ε)) = Ω(ε 1k/ lg(1/ε)) as well as s = Ω(ε 1 lg(n)/ℓ) which is Ω(ε 1k/ℓ). Hence no dependencies on d here. Similarly, for the cross terms, we got the requirement s being at least Ω(ε 1 lg(n)/

ℓ) = Ω(ε 1k/

ℓ). Thus the dependency on d comes only from preserving the norms of the heads.

For the heads, we argued that there were d ℓ choices for the positions of the heads and thereafter, we needed an εnet on the chosen ℓpositions. This resulted in d O(ℓ) heads in the net and we then used Kane and Nelson s analysis yielding s = O(ε 1 lg(1/δ)) with δ = d O(ℓ). Thus we need a tighter bound on the number of heads to avoid the dependency on d.

The first idea is to change the definition of the head whead to be all entries wi of w with |wi| 1/

ℓ. This is a small but crucial change from the previous definition where the head contained the top ℓentries. To distinguish the two, we instead denote the heavy entries by wheavy and the remaining entries by wlight = w wheavy.

Sparse Dimensionality Reduction Revisited

Next, we argue that the positions of the at most ℓentries in wheavy, must be among a small set of coordinates:

Lemma 2.1. Let V be a k-dimensional subspace of Rd. For every ℓ 1, there is a set S [d] of coordinates with |S| kℓsuch that for every unit vector v V , all coordinates i [d] \ S satisfy |vi| < 1/

Lemma 2.1 states that the positions of the non-zeros in all wheavy must be contained in a small set S of cardinality only |S| = kℓ. Thus there are only |S| ℓ = 2O(ℓlg k) possible positions of the non-zeros in wheavy. Next, we also argue that once the positions of the heavy entries have been determined, it suffices with 1/2-net on the chosen positions. Hence we reduce the number of wheavy to just 2O(ℓlg k) and have removed the dependency on d. Using Kane and Nelson now gives us that we need s = Ω(ε 1ℓlg k). Balancing this with s = Ω(ε 1k/

ℓ) gives ℓ= (k/ lg k)2/3. The final bound thus becomes s = O(ε 1(k/ lg(1/ε)+k2/3 lg1/3 k)) as claimed.

While the full proof is omitted from the main body of the paper, for sake of completeness, a detailed proof of Theorem 1.3 can be found in Appendix C.

3. Sparsity Upper Bound

In this section we prove Theorem 1.1. Let d be an integer, let ε (0, 1) and let X Rd be some finite set of n vectors. Let m = O(ε 2 lg n) and let s = O(ε 1 lg n/ lg(1/ε) + ε 1 lg2/3 n lg1/3 d). We will show that if A is sampled as in Kane and Nelson (2014), then A is an ε-JL matrix for X with probability at least 1 O(1/d).

For simplicity, we will actually only show that it is an O(ε)- JL matrix. A simple rescaling of ε by a constant factor implies the result.

As A is a linear transformation, and n appears in all terms inside a logarithm, it is enough to show the following claim (by replacing X with X containing xi,j = (xi xj)/ xi xj for all xi, xj X).

Claim 3.1. Assume A is sampled as in Kane and Nelson (Kane & Nelson, 2014) with m = O(ε 2 lg n) and s = O(ε 1 lg n/ lg(1/ε) + ε 1 lg2/3 n lg1/3 d), then for every set X Rd of n unit vectors, it holds that with probability at least 1 O(1/d) for all x X that Ax 2 (1 O(ε)).

For the rest of the section we therefore prove Claim 3.1, and we start by introducing the following notation.

Notation 1. Let x Rd, and let ℓ [d]. Denote by xhead(ℓ) the vector obtained from x where all but the top ℓentries are zeroed out. Denote xtail(ℓ) = x xhead(ℓ).

Let ℓ= min ε 1/2, lg n lg d 2/3 be an integer. For

every T [d] ℓ , let YT be the set of all vectors y Rd such that y 1 and supp(y) T. Let Y = S

T ( [d] ℓ) YT .

Note that for every i [d], ei Y.

Fix some set X Rd of n unit vectors. Define E1 to be the set of all matrices A Rm d such that for all x Y, Ax 2 (1 ε) x 2. Define E2 to be the set of all matrices A Rm d such that for all x X, Axtail(ℓ) 2 xtail(ℓ) 2 ε. Define E3 to be the set of all matrices A Rm d such that for all x X, either Axhead(ℓ) 2 > 2 or Axhead(ℓ), Axtail(ℓ) < ε.

Claim 3.2. Assume A E1 E2 E3. Then for every x X, Ax 2 (1 O(ε)).

Proof. Let x X, then Ax 2 can be written as

Axhead(ℓ) 2 + Axtail(ℓ) 2 + 2 Axhead(ℓ), Axtail(ℓ) .

If A is in E1, then Axhead(ℓ) 2 is in the interval (1 ε) xhead(ℓ) 2. Specifically Axhead(ℓ) 2 < 2 and thus since we also have A in E3, it must be the case that Axhead(ℓ), Axtail(ℓ) < ε. Therefore

Ax 2 (1 + ε) xhead(ℓ) 2 + xtail(ℓ) 2 + ε + 2ε.

Ax 2 (1 ε) xhead(ℓ) 2 + xtail(ℓ) 2 ε 2ε

i.e. Ax 2 (1 O(ε)) as claimed.

It remains to show that Pr[A E1 E2 E3] happens with at least probability 1 O(1/d). This is implied by the next three claims, bounding the probability of each of the events separately.

Claim 3.3. Pr[A E1] 1 d 1.

Claim 3.4. Pr[A E2] 1 n 1.

Claim 3.5. Pr[A E3] 1 n 1.

Claim 3.5, that essentially shows that with high probability over the choice of A the cross terms are small, constitute the technical crux of the upper bound result, and its proof requires a much more careful examination of the construction by Kane and Nelson (2014). We now therefore give the proof of Claim 3.5 whereas the proofs of Claim 3.3 and Claim 3.4 can be found in Appendix A.

Recall that for a choice of m and s, the construction works by grouping the rows of A into s blocks of m/s consecutive rows each, [1, m/s], [m/s + 1, 2m/s] and so on. For every column, a uniform random entry in each block is chosen together with an independent uniform sign σ. That entry is then set to σ/ s. Each column thus has exactly one non-zero per block of rows.

Sparse Dimensionality Reduction Revisited

The rest of this section is devoted to the proof of Claim 3.5. We start by proving that xhead(ℓ) often has a desirable property. Concretely, we define the following

Definition 3.6. A set J [d] ℓ of ℓcolumns of A is wellbehaved if there are no more than 6 lg n/ lg(1/ε) rows i [m] such that |{j J : aij = 0}| 6.

Claim 3.7. Let A be sampled as in Kane and Nelson (Kane & Nelson, 2014) with m = O(ε 2 lg n) and s εm. Then for every set J [d] ℓ of ℓcolumns of A, it holds with probability at least 1 n 3 that J is well-behaved.

Proof. Let J [d] ℓ , and denote β = 6 lg n/ lg(1/ε). For every subset I = {i1, . . . , iβ} [m] β of β rows and every sequence V1, . . . , Vβ J 6 of subsets of J of size 6 each, define the event EI,V1,...,Vβ to be the set of all matrices A such that for every i I, for every j Vi, ai,j = 0.

Note first that if the entries {ai,j}i I,j Vi are not independent, then there must be two such entries in the same column and same subset of m/s rows. In this case, Pr[EI,V1,...,Vβ] = 0 as at most one of them may be non-zero. Otherwise, all 6β entries of A considered in EI,V1,...,Vβ are

independent and therefore Pr[EI,V1,...,Vβ] s

m 6β. As s εm and ℓ ε 1/2, and by applying a union bound we get that Pr[J is not well-behaved] is less than

I ( [m] β )

V1,...,Vβ ( J 6) Pr[EI,V1,...,Vβ] me

(O(1)ε 2 lg(1/ε))β ε3β .

For ε smaller than some constant, this is at most εβ/2 n 3.

Next we show that if the the support of xhead(ℓ) is a wellbehaved subset of columns, then Axhead(ℓ) has few large entries (note that | supp(xhead(ℓ))| ℓ).

Claim 3.8. Let x X and assume the support of xhead(ℓ) is well-behaved. Then Axhead(ℓ) has at most 6 lg n/ lg(1/ε) entries that exceed p

5/s. Furthermore, we have Axhead(ℓ) p

Proof. Let I [m] be the set of rows i [m] for which |{j supp(xhead(ℓ)) : aij = 0}| 6. Consider an i [m] \ I. The number of columns j [d] such that aij = 0 is at most 5 and for each of these we have |aij| 1/ s. Therefore since xhead(ℓ) 1 we get that |(Axhead(ℓ))i| p

5/s. Since supp(xhead(ℓ)) is wellbehaved, then |I| 6 lg n/ lg(1/ε), and the claim follows. The bound Axhead(ℓ) p

ℓ/s follows simply from the support of xhead(ℓ) only having cardinality ℓand xhead(ℓ) 1.

For every x X, let E3,x be the set of matrices A where Axhead(ℓ) 2 2 or | Axhead(ℓ), Axtail(ℓ) | < ε. Then E3 = x XE3,x. Our goal is to show that Pr[E3,x] 1 O(1/n2) which by a union bound over all x X completes the proof of Claim 3.5. To this end, define Wx to be the set of all matrices A for which the support of xhead(ℓ) is a wellbehaved set of columns. Claim 3.7 implies that Pr[Wx] 1 n 3. It is therefore enough to show that Pr[E3,x | Wx] 1 O(n 2), as Pr[E3,x] Pr[E3,x | Wx] Pr[Wx]. The following lemma thus concludes the proof of Claim 3.5, and the rest of this section is devoted to its proof.

Lemma 3.9. Pr[E3,x | Wx] 1 O(n 2).

Since Pr[E3,x | Wx Axhead(ℓ) 2 2] = 1 it is enough to bound Pr[E3,x | Wx Axhead(ℓ) 2 < 2]. Note that by disjointness of the support of xhead(ℓ) and xtail(ℓ), the vectors Axhead(ℓ) and Axtail(ℓ) are independent. In fact, Axtail(ℓ) is completely independent of all columns of A in the support of xhead(ℓ). We will therefore show that conditioned on Wx Axhead(ℓ) 2 < 2, | Axhead(ℓ), Axtail(ℓ) | = O(ε) with probability at least 1 O(1/n2) over the choice of the random columns in the support of xtail(ℓ). We can therefore condition on some outcome of u = Axhead(ℓ) where supp(xhead(ℓ)) is also well-behaved.

For every i [m] and j supp(xtail(ℓ)) define bij as the Bernoulli random variable taking the value 1 if entry (i, j) of A is non-zero and 0 otherwise. In addition, let σij denote uniform random and independent signs. Then u, Axtail(ℓ) = Pm i=1 ui P

j supp(xtail(ℓ)) bijσijxj/ s. To bound the sum, we split it into two sums, and bound the probabilities of each part being at most O(ε). Denote

R = {(i, j) [m] supp(xtail(ℓ)) : |ui| > p

and |xj| > 1/(

ℓlg2(1/ε))}

and S = ([m] supp(xtail(ℓ))) \ R

Claim 3.10. Pr

(i,j) R ui bijσijxj/ s

Proof. Recall that u p

ℓ/s and xtail(ℓ) 1/

ℓ. Therefore

(i,j) R ui bijσijxj/ s

(i,j) R |ui| bij|σijxj|

(i,j) R bij .

To complete the proof we will show that with probability at least 1 n 2 it holds that P

(i,j) R bij O(sε)

Sparse Dimensionality Reduction Revisited

c lg n/ lg(1/ε). Since supp(xhead(ℓ)) is well-behaved, there are at most 6 lg n/ lg(1/ε) rows i [m] for which |ui| > p

5/s and since xtail(ℓ) = 1 there are at most ℓlg4(1/ε) columns j supp(xtail(ℓ)) such that |xj| 1/(

ℓlg2(1/ε)). Thus |R| 6ℓlg n lg3(1/ε), and therefore µ := E h P

(i,j) R bij i (s/m) 6ℓlg n lg3(1/ε)

ε1/2 lg n lg3(1/ε), where the last inequality follows from the fact that s εm and ℓ ε 1/2. For ε smaller than some constant we get that the expectation is at most µ ε1/4 lg n/ lg(1/ε). Straightforward calculations give the following observation, whose proof is deferred to Appendix D.

Observation 3.11. For every t > 0,

E h exp t P

(i,j) R bij i Y

(i,j) R E [exp (tbij)].

Employing Observation 3.11 we can apply Hoeffding-like inequalities on the probability that P

(i,j) R bij is large. Specifically for a large enough constant c let δ = cε 1/4 1 and t = ln(1 + δ) we get from Markov s inequality that

(i,j) R bij > c lg n lg(1/ε) i

= Pr exp t P

(i,j) R bij > exp tc lg n

(1 + δ)c lg n/ lg(1/ε) .

As (1 + δ) = cε 1/4 and µ ε1/4 lg n/ lg(1/ε) we get that if c is large enough

(i,j) R bij > c lg n lg(1/ε) i eε1/4 c lg n lg(1/ε) n 2 .

Claim 3.12. Pr[| P

(i,j) S ui bijσijxj/ s| O(ε)] is at least 1 O(n 2).

Proof. We first note that the sum can be thought of as an inner product between two vectors indexd by (i, j) S. Specifically let σ, w RS be defined as follows. For every (i, j) S, σ(i,j) = σij and w(i,j) = c(i,j)bij, where c(i,j) = uixj/ s. As σ and w are independent, we get from Hoeffding s inequality that for every c > 0

Pr [| w, σ | > cε | w ] 2 exp (cε)2

Therefore it is enough to show that with probability at least 1 O(n 2) it holds that w 2 = O(ε2/ lg n). Note first

E w 2 = E X

(i,j) S c2 (i,j)b2 ij

(i,j) S u2 i x2 j E[bij]

(i,j) S u2 i x2 j = 1

m xtail(ℓ) 2 u 2 .

Since we conditioned on u 2 < 2, and since xtail(ℓ) 2 1 we have that E w 2 2/m = O(ε2/ lg n). Our goal is therefore to bound Pr[ w 2 > (1+δ)E[ w 2]] for some constant δ > 0. Similarly to the previous proof we employ the following observation, whose proof is deferred to Appendix D.

Observation 3.13. For every t > 0,

(i,j) S c2 (i,j)bij Y

(i,j) S E exp tc2 (i,j)bij .

We start by bounding the coefficients c(i,j). Recall that u p

ℓ/s and xtail(ℓ) 1/

ℓ, and let (i, j) S. Then either |ui| p

5/s or |xj| 1/(

ℓlg2(1/ε)). In the former case |uixj/ s|

ℓ), and by the choice of s and ℓwe get |uixj/ s| O(ε/ lg n). In the latter case |uixj/ s| 1/s lg(1/ε) = O(ε/ lg n). We conclude that for all (i, j) S we have |c(i,j)| = |uixj/ s| O(ε/ lg n). Let µ = E[ w 2], α = O((ε/ lg n)2) and let t = ln(1 + δ)/α for some large enough constant δ, then we get from Markov s inequality that

Pr[ w 2 > (1 + δ)/m] E exp t P

(i,j) S c2 (i,j)bij

exp(t(1 + δ)/m)

(i,j) S E exp tc2 (i,j)bij

(1 + δ)(1+δ)/(αm) (2)

Now note that for every (i, j) S it holds that

E h exp tc2 (i,j)bij i = s

metc2 (i,j) + 1 s

etc2 (i,j) 1 = 1 + s

(1 + δ)c2 (i,j)/α 1

Since c2 (i,j) α, we get that (1+δ)c2 (i,j)/α 1+δc2 (i,j)/α. Therefore

E h exp tc2 (i,j)bij i 1+ sc2 (i,j)δ

δ α sc2 (i,j) m

Plugging into (2) we get that

Pr w 2 > (1 + δ)µ

(i,j) S exp δ

α sc2 (i,j) m

(1 + δ)(1+δ)/(αm)

(1 + δ)(1+δ)/(αm) e2δ

Sparse Dimensionality Reduction Revisited

where the last inequality is due to the fact that µ 2/m. As 2/(αm) = Ω(lg n), then for a large enough constant δ the probability is at most n 2

Acknowledgements

M. M. Høgsgaard and K. G. Larsen are supported by a DFF Sapere Aude Research Leader Grant No. 9064-00068B.

Chris Schwiegelshohn is partially supported by the Independent Research Fund Denmark (DFF) under a Sapere Aude Research Leader grant No 1051-00106B.

This work was done while the author Jelani Nelson was supported by NSF grant CCF-1951384, ONR grant N0001418-1-2562, and ONR DORECG award N00014-17-1-2127.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ailon, N. and Chazelle, B. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302 322, 2009.

Arora, S., Hazan, E., and Kale, S. A fast random sampling algorithm for sparsifying matrices. In Proceedings of the 10th International Workshop on Randomization and Computation (RANDOM), pp. 272 279, 2006.

Benson, A. R., Kumar, R., and Tomkins, A. A discrete choice model for subset selection. In Proceedings of the eleventh ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2018.

Charikar, M., Chen, K. C., and Farach-Colton, M. Finding frequent items in data streams. Theor. Comput. Sci., 312 (1):3 15, 2004.

Chenakkod, S., Derezi nski, M., Dong, X., and Rudelson, M. Optimal embedding dimension for sparse subspace embeddings. Ar Xiv, abs/2311.10680, 2023. URL https://api.semanticscholar. org/Corpus ID:265281115.

Clarkson, K. L. and Woodruff, D. P. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pp. 81 90, 2013.

Cohen, M. B. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the Twenty-

Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 278 287, 2016.

Dahlgaard, S., Knudsen, M., and Thorup, M. Practical hash functions for similarity estimation and dimensionality reduction. In Advances in Neural Information Processing Systems 30, pp. 6615 6625. Curran Associates, Inc., 2017.

Do, T. T., Gan, L., Chen, Y., Nguyen, N., and Tran, T. D. Fast and efficient dimensionality reduction using structurally random matrices. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1821 1824, 2009. doi: 10.1109/ICASSP.2009.4959960.

Fandina, O. N., Høgsgaard, M. M., and Larsen, K. G. The fast johnson-lindenstrauss transform is even faster. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 9689 9715. PMLR, 2023. URL https://proceedings.mlr.press/ v202/fandina23a.html.

Freksen, C. B. and Larsen, K. G. On using toeplitz and circulant matrices for johnson-lindenstrauss transforms. Algorithmica, 82(2):338 354, 2020.

Freksen, C. B., Kamma, L., and Larsen, K. G. Fully understanding the hashing trick. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 5394 5404, 2018.

Go, A., Bhayani, R., and Huang, L. Twitter sentiment classification using distant supervision. Processing, 150, 01 2009.

Jagadeesan, M. Understanding sparse JL for feature hashing. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 15177 15187, 2019.

Jain, V., Pillai, N. S., and Smith, A. Kac meets johnson and lindenstrauss: a memory-optimal, fast johnsonlindenstrauss transform. Co RR, abs/2003.10069, 2020.

Jayram, T. S. and Woodruff, D. P. Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algorithms, 9 (3):26:1 26:17, 2013.

Johnson, W. and Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982),

Sparse Dimensionality Reduction Revisited

volume 26 of Contemporary Mathematics, pp. 189 206. American Mathematical Society, 1984.

Kane, D. M. and Nelson, J. Sparser Johnson-Lindenstrauss transforms. J. ACM, 61(1):4:1 4:23, 2014.

Krahmer, F. and Ward, R. New and improved johnson lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43 (3):1269 1281, 2011.

Larsen, K. G. and Nelson, J. Optimality of the Johnson Lindenstrauss lemma. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pp. 633 638, 2017.

Li, Y. and Liu, M. Lower bounds for sparse oblivious subspace embeddings. In Proceedings of the 41st Annual International Conference on Management of Data (PODS), pp. 251 260, 2022.

Nelson, J. and Nguyen, H. L. Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pp. 101 110, 2013.

Nelson, J. and Nguy ˆen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 117 126, 2013.

Nelson, J. and Nguy ˆen, H. L. Lower bounds for oblivious subspace embeddings. In Proceedings of the 41st International Colloquium on Automata, Languages and Programming (ICALP), pp. 883 894, 2014.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Sarl os, T. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 143 152, 2006.

Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pp. 1113 1120, 2009.

Woodruff, D. P. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci., 10(1-2): 1 157, 2014.

Sparse Dimensionality Reduction Revisited

A. Proofs for Claim 3.3 and Claim 3.4

In this section we supply the proofs for Claims 3.3 and 3.4 from Section 3. We start with the proof of Claim 3.3 i.e. Pr[A E1] 1 d 1, where E1 is defined in the beginning of Section 3.

Proof. Denote δ = 2 Ω( 3

lg2 n lg d). As n d we get that m Ω(ε 2 lg(1/δ)) and s Ω(ε 1 lg(1/δ)). Following Kane and Nelson (Kane & Nelson, 2014), for every unit vector x Rd we have that Pr Ax 2 (1 ε) 1 δ. Denote by ˆY the set of all vectors y Y such that for every i T, d3yi is an integer. Then for every y ˆY, for every i supp(y), d3yi { d3, d3 + 1, . . . , d3}, and therefore

(2d3)ℓ (2d4)ℓ 2O(ℓlg d) = 2O( 3

lg2 n lg d) 1

Therefore with probability at least 1

δ 1 d 1, we get that for all y ˆY, Ay 2 (1 ε).

Assume therefore that for all y ˆY, Ay 2 (1 ε). Let x Y, and let y ˆY be the closest vector in ˆY to x. Then x y 2 ℓ/d3. Since A 2 F = d we get that

A(x y) A F x y

where the last inequality is due to the fact that d lg n

ε2 . Therefore

Ax Ay + A(x y) 1 + ε + O(ε) 1 + O(ε),

and similarly Ax 1 O(ε).

Next we give the proof of Claim 3.4 i.e. Pr[A E2] 1 d 1, where E2 is defined in the beginning of Section 3. In showing this claim, we will make use of the following result by Jagadeesan (2019).

Theorem A.1 ((Jagadeesan, 2019)). For any 0 < δ < 1 and 0 < ε < ε0 for some constant ε0, assume A is sampled as in Kane and Nelson (2014) with m Θ(ε 2 lg(1/δ)) and m s exp(max{1, ln(1/δ)/(εs)}), then for any vector v with

εs ln(mε2/ ln(1/δ))

we have Pr[ Av 2 (1 ε) v 2] 1 δ.

Using Theorem A.1, we now turn to bound the probability of E2.

Proof. Fix x X. Denote v = xtail(ℓ), and let ˆε = max{ε, lg n

v 2 }. We wish to apply Theorem A.1 and thus start by verifying that our choice of parameters satisfy the constraints in the theorem. Applying the right constants, we have that m Θ(ε 2 log n) Θ(ˆε 2 log n). Furthermore

seΘ(max{1,(ˆεs) 1 lg n}) seΘ(max{1,(εs) 1 lg n}) semax{Θ(1), lg ε} = s max{eΘ(1), 1

lg2 n lg d + lg n/ lg(1/ε) max{eΘ(1), 1

Finally note that s

ˆεs lg mˆε2

Therefore, Theorem A.1 gives us that with probability 1 1 n2 we have Av 2 (1 ˆε) v 2. That is Av 2 v 2 ˆε v 2 = v 2 max{ε v 2, lg n

s v 2 }. Note first that ε v 2 < ε. Next, as v = xtail(ℓ), and x = 1 we

Sparse Dimensionality Reduction Revisited

have that v 1

ℓ, and therefore lg n

sℓ= O 1 s max{lg1/3 n lg2/3 d , ε1/2 lg n} = O(ε). We conclude

that with probability 1 1 n2 we have that Axtail(ℓ) 2 xtail(ℓ) 2 O(ε). Applying a union bound we get that Pr[A E2] 1 1

B. Sparsity Lower Bound

In this section, we prove our lower bound result, Theorem 1.2. Let 0 < ε < 1/4. We first define a hard set of input vectors in Rd. Let ℓ= lg n/ lg(ed/ℓ). For every ℓ-sized subset S [d] of coordinates, form the vector x S = P

ℓ. The collection of these vectors, along with the 0-vector and e1, . . . , ed, is our hard input instance X of cardinality |X| d ℓ + 1 + d (ed/ℓ)ℓ+ n 2n.

Assume that A is an m d matrix in which every column has at most s non-zeros, and that A satisfies Au Av 2 (1 ε) u v 2 for all u, v X. We also assume that m = Ω(ε 2 lg n) as such a lower bound on m is already known. We prove a lower bound on s from these assumptions. Throughout the proof, we assume s m/2 as otherwise, we are already done.

Let aj denote the j th column of A. We first observe that aj 2 (1 ε) for all j since aj 2 = Aej 2 = Aej A0 2 (1 ε) ej 0 2 = (1 ε).

Our next step is to identify a subset T [m], such that many of the columns of A have large entries in T. For this, we prove the following lemma:

Lemma B.1. Let v Rm be a vector with at most s m/2 non-zeros. For any t s/8, there are at least min{ m 1 t 1 , (s/(8t))t} distinct subsets T [m] of cardinality |T| = t for which P

i T v2 i t v 2/(2s).

We defer the proof to the end of the section and instead proceed with the lower bound argument.

Let t be a parameter to be fixed. There are d columns in A, which by Lemma B.1 and averaging among all tsized subsets of [m] implies that there is a T with |T| = t such that at least d min{ m 1 t 1 , (s/(8t))t}/ m t d min{t/m, (s/(8t))t/(em/t)t} = d min{t/m, (s/(8em))t} columns aj of A satisfy P

i T a2 i,j (1 ε)t/(2s) t/(4s). Fix such a T and let AT be the subset of columns satisfying the previous conditions for this T.

4 be a 1/4-net for the set of unit vectors in Rt, i.e. for any x Rt with x = 1, there is an x N 1

4 with x x 1/4 and x = 1. Standard results give that there is such a N 1

4 of cardinality 2O(t). For every aj AT , let a T j denote aj restricted to the t entries in T and let w(aj) denote the closest vector in N 1

4 to a T j / a T j . By averaging, there is a vector w N 1

4 where at least d min{t/m, (s/m)t}2 O(t) vectors aj AT have w as the closest vector to a T j / a T j . Fix such a w and let AT,w be the subset of columns in A satisfying the conditions.

Now fix t = (1 o(1)) lg(εd/ℓ)/ lg(m/s). Assume first that for this choice, we have (s/m)t t/m. Then since s = o(m) (otherwise we are done with the lower bound proof), we have

|AT,w| d(s/m)t2 O(t) = d(s/m)(1+o(1))t ℓ/ε.

From AT,w, construct ε 1 disjoint sets of ℓvectors each. For each such set S, we have that the vector P

ℓis in X. Let v1, . . . , vε 1 denote these vectors. Since they have disjoint supports, we have vi, vj = 0 for i = j and thus vi vj 2 = 2. This also implies that Avi Avj 2 2 2ε. Since Avi Avj 2 = Avi 2 + Avj 2 2 Avi, Avj and Avi 2, Avj 2 1 ε, it must be the case that Avi, Avj 2ε.

On the other hand, we have Avi, Avj = P

ak Sj ah, ak /ℓ. Thus P

ak Sj ah, ak /ℓ 2ε. Now

set all entries i T to 0 for all columns of A. Call the resulting columns ˆaj and the resulting matrix ˆA. Then for two columns ah, ak, we have ˆah, ˆak = ah, ak a T h , a T k . For any two ah, ak AT,w, we have a T h , a T k =

Sparse Dimensionality Reduction Revisited

a T h a T k w + (a T h / a T h w), w + (a T k / a T k w) . We have

w + (a T h / a T h w), w + (a T k / a T k w) =

w 2 + w, (a T h / a T h w) + w, (a T k / a T k w) + (a T h / a T h w), (a T k / a T k w)

w 2 w a T h / a T h w w a T k / a T k w a T h / a T h w |a T h / a T h w

1 1/4 1/4 1/16

Hence for any two ah, ak AT,w, it holds that ˆah, ˆak ah, ak a T h a T k /4 ah, ak t(1 ε)/(8s) ah, ak t/(16s). We therefore conclude that ˆAvi, ˆAvj P

ak Sj( ah, ak t/(16s))/ℓ 2ε ℓt/(16s).

Finally, consider the vector z = Pε 1

i=1 ˆAvi. We have z 2 = Pε 1

i=1 ˆAvi 2 +P

i =j ˆAvi, ˆAvj ε 1(1+ε)+ε 1(ε 1 1)(2ε ℓt/(16s)). Since z 2 0, it must thus be the case that (ε 1 1)ℓt/(16s) 1 + ε + (ε 1 1)2ε. Since ε 1 1 ε 1/2 and 1 + ε + (ε 1 1)2ε 4, we conclude s ε 1ℓt/128. Since d m = Ω(ε 2 lg n), we have lg(ed/ℓ) c lg(εd/ℓ) for a constant c > 0. Thus

s = Ω(ε 1 lg n/ lg(m/s)) = Ω(ε 1 lg n/ lg(m/ lg n)).

This was only under the assumption that (s/m)t t/m for t = (1 o(1)) lg(εd/ℓ)/ lg(m/s). This is implied by (ℓ/(εd))1 o(1) (1 o(1)) lg(εd/ℓ)/(m lg(m/s)). This is in particular implied by m (εd/ℓ)1 o(1). Constraining m (εd/ lg n)1 o(1) thus completes the proof.

Proof of Lemma B.1. First consider the case where v has at least one coordinate j with v2 j t v 2/(2s). In this case, there are at least m 1 t 1 valid choices for T.

If all coordinates j satisfy v2 j < t v 2/(2s), we partition the coordinates of v into buckets based on their magnitude. Concretely, for every i = 0, . . . , lg t 1, let Vi denote the subset of coordinates j for which v2 j [ v 22i 1/s, v 22i/s). Notice that all coordinates of j with v2 j < v 2/(2s) contribute at most s v 2/(2s) = v 2/2 to v 2. Furthermore, the contribution from coordinates j with j Vi for a Vi with |Vi| s/(4t), is no more than Plg t 1 i=0 (s/(4t)) v 22i/s v 2/4. Hence P

i:|Vi|>s/(4t) P

j Vi v2 j v 2/4. This implies that we also have P

i:|Vi|>s/(4t) |Vi| v 22i/s v 2/4.

For each i with |Vi| > s/(4t), let ti := 4t|Vi|/s . Then ti 4t|Vi|/s + 1 4t|Vi|/s + 4t|Vi|/s 8t|Vi|/s. Consider all sets T having |T Vi| = ti for all i with |Vi| > s/(4t). Any such T satisfies P

i:|Vi|>s/(4t) ti v 22i 1/s

i:|Vi|>s/(4t) |Vi| v 22i/s (t/(2s)) v 2. The number of such T is at least m/2 t P

i:|Vi|>s/(4t) |Vi| ti . For

|Vi| > s/(4t), we have |Vi| ti (|Vi|/ti)ti (|Vi|/(8t|Vi|/s))ti = (s/(8t))ti. The number of valid T is thus at least m s t P

i:|Vi|>s/(4t)(s/(8t))ti (m/(2t))t P

i ti(s/(8t)) P

i ti (s/(8t))t.

C. Subspace Embeddings

In this section, we show that for any k-dimensional subspace V Rd, an embedding matrix A sampled as in Kane and Nelson (2014), with a sparsity s = Θ(ε 1(k/ lg(1/ε) + k2/3 lg1/3 k)) as in Theorem 1.3, preserves the norm of every vector in V to within 1 ε with high probability, thus proving Theorem 1.3.

To simplify the proof, we will once again argue that norms are preserved to within 1 O(ε). As in Section 3, simple rescaling of ε by a constant factor implies the result.

We first show that it is enough that A approximately preserves norms of a finite set defined by a 1/2-net on the subspace. The following lemma is known and appears in previous works. For sake of completeness, we supply a proof, which is deferred to the Appendix E.

Lemma C.1. Let A be a matrix and V a subspace of Rd. Let N 1

2 be a 1/2-net for V and N + 1 2 = {x+y : x, y N 1

Assume that for all v N + 1 2 , Av 2 (1 O(ε)) v 2, then for all unit vectors x V , Ax 2 (1 O(ε)) x 2.

As explained in the technical overview, we also employ Lemma 2.1. The lemma gives a combinatorial property of subspaces of Rd.

Sparse Dimensionality Reduction Revisited

Lemma C.2. 2.1 Let V be a k-dimensional subspace of Rd. For every ℓ 1, there is a set S [d] of coordinates with |S| kℓsuch that for every unit vector v V , all coordinates i [d] \ S satisfy |vi| < 1/

Proof. Let v1, . . . , vk be an orthonormal basis for V . Consider any unit vector u V and write it as u = P

j αjvj with P

j α2 j = 1. Then ui = P

j αjvj i . By Cauchy-Schwarz, we have |ui| q P

j(vj i )2 = q P

j(vj i )2. Now let

S [d] be all coordinates such that there is a unit vector u V with |ui| 1/

ℓ. Then for all i S, we must have P

j(vj i )2 1/ℓ. But P

i(vj i )2 = k and thus |S| kℓ.

With the two lemmas above, we are ready to prove our main result on subspace embeddings, captured in Theorem 1.3. Similarly to the proof of Theorem 1.1, we define the following notation.

Notation 2. Let x Rd. For every ℓ [d] denote by xheavy(ℓ) the vector obtained from x where all but the entries of magnitude strictly greater than 1/

ℓare zeroed out. Denote xlight(ℓ) = x xheavy(ℓ).

2 be a 1/2-net on the unit ball in V , and define N + 1 2 = N 1

2 {x+y : x, y N 1

2 {0}}. A 1/2-net can be constructed

such that |N 1

2 | 4k. Let n = |N + 1 2 | 8k. Let ℓ= min ε 1/2, lg n lg k 2/3 be an integer, and let S be defined as in

Lemma 2.1. We define Y as the set of all vectors y Rd such that supp(y) S, | supp(y)| ℓand y 1.

Define E1 to be the set of all matrices A Rm d such that for all x Y, Ax 2 (1 ε) x 2. Define E2 to be the set of all matrices A Rm d such that for all x N + 1 2 , Axlight(ℓ) 2 xlight(ℓ) 2 ε. Define E3 to be the set of all matrices

A Rm d such that for all x N + 1 2 , Axheavy(ℓ), Axlight(ℓ) < ε.

Claim C.3. Assume A E1 E2 E3. Then for every unit vector x V , Ax 2 (1 O(ε)).

Proof. Following Lemma C.1 and using linearity of A, it is enough to prove the claim for x = z/ z for all vectors z in N + 1 2 .

Let therefore x be any such unit vector. Then Ax 2 = Axheavy(ℓ) 2 + Axlight(ℓ) 2 +2 Axheavy(ℓ), Axlight(ℓ) . Since x = 1 and every entry of xheavy(ℓ) is at least of magnitude 1/

ℓ, we have by the definition of S that supp(xheavy(ℓ)) S and | supp(xheavy(ℓ))| ℓ. Therefore xheavy(ℓ) Y and thus

Ax 2 (1 + ε) xheavy(ℓ) 2 + (1 + ε) xlight(ℓ) 2 + ε + 2ε ( x 2 + O(ε))

Similarly Ax 2 (1 ε) xheavy(ℓ) 2 + (1 ε) xlight(ℓ) 2 ε 2ε ( x 2 O(ε))

As in the proof of Theorem 1.1, it remains to lower bound the probability of E1 E2 E3. Once again, we bound the probability of each event separately.

Claim C.4. Pr[A E1] 1 2 k2/3.

Proof. Denote δ = 2 Ω( 3

lg2 n lg k). We get that m Ω(ε 2 lg(1/δ)) and s Ω(ε 1 lg(1/δ)). Following the result by Kane and Nelson (2014), for every unit vector x Rd, we have that Pr Ax 2 (1 O(ε)) 1 δ.

Next, for every T S such that |T| = ℓ, let YT = {y Rd : y 1 and supp(y) T}, then YT is a unit ball of an ℓ-dimensional subspace of Rd, and thus there is a 1/2-net ˆYT for YT such that | ˆYT | 4ℓ. Note that in these notations Y = S

T ( S ℓ) YT , and denote in addition ˆY = S

T ( S ℓ) ˆYT . Then

| ˆY| |S| ℓ

4ℓ (4ek)ℓ= 2Ω(ℓlg(4ek)) .

For k > 1, we have | ˆY| 2Ω(ℓlg k) = 2Ω( 3

lg2 n lg k) = δ 1/2. Therefore with probability at least 1

δ 1 2 k2/3, we get that for all y ˆY, Ay 2 (1 O(ε)).

Sparse Dimensionality Reduction Revisited

Assume therefore that for all y ˆY, Ay 2 (1 O(ε)). Let x Y, then there exists T S such that |T| = ℓ and supp(x) T, hence x YT . As ˆYT is a 1/2-net of YT and YT is a unit ball of an ℓ-dimensional subspace of Rd, Lemma C.1 implies that Ax 2 (1 O(ε)) x 2. Therefore Pr[A E1] 1 2 k2/3.

The following claim completes the proof of Theorem 1.3. Proving bounds on the probabilities of E2 and E3 is analogous to the proofs of Claims 3.4 and 3.5 respectively, and is therefore omitted.

Claim C.5. Pr[A E2] 1 1

n and Pr[A E3] 1 1

D. Proofs for Observations 3.11 and 3.13

For sake of completeness, we prove the following lemma, which implies Observations 3.11 and 3.13.

Lemma D.1. Let I [m] [d] and let {c(i,j)}(i,j) I be a set of non-negative constants. For every (i, j) I, define bij as the Bernoulli random variable attaining 1 if and only if aij = 0, then

(i,j) I c(i,j)bij

(i,j) I E[exp(c(i,j)bij)]

Proof. Recall that the rows of A are divided into s blocks I1, . . . , Is of m/s consecutive rows each. That is for every p [s], Ip = [(p 1)(m/s) + 1, pm/s]. In these notations, X

(i,j) I c(i,j)bij = X

i Ip:(i,j) I c(i,j)bij .

As the columns of A, as well as different blocks within each column are independent, we get that

(i,j) I c(i,j)bij

i Ip:(i,j) I exp(c(i,j)bij)

Fix j [d] and p [s], and denote C = {i Ip : (i, j) I}. For every i C, c(i,j) 0, and thus eci,j 1

(m/s) 0. Therefore

i C exp(c(i,j)bij)

(m/s) + (1 |C| (m/s)) = 1 + X

1 + ec(i,j) 1

i C E(exp(c(i,j)bij))

Sparse Dimensionality Reduction Revisited

E. A 1/2-net suffices

Here we give the defered proof of Lemma C.1

Proof of Lemma C.1. Let x V be a unit vector. We construct inductively a sequence {xi} i=0 of vectors in N 1

2 and a sequence {αi} i=0 of non-negative real numbers such that x = P i=0 αixi and moreover αi 2 i for all i 0. Let x0 be the closest vector to x in N 1

2 , and let α0 = 1. Then x = α0x0 + (x α0x0). Clearly if x α0x0 = 0 we are done, as we can define αi = 0 for all i 1. Otherwise, denote α1 = x α0x0 and v1 = α 1 1 (x α0x0), then α1 1/2, v1 is a unit vector and x = α0x0 + α1v1. Following by induction let p N and assume there are vectors x0, . . . , xp N 1

2 , numbers α0, . . . , αp+1 and a unit vector vp+1 such that x = Pp i=0 αixi + αp+1vp+1 and such that αi 2 i for all i p + 1. Let xp+1 be the closest vector in N 1

2 to vp+1. Then vp+1 = xp+1 + (v xp+1). If v xp+1 = 0 we are done, as we can define αi = 0 for all i p + 2. Otherwise, denote β = v xp+1 , αp+2 = αp+1β and vp+2 = β 1(v xp+1), then αp+2 2 p+1, vp+2 is a unit vector and

i=0 αixi + αp+1vp+1 =

i=0 αixi + αp+1(xp+1 + (v xp+1)) =

i=0 αixi + αp+2vp+2 .

This completes the construction of the sequences. Next note that

i=0 αi xi 2 + X

i<j 2αiαjxt ixj .

Similarly we get that

i=0 αi Axi 2 + X

i<j 2αiαjxt i At Axj .

Since xi N 1

2 N + 1 2 for all i we have that Axi 2 1 O(ε). In addition, for all i < j, 2xt ixj = xi + xj 2 xi 2

xj 2. Since xi, xj, xi + xj N + 1 2 we have that

2xt i At Axj = Axi + Axj 2 Axi 2 Axj 2 = A(xi + xj) 2 Axi 2 Axj 2 2xt ixj O(ε) ,

i=0 αi Axi 2 + X

i<j 2αiαjxt i At Axj

i=0 αi( xi 2 O(ε)) + X

i<j 2αiαj(xt ixj O(ε))

i=0 αi xi 2 + X

i<j 2αiαjxt ixj + O(ε)