# optimal_bounds_for_johnsonlindenstrauss_transformations__b2d998fb.pdf Journal of Machine Learning Research 19 (2018) 1-22 Submitted 5/18; Revised 9/18; Published 10/18 Optimal Bounds for Johnson-Lindenstrauss Transformations Michael Burr burr2@clemson.edu Shuhong Gao sgao@clemson.edu Department of Mathematical Sciences Clemson University, Clemson SC 29634, USA Fiona Knoll fknoll@g.clemson.edu Department of Mathematical Sciences University of Cincinnati, Cincinnati, OH 45221, USA Editor: Kenji Fukumizu In 1984, Johnson and Lindenstrauss proved that any finite set of data in a high-dimensional space can be projected to a lower-dimensional space while preserving the pairwise Euclidean distances between points up to a bounded relative error. If the desired dimension of the image is too small, however, Kane, Meka, and Nelson (2011) and Jayram and Woodruff (2013) proved that such a projection does not exist. In this paper, we provide a precise asymptotic threshold for the dimension of the image, above which, there exists a projection preserving the Euclidean distance, but, below which, there does not exist such a projection. Keywords: Johnson-Lindenstrauss transformation, Dimension reduction, Phase transition, Asymptotic threshold 1. Introduction In 1984, Johnson and Lindenstrauss (1984), in establishing a bound on the Lipschitz constant for the Lipschitz extension problem, proved that any finite set of data in a highdimensional space can be projected into a lower-dimensional space while preserving the pairwise Euclidean distance within any desired relative error. In particular, for any finite set of vectors x1, . . . , x N Rd and for any error factor 0 < ϵ < 1 2, there exists an absolute constant c such that for all k cϵ 2 log N, there exists a linear map A : Rd Rk such that for all pairs 1 i, j N, (1 ϵ) xi xj 2 Axi Axj 2 (1 + ϵ) xi xj 2, where 2 denotes the Euclidean norm. These inequalities are implied by the following theorem (by setting δ = 1 N2 and using the union bound): Theorem 1 (Johnson and Lindenstrauss (1984)) For any real numbers 0 < ϵ, δ < 1 2, there exists an absolute constant c > 0 such that for any integer k cϵ 2 log 1 δ, there exists a probability distribution D on k d real matrices such that for any fixed x Rd, Prob A D (1 ϵ) x 2 2 Ax 2 2 (1 + ϵ) x 2 2 > 1 δ, (1) where A D indicates that the matrix A is a random matrix with distribution D. c 2018 Michael Burr, Shuhong Gao, and Fiona Knoll. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/18-264.html. Burr, Gao, and Knoll Note that, in order to project a large number of vectors, δ must be sufficiently small. For instance, suppose we wish to project a set of N = 220 vectors to a smaller dimensional space. To apply the union bound to Inequality (1), we use any δ < 2 39. In this case, Inequality (1) implies that the probability of preserving all pairwise distances between N points (up to a relative error of ϵ) is at least 1 δN2/2 > 0. Since the probability is nonzero, such a projection exists. A probability distribution D satisfying Inequality (1) is called an (ϵ, δ)-JL distribution, or simply a JL distribution. Since these transformations are linear, without loss of generality, we assume for the rest of the paper that x 2 = 1. When a JL distribution is specified via an explicit construction, we may call a random projection x 7 Ax generated in this way a JL transformation. Since the introduction of JL distributions, there has been considerable work on explicit constructions of JL distributions, see, e.g., Johnson and Lindenstrauss (1984); Frankl and Maehara (1988); Indyk and Motwani (1998); Achlioptas (2003); Ailon and Chazelle (2006); Matousek (2008); Dasgupta et al. (2010); Kane and Nelson (2014); and the references therein. A simple and easily described JL distribution is that of Achlioptas (2003). In this construction, the entries of A are distributed as follows: ( k 1/2, with probability 1/2, k 1/2, with probability 1/2. The recent constructions in Ailon and Chazelle (2006); Matousek (2008); Dasgupta et al. (2010); Kane and Nelson (2014) have focused on the complexity of computing a projection for the purpose of applications. We note that the ability to project a vector to a smaller dimensional space, independent of the original dimension, while preserving the Euclidean norm up to a prescribed relative error, is highly desirable. In particular, dimension reduction has applications in many fields, including machine learning (Arriaga and Vempala (2006); Weinberger et al. (2009)), low rank approximation (Clarkson and Woodruff (2013); Nguyen et al. (2009); Ubaru et al. (2017)), approximate nearest neighbors (Ailon and Chazelle (2006); Indyk and Motwani (1998)), data storage (Cand es (2008); Cormode and Indyk (2016)), and document similarity (Bingham and Mannila (2001); Lin and Gunopulos (2003)). For both practical and theoretical purposes, it is important to know the smallest possible dimension k of a potential image space for any given ϵ and δ. Note that, for any d1 < d, each (ϵ, δ)-JL distribution D on Rk d induces an (ϵ, δ)-JL distribution D1 on Rk d1 in a natural way: the matrices of D1 are obtained from D by deleting the last d d1 columns of a random matrix, together with the induced probability distribution. This construction is a JL distribution since Rd1 can be naturally embedded into Rd by extending a vector in Rd1 by d d1 zeros. Hence, if there exists an (ϵ, δ)-JL distribution on Rk d, then there is an (ϵ, δ)-JL distribution on Rk d1 for all 1 d1 d. Similarly, if an (ϵ, δ)-JL distribution does not exist on Rk d, then, for any k1 < k, there cannot be an (ϵ, δ)-JL distribution on Rk1 d. In particular, since Rk1 can be naturally embedded into Rk by extending a vector in Rk1 by k k1 zeros, if an (ϵ, δ)-JL distribution existed for Rk1 d, it could be extended to an (ϵ, δ)-JL distribution for Rk d. Optimal Bounds for Johnson-Lindenstrauss Transformations c1ϵ 2 log 1 δ 4ϵ 2 log 1 δ [1 + o(1)] JLD Exists No JLD Exists Figure 1: For fixed ϵ and δ, there exists a JL distribution for k 4ϵ 2 log(1/δ) [1 + o(1)]. For k < c1ϵ 2 log(1/δ), for some absolute constant c1 > 0, there is no JL distribution. In this paper, we close this gap in the limit. For any ϵ and δ, we define k0(ϵ, δ) = min{k : there exists an (ϵ, δ)-JL distribution on Rk d for every d 1}. By our definition, k0 = k0(ϵ, δ) is independent of d, and, by Theorem 1, we have that k0 cϵ 2 log(1/δ) for some absolute constant c > 0. Frankl and Maehara (1988) show that c 9. Achlioptas (2003) further improves this bound by providing a JL distribution with k > 2 log(2/δ) ϵ2 resulting in the following upper bound: k0 2 log(2/δ) ϵ2 1 = 4ϵ 2 log(1/δ) [1 + o(1)] , where o(1) approaches zero as both ϵ and δ approach zero. A lower bound on k0 was not given until 2003 when Alon (2003) proved that k0 cϵ 2 log(1/δ) . log(1/ϵ) for some absolute constant c > 0. Improving Alon s work, Jayram and Woodruff(2013) and Kane et al. (2011) showed, through different methods, that, for some absolute constant c1 > 0, there is no (ϵ, δ)-JL distribution for k c1ϵ 2 log 1 δ. Hence, there is a lower bound of the form k0 c1ϵ 2 log 1 δ. This situation is summarized in Figure 1. The goal of the current paper is to close the gap between the upper and lower bounds in the limit. In particular, we prove an optimal lower bound that asymptotically matches the known upper bound when ϵ and δ approach 0, see Theorem 2. This means that 4ϵ 2 log(1/δ) is an asymptotic threshold for k0 where a phase change phenomenon occurs. The main result of this paper is captured in the following theorem: Theorem 2 For ϵ and δ sufficiently small, k0 4ϵ 2 log(1/δ). More precisely, lim ϵ,δ 0 k0(ϵ, δ) 4ϵ 2 log(1/δ) = 1. Burr, Gao, and Knoll The rest of the paper is organized as follows: To prove Theorem 2, we follow the approach of Kane et al. (2011). To make their constant c1 explicit, however, we must use a more careful argument. In Section 2, we provide explicit conditions under which we prove the main result, Theorem 2. We delay the proofs of these conditions until Section 3 in order to make the main result more accessible, since only the statements of these results are needed and not their more technical proofs. In Section 3, we prove probabilistic bounds on s = x2 1 + +x2 k where x = (x1, , xk, , xd) is a random variable uniformly distributed on Sd 1. These bounds explicitly determine the hidden constants in the bounds on Prob [s < s0(1 ϵ)] and Prob [s > s0(1 + ϵ)] from (Kane et al., 2011, Theorem 20). These bounds can be viewed as explicit bounds for concentration theorems for laws of large numbers from probability theory. In the Appendix, we provide an alternate proof of a result in Kane et al. (2011), showing that if dΩd 1 is the surface area for the (d 1)-dimensional sphere Sd 1, then for any 1 k d, 2f(s)ds dΩk 1dΩd k 1, where s [0, 1] and f(s) = s k 2 2 (1 s) d k 2 2. Asymptotic Threshold Bound In this section, we prove the asymptotic threshold bound for JL transformations. In particular, we provide specific conditions that result in the asymptotic threshold bound of 4ϵ 2 log(1/δ). In Section 3, we prove that these conditions hold, but the details of these proofs are more technical and are unnecessary to understand the main results of this paper. 2.1. The Uniform Distribution on Sd 1 There is a unique probability distribution, called the uniform distribution, on the (d 1)- dimensional sphere Sd 1 that is invariant under the orthonormal group. We express the uniform distribution on Sd 1 in terms of the surface area differential form dΩd 1, which means that, for any measurable subset V Sd 1, the (d 1)-dimensional surface area of V is equal to the integral with respect to dΩd 1, i.e., Vold 1 (V ) = R V dΩd 1.1 Thus, the uniform distribution on Sd 1 corresponds to the measure dΩd 1/Vold 1 Sd 1 . As we are interested in reducing a d-dimensional vector to a k-dimensional vector for 1 k < d, we derive a relationship between the uniform distribution on Sd 1 and the uniform distributions on Sk 1 and Sd k 1. Following the approach of Kane et al. (2011), for 1 k < d, we define an injective map Ψ : Sd 1 [0, 1] Sk 1 Sd k 1 as follows: For any x = (x1, x2, . . . , xd)t Sd 1, we define s in Ψ(x) = (s, u, v) as s = x2 1 + + x2 k. In the case where 0 < s < 1, we define u = (x1, . . . , xk)t/ s and v = (xk+1, . . . , xd)t/ 1. In this paper, we suppress the pullback maps on equalities for differential forms since there is a unique (almost) bijective map under consideration in each case. We leave the details to the interested reader. Optimal Bounds for Johnson-Lindenstrauss Transformations When s = 0, i.e., x1 = = xk = 0, we define u = (1, 0, . . . , 0)t (or any point in Sk 1) and v = (xk+1, . . . , xd)t. Similarly, for s = 1, we define u = (x1, . . . , xk)t and v = (1, 0, . . . , 0)t (or any point in Sd k 1). It is straight-forward to check that Ψ is injective. In addition, the complement of the image of Ψ is a subset of {0, 1} Sk 1 Sd k 1, which has, in turn, (d 1)-dimensional surface area equal to 0. Therefore, when convenient, we assume that s (0, 1). For s [0, 1], we define f(s) = s k 2 2 (1 s) d k 2 In the Appendix, we provide alternative proof to the computation in Kane et al. (2011) which shows that, via the map Ψ, 2f(s)ds dΩk 1dΩd k 1. Equivalently, in term of probability distributions, dΩd 1 Vold 1 (Sd 1) = Bf(s)ds dΩk 1 Volk 1 (Sk 1) dΩd k 1 Vold k 1 (Sd k 1), (2) 2 Volk 1 Sk 1 Vold k 1 Sd k 1 Volk 1 (Sk 1) = Γ(d is an appropriate scaling constant, depending on the gamma function, for more details, see the Appendix. Moreover, in this situation, we observe that Bf(s) is a probability distribution on [0, 1]. This implies that the uniform distribution on Sd 1 is a direct product of the distributions on the factors. In other words, a uniformly distributed random variable Xd 1 on Sd 1 can be decomposed into three random variables Ψ(Xd 1) = (S, Xk 1, Xd k 1) with the following properties: (i) S is a random variable on [0, 1] with density function Bf(s), (ii) Xk 1 and Xd k 1 are uniformly distributed on Sk 1 and Sd k 1, and (iii) The random variables S, Xk 1, and Xd k 1 are independent. The independence of these three random variables is a key property in our proof as it allows us to study the three spaces separately. 2.2. Upper Bound: Explicit JL Distribution We recall that Achlioptas (2003) proved that k0(ϵ, δ) 2 log(2/δ) ϵ2 1 = 4ϵ 2 log(1/δ) [1 + o(1)] . In this section, we give an alternate proof of this result using the approach and bounds from this paper. We recall the following construction by Dasgupta and Gupta (2003): A distribution D on k d matrices is formed by picking a d d orthonormal matrix V = (v1, . . . , vd)t uniformly Burr, Gao, and Knoll at random with respect to the Haar measure on orthonormal matrices and then letting A = 1 s0 (v1, . . . , vk)t, where s0 = k/d. The following proposition shows that k0(ϵ, δ) 4ϵ 2 log(1/δ) [1 + o(1)], which, in turn, implies that the limit appearing in Theorem 2 (if it exists) is at most 1: Proposition 3 Let 0 < ϵ, δ < 1 2 and s0 = k/d. Suppose that there is some constant C so that max{Probx Sd 1 [s < s0(1 ϵ)] , Probx Sd 1 [s > s0(1 + ϵ)]} Ce k 2 where s = x2 1 + +x2 k is defined as in Ψ(x) = (s, u, v). Then, there exists an o(1) function, which approaches zero as both ϵ and δ approach zero so that if k > 4ϵ 2 log 1 δ [1 + o(1)], then the distribution on k d random matrices defined as above is an (ϵ, δ)-JL distribution, that is, for any w Sd 1, Prob A D Aw 2 2 1 < ϵ 1 δ. Proof Let V be the random orthogonal matrix as defined above, and let x = (x1, . . . , xd)t = V w. Then Aw = q s 1 0 (x1, . . . , xk)t, and ||Aw||2 2 = 1 s0 (x2 1 + + x2 k). Since V is orthonormal and ||w||2 = 1, we have that ||x||2 = 1, and, hence, x Sd 1. We observe that since V is a random orthogonal matrix, for fixed w Sd 1, x = V w is a random variable, uniformly distributed on Sd 1. Hence, Prob A D Aw 2 2 1 > ϵ = Probx Sd 1 i=1 x2 i 1 > ϵ where x Sd 1 means that x is a random variable uniformly distributed on Sd 1. Let s = Pk i=1 x2 i . Then, s [0, 1] and the probability above becomes Probx Sd 1 [s < s0(1 ϵ)] + Probx Sd 1 [s > s0(1 + ϵ)] 2Ce k 2 by assumption. We observe that when k > 4ϵ 2 log 1 1 + 2ϵ 3 2ϵ + log(2C) δ 1 1 2ϵ/3 + 2ϵ2 = 4ϵ 2 log 1 [1 + o(1)] , the right-hand-side of Inequality (3) is less than δ. In this case, the o(1) term needed in the theorem statement appears in Inequality (4). Therefore, when k > 4ϵ 2 log 1 δ [1 + o(1)], the distribution D is an (ϵ, δ)-JL distribution. We observe that this result also follows from work in Dasgupta and Gupta (2003). In particular, (Dasgupta and Gupta, 2003, Lemma 2.2) can be used to derive bounds similar to the assumed bounds in this statement. Optimal Bounds for Johnson-Lindenstrauss Transformations 2.3. Lower Bound for Arbitrary Distributions In this section, we prove an optimal lower bound on the limit in Theorem 2 that matches the upper bound from the previous section. The proof of this lower bound is the main challenge in this paper. We begin with the following key lemma: Lemma 4 Let x = (x1, . . . , xd)t be a random variable, uniformly distributed on Sd 1, Ψ(x) = (s, u, v), and s0 = k/d. Suppose that for a fixed ϵ > 0, min{Prob [s > s0(1 + ϵ)] , Prob [s < s0(1 ϵ)]} L, where s is a random variable with probability distribution Bf(s) on [0, 1]. For any function c(u, v) > 0 depending only on u Sk 1 and v Sd k 1 (i.e., independent of s), we have Probx Sd 1 [|sc 1| > ϵ] L. Proof By the equality of differential forms in Equation (2), Prob [|sc 1| > ϵ] = Z |sc 1|>ϵ Bf(s)ds dΩk 1 Volk 1 (Sk 1) dΩd k 1 Vold k 1 (Sd k 1) Sk 1 Sd k 1 |sc 1|>ϵ Bf(s)ds ! dΩk 1 Volk 1 (Sk 1) dΩd k 1 Vold k 1 (Sd k 1). Our goal is to find a lower bound on the integral R |sc 1|>ϵ Bf(s)ds. Due to the independence of u, v, and s, c(u, v) is a fixed positive constant within this integral. We observe that |sc 1| > ϵ consists of two intervals, s < (1 ϵ)/c and s > (1 + ϵ)/c, and we consider two cases depending on the value of c. We begin by recalling that Prob [s > s0(1 + ϵ)] = Z s>s0(1+ϵ) Bf(s)ds and Prob [s < s0(1 ϵ)] = Z sϵ Bf(s)ds Z s>(1+ϵ)/c Bf(s)ds Z s>(1+ϵ)/s0 Bf(s)ds L. On the other hand, if c < s0, then (1 ϵ)/s0 < (1 ϵ)/c, then Z |sc 1|>ϵ Bf(s)ds Z s<(1 ϵ)/c Bf(s)ds Z s<(1 ϵ)/s0 Bf(s)ds L. Therefore, the integral R |sc 1|>ϵ Bf(s)ds is bounded from below by L, and Prob [|sc 1| > ϵ] Z Sk 1 Sd k 1 L dΩk 1 Volk 1 (Sk 1) dΩd k 1 Vold k 1 (Sd k 1) = L. Burr, Gao, and Knoll We now show that when k ηϵ 2 log(1/δ) with η < 4, and ϵ and δ are sufficiently small, there does not exist an (ϵ, δ)-JL distribution on Rk d. This fact, combined with the results in Section 2.2, shows that the limit appearing in Theorem 2 exists and equals 1. In order to show this, we consider the following related problem: By definition, for a probability distribution D on Rk d to be an (ϵ, δ)-JL distribution, the following inequality must hold for every w Sd 1: Prob A D | Aw 2 2 1| > ϵ < δ. Hence, Prob A D, w Sd 1 | Aw 2 2 1| > ϵ < δ, (5) where w Sd 1 is a random variable distributed uniformly on Sd 1. Following the approach of Kane et al. (2011), our goal is to prove that, for every A Rk d, Probw Sd 1 | Aw 2 2 1| > ϵ > δ. (6) When Inequality (6) holds for all A, then Inequality (5) cannot hold for any distribution D on Rk d. Therefore, an (ϵ, δ)-JL distribution does not exist. We make this precise in the following theorem: Theorem 5 Suppose that η < 4 and let k(ϵ, δ) = ηϵ 2 log 1 δ . Let s0 = k/d, and suppose that, for every ϵ, δ, and s0 sufficiently small (to make s0 sufficiently small, d must be sufficiently large), min{Prob [s > s0(1 + ϵ)] , Prob [s < s0(1 ϵ)]} Cδ η 4 γ, where C > 0 is an absolute constant, and γ approaches 1 as ϵ, δ, and s0 approach 0. Then, by decreasing ϵ, δ, and s0 as needed, for every matrix A Rk(ϵ,δ) d, Probw Sd 1 Aw 2 2 1 > ϵ > δ. Proof We assume that A has rank k = k(ϵ, δ) since, if not, we may reduce k (and decrease η correspondingly) to the rank of A. Let A = UΣV t be the singular value decomposition of A where U is a k k orthonormal matrix, V = (v1, . . . , vd) is a d d orthonormal matrix, and Σ is a k d diagonal matrix with λi > 0 its entry at Σi,i for 1 i k. Let x = (x1, . . . , xd)t = V tw. Since V is orthonormal, we have x Sd 1. We observe that since w is a uniformly distributed random variable on Sd 1, V tw is also a uniformly distributed random variable on Sd 1. Therefore, since U is orthonormal, we have Aw 2 2 = UΣx 2 2 = Σx 2 2 = i=1 λ2 i x2 i . We recall the definition of the function Ψ(x) = (s, u, v) where s = x2 1 + + x2 k. Moreover, we restrict our attention to the case where s (0, 1) since the complement has zero measure. Let i=1 λ2 i x2 i /s = Σk ku 2 2, Optimal Bounds for Johnson-Lindenstrauss Transformations where Σk k denotes the k k principal submatrix of Σ, then Probw Sd 1 Aw 2 2 1 > ϵ = Probx Sd 1 [|sc 1| > ϵ] . Due to the independence of u, v, and s, it follows that c depends only on u. Therefore, by Lemma 4, it follows that Probw Sd 1 Aw 2 2 1 > ϵ Cδ η 4 γ. It follows that for ϵ, δ, and s0 sufficiently small, Cδ η 4 γ > δ. Therefore, by selecting ϵ, δ, and s0 sufficiently small, it follows from Theorem 5 that there is no (ϵ, δ)-JL distribution when k < ηϵ 2 log 1 δ for η < 4. Therefore, k0(ϵ, δ) ηϵ 2 log 1 δ . We collect the results of Proposition 3 and Theorem 5 in the following corollary: Corollary 6 Assume the hypotheses on Prob [s > s0(1 + ϵ)] and Prob [s < s0(1 ϵ)] from Proposition 3 and Theorem 5 hold. (a) There exists an o(1) function that approaches 0 as ϵ and δ approach zero such that if k > 4ϵ 2 log 1 δ [1 + o(1)], then there exists a JL distribution. (b) If k(ϵ, δ) = ηϵ 2 log 1 δ , then, by decreasing ϵ and δ, and increasing d, there is no (ϵ, δ)-JL distribution for any k k(ϵ, δ). This proves the main result in the paper. In the following section, we provide the more technical results that verify the assumptions in Proposition 3 and Theorem 5. 3. Explicit Concentration Bounds Throughout this section, we assume that s is a random variable with probability distribution Bf(s). We define s0 = k d, and we further assume that 0 ϵ, δ 1/2, k 4 ϵ 2 and s0 < 0.4. We derive lower and upper bounds for the following probabilities: Prob [s > s0(1 + ϵ)] and Prob [s < s0(1 ϵ)] , using the probability density Bf(s) for s and f(s) = s(k 2)/2(1 s)(d k 2)/2. These probabilities appear in (Kane et al., 2011, Theorem 20) without the explicit constants that are derived in this paper. These bounds are instances of explicit concentration theorems (or explicit laws of large numbers) from probability theory. Our goal is to formulate these bounds as precisely as possible so that the lower and upper bounds are asymptotically the same when ϵ and δ approach 0. 3.1. Bounds for B We recall that Γ(1/2) = π, Γ(1) = 1, and Γ(1 + z) = zΓ(z). Hence 2 1 ! if d is even, and Γ d 2 1 2 π if d is odd. Burr, Gao, and Knoll In this section, we derive lower and upper bounds for B, see Equation (38), by using the following form of Stirling s approximation of n! due to Robbins (1955): 2πnn+1/2e ne 1 12n+1 < Γ(n + 1) = n! < 2πnn+1/2e ne 1 12n . Since we are interested in the asymptotic behavior, we focus on the case where d is even. This choice does not affect the asymptotic results of our paper, but the calculations are more straight-forward in this case. We leave the details for the case where d is odd to the interested reader. Lemma 7 Suppose k and d are both even. Then we have the following inequality:2 2 π (d 2)(d 1)/2 (k 2)(k 1)/2 (d k 2)(d k 1)/2 B e 1 2 π (d 2)(d 1)/2 (k 2)(k 1)/2 (d k 2)(d k 1)/2 . Proof Using the bound on n! from Robbins (1955), we obtain C0 (d 2)(d 1)/2 (k 2)(k 1)/2 (d k 2)(d k 1)/2 B C1 (d 2)(d 1)/2 (k 2)(k 1)/2 (d k 2)(d k 1)/2 , C0 = 1 2 πe 1e 1 6(d 2)+1 e 1 6(d k 2) e 2 C1 = 1 2 πe 1e 1 6(k 2)+1 e 1 6(d k 2)+1 e 1 Corollary 8 With s0 = k/d, we have k Bs0f(s0) 9e 1 Proof By evaluating f at s0 and replacing B by its lower bound found in Lemma 7, we obtain the lower bound Bs0f(s0) e 2 2 (d k)(d 2) Similarly, by using the upper bound in Lemma 7, we obtain the upper bound Bs0f(s0) e 1 2 d k d k 2 where the last inequality follows from d d k 2 since s0 < 0.4, and x x 2 x 1 2 3 for x 3. 2. Throughout this section, it is possible to derive tighter bounds on constants in the following inequalities, but the ones appearing here are sufficient for our proofs. We leave the details of the tighter bounds to the interested reader. Optimal Bounds for Johnson-Lindenstrauss Transformations 3.2. Bounds on Prob [s > s0(1 + ϵ)] We begin by mentioning the following inequalities which are used in our arguments below: log(1 + x) x x2 2 for 0 < x < 1, (7) log(1 x) x x2 for 0 < x < 0.68, (8) log(1 + x) x for x > 1, and (9) log(1 + x) x x2 3 for x > 1. (10) These bounds can be verified by employing basic calculus techniques (e.g., derivatives and Taylor expansions) as well as sufficiently accurate approximations. Using these inequalities, we derive the following bounds: Prob [s > s0(1 + ϵ)] e 2 kϵ+1)2 1+s0 Moreover, when k < ηϵ 2 log 1 Prob [s > s0(1 + ϵ)] e 2 4π δ η 4 γ1, γ1 = 1 + (η log(1/δ)) 1/2 2 1 + s0 Additionally, γ1 approaches 1 as ϵ, δ, and s0 approach 0. Proof Note that Prob [s > s0(1 + ϵ)] = B Z s>s0(1+ϵ) f(s)ds = Bs0 ϵ f(s0(1 + x))dx, (11) via the substitution s = s0(1 + x). Let g(s) = sk/2(1 s)(d k)/2, then f(s0(1+x)) can be expressed in terms of g(s), namely, f(s0(1 + x)) = g(s0(1 + x)) s0(1 + x) (1 s0(1 + x)). (12) To find a lower bound on Prob [s > s0(1 + ϵ)], we compute a bound on g(s0(1 + x)) from below. Taking the logarithm of g(s0(1 + x)), we find log (g(s0(1 + x))) = log g(s0) + d s0 log(1 + x) + (1 s0) log 1 s0 1 s0 x . (13) Restricting x to the interval 0 x < 1, it then follows that 0 < s0 1 s0 x < 0.68 from the assumption that s0 < 0.4. We now bound the second term in Equation (13) using Inequalities (7) and (8), as follows: s0 log(1 + x)+(1 s0) log 1 s0 1 s0 x s0(1 + s0) Burr, Gao, and Knoll Hence, by substituting Inequality (14) into Equation (13) and exponentiating, we obtain the following lower bound for g(s0(1 + x)): g(s0(1 + x)) g(s0)e k 1 s0 . (15) We also observe that since s0 < 0.4 and 0 x < 1, the denominator of Equation (12) is bounded from below as follows: 1 s0(1 + x)(1 s0(1 + x)) 1 2s0(1 s0). (16) Therefore, by substituting Inequalities (15) and (16) into Equation (12) when 0 x < 1, we have f(s0(1 + x)) g(s0) 2s0(1 s0)e k 1 s0 . (17) Since s0 < 0.4, we observe that 1 s0 1 = d k k > 1.5. Since we assumed that ϵ < 1 2 and k 4 + ϵ 2 > 4, it follows that ϵ + k 1/2 < 1 and so ϵ + k 1/2 < 1 < 1 s0 1. Therefore, we further restrict x to the interval (ϵ, ϵ + k 1/2) and observe that Inequality (17) applies in this range. Therefore, since f is a positive function, ϵ f(s0(1 + x))dx Bs0 ϵ f(s0(1 + x))dx 1 2Bs0f(s0) Z ϵ+k 1/2 (18) Replacing Bs0f(s0) with its lower bound given in Corollary 8 and observing that the integrand is decreasing over an interval of width k 1/2, Inequality (18) is bounded from below by 1 2Bs0f(s0) Z ϵ+k 1/2 1 s0 dx e 2 kϵ+1)2 1+s0 which completes the first inequality. The second inequality follows by replacing k by the given upper bound and simplifying. Prob [s > s0(1 + ϵ)] 27e 1 Proof To derive an upper bound, we start with the expression for Prob [s > (1 + ϵ)s0] from Equation (11). We first find an upper bound on f(s0(1 + x)). Then, we bound f(s0(1 + x)) using the inequality 1 x e x for all x as follows: f(s0(1+x)) = f(s0)(1+x) k 2 2 1 s0 1 s0 x d k 2 2 f(s0)(1+x) k 2 s0 1 s0 x d k 2 Moreover, since s0 < 0.4, 2 = k d k d k 2 Optimal Bounds for Johnson-Lindenstrauss Transformations By applying Inequality (20) to Inequality (19), we derive the upper bound f(s0)(1 + x) k 2 Therefore, by extending the region of integration in Inequality (11), we find the following upper bound on the probability: ϵ f(s0(1 + x))dx Bs0f(s0) Z ϵ (1 + x) k 2 2 xdx. (21) By integrating by parts, we observe that for any ℓand m with 1 ℓ m, Z ϵ (1 + x)ℓe mxdx 1 m(1 + ϵ)ℓe mϵ + Z ϵ (1 + x)ℓ 1e mxdx. (22) Applying Inequality (22) k 2 2 times to the integral in Inequality (21) and bounding the resulting geometric series from above gives ϵ (1 + x) k 2 2 xdx 2e k 2 i=0 (1 + ϵ)i 2(1 + ϵ) ϵ(k 2)(1 + ϵ) k 2 By applying Inequality (10) to (1 + ϵ) k 2 2 log(1+ϵ), we obtain the upper bound 2(1 + ϵ) ϵ(k 2)(1 + ϵ) k 2 2 ϵ 2(1 + ϵ) ϵ(k 2)e k 2 Since k 4 ϵ 2, it follows that (k 2)2 k k 4 ϵ 2, and, hence, that ϵ(k 2) k. Therefore, we can further simplify our bound to 2(1 + ϵ) ϵ(k 2)e k 2 3 ϵ) 2(1 + ϵ) By combining Inequalities (21) and (23), we find an upper bound on the probability as follows: ϵ f(s0(1 + x))dx Bs0f(s0)2(1 + ϵ) Applying the upper bound on Bs0f(s0) from Corollary 8 and the assumption that ϵ 1 2 completes the inequality. 3.3. Bounds on Prob [s < s0(1 ϵ)] We begin this section by including the following two additional inequalities on log(1 x): log(1 x) x x2 2 x3 for 0 < x < 0.815, (24) log(1 x) x x2 2 for 0 x < 1. (25) These bounds can be justified using a similar approach as for Inequalities (7-10). Using these inequalities, we derive the following bounds: Burr, Gao, and Knoll Prob [s < s0(1 ϵ)] e 2 Moreover, when k < ηϵ 2 log 1 Prob [s < s0(1 ϵ)] e 2 2 πδ η 4 γ2, γ2 = 1 1 s0 Additionally, γ2 approaches 1 as ϵ, δ, and s0 approach 0. Proof The proof of this lemma is very similar to the proof of Lemma 9, so we focus on the new details. The probability can be rewritten, using the substitution s = s0(1 x), as Prob [s < (1 ϵ)s0] = B Z s 4, it follows that ϵ+k 1/2 < 1. Therefore, we restrict x to the interval (ϵ, ϵ + k 1/2), and observe that Inequality (32) applies in this range. Therefore, ϵ f(s0(1 x))dx Bs0f(s0) Z ϵ+k 1/2 4 x2 1 s0 +2x3 dx. (33) Replacing Bs0f(s0) with its lower bound given in Corollary 8 and observing that the integrand is decreasing on an interval of width k 1/2, Inequality (33) is bounded from below by Bs0f(s0) Z ϵ+k 1/2 4 x2 1 s0 +2x3 dx e 2 which completes the first inequality. The second inequality follows by replacing k 1 by the given upper bound and simplifying. Prob [s < s0(1 ϵ)] 18 2e1/2 π e ( k 2e1/2 π e ( k 2 Proof The proof of this lemma is very similar to the proof of Lemma 10, so we focus on the new details. To prove an upper bound, we start with the bound on Prob [s < s0(1 ϵ)] from Equation (26). We first observe that f(s0(1 x)) = f(s0)(1 x) k 2 2 1 + s0 1 s0 x d k 2 We now bound the logarithm of the second and third factors in Equation (34) using Inequalities (9) and (25) as follows: 2 1 + s0 1 s0 x d k 2 s0 1 s0 x (35) Since s0 1 s0 = k d k and ϵ < x < 1, Inequality (35) further simplifies to k d kx x k 2 Hence, for ϵ x < 1, we have f(s0(1 x)) f(s0)e3/2e ( k 4)x2 f(s0)e3/2e ( k Burr, Gao, and Knoll Substituting Inequality (36) into the integral of Equation (26), we find ϵ f(s0(1 x))dx Bs0f(s0)e3/2 Z 1 Bs0f(s0)e3/2 Z 4)ϵxdx Bs0f(s0)4e3/2 Since k ϵ 2, Inequality (37) can be further simplified to Bs0f(s0)4e3/2 Applying the upper bound on Bs0f(s0) from Corollary 8 completes the first inequality of the proof. The final inequality follows from the fact that k (k 2) 1 2 This completes the proof all of the conditions in Section 2, and, therefore, completes the proof of our main theorem, Theorem 2. Acknowledgments This work was supported by a grant from the Simons Foundation (#282399 to Michael Burr) and National Science Foundation Grants CCF-1527193, CCF-1407623, DMS-1403062, and DMS-1547399. Appendix A. Uniform Distributions on Unit Spheres in High Dimensions In this section, we study uniform distributions and surface areas (or hypervolumes) on high-dimensional spheres. More precisely, for any d 1, let Sd 1 denote the unit sphere of dimension d 1 and dΩd 1 denote the surface area measure for Sd 1. We show that, for any 1 k d, 2f(s)ds dΩk 1dΩd k 1, where s [0, 1] and f(s) = s k 2 2 (1 s) d k 2 2 . This is a more precise version of a result in Kane et al. (2011), replacing an unspecified constant by 1/2. We include the proof of this result here in order to make this result more accessible to a larger population of researchers. This formula is of independent interest since it shows that the uniform distribution on Sd 1 is a product of uniform distributions on Sk 1 and Sd k 1 with a distribution on [0, 1]. Theorem 13 Under the almost bijective map Ψ : Sd 1 [0, 1] Sk 1 Sd k 1, we have equality of the surface area differential forms on Sd 1, Sk 1, and Sd k 1, i.e., 2f(s)ds dΩk 1dΩd k 1, where f(s) = s(k 2)/2(1 s)(d k 2)/2. Equivalently, in terms of probability distribution measures, dΩd 1 Vold 1 (Sd 1) = Bf(s)ds dΩk 1 Volk 1 (Sk 1) dΩd k 1 Vold k 1 (Sd k 1). Optimal Bounds for Johnson-Lindenstrauss Transformations Hence, the uniform distribution on Sd 1 can be identified with the product distribution on [0, 1] Sk 1 Sd k 1 where the distribution of s on [0, 1] has density function Bf(s) and 2 Volk 1 Sk 1 Vold k 1 Sd k 1 Volk 1 (Sk 1) = Γ(d This theorem is based on the following lemma, which is well-known to experts, but is included here for completeness. Lemma 14 Let x = (x1, . . . , xd) with xd > 0 be a point on the upper hemisphere of Sd 1. Then the surface area measure of the unit sphere Sd 1 at x is xd dx1 . . . dxd 1. Before we begin the proof, we recall the approach for S2 in 3-dimensional space. We consider the upper hemisphere of S2 as the graph of a function over D2, where Dd 1 denotes (d 1)- dimensional disk, namely i=1 ˆx2 i 1 We then integrate over the disk D2 to calculate the surface area of S2. In particular, the integrand is the limit of the ratios of the area of a square in D2 to the area of the corresponding parallelogram above the square in the tangent space of S2 as the square shrinks a point. In the case of the sphere, the parallelogram s area is calculated using the cross product, but we must replace the use of the cross product in higher dimensions. Proof In d-dimensional space, we consider the upper hemisphere of Sd 1 as the graph of a function over the (d 1)-dimensional disk Dd 1. We construct a pair of (d 1)- dimensional parallelepipeds as follows: P d 1 is in the tangent space of Dd 1 and Qd 1 is in the tangent space of Sd 1. Then, we take the limit of their (d 1)-dimensional volumes as P d 1 approaches a point. Due to complications in taking the (d 1)-dimensional volume in d-dimensional space, we extend both P d 1 and Qd 1 to associated, full-dimensional parallelepipeds. Let (ˆx1, . . . , ˆxd 1) Dd 1 and define φ : Dd 1 R 0 as φ(ˆx1, . . . , ˆxd 1) = i=1 ˆx2 i . We observe that the graph of this function is the upper hemisphere of Sd 1. We now extend this map to Dd 1 R as Φd : Dd 1 R Rd defined by (ˆx1, . . . , ˆxd 1, ˆxd) 7 ((1 + ˆxd)ˆx1, (1 + ˆxd)ˆx2, . . . , (1 + ˆxd)ˆxd 1, (1 + ˆxd)φ(ˆx1, . . . , ˆxd 1)) . We observe that Φd|Dd 1 {0} maps the disk Dd 1 {0} surjectively onto the graph of φ, i.e., the upper hemisphere of Sd 1, see Figure 2. Burr, Gao, and Knoll Figure 2: Φd maps the disk Dd 1 {0} surjectively onto the upper hemisphere of the (d 1)- dimensional unit sphere Sd 1. We observe that (Φd) 1 = π is the projection map onto the first d 1 coordinates. We recall that, at (ˆx1, . . . , ˆxd 1) Dd 1, the tangent space is Rd 1 and we define the parallelepiped P d 1 in the tangent space by the vectors ˆxiei of length ˆxi in the direction of the ith standard basis vector ei of Rd 1. Similarly, the tangent space at (ˆx1, . . . , ˆxd 1, 0) Dd 1 R is Rd, and we define the parallelepiped P d in this tangent space by the vectors ˆxiei for 1 i d 1 and hed for the final direction. Then, as the vector hed is perpendicular to the tangent vectors of the disk Dd 1, the d-dimensional volume of P d can be computed in terms of the (d 1)-dimensional volume of P d 1 and the height h, i.e., Vold P d = h Vold 1 P d 1 . Next, we let Qd 1 and Qd be the images of P d 1 and P d under the Jacobian of Φd, i.e., Jac Φd, respectively. Note that the Jacobian of Φd, when restricted to Dd 1 {0}, is (Jac Φd) |Dd 1 {0} = ˆx1 φ(ˆx1,...,ˆxd 1) . . . ˆxd 1 φ(ˆx1,...,ˆxd 1) φ(ˆx1, . . . , ˆxd 1) Since the Jacobian acts on tangent vectors, Qd 1 is defined by the vectors fi ˆxi φ(ˆx1, . . . , ˆxd 1)fd for 1 i d 1, where the fi is the ith standard basis vector of Rd. Moreover, Qd is defined by these vectors as well as the image of hed, i.e., h (ˆx1, . . . , ˆxd 1, φ(ˆx1, . . . , ˆxd 1)) . Optimal Bounds for Johnson-Lindenstrauss Transformations We observe that the vectors ˆxi fi ˆxi φ(ˆx1,...,ˆxd 1)fd for 1 i d 1 are tangent vectors to the sphere Sd 1, and that h (ˆx1, . . . , ˆxd 1, φ(ˆx1, . . . , ˆxd 1)) is the outward pointing surface normal with length h. Since the tangent vectors are perpendicular to the outward pointing normal, the volumes of Qd 1 and Qd have a similar relationship as the volumes of P d 1 and P d, i.e., Vold Qd = h Vold 1 Qd 1 . Therefore, the ratio between the d-dimensional volumes of Qd and P d is the same as the ratio of the (d 1)-dimensional volumes of Qd 1 and P d 1. Since Qd is the image of P d under the linear map Jac Φd Dd 1 {0}, it follows that Vold Qd = det (Jac Φd) (ˆx1,...,ˆxd 1,0) Vold P d . Therefore, the ratio of the volumes of Qd 1 and P d 1 is det (Jac Φd) (ˆx1,...,ˆxd 1,0) . It is straight-forward to compute the determinant of Jac(Φd) at the point (ˆx1, . . . , ˆxd 1, 0) via a few row reductions to eliminate the first d 1 entries in the last row and turn the matrix into an upper triangular matrix whose lower right corner is 1 φ(ˆx1,...,ˆxd 1). Hence, the determinant of Jac(Φd) is 1 φ(ˆx1,...,ˆxd 1), which is the desired scaling factor. Therefore, 1 φ(ˆx1,...,ˆxd 1) is the local factor in the stretching of the surface area in the map from Dd 1 to Sd 1. We recall that the coordinates x1, . . . , xd are the coordinates on the upper hemisphere and ˆx1, . . . , ˆxd 1 are the coordinates on Dd 1. Since, under the map Φd|Dd 1 {0}, xi = ˆxi for 1 i d 1, it follows that dˆx1 . . . dˆxd 1 = dx1 . . . dxd 1 and φ(ˆx1, . . . , ˆxd 1) = xd. From here, the result follows directly. Proof of Theorem 13 Let (w1, . . . , wd), (x1, . . . , xk), and (y1, . . . , yd k) be coordinates of points on the (d 1)-dimensional unit sphere, the (k 1)-dimensional unit sphere, and the (d k 1)-dimensional unit sphere, respectively. Let ( ˆw1, . . . , ˆwd 1), (ˆx1, . . . , ˆxk 1), and (ˆy1, . . . , ˆyd k 1) be the coordinates of points on the disks Dd 1, Dk 1 and Dd k 1, respectively. Let ϕ : [0, 1] Dk 1 Dd k 1 Dd 1 be defined by s (ˆx1, . . . , ˆxk 1) (ˆy1, . . . , ˆyd k 1) 7 sˆx1, . . . , sˆxk 1, s i=1 ˆx2 i , 1 sˆy1, . . . , We observe that ϕ maps the disks Dk 1 and Dd k 1 onto the half of the disk Dd 1 whose kth coordinate is nonnegative. As the measure of the image is the measure of the preimage scaled by the determinant of the Jacobian of ϕ, the surface area measure of the disk Dd 1 is d ˆw1 . . . d ˆwd 1 = | det Jac(ϕ)|ds dˆx1 . . . dˆxk 1dˆy1 . . . dˆyd k 1. (39) Burr, Gao, and Knoll The Jacobian of ϕ for s (0, 1) is s . . . 0 0 . . . 0 ... ... ... ... ... ... ... 2 s 0 . . . s 0 . . . 0 1 Pk 1 i=1 ˆx2 i 2 s 1 Pk 1 i=1 ˆx2 i . . . sˆxk 1 q 1 Pk 1 i=1 ˆx2 i ˆy1 2 1 s 0 . . . 0 1 s . . . 0 ... ... ... ... ... ... ... 2 1 s 0 . . . 0 0 . . . 1 s Eliminating all but the kth entry of the first column by adding multiples of the other columns to the first column, we obtain det Jac(ϕ) = 1 2ˆxk s k 2 2 (1 s) d k 1 Substituting this value into Expression (39), we have the surface area measure of Dd 1 in terms of the disks Dk 1 and Dd k 1. That is, d ˆw1 . . . d ˆwd 1 = 1 2ˆxk s(k 2)/2(1 s)(d k 1)/2ds dˆx1 . . . dˆxk 1dˆy1 . . . dˆyd k 1. (40) We observe that the coordinates of the disk Dt 1 correspond to the first t 1 entries of coordinates of the unit sphere St 1. Therefore, we may extend ϕ to the map Ψ, as defined above, where Ψ 1 = Φd ϕ (ids (Φk) 1 (Φd k) 1). Employing the results of Lemma 14 in various dimensions, we rewrite the surface measure of a unit sphere in terms of the surface measure of the corresponding disks: dˆx1 . . . dˆxk 1 = dx1 . . . dxk 1 = xkdΩk 1 dˆy1 . . . dˆyd k 1 = dy1 . . . dyd k 1 = yd kdΩd k 1 d ˆw1 . . . d ˆwd 1 = dw1 . . . dwd 1 = wddΩd 1. By applying the Ψ, we can substitute these three equalities into Equation (40) to obtain wd dw1 . . . dwd 1 = yd k 2wd s(k 2)/2(1 s)(d k 1)/2ds dΩk 1dΩd k 1 2s(k 2)/2(1 s)(d k 2)/2ds dΩk 1dΩd k 1 = 1 2f(s)ds dΩk 1dΩd k 1, where the third equality follows from the fact that wd = 1 syd k by the map Ψ. Since the cases where s = 0 or s = 1 have measure 0, the result follows. Optimal Bounds for Johnson-Lindenstrauss Transformations Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671 687, 2003. Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson Lindenstrauss transform. In Proceedings of the 38th ACM Symposium on Theory of Computing (STOC), pages 557 563, 2006. Noga Alon. Problems and results in extremal combinatorics. Discrete Mathematics, 273: 31 53, 2003. Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 63:161 182, 2006. Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. In Knowledge Discovery and Data Mining, pages 245 250, 2001. Emmanuel J. Cand es. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Math ematique. Acad emie des Sciences. Paris, 346:589 592, 2008. Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 81 90, 2013. Graham Cormode and Piotr Indyk. Stable distributions in streaming computation. In Garofalakis M., Gehrke J., and Rastoi R., editors, Data Stream Management. Springer, Berlin, Heidelberg, 2016. Anirban Dasgupta, Ravi Kumar, and Tam as Sarl os. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341 350, 2010. Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structure and Algorithms, 22(1):60 65, 2003. Peter Frankl and Hiroshi Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory Series B, 44(3):355 362, 1988. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th ACM Symposium on Theory of Computing (STOC), pages 604 613, 1998. T. S. Jayram and David P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Transactions on Algorithms, 9(26), 2013. William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a hilbert space. Contemporary Mathematics, 26:189 206, 1984. Burr, Gao, and Knoll Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. Journal of the ACM, 61(1), 2014. Daniel M. Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit Johnson Lindenstrauss families. In Proceedings of the 15th International Workshop on Randomization and Computation (RANDOM), pages 628 639, 2011. Jessica Lin and Dimitrios Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003. Jiri Matousek. On variants of the Johnson-Lindenstrauss lemma. Random Structure and Algorithms, 33(2):142 156, 2008. Nam H. Nguyen, Thong T. Do, and Trac D. Tran. A fast and efficient algorithm for lowrank approximation of a matrix. In Proceedings of the 41st ACM Symposium on Theory of Computing (STOC), 2009. Herbert Robbins. A remark on Stirling formula. American Mathematical Monthly, 62:26 29, 1955. Shashanka Ubaru, Arya Mazumdar, and Yousef Saad. Low rank approximation and decomposition of large matrices using error correcting codes. IEEE Transactions on Information Theory, 63(9):5544 5558, 2017. Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), volume 1113-1120, 2009.