# rehashing_kernel_evaluation_in_high_dimensions__db3afa0e.pdf Rehashing Kernel Evaluation in High Dimensions Paris Siminelakis * 1 Kexin Rong * 1 Peter Bailis 1 Moses Charikar 1 Philip Levis 1 Kernel methods are effective but do not scale well to large scale data, especially in high dimensions where the geometric data structures used to accelerate kernel evaluation suffer from the curse of dimensionality. Recent theoretical advances have proposed fast kernel evaluation algorithms leveraging hashing techniques with worst-case asymptotic improvements. However, these advances are largely confined to the theoretical realm due to concerns such as super-linear preprocessing time and diminishing gains in non-worst case datasets. In this paper, we close the gap between theory and practice by addressing these challenges via provable and practical procedures for adaptive sample size selection, preprocessing time reduction, and refined variance bounds that quantify the datadependent performance of random sampling and hashing-based kernel evaluation methods. Our experiments show that these new tools offer up to 10 improvement in evaluation time on a range of synthetic and real-world datasets. 1. Introduction Kernel methods are a class of non-parametric methods used for a variety of tasks including density estimation, regression, clustering and distribution testing (Muandet et al., 2017). They have a long and influential history in statistical learning (Sch olkopf et al., 2002) and scientific computing (Buhmann, 2003). However, the scalability challenges during both training and inference limit their applicability to large scale high-dimensional datasets: a larger training set improves accuracy but incurs a quadratic increase in overall evaluation time. In this paper, we focus on the problem of fast approximate evaluation of the Kernel Density Estimate. Given *Equal contribution 1Stanford University, Stanford, California, US. Correspondence to: Paris Siminelakis , Kexin Rong . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). (b) difficult case (c) simple case Figure 1. (a) For radially decreasing kernels (e.g. Gaussian), points close to the query have higher kernel values than those that are far. (b) Random sampling does not perform well when a small number of close points contribute significantly to the query density. (c) Random sampling performs well when most points have similar kernel values (distance from query). a dataset P = {x1, . . . , xn} Rd, a kernel function k : Rd Rd [0, 1], and a vector of (non-negative) weights u Rn, the weighted Kernel Density Estimate (KDE) at q Rd is given by KDEu P (q) := Pn i=1 uik(q, xi). Our goal is to, after some preprocessing, efficiently estimate the KDE at a query point with (1 ϵ) multiplicative accuracy under a small failure probability δ. This problem is well-studied (Greengard & Rokhlin, 1987; Gray & Moore, 2003; March et al., 2015), but in high dimensions only recently novel importance sampling algorithms, called Hashing-Based-Estimators (HBE), demonstrate provable improvements over random sampling (RS) under worst-case assumptions (Charikar & Siminelakis, 2017). HBE at its core uses a hash function h (randomized space partition) to construct for any q an unbiased estimator of µ := KDEu P (q)1. (Charikar & Siminelakis, 2017) showed that given a hash function with evaluation time T and collision probability P[h(x) = h(y)] = Θ( p k(x, y)), one can get a (1 ϵ) approximation to µ τ in time O(T/ϵ2 µ) and space O(n/ϵ2 τ), improving over random sampling that requires time O(1/ϵ2µ) in the worst case. HBE improves upon RS by being able to better sample points with larger weights (kernel values). In particular, RS does not perform well (Figure 1(b)) when a significant portion of the density comes from a few points with large weights (close to the query). HBE samples uniformly from a set of biased samples (hash buckets where the query is mapped to) that has a higher probability of containing high- 1The value µ (0, 1] can be seen as a margin for the decision surface Pn i=1 uik(q, xi) 0. Rehashing Kernel Evaluation in High Dimensions weight points. For radially decreasing kernels (Figure 1(a)), the biased sample can be produced via Locality Sensitive Hashing (LSH) (Indyk & Motwani, 1998). To obtain m such biased samples for estimating the query density, the scheme requires building m independent hash tables on the entire dataset. The runtime mentioned above is achieved by setting m = O(1/ϵ2 µ). In practice, one cares about values of µ 1/ n, which is a lower bound on the order of statistical fluctuations when P consists of n i.i.d samples from some smooth distribution (Sriperumbudur et al., 2016). Limitations. Despite progress on the worst case query time, the idea of using hashing for kernel evaluation, introduced independently in (Spring & Shrivastava, 2017) and (Charikar & Siminelakis, 2017), has largely been confined to the theoretical realm due to a few practical concerns. First, straightforward implementations of the proposed approach require a large amount (e.g. O(n 5 4 ) for τ = 1/ n) of preprocessing time and memory to create the requisite number of hash tables. Second, RS can outperform HBE on certain datasets (Figure 1(c)), a fact that is not captured by the worst-case theoretical bounds. Third, to implement this approach (Charikar & Siminelakis, 2017) use an adaptive procedure to estimate the sufficient sample size m. For this procedure to work, the estimators need to satisfy some uniform polynomial decay of the variance as a function of µ, that for the popular Gaussian kernel, only the hashing scheme of (Andoni & Indyk, 2006) (with an significant exp(O(log 2 3 n)) runtime and memory overhead per hash table) was shown to satisfy. These issues call the applicability of HBE into question and to the best of our knowledge, the method has not been shown before to empirically improve upon competing methods. 1.1. Contributions In this paper, we close the gap between theory and practice by making the following contributions: 1. Overhead reduction via sketching. HBE requires super-linear preprocessing time and memory, a result of having to hash the entire dataset once for each sample. To reduce this overhead, we develop new theoretical results on sketching the KDE by a weighted subset of points using hashing and non-uniform sampling (Section 6). Our Hashing-Based-Sketch (HBS) is able to better sample sparse regions of space, while still having variance close to that of a random sketch, leading to better performance on low-density points. Applying HBE on top of the sketch results in considerable gains while maintaining accuracy. 2. Data-dependent diagnostics. Despite worst-case theoretical guarantees, RS outperforms HBE on certain datasets in practice. To quantify this phenomenon, we introduce an inequality that bounds the variance of un- biased estimators that include HBE and RS (Section 4). We propose a diagnostic procedure utilizing the variance bounds that, for a given dataset, chooses at runtime with minimal overhead the better of the two approaches, and does so without invoking HBE; our evaluation shows that the procedure is both accurate and efficient. The new variance bound also allows us to simplify and recover theoretical results of (Charikar & Siminelakis, 2017). 3. A practical algorithm for the Gaussian kernel. We design a simplified version of the adaptive procedure of (Charikar & Siminelakis, 2017), that estimates the sufficient sample size m in parallel with estimating the density µ, and provide an improved analysis that results in an order of magnitude speedup in the query time (Section 5). More importantly, our new analysis shows that slightly weaker conditions (non-uniform but bounded polynomial decay) on the variance of the estimators are sufficient for the procedure to work. By proving that HBE based on the hashing scheme of (Datar et al., 2004) satisfy the condition, we give the first practical algorithm that provably improves upon RS for the Gaussian kernel in high dimensions. Previously, it was not known how to use this hashing scheme within an adaptive procedure for this purpose. We perform extensive experimental evaluations of our methods on a variety of synthetic benchmarks as well as realworld datasets. Together, our evaluation against other stateof-the-art competing methods on kernel evaluation shows: HBE outperforms all competing methods for synthetic worst-case instances with multi-scale structure and dimensions d 10, as well as for structured instances with a moderate number of clusters (Section 7.1). For real-world datasets, HBE is always competitive with alternative methods and is faster for many datasets by up to 10 over the next best method (Section 7.2). In summary, our theoretical and experimental results constitute an important step toward making Hashing-Based Estimators practical and in turn improving the scalability of kernel evaluation in large high-dimensional data sets. 2. Related Work Fast Multipole Methods and Dual Tree Algorithms. Historically, the problem of KDE evaluation was first studied in low dimensions (d 3) in the scientific computing literature, resulting in the Fast Multipole Method (FMM) (Greengard & Rokhlin, 1987). This influential line of work is based on a hierarchical spatial decomposition and runs in O(2d) time per evaluation point. Motivated by applications in statistical learning (Gray & Moore, 2001), the problem was revisited in 2000 s and analogs of the Fast Multipole Rehashing Kernel Evaluation in High Dimensions Method (Gray & Moore, 2003), known as Dual Tree Algorithms (Lee et al., 2006), were proposed that aimed to capture latent underlying low-dimensional structure (Ram et al., 2009). Unfortunately, when such low-dimensional structure is absent, these methods, in general, have nearlinear O(n1 o(1)) runtime. Thus, under general assumptions, the only method that was known (Lee & Gray, 2009) to provably accelerate kernel evaluation in high dimensions was simple uniform random sampling (RS). Hashing-Based-Estimators. The framework of HBE has recently been extended to apply to more general kernels (Charikar & Siminelakis, 2018) and hashing-based ideas have been used to design faster algorithms for smooth kernels (Backurs et al., 2018). Besides kernel evaluation, researchers have also applied hashing techniques to get practical improvements in outlier detection (Luo & Shrivastava, 2018a), gradient estimation (Chen et al., 2018), non-parametric clustering (Luo & Shrivastava, 2018b) and range-counting (Wu et al., 2018). Approximate Skeletonization. ASKIT (March et al., 2015) represents a different line of work aimed at addressing the deficiencies of FMM and Dual-Tree methods. ASKIT also produces a hierarchical partition of points, but then uses linear algebraic techniques to approximate the contribution of points to each part via a combination of uniform samples and nearest neighbors. This makes the method robust to the dimension and mostly dependent on the number of points. Core-sets. The problem of sketching the KDE for all points has been studied under the name of Core-sets or ϵ-samples (Phillips, 2013). The methods are very effective in low dimensions (Zheng et al., 2013) d 3, but become impractical to implement in higher dimensions for large data-sets due to their computational requirements Ω(n2) (e.g. (Chen et al., 2010)). The recent paper (Phillips & Tai, 2018) presents an up-to-date summary. 3. Preliminaries In this section, we present basic definitions and give a self contained presentation of Hashing-Based-Estimators that summarizes the parts of (Charikar & Siminelakis, 2017) upon which we build. All material beyond Section 3 is new. Notation. For a set S [n] and numbers u1, . . . , un, let u S := P i S ui. Let n := {u Rn + : u 1 = 1} denote the n-dimensional simplex. Throughout the paper we assume that we want to approximate KDEu P with u n. We can handle the general case u Rd by treating the positive and negative part of u separately. Kernels. All our theoretical results, unless explicitly stated, apply to general non-negative kernels. For concreteness and in our experiments, we focus on the Gaussian exp( x y 2/σ2) and Laplace kernels exp( x y /σ) (Belkin et al., 2018), and typically suppress the dependence on the bandwidth σ > 0 (equivalently, we can rescale our points). 3.1. Multiplicative Approximation & Relative Variance Definition 1. A random variable ˆZ is called an (ϵ, δ)- approximation to µ if P[| ˆZ µ| ϵµ] δ. Given access to an unbiased estimator E[Z] = µ, our goal is to output an (ϵ, δ)-approximation to µ. Towards that end the main quantity to control is the relative variance. Definition 2. For a non-negative random variable Z we define the relative variance as Rel Var[Z] := Var[Z] E[Z]2 E[Z2] The relative variance captures the fluctuations of the random variable at the scale of the expectation. This is made precise in the following lemma that combines Chebyshev s and Paley-Zygmund inequality. Lemma 1. For a non-negative random variable Z and parameters t > 0, θ [0, 1], we have: P[Z (t + 1) E[Z]] 1 t2 Rel Var[Z], (1) P[Z > (1 θ)E[Z]] 1 1 + 1 θ2 Rel Var[Z]. (2) As Rel Var[Z] decreases, the upper bound in (1) decreases while the lower bound in (2) increases, increasing our overall confidence of Z being an accurate estimate of the mean. Thus, if one can construct a random variable Z with E[Z] = µ and small relative variance Rel Var[Z] = O(ϵ2), Lemma 1 shows that Z is an (ϵ, O(1))-approximation to µ. In fact, by Chernoff bounds, one can use the median-trick to boost the probability of success (Supplementary material). 3.2. Hashing-Based-Estimators HBE uses hashing to create a data-structure that, after some preprocessing, can produce unbiased estimators for the KDE at query time (Figure 2) with low variance. Let H be a set of functions and ν a distribution over H. We denote by h Hν a random function h H sampled from ν and refer to Hν as a hashing scheme. Definition 3. Given a hashing scheme Hν, we define the collision probability between two elements x, y Rd as p(x, y) := Ph Hν[h(x) = h(y)]. Pre-processing: given dataset P and hashing scheme Hν: Sample m hash functions ht i.i.d. Hν for t [m]. Create hash tables Ht := ht(P) for t [m] by evaluating the hash functions on P. Rehashing Kernel Evaluation in High Dimensions Figure 2. Given a dataset, the HBE approach samples a number of hash functions and populates a separate hash table for each hash function. At query time, for each hash table, we sample a point at random from the hash bucket that the query maps to. The m hash tables allow us to produce at most m independent samples to estimate the KDE for each query. Query process: query q Rd, hash table index t [m] let Ht(q) := {i [n] : ht(xi) = ht(q)} denote the hash bucket (potentially empty) that q maps to. If Ht(q) is not empty return Zht := k(Xt,q) p(Xt,q)u Ht(q) where Xt is sampled with probability proportional to ui from Ht(q), otherwise return 0. By averaging many samples produced by the hash tables we get accurate estimates. The salient properties of the estimators Zht are captured in the following lemma. Lemma 2. Assuming that i [n], p(xi, q) > 0 then i=1 uik(x, xi), (3) i,j=1 k2(q, xi)ui P[i, j H(q)]uj p2(q, xi) . (4) As we see in (4), the second moment depends on the hashing scheme through the ratio P[i,j H(q)] p2(q,xi) . Thus, we hope to reduce the variance by selecting the hashing scheme appropriately. The difficulty lies in that the variance depends on the positions of points in the whole dataset. (Charikar & Siminelakis, 2017) introduced a property of HBE that allowed them to obtain provable bounds on the variance. Definition 4. Given a kernel k, a hashing scheme is called (β, M)-scale-free for β [0, 1] and M 1 if: 1 M k(x, y)β p(x, y) M k(x, y)β. Thus, one needs to design hashing scheme that adapts to the kernel function. Next, we present a specific family of hashing schemes that can be used for kernel evaluation under the Exponential and Gaussian kernel. 3.3. HBE via Euclidean LSH In the context of solving Approximate Nearest Neighbor search in Euclidean space, Datar et al. (Datar et al., 2004) introduced the following hashing scheme, Euclidean LSH (e LSH), that uses two parameters w > 0 (width) and κ 1 (power). First, hash functions in the form of hi(x) := g i x w + bi map a d dimensional vector x into a set of integers ( buckets ) by applying a random projection (gi i.i.d. N(0, Id)), a random shift (bi i.i.d. U[0, 1])) and quantizing the result (width w > 0). A concatenation (power) of κ such functions gives the final hash function. They showed that the collision probability for two points x, y Rd at distance x y = c w equals pκ 1(c) where p1(c) := 1 2Φ(c 1) 2 π c 1 exp c 2 In order to evaluate the hashing scheme on a dataset P, one needs space and pre-processing time O(dκ n). By picking w large enough, i.e. for small c, one can show the following bounds on the collision probability. Lemma 3 ((Charikar & Siminelakis, 2017)). For δ 1 2 and c min{δ, 1 2 π δ p1(c) Taking appropriate powers κ, one can then use Euclidean LSH to create collision probabilities appropriate for the Exponential and Gaussian kernel. Theorem 1 ((Charikar & Siminelakis, 2017)). Let H1(w, κ) denote the e LSH hashing scheme with width w and power κ 1. Let R = max x,y P {q}{ x y }. Then for the Laplace kernel, setting κe = 2 π 1 β κe results in (β, e)-scale free hashing scheme for k(x, y) = e x y 2. Gaussian kernel, setting rt := 1 2 p t log(1 + γ), κt := 3 rt R 2, wt = q 2 π κt rt results in an estimator for k(x, y) = e x y 2 with relative variance Rel Var[Zt] Vt(µ) := 4e 3 2 1 µe r2 t rt q 4. Refined Variance bounds and Diagnostics To understand how many samples are needed to estimate the density through RS or HBE, we need bounds for expressions of the type (4). In particular, for a sequence of numbers w1, . . . , wn [0, 1], e.g. wi = k(q, xi), and u n such that µ = P i [n] uiwi, we want to bound: sup u n,u w=µ i,j [n] w2 i (ui Vijuj) (7) Rehashing Kernel Evaluation in High Dimensions where V Rn n + is a non-negative matrix. Our bound will be decomposed into the contribution µℓfrom subset of indices Sℓ [n] where the weights (kernel values) have bounded range. Specifically, let µ λ L 1 and define: S1 = {i [n] : L wi 1}, S2 = {i [n] \ S1 : λ wi L}, S3 = {i [n] \ (S2 S1) : µ wi λ}, S4 = {i [n] : wi < µ} as well as µℓ= P i Sℓuiwi µ for ℓ [4]. The intuition behind the definition of the sets is that for radial decreasing kernels, they correspond to spherical annuli around the query. Lemma 4. For non-negative weights w1, . . . , wn, vector u n and sets S1, . . . , S4 [n] as above it holds X i,j [n] w2 i {ui Vijuj} X ℓ [3],ℓ [3] sup i Sℓ, j Sℓ ℓ [3] sup i Sℓ, j S4 + sup i S4,j [n] {Vijwi} µ4 (8) where u S := P This simple lemma allows us to get strong bounds for scalefree hashing schemes simplifying results of (Charikar & Siminelakis, 2017) and extended them to β < 1/2. Theorem 2. Let Hν be (β, M)-scale free hashing scheme and Zh the corresponding HBE. Then Rel Var[Zh] Vβ,M(µ) := 3M 3/µmax{β,1 β}. Proof. Letting wi = k(q, xi), we start from (4) and use the fact that P[i, j H(q)] min{p(xi, q), p(xj, q)} and that the hashing scheme is (β, M)-scale free, to arrive at E[Z2 h] P i,j [n] w2 i {ui Vijuj} with Vij = M 3 min{wβ i ,wβ j } w2β i . Let S1, . . . , S4 be sets defined for ℓ= L = µ. Then S2 = S3 = . By Lemma 4 we get E[Z2 h] M 3( sup i,j S1 {w1 2β i w1 β j }µ2 1 + sup i S4 {w1 β i }µ4 + u S4 sup i S1,j S4 {w1 2β i wβ j }µ1) M 3( 1 µ1 β µ2 1 + u S4 sup i S1 {w1 2β i µ1 β }µµ1 + µ1 βµ4) M 3( 1 µ1 β + 1 µmax{β,1 β} + 1 Noting that E[Zh] = µ and µ 1 concludes the proof. Remark 1. It is easy to see that random sampling is a (trivial) (0, 1)-scale free hashing scheme, hence the relative variance is bounded by 3 Remark 2. For any ( 1 2, M)-scale free hashing scheme the relative variance is bounded by 3M 3 4.1. Diagnostics for unbiased estimators The inequality provides means to quantify the dataset dependent performance of RS and HBE: by setting the coefficients Vij appropriately, Lemma 4 bounds the variance of RS (Vij = 1) and HBE (Vij = min{p(q,xi),p(q,xj)} p(q,xi)2 ). However, evaluating the bound over the whole dataset is no cheaper than evaluating the methods directly. Instead, we evaluate the upper bound on a representative sample S0 in place of [n] by using the sets Sℓ= S0 Sℓ for ℓ [4]. Given a set S0 define the estimator µℓ= P j Sℓujwj. For ϵ (0, 1), let: λϵ := arg max µ0 λ 1 2(ϵ µ0 µ4) , (9) Lϵ := arg min µ0 L 1 2(ϵ µ0 µ4) . (10) Since Lemma 4 holds for all µ λ L 1, Lϵ, λϵ complete the definition of four sets on S0, which we use to evaluate the upper bound. Finally, we produce a representative sample S0 by running our adaptive procedure (Section 5) with Random Sampling. The procedure returns a (random) set S0 such that µ0 is an (ϵ, O(1))-approximation to µ for any given query. Algorithm 1 Data-dependent Diagnostic 1: Input: set P, threshold τ (0, 1), accuracy ϵ, T 1, collision probability p(x, y) of the hashing scheme Hν. 2: for t = 1, . . . , T do 3: q Random(P) For each random query 4: ( S0, µ0) AMR(ZRS(q), ϵ, τ) Algorithm 2 5: Set λϵ, Lϵ using (9) and (10) 6: VRS r.h.s of (8) for S0 and Vij = 1. 7: VH r.h.s of (8) for S0, Vij = min{p(q,xi),p(q,xj)} 8: r VRS(t) VRS/ max{ µ0, τ}2 9: r VHBE(t) VH/ max{ µ0, τ}2. 10: Output: (mean T (r VRS), mean T (r VHBE)) 5. Adaptive estimation of the mean Towards our goal of producing an (ϵ, O(1))-approximation to µ for a given query q, the arguments in Section 3.1 reduce this task to constructing an unbiased estimator Z of lowrelative variance Rel Var[Z] ϵ2 6 . Given a basic estimator Z0 and bound V (µ) on its relative variance, one can always create such an estimator by taking the mean of O(V (µ)/ϵ2) independent samples. The question that remains in order to implement this approach is how to deal with the fact that µ and thus V (µ) is unknown? We handle this by providing an improvement to the adaptive estimation procedure of (Charikar & Siminelakis, 2017). Rehashing Kernel Evaluation in High Dimensions The procedure makes a guess of the density and uses the guess to set the number of samples. Subsequently, the procedure performs a consistency check to evaluate whether the guess was correct up to a small constant and if not, revises the guess and increases the sample size. The procedure applies generically to the following class of estimators (whose variance varies polynomially with µ) that we introduce. 5.1. (α, β, γ)-regular estimators Definition 5. For α, β (0, 2] and γ 1, an estimator Z is (α, β, γ)-regular if for some constant C 1 and all t [T], there exist a random variable Zt and a function Vt : (0, 1] R++ such that (A) E[Zt] = µ and Rel Var[Zt] Vt(µ), µ (0, 1], (B) 1 Vt(y) Vt(x) x y 2 α for all x y > 0, (C) Vt(µt+1) Cµ β t with µt := (1 + γ) t. We write Zt Z(t, γ) to denote a sample from such an estimator at level t [T]. At a high level, regular estimators generalize properties of scale-free estimators relevant to adaptively estimating the mean. Property (B) affects the success probability whereas (C) affects the running time and space requirements. This level of generality will be necessary to analyze the hashing scheme designed for the Gaussian kernel. Regular Estimators via HBE. For any t [T], given a collection of hash tables {H(i) t }i [mt] created by evaluating mt i.i.d hash functions with collision probability pt(x, y) on a set P and a fixed kernel k(x, y), we will denote by Z(t, γ) HBEk,pt({H(i) t }i [mt]) (11) a data structure at level t, that for any query q is able to produce up to mt i.i.d unbiased random variables Z(i) t (q) for the density µ according to Section 3.2. The union of such data structures for t [T] will be denoted by Z. Before proceeding with the description of the estimation procedure we show that (β, M)-scale free HBE (that include random sampling) satisfy the above definition. Theorem 3. Given a (β, M)-scale free hashing scheme with β 1 2, the corresponding estimator is (2 β, β, γ)- regular with constant C = 3M 3(1 + γ)β. The proof follows easily by Theorem 2 by using the same estimator for all t [T] and Vt(µ) = V (µ) = 3M 3 5.2. Adaptive Mean Relaxation For regular estimators we propose the following procedure (Algorithm 2). In the supplementary material we analyze Algorithm 2 Adaptive Mean Relaxation (AMR) 1: Input: (a, β, γ)-regular estimator Z, accuracy ϵ (0, 1), threshold τ (0, 1). 2: T log1+γ( 1 ϵτ ) + 1 3: for t = 1, . . . , T do For each level 4: µt (1 + γ) t Current guess of mean 5: mt 6 ϵ2 Vt(µt+1) sufficient samples in level t 6: Z(i) t Z(t, γ) i.i.d samples for i [mt]. 7: Zt mean{Z(1) t , . . . , Z(mt) t } 8: if Zt µt then consistency check 9: return Zt 10: return 0 In this case µ ϵτ the probability that this procedure successfully estimates the mean as well as the number of samples it uses and obtain the following result. Theorem 4. Given an (a, β, γ)-regular estimator Z, the AMR procedure outputs a number ˆZ such that P[| ˆZ µ| ϵ max{µ, τ}] 2 and with the same probability uses Oγ( 1 ϵ2 1 µβ ) samples. Using the bounds on the relative variance in Section 4, we get that any (β 1/2, M)-scale free estimator can be used to estimate the density µ using O(1/ϵ2µβ) samples. 5.3. Regular estimator for Gaussian Kernel We show next show a construction of a HBE for Gaussian kernel (Algorithm 3) that results in a regular estimator. Algorithm 3 Gaussian HBE (Gauss-HBE) 1: Input: γ 1, τ (0, 1), ϵ (0, 1), dataset P. 2: T log1+γ( 1 ϵτ ) + 1, R p log(1/ϵτ) 3: for t = 1, . . . , T do 4: mt 6 ϵ2 Vt((1 + γ) (t+1)) see (6) 5: for i = 1, . . . , mt do 6: H(i) t e LSH(wt, κt, P) see Theorem 3 7: pt(x, y) := pκt 1 ( x y /wt) see (5) 8: k(x, y) := e x y 2 9: ZGauss(t, γ) HBEk,pt({H(i) t }i [mt]) 10: Output: ZGauss Theorem 5. ZGauss is (1, 3 4, γ)-regular and takes preprocessing time/space bounded by Od,κT ,γ(ϵ 3+ 1 The proof (given in the supplementary material) is based on using (6) to show that Definition 5 is satisfied with appropriate selection of constants. We also note that (6) can be derived using Lemma 4. Rehashing Kernel Evaluation in High Dimensions 6. Sketching the Kernel Density Estimate In this section, we introduce an approach based on hashing to create a sketch of KDE that we can evaluate using HBE or other methods. Definition 6. Let (u, P) be a set of weights and points. We call (w, S) an (ϵ, δ, τ)-sketch iff for any q Rd: E[|KDEw S (q) KDEu P (q)|2] ϵ2τδ KDEu P (q). (12) Let µ = KDEu P (q), using Chebyshev s inequality it is immediate that any (ϵ, δ, τ)-sketch satisfies for any q Rd: P[|KDEw S (q) µ| ϵ max{τ, µ}] δ. Remark 3. It is easy to see that one can construct such a sketch by random sampling m 1 ϵ2δ 1 τ points. Hashing-Based-Sketch (HBS). The scheme we propose samples a random point by first sampling a hash bucket Hi with probability uγ Hi and then sampling a point j from the bucket with probability uj. The weights are chosen such that the sample is an unbiased estimate of the density. This scheme interpolates between uniform over buckets and uniform over the whole data set as we vary γ [0, 1]. We pick γ so that the variance is controlled and with the additional property that any non-trivial bucket u Hi τ will have a representative in the sketch with some probability. This last property is useful in estimating low-density points (e.g. for outlier detection (Gan & Bailis, 2017)). For any hash table H and a vector u n (simplex), let B = B(H) denote the number of buckets and umax = umax(H) := max{u Hi : i [B]} the maximum weight of any hash bucket of H. The precise definition of our Hashing-Based Sketch is given in Algorithm 4. Theorem 6. Let H be the hash function sampled by the HBS procedure. For ϵ > 0 and δ [e 6 6 log(1/δ)) log( umax ϵ2 1 τ (Bumax)1 γ < log( 1 δ ) τ . (14) Then (Sm, w) is an (ϵ, 1 6, τ)-sketch and if B 1 τ any hash bucket with weight at least τ will have non empty intersection with Sm with probability at least 1 δ. 7. Experiments In this section, we evaluate the performance of hashingbased methods on kernel evaluation on real and synthetic datasets, as well as test the diagnostic procedure s ability to predict dataset-dependent estimator performance 2. 2Source code available at: http://github.com/ kexinrong/rehashing Algorithm 4 Hashing-Based-Sketch (HBS) 1: Input: set P, sketch size m, hashing scheme Hν, threshold τ (0, 1), u n 2: Sample h Hν and create hash table H = h(P). 3: Set γ according to (13) 4: Sm , w 0 1m, B B(H) 5: for j = 1, . . . , m do 6: Sample hash bucket Hi with probability uγ Hi 7: Sample a point Xj from Hi with probability uj 8: Sm Sm {Xj} 9: wj(γ, m) u Hi PB i =1 uγ Hi uγ Hi 10: Output: (Sm, w) 7.1. Experimental Setup Baselines and Evaluation Metrics. As baselines for density estimation (using u = 1 n1), we compare against ASKIT (March et al., 2016), a tree-based method Fig Tree (Morariu et al., 2009), and random sampling (RS). We used the open sourced libraries for Fig Tree and ASKIT; for comparison, all implementations are in C++ and we report results on a single core. We tune each set of parameters via binary search to guarantee an average relative error of at most 0.1. By default, the reported query time and relative error are averaged over 10K random queries that we evaluate exactly as ground truth. The majority of query densities in evaluation are larger than τ = 10 4 (roughly 1 10 n). Synthetic Benchmarks. To evaluate algorithms under generic scenarios, we need benchmarks with diverse structure. The variance bound suggests that having a large number of points far from the query is hard for HBE, whereas having a few points close to the query is hard for RS. In addition, the performance of space-partitioning methods depends mostly on how clustered vs spread-out the datasets are, with the latter being harder. A fair benchmark should therefore include all above regimes. We propose a procedure that generates a d-dimensional dataset where for D random directions, clusters of points are placed on s different distance scales, such that i) each distance scale contributes equally to the kernel density at the origin, ii) the outermost scale has n points per direction, iii) the kernel density around the origin is approximately µ. By picking D n the instances become more random like, while for D n they become more clustered. Using this procedure, we create two families of instances: worst-case : we take the union of two datasets generated for D = 10, n = 50K and D = 5K, n = 100 while keeping s = 4, µ = 10 3 fixed and varying d. D-structured: we set N = 500K, s = 4, µ = 10 3, d = 100 and vary D while keeping n D = N. Rehashing Kernel Evaluation in High Dimensions Table 1. Comparison of preprocess time and average query time on real world datasets. Bold numbers correspond to the best result. The sources of the datasets are: MSD (Bertin-Mahieux et al., 2011), Glo Ve (Pennington et al., 2014), SVHN (Netzer et al., 2011), TMY3 (Hendron & Engebrecht, 2010), covtype (Blackard & Dean, 1999), TIMIT (Garofolo, 1993). Preprocess Time (sec) Average Query Time (ms) Dataset Description n d HBE Fig Tree ASKIT RS HBE Fig Tree ASKIT RS TMY3 energy 1.8M 8 15 28 838 0 0.8 3.0 28.4 55.0 census census 2.5M 68 66 4841 1456 0 2.8 1039.9 103.7 27.5 covertype forest 581K 54 22 31 132 0 1.9 23.0 47.0 149.4 TIMIT speech 1M 440 298 1579 439 0 24.3 1531.8 169.8 32.8 ALOI image 108K 128 24 70 14 0 6.3 53.5 5.4 21.2 SVHN image 630K 3072 2184 > 105 877 0 67.9 > 104 43.0 370.1 MSD song 463K 90 55 2609 107 0 5.2 326.9 8.1 2.1 Glo Ve embedding 400K 100 98 5603 76 0 19.0 656.5 86.1 5.0 acoustic mnist susy hep higgs MSD Glo Ve TIMIT SVHN TMY3 ALOI census covertype home shuttle skin corel ijcnn sensorless poker cadata codrna Variance Bound RS better HBE better RS HBE Figure 3. Predicted relative variance upper bound on real datasets. RS exhibits lower variance for datasets on the left. Overall, the diagnostic procedure correctly predicts the performance all but one dataset (SVHN). In the supplementary material we include a precise description of the procedure as well as example scatter plots. 7.2. Results Synthetic datasets. We evaluate the four algorithms for kernel density estimation on worst-case instances with d [10, 500] and on D-structured instances with D [1, 100K]. For worst-case instances, while all methods experience increased query time with increased dimension, HBE consistently outperforms other methods for dimensions ranging from 10 to 500 (Figure 4 left). For the Dstructured instances (Figure 4 right), Fig Tree and RS dominate the two extremes where the dataset is highly clustered D n or scattered on a single scale D n, while HBE achieves the best performance for datasets in between. ASKIT s runtime is relatively unaffected by the number of clusters but is outperformed by other methods. 10 50 100 200 500 # dimensions Avg Query Time (s) 100 101 102 103 104 105 # clusters Fig Tree HBE RS Fig Tree ASKIT RS HBE Figure 4. Synthetic Experiment. Real-world datasets. We repeat the above experiments on eight large real-world datasets from various domains. We z-normalize each dataset dimension, and tune bandwidth based on Scott s rule (Scott, 2015). We exclude a small percent of queries whose density is below τ. Table 1 reports the preprocessing time as well as average query time for each method. Overall, HBE achieves the best average query time on four datasets; we focus on query time since it dominates the runtime given enough queries. With sketching, HBE also exhibits comparable preprocessing overhead to Fig Tree and ASKIT. While the performances of Fig Tree and ASKIT degrades with the increased dimension and dataset size respectively, HBE remains competitive across the board. Diagnosis. To assess the accuracy of the diagnostic procedure, we compare the predicted variance of RS and HBE on an extended set of datasets. RS exhibits lower variances (smaller error with the same number of samples) for datasets on the left. Overall, our proposed diagnosis correctly identified the better estimator for all but one dataset. Notice that although RS has better sampling efficiency on TIMIT, HBE ends up having better query time since the frequent cache misses induced by the large dataset dimension offset the additional computation cost of HBE. For all datasets in Table 1, the diagnosis costs between 8% to 59% of the setup time of HBE, indicating that running the diagnosis is cheaper than running even a single query with the HBE. Supplementary material. We include an experimental comparison between HBS and competing methods (Chen et al., 2010; Cortes & Scott, 2017), where it is consistently the best method for low density queries while having similar performance for random queries. We also show how to use the diagnostic procedure to produce dataset visualizations. Rehashing Kernel Evaluation in High Dimensions Acknowledgments We thank Leslie Greengard for valuable conversations and encouragement in earlier stages of this work. We also thank Edward Gan, Daniel Kang, Sahaana Suri, and Kai Sheng Tai for feedback on an early draft of this paper. The first author has been partially supported by an Onassis Foundation Scholarship. Moses Charikar was supported by NSF grant CCF-1617577, a Simons Investigator Award and a Google Faculty Research Award. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project Ant Financial, Facebook, Google, Intel, Microsoft, NEC, SAP, Teradata, and VMware as well as Toyota Research Institute, Keysight Technologies, Northrop Grumman, and Hitachi. We also acknowledge the support of the Intel/NSF CPS Security under grant No. 1505728, the Stanford Secure Internet of Things Project and its supporters (VMware, Ford, NXP, Google), and the Stanford System X Alliance. Andoni, A. and Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pp. 459 468. IEEE, 2006. Backurs, A., Charikar, M., Indyk, P., and Siminelakis, P. Efficient Density Evaluation for Smooth Kernels. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 615 626. IEEE, 2018. Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. ar Xiv preprint ar Xiv:1802.01396, 2018. Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011. Blackard, J. A. and Dean, D. J. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, vol.24:131 151, 1999. Buhmann, M. D. Radial basis functions: theory and implementations, volume 12. Cambridge university press, 2003. Charikar, M. and Siminelakis, P. Hashing-Based-Estimators for kernel density in high dimensions. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pp. 1032 1043. IEEE, 2017. Charikar, M. and Siminelakis, P. Multi-Resolution Hashing for Fast Pairwise Summations. ar Xiv preprint ar Xiv:1807.07635, 2018. Chen, B., Xu, Y., and Shrivastava, A. Lsh-sampling breaks the computational chicken-and-egg loop in adaptive stochastic gradient estimation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018. URL https://openreview. net/forum?id=Hk8z Zj RLf. Chen, Y., Welling, M., and Smola, A. Super-samples from Kernel Herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 10, pp. 109 116, Arlington, Virginia, United States, 2010. AUAI Press. ISBN 978-0-9749039-6-5. Cortes, E. C. and Scott, C. Sparse Approximation of a Kernel Mean. Trans. Sig. Proc., 65(5):1310 1323, March 2017. ISSN 1053-587X. Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253 262. ACM, 2004. Gan, E. and Bailis, P. Scalable kernel density classification via threshold-based pruning. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 945 959. ACM, 2017. Garofolo, J. S. TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993. Gray, A. G. and Moore, A. W. N-body problems in statistical learning. In Advances in neural information processing systems, pp. 521 527, 2001. Gray, A. G. and Moore, A. W. Nonparametric density estimation: Toward computational tractability. In Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 203 211. SIAM, 2003. Greengard, L. and Rokhlin, V. A fast algorithm for particle simulations. Journal of computational physics, 73(2): 325 348, 1987. Hendron, R. and Engebrecht, C. Building america research benchmark definition: Updated december 2009, 2010. Indyk, P. and Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604 613. ACM, 1998. Rehashing Kernel Evaluation in High Dimensions Lee, D. and Gray, A. G. Fast high-dimensional kernel summations using the monte carlo multipole method. In Advances in Neural Information Processing Systems, pp. 929 936, 2009. Lee, D., Moore, A. W., and Gray, A. G. Dual-tree fast gauss transforms. In Advances in Neural Information Processing Systems, pp. 747 754, 2006. Luo, C. and Shrivastava, A. Arrays of (locality-sensitive) Count Estimators (ACE): Anomaly Detection on the Edge. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 1439 1448. International World Wide Web Conferences Steering Committee, 2018a. Luo, C. and Shrivastava, A. Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS). ar Xiv preprint ar Xiv:1802.07444, 2018b. March, W., Xiao, B., and Biros, G. ASKIT: Approximate Skeletonization Kernel-Independent Treecode in High Dimensions. SIAM Journal on Scientific Computing, 37 (2):A1089 A1110, 2015. March, W., Xiao, B., Yu, C., and Biros, G. ASKIT: An Efficient, Parallel Library for High-Dimensional Kernel Summations. SIAM Journal on Scientific Computing, 38 (5):S720 S749, 2016. Morariu, V. I., Srinivasan, B. V., Raykar, V. C., Duraiswami, R., and Davis, L. S. Automatic online tuning for fast Gaussian summation. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, pp. 1113 1120. Curran Associates, Inc., 2009. Muandet, K., Fukumizu, K., Sriperumbudur, B., Sch olkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In Neur IPS workshop on deep learning and unsupervised feature learning, 2011. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532 1543, 2014. Phillips, J. M. ε-samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pp. 1622 1632. SIAM, 2013. Phillips, J. M. and Tai, W. M. Near-optimal coresets of kernel density estimates. In LIPIcs-Leibniz International Proceedings in Informatics, volume 99. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2018. Ram, P., Lee, D., March, W., and Gray, A. G. Linear-time algorithms for pairwise statistical problems. In Advances in Neural Information Processing Systems, pp. 1527 1535, 2009. Sch olkopf, B., Smola, A. J., Bach, F., et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. Scott, D. W. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015. Spring, R. and Shrivastava, A. A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models. ar Xiv preprint ar Xiv:1703.05160, 2017. Sriperumbudur, B. et al. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016. Wu, X., Charikar, M., and Natchu, V. Local density estimation in high dimensions. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5296 5305. PMLR, 10 15 Jul 2018. Zheng, Y., Jestes, J., Phillips, J. M., and Li, F. Quality and efficiency for kernel density estimates in large data. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 433 444. ACM, 2013.