# nonparametric_classification_via_expandandsparsify_representation__ab3eb429.pdf

Non-parametric classification via expand-and-sparsify representation

Kaushik Sinha School of Computing Wichita State University Wichita, KS 67260 kaushik.sinha@wichita.edu

In expand-and-sparsify (Ea S) representation, a data point in Sd 1 is first randomly mapped to higher dimension Rm, where m > d, followed by a sparsification operation where the informative k m of the m coordinates are set to one and the rest are set to zero. We propose two algorithms for non-parametric classification using such Ea S representation. For our first algorithm, we use winners-take-all operation for the sparsification step and show that the proposed classifier admits the form of a locally weighted average classifier and establish its consistency via Stone s Theorem. Further, assuming that the conditional probability function P(y = 1|x) = η(x) is Hölder continuous and for optimal choice of m, we show that the convergence rate of this classifier is minimax-optimal. For our second algorithm, we use empirical k-thresholding operation for the sparsification step, and under the assumption that data lie on a low dimensional manifold of dimension d0 d, we show that the convergence rate of this classifier depends only on d0 and is again minimax-optimal. Empirical evaluations performed on real-world datasets corroborate our theoretical results.

1 Introduction

Given a training set x1, . . . , xn of size n and a test point x sampled i.i.d. from a distribution µ over some sample space X Rd, the basic idea of non-parametric estimation (e.g., density estimation, regression or classification) is to construct a function x 7 fn(x) = fn(x, x1, . . . , xn) that approximates the true function f (which could be a density/regression/classification function as appropriate) as n Tsybakov [2008]. For any x X, fn(x) typically depends on a small subset of the training set lying within a small neighborhood of (thus close to) x. For example, in case of k-nearest neighbor, fn(x) depends on the k points from the training set that are closest to x; in case of random forest, for each tree constituting the forest, fn(x) depends on the points from the training set that are routed to the same leaf node to which x is also routed to; in case of kernel methods, fn(x) depends on the points from training set defined by an appropriate kernel. In this paper, we propose a new non-parametric classification method where, for any x Sd 1, the appropriate neighborhoods are obtained using expand-and sparsify (Ea S) representation of x.

On a high level, expand-and-sparsify representation is essentially a transformation from a lowdimensional dense representation of sensory inputs to a much higher-dimensional, sparse representation. Such representation has been found, for instance, in the olfactory system of the fly (Wilson [2013]) and mouse (Stettler and Axel [2009]), the visual system of the cat (Olshausen and Field [2004]), and the electrosensory system of the electric fish (Chacron et al. [2011]). For example, in the olfactory system of Drosophila (Turner et al. [2008], Masse et al. [2009], Wilson [2013], Caron et al. [2013]), the primary sense receptors of the fly are the roughly 2,500 odor receptor neurons

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

(also known as, ORNs), which can be clustered into 50 types, based on their odor responses, leading to a dense, 50-dimensional sensory input vector. In fact, all ORNs of a given type converge on a corresponding glomerulus in the antennal lobe; there are 50 of these in a topographically fixed configuration. This information is then relayed via projection neurons to a collection of roughly 2000 Kenyon cells (KCs) in the mushroom body, with each KC receiving signal from roughly 5-10 glomeruli (Dasgupta and Tosh [2020]. The pattern of connectivity between the glomeruli and Kenyon cells appears random (Chacron et al. [2011]). The output of the KCs is integrated by a single anterior paired lateral (APL) neuron which then provides negative feedback causing all but the 5% highestfiring KCs to be suppressed (Lin et al. [2014]). The result is a random sparse high-dimensional representation of the sensory input, that is the basis for subsequent learning. The primary motivation of this paper is to study the benefit of this type of naturally occurring representation in the context of supervised classification.

In our setting, the Ea S representation is a mapping from the d-dimensional unit sphere Sd 1 to {0, 1}m, m d, where a data point x Sd 1 is first randomly mapped to higher dimension Rm using a random projection matrix Θ, and is followed by a sparsification operation, where the informative k m of the m coordinates of the resulting vector Θx Rm are set to one and the rest of the coordinates are set to zero. Rows of Θ are typically drawn independently from some distribution ν over Sd 1 or Rd. Let Cj, j = 1, . . . , m, be the set of all examples from the input space X Sd 1 whose jth coordinate in respective Ea S representations are set to 1. We call Cj the response region of θj (the jth row of Θ).

Figure 1: Top: A point x R2 (coordinate-wise values are different shades of gray) and its projection y = Θx R15 (coordinate-wise values are different shades of red). The sparsification step sets the largest 5 values of y to 1 (black squares) and the rest to zero. Bottom: Activated response regions Cj, x Cj, (x is a red dot), are shown using different colors. The points from the training set that intersects with these activated response regions are shown using black dots.

Note that for any x X, there are k activated response regions corresponding to the k nonzero bits in the Ea S representation of x. These k activated response regions serves as k neighborhoods of x for our proposed non-parametric classifier, where, for any x X and each activated response region Cj, we estimate Pr(y = 1|Cj) expected value of y when x is restricted to Cj and take their average to be the estimate of Pr(y = 1|x). In a toy example, we visually show the Ea S representation and the activated response regions Cj of a data point in Fig. 1. Comparing whether this conditional probability estimate exceeds 1/2, we predict the class label of y. In particular, we present two algorithms using different sparsification schemes and analyze their theoretical properties.

One may find similarity between our proposed algorithm and a random forest classifier for any x X, k activated response regions Cj, x Cj, may be interpreted as the leaf nodes (containing x) of k decision trees in a random forest. However, unlike random forest, we can not simply increase k (number of trees) without changing other hyper-parameters (such as m) hoping to achieving better prediction performance. Therefore, even-though the consistency property of random forest is studied under different settings Biau et al. [2008], Scornet et al. [2015], Scornet [2016], Tang et al. [2018], those ideas can not be applied for our theoretical analysis.

We summarize our contributions below:

We present an interesting connection between non-parametric estimation and Ea S representation, and propose two algorithms for non-parametric classification via Ea S representation, and present empirical evaluations on benchmark machine learning datasets.

For our first algorithm (using k-WTA sparsification), we establish its universal consistency and prove that it achieves minimax-optimal convergence rate that depends on dimension d.

When data lie on a low dimensional manifold having dimension d0 d, we present a second algorithm (using empirical k-thresholding sparsification) and prove that its convergence rate is minimax-optimal and depends only on d0.

The rest of the paper is organized as follows. We discuss related work in 2. We present our first algorithm and its theoretical analysis including consistency and convergence rate in section 3. We present our second algorithm and derive its convergence rate in section 4. We present our empirical results in section 5 and conclude discussing limitations and future work in section 6.

2 Related work

Non-parametric estimation is an important branch of statistics with a rich history and classical results. Interested readers may refer to Tsybakov [2008], Györfi et al. [2002], which provide excellent treatment of this subject. Here we briefly review consistency and convergence rates of well-known non-parametric methods such as, partitioning estimates (histograms), k-nearest neighbors, kernel methods, and random forests. In the non-parametric literature, it is typical to estimate the regression function η(x) = E(y|x) from data, and use the resulting estimate ηn (using a sample of size n) to construct plug-in decision function (classification rule). Under mild conditions, such regression estimates and the resulting plug-in classification rules are consistent for histograms, knearest neighbors, and kernel methods Györfi et al. [2002], Devroye et al. [1996]. Under mild conditions, different variations of random forest methods are also known to be consistent Biau et al. [2008], Scornet et al. [2015], Scornet [2016], Tang et al. [2018]. Under the assumption that the regression function η(x) is Lipschitz continuous, the L2 error of the regression estimate of histogram convergences at a rate O n 1/(d+2) (Theorem 4.3 Györfi et al. [2002], the L2 error of the regression estimate of the kernel method convergence at a rate O n 1/(d+2) (Theorem 5.2 Györfi et al. [2002]), and the L2 error of the regression estimate of k-nearest neighbors method convergences at a rate O n 2/(d+2) (Theorem 6.2 Györfi et al. [2002]). For random forest, Gao and Zhou [2020] established finite-sample rate O(n 1/(8d+2)) on the convergence of pure random forests for classification, which was be improved to be of O(n 1/(3.87d+2)) by considering the midpoint splitting mechanism. They introduced another variant of random forests, which follow Breiman s original random forests but with different mechanisms for splitting dimensions and positions, to get a convergence rate O(n 1/(d+2)(ln n)1/(d+2)), which reaches the minimax rate, except for a factor (ln n)1/(d+2).

For Ea S representation, when ν is the uniform distribution over Sd 1 and the sparsification scheme is a k-winners-take-all (k-WTA) operation that sets the k largest entries of a vector in Rm to one the rest to zero, Dasgupta and Tosh [2020] proposed a function approximation scheme using such Ea S representation that can approximate any Lipschitz continuous function f : Sd 1 R in the L norm, where the error decays exponentially slowly with d. Further, they showed when the data lie on a low dimensional submanifold of Sd 1 having dimension d0 d, using a different sparsification step, termed as k-thresholding, any Lipschiz continuous function defined on this manifold can be approximated in the L norm, where the error decays exponentially slowly with d0. A different Ea S representation, where the projection matrix is a sparse binary matrix and the sparsification step is k-WTA, was proposed in Dasgupta et al. [2017] that effectively hash input data points and such hashing has been shown to provide improved performance in accurately solving the similarity search problems compared to the state-of-the-art locality sensitive hashing (LSH) based techniques (Gionis et al. [1999], Andoni and Indyk [2008], Datar et al. [2004]). Such Ea S representation has also been used to summarize data in the form of a bloom filter, termed as fly bloom filter (FBF), and has been successfully used in solving the novelty detection problems in Dasgupta et al. [2018] and classification problem in Sinha and Ram [2021], Ram and Sinha [2021, 2022].

While our work is inspired by the results of Dasgupta and Tosh [2020], there are key differences. First, we explicitly expand upon and apply the idea of Dasgupta and Tosh [2020] to the supervised binary classification setting, and derive the rate at which the error probability of our proposed classifier converges to that of the Bayes optimal classifier. In comparison, the main motivation of Dasgupta and Tosh [2020] was to prove the existence of a weight vector that results in arbitrarily well function approximation in the unsupervised learning setting using Ea S representation. Because of this existential nature of their result, the effect of sample size was neither needed nor considered in their result. Second, the k-thresholding sparsification scheme proposed in Dasgupta and Tosh

[2020] for function approximation result assumed that the coordinate-wise thresholds were known. We make no such assumption and explicitly describe how to compute such thresholds from data resulting in realizable algorithm and derive convergence rate of our proposed classifier that takes care of the uncertainly associated with random natures of these thresholds. Third, while Dasgupta and Tosh [2020] only considers Lip Schitz continuous functions, we consider that conditional probability function η(x) = Pr(y = 1|x) to be Hölder continuous (a broader function class), and prove that our proposed classifier achieves minimax-optimal convergence rate whether such optimal convergence rate was achievable in the proposed setting was, unknown prior to our result.

Finally, we note that random Fourier features Rahimi and Recht [2007] are generated using a construction similar to Ea S, however the choice of random directions there are chosen using a kernel function that measures a notion of similarity in the input space. Ea S representation can also be though of an opposite process of compressed sensing Candes et al. [2006], Donoho [2006], where the goal is to recover a sparse vector given random projections to it, while random projection is used to generate a sparse representation in case of Ea S.

3 Algorithm 1

Algorithm 1 Training set Dn = {(xi, yi)}n i=1 Sd 1 {0, 1}, Projection dimensionality m N, k m non-zeros in the Ea S representation, random seed R, and inference with test point x Sd 1. Train Ea SClassifier(Dn, m, k, R)

Sample Θ with seed R Initialize w[i], ct[i] 0, i [m] for (x, y) S do

eas h1(x) w[i] w[i] + y, i [m] : eas[i] = 1 ct[i] ct[i] + 1, i [m] : eas[i] = 1 end w[i] w[i]/ct[i], i [m] return Θ, w end Infer Ea SClassifier(x, Θ, k, w)

eas h1(x) return I[(eas w)/k 1

In this section we present our first algorithm and analyze its statistical properties. For the rest of our presentation, we use the following notations. For any positive integer k N, we use [k] to denote the set {1, . . . , k}. For any vector v, we use the notation v[i] to denote its ith coordinate value. We use Sd 1 to denote the d-dimensional unit sphere and B(x, r) to denote a closed ball of radius r centered at x. We use I to denote indicator variable such that for any boolean variable A, I[A] is 1 if A is True, and 0 otherwise. We consider the binary classification setting, where the instance space is X Sd 1 and the output class label is Y = {0, 1}. Let µ denotes the measure on X and let η : X [0, 1], defined as η(x) = Pr(y = 1|x), represents the conditional probability distribution. Then, the joint probability distribution on X Y is completely defined by µ and η. Let {(xi, yi)}n i=1, be a training set of size n sampled i.i.d.from µ η. Due to space limitation, proofs of all our results (organized section wise) are deferred to the Appendix.

For x Sd 1, the Ea S representation of x, that uses k-WTA sparsification step, is given by the function h1 : Sd 1 {0, 1}m defined as, h1(x) = Γk(Θx), (1) where Θ is a m d random projection matrix whose rows θ1, . . . , θm are sampled i.i.d from the uniform distribution over Sd 1 and Γk : Rm {0, 1}m is the k-WTA function converting a vector in Rm to one in {0, 1}m by setting the largest k m elements of Θx to 1 and the rest to zero. For any j [m], let Cj = {x X : h1(x)[j] = 1}. We note that the subsets (response regions) C1, . . . , Cm does not form a partition since they can be overlapping. We summarize our first algorithm for binary classification using h1 in Alg. 1. During its learning/training phase, a vector w [0, 1]m summarizes the average y value over m response regions Cj, j [m] using the training set. In particular, w[j], j [m] learned during the training phase is precisely given by

ˆηj = Pn i=1 yi I[xi Cj] Pn i=1 I[xi Cj] (2)

Using (2),for any x X, we further define

j:x Cj ˆηj (3)

During the inference phase, for any test point x, Alg. 1 first computes (w h1(x))/k, which is an average of average y values over k response regions {Cj : h1(x)[j] = 1}, which can be interpreted as an estimate of the conditional probability η(x) and is precisely the quantity ˆη(x) given in (3). Alg. 1 then makes its prediction based on whether ˆη(x) is greater than or equal to 1/2. We can rewrite this conditional probability estimate as follows.

ˆη(x) = w h1(x)

j:x Cj w[j] = 1

j:x Cj ˆηj = 1

Pn i=1 yi I[xi Cj] Pn i=1 I[xi Cj]

I[xi Cj] Pn i=1 I[xi Cj]

| {z } wn,i(x)

i=1 wn,i(x)yi (4)

Using viewpoint (4), Alg. 1 can be interpreted as a plug-in" classifier Devroye et al. [1996] where prediction is based on whether the estimated conditional probability ˆη exceeds 1/2 or not. In particular, classifier in Alg. 1 can be represented as g : X {0, 1} described as

g(x) = 1, if ˆη(x) 1/2, 0, otherwise. (5)

In comparison, the Bayes optimal classifier g : X {0, 1} is defined as

g (x) = 1, if η(x) 1/2, 0, otherwise. (6)

In equation 4, the weights wn,i(x) = wn,i(x, x1, . . . , xn) R depends on x1, . . . , xn. Next we show that for sufficiently large n sum of these weights is 1. Therefore, ˆη is simply a weighted average estimator of η and is a non-parametric classifier Devroye et al. [1996].

Lemma 3.1. For any x X, suppose n is sufficiently large such that {xi, . . . , xn} Cj = for each j satisfying x Cj. Then, Pn i=1 wn,i(x) = 1.

Indeed, we show in Lemma D.7 (in the Appendix), that for any x X, whenever n and mk/n 0 as n , |{x1, . . . , xn} Cj| for all j such that x Cj.

3.1 Consistency of Algorithm 1

In this section we prove that Alg. 1 is universally consistent. We start with the definition of consistent and universally consistent classification rule Devroye et al. [1996]. Let Dn = {(x1, y1), . . . , ((xn, yn)} be a training set sampled i.i.d. from a certain distribution of (x, y) and let gn : X {0, 1} be a classification rule learned using Dn. Then gn is consistent (or asymptotically Bayes-risk efficient) for a certain distribution of (x, y) if the expected error probability ELn = Pr {gn(x, Dn) = y} L as n , where L is the Bayes error probability. A sequence of decision rules is called universally consistent, if it is consistent for any distribution of the pair (x, y). It is well known that a general theorem by Stone Stone [1977] (presented below) provides a recipe for establishing universal consistency of any classification rule of the form

gn(x) = 1, if ηn(x) 1/2, 0, otherwise. (7)

based on an estimate ηn(x) of the conditional probability η(x), satisfying ηn(x) = Pn i=1 wn,i(x)yi where weights wn,i(x) = wn,i(x, x1, . . . , xn) are non-negative and Pn i=1 wn,i(x) = 1.

Theorem 3.2. [Stone s Theorem (Theorem 6.3 [Devroye et al., 1996])] Assume that for any distribution of x, the weights satisfy the following three conditions: (i) There is a constant c such that, for every non-negative measurable function f satisfying Ef(x) < , E (Pn i=1 wn,i(x)f(xi)) c Ef(x). (ii) For all a > 0, limn E Pn i=1 wi,n(x)I{ xi x >a} = 0. (iii) limn E (max1 i n wn,i(x)) = 0.

Then gn is universally consistent.

Since we have already established in the previous section that the classification rule given in Alg. 1 admits the form given in (5), in conjunction with (4) and Lemma 3.1, we are ready to use Theorem 3.2 to establish universal consistency of Alg. 1. We state our consistency result below.

Theorem 3.3. Let (kn) and (mn) be increasing functions of n and let (δn) be a decreasing function of n satisfying limn δn = 0. Then the classification rule using Ea S representation (1) given in Alg. 1 that admits the form given in (5) is universally consistent whenever the following conditions hold:(i) kn

mn 0 as n , (ii) log(1/δn)

mn 0 as n and, (iii) mkn n n 0 as n .

Once m and k are fixed, for any x X, there are m k possible subsets of k indices, from the m possible indices in the Ea S representation, that h1(x) can set to 1. For i {1, . . . , m k }, let σ(i) represents the set of k indices in the ith subset and we let Ck i = {x X : h1(x)[j] = 1, j σ(i)} to denote the set of instances whose Ea S representation precisely sets the indices in σ(i) to 1 and the rest to zero. The following two results are crucial in proving our consistency results above, as well as in deriving convergence rate in the next section.

Lemma 3.4. For any positive integers m, k, where k < m, the following relations hold.

i) For any i {1, . . . , m k }, Ck i = j σ(i)Cj. ii) {Ck i }( m k) i=1 forms a partition of X. iii) For any j [m], Cj = i:j σ(i)Ck i . iv) For any j [m], µ(Cj) = P

i:j σ(i) µ(Ck i ).

Lemma 3.5. Let d 3 and pick any 0 < δ < 1. There is an absolute constant c0 > 0 such that the following holds. With probability at least 1 δ,

max j [m] diam(Cj) 8 k + c0(d log m + log(1/δ)

In particular, if k c0(d log m + log(1/δ)) then, maxj [m] diam(Cj) 8 2k

3.2 Convergence rate of Algorithm 1

While the consistency result established in the previous section provides the behavior of Alg. 1 in the large sample limit, it does not provide the rate at which the excess Bayes risk converges to zero. In this section we establish such finite sample convergence rate. From the plug-in classifier view point of Alg. 1 given in (5) and the definition of Bayes optimal classifier in (6), it is easy to see that how well the prediction of Algorithm 1 relates to the prediction of the Bayes optimal classifier will depend on how well ˆη(x) approximates η(x).

Recall that for any x X, point-wise error probability or risk of the Bayes optimal classifier is defined as L (x) = Pr(g (x) = y|x) = min{η(x), 1 η(x)} [Devroye et al., 1996]. Taking expectation over x, risk of the Bayes optimal classifier is L = E (min{η(x), 1 η(x)}). Now, for any plug-in classification rule gn(x) defined in (7) that uses a conditional probability estimate ηn using training data Dn, the excess Bayes risk is given by the following well known result (Corollary 6.1 of Devroye et al. [1996].

Lemma 3.6 ([Devroye et al., 1996]). The error probability of a classifier gn(x) defined in (7) satisfies

the inequality: Pr (gn(x) = y|Dn) L 2Ex

|η(x) ηn(x)| Dn

Taking expectation over Dn, the above lemma immediately translates to

Pr(gn(x) = y) L 2Ex,Dn (|η(x) ηn(x)|) (8)

Therefore, in order to establish convergence rate of Alg. 1, following Lemma 3.6 and (8), we need to bound expected value of |ˆη(x) η(x)|. Towards this end, we first extend η defined over points x X to over measurable sets A X with µ(A) > 0 as follows:

η(A)= 1 µ(A)

A η(x)µ(dx)= 1 µ(A)

A Pr(y=1|x)µ(dx)=Pr(y=1|x A)=E(y|x A) (9)

Thus, η(A) is the probability that y = 1 for a point x chosen at random from the distribution µ restricted to the set A, or in other words, it is the expected value of y when x is restricted to the set A.

With this definition, we introduce an intermediate quantity

j:x Cj ηj (10)

where, ηj is defined as,

ηj = 1 µ(Cj)

Cj η(x)µ(dx) = η(Cj) (11)

Using triangle inequality, this allows us to write: Ex,Dn|ˆη(x) η(x)| = Ex,Dn|ˆη(x) η(x)+ η(x) η(X)| Ex,Dn|ˆη(x) η(x)| + Ex,Dn| η(x) η(x)|, and we show how to individually bound each term on the right-hand side of this inequality next. Remark 3.7. Note that, once m and k are fixed, our hypothesis space is the set of all linear models on the k-sparse m dimensional binary vectors. The two terms on the right-hand side of the above inequality correspond to estimation error (the error of our proposed classifier with respect to the best hypothesis from the hypothesis space) and approximation error (the error difference between the best hypothesis from the hypothesis space and the target classifier, i.e., Bayes optimal), respectively.

3.3 Bounding Ex,Dn|ˆη(x) η(x)|

Note that there is an inherent randomness in our proposed algorithm associated with the choice of Θ. In particular, the response regions Cj are random quantities that depend on the choice of Θ. In this section, we fix Θ and conditioned on this, we bound Ex,Dn|ˆη(x) η(x)|. This ensures that for any δ > 0, the same bound holds with probability at least 1 δ over the choice of Θ. Lemma 3.8. Fix any Θ. Then we have, Ex,Dn|ˆη(x) η(x)| p m

Sketch of proof: From the definition of ˆη(x) and η(x) given in (3) and (10) and applying Jensen s inequality, crux of the proof is to focus on the expected value of the random quantity

Pn i=1 yi I[xi Cj] Pn i=1 I[xi Cj] η(Cj) 2 . Note that, for j [m], Pn i=1 I[xi Cj] = nµn(Cj) is the number of the points from Dn that fall in Cj, where µn(Cj) is the empirical probability estimate. We bound the quantity of interest in Lemma 3.8 as a sum of two quantities corresponding to the following two cases:

Case 1: when µn(Cj) = 0. Using the notation 0/0 = 0, the quantity of interest becomes 1 k P

j:x Cj η2(Cj)I[µn(Cj) = 0] . Utilizing the properties of the response regions established in Lemma 3.4, we show in Lemma E.3 that the expected value of the quantity of interest at most m nke.

Case 2: when nµn(Cj) > 0. Here the quantity of interest becomes

Pn i=1(yi η(Cj))I[xi Cj]

nµn(Cj) 2 I[nµn(Cj) > 0]. Conditioned on x, x1, . . . , xn, we first show in Lemma E.1 that expected value (w.r.t. y1, . . . , yn) of this quantity becomes at most

I[nµn(Cj)>0]

nµn(Cj) . Next, conditioned on x1, . . . , xn, and utilizing the properties of the response regions established in Lemma 3.4, we show in Lemma E.2 that expected value (w.r.t. x) of this quantity is at most 1

4 Pm j=1 µ(Cj)I[nµn(Cj)>0]

nµn(Cj) . Finally, using standard Binomial bound (Lemma E.4) we show that expected value (w.r.t. x1, . . . , xn) of this quantity is at most m 2kn.

3.4 Bounding Ex,Dn| η(X) η(x)|

In order to bound the expected value of | η(x) η(x)|, we need to impose certain smoothness condition on η. We consider a general form of smoothness, known as Hölder continuty for η. Definition 3.9. We say that η : X [0, 1] is (L, β) smooth if for all x, x X, we have |η(x) η(x )| L x x β.

Using Hölder continuity assumption above, we first show the following: Lemma 3.10. Suppose η is (L, β) smooth. Then, supx S |η(x) η(x)| L maxj [m](diam(Cj))β.

We have already shown in Lemma 3.5 how to bound the diameters of Cj. Combining these two results we have

Lemma 3.11. Let d 3 and pick any 0 < δ < 1. Assume η to be (L, β) smooth. There is an absolute constant c0 > 0 such that the following holds. If k c0(d log m + log(1/δ)) then with

probability at least 1 δ, Ex,Dn| η(x) η(x)| 8L (2k/m)

Combining Lemma 3.8, 3.11 and 3.6 we are present the main result of this section. Theorem 3.12. Let Dn = {(xi, yi)}n i=1 X {0, 1} be the training data and consider the Ea S representation given in (1). Let d 3 and pick any 0 < δ < 1. Assume η to be (L, β) smooth. There is an absolute constant c0 > 0 such that the following holds. If k c0(d log m + log(1/δ)) then with probability at least 1 δ over the random choice of Θ,

Pr(g(x) = y) L 2

Corollary 3.13. In Theorem 3.12, set m = kn

(d 1) 2β+(d 1) . Then, with probability at least 1 δ, over the choice of Θ, Pr(g(x) = y) L = O n β 2β+(d 1) .

Remark 3.14. Since X Sd 1, the effective dimension in our setting is d = (d 1) and the

convergence rate of Corollary 3.13 can be rewritten as O n β 2β+d which is minimax-optimal for

plug-in classifiers under the assumption that η is (L, β)-smooth [Audibert and Tsybakov, 2007].

3.5 Inability to adapt to manifold structure

In Theorem 3.12, we derived the convergence rate of the classifier presented in Alg. 1 by bounding Ex,Dn|η(x) ˆη(x)|, which upper bounds the excess Bayes risk, from above and the resulting convergence rate decays exponentially slowly with the dimension d. We now show that even if the data lie on a low dimensional manifold having dimesnion d0 d, there exists a smooth η such that the quantity Ex,Dn|η(x) ˆη(x)| decreases at a rate no faster than n 1 d+1 . To prove claim, we assume data to lie on the following one-dimensional manifold:

X1 = {(x1, x2, 0, 0, . . . , 0) Rd : x2 1 + x2 2 = 1} (12)

We further assume that k = β = 1 which implies that η is L-Lipschitz from the definition of (L, β)-smoothness. Our lower bound result is as follows.

Theorem 3.15. For any d > 3, let input space X1 be the one-dimensional sub-manifold of Rd given in (12). Take k = 1 and β = 1. Suppose the random matrix Θ has rows chosen from a distribution that is uniform over Sd 1. Then there exists a 1

2-Lipschitz function η : X1 [0, 1] such that the following holds with probability at least 1/2 over the choice of Θ.

Ex,Dn|η(x) ˆη(x)| = Ω n 1 d+1

4 Algorithm 2

In this section we present an alternate Ea S representation and an associated classification algorithm that adapts to intrinsic dimension d0 when data lie on a low dimensional manifold with dimension d0 d. Here, we assume that the rows θi, i [m] of matrix Θ are sampled i.i.d. from a multivariate Gaussian distribution N(0, 1/

d Id), denote by ν, where Id is d d identity matrix. This Ea S representation is denoted by h2 : X {0, 1}m, where for any point x X, the jth coordinate of h2(x) is set to 1, as given in (14), using a data dependent threshold τn. In particular, given a training set of size1 2n sampled i.i.d.from (µ, η), using the first half of it, namely, D n = {(x 1, y 1), . . . , (x n, y n)}, define τn : Rd R to be:

i=1 I[θ x i τ] k

1Instead of representing the training set to be D2n = {(xi, yi)}2n i=1, for notational convenience, we represent its first half as D n = {(x i, y i}n i=1 and the second half as Dn = {(xi, yi)}n i=1. D n is only used to compute empirical thresholds in (13) and therefore, we could use unlabeled data sampled i.i.d.from µ for this purpose.

h2(x)[j] = I[θj x τn(θj)] (14) We call the sparsification scheme given in (14)empirical-k-thresholding. For j [m], the jth response region Cj is defined as: Cj = C(θj) = {x X : θj x τn(θj)} (15) Using the second half of the training data, namely Dn = {(x1, y1), . . . , (xn, yn)}, average y values over different Cj are estimated and summarized in vector w in the same way as in Alg. 1 implying that w[j] = ˆηj, j [m], where ˆηj is given in (2). This completes the training phase of our proposed algorithm summarized in Alg. 2.

Algorithm 2 Training set D n = {(x i, y i)}n i=1, Dn = {(xi, yi)}n i=1 X {0, 1}, Projection dimensionality m N, integer k m, integer t, random seed R, and inference with test point x Sd 1. Train Ea SClassifier(Dn, D nm, k, t, R) Sample Θ with seed R Initialize w[i], ct[i] 0, i [m] for j [m] do

Compute τn(θj) using (13) and D n end for (x, y) Dn do

eas h2(x) w[i] w[i] + y, i [m] : eas[i] = 1 ct[i] ct[i]+1, , i [m] : eas[i] = 1 end w[i] w[i]/ct[i], i [m] return Θ, w end Infer Ea SClassifier(x, Θ, t, w)

eas hn 2(x) Θx {θi : h2(x)[i] = 1} eas eas eas[i] 0, i [m] : i / At(x) return I[( eas w)/t 1

The inference phase of Alg. 2 is slightly different from Alg. 1. While Ea S representation using h2 is not k-sparse anymore, we show that for large enough sample size, with high probability, it is at least k/2-sparse and at most 2k-sparse in expectation (see Lemma F.1 in Appendix F). Let t be an integer passed as an argument to Alg. 2. For any x X, let Θ(x) = {θj : x Cj} and we define At(x) to be the set containing the indices j [m], such that θj is one of the t closest points to x from Θ(x). To make an inference for any x X, Alg. 2 first computes ( eas w)/t and makes it prediction based on whether this quantity is greater than 1/2. Clearly, ( eas w)/t is the average of w[j] for j At(x) and therefore, the conditional probability estimate ˆη(x) of Alg. 2 can be represented as,

j ˆηj I[j At(x)] (16)

Remark 4.1. Note that h1 and h2 are different in a specific way. When the support of data does not cover the whole unit sphere and is concentrated possibly in a small region and m is large, many of the m coordinates in Ea S representation will never be activated (set to 1) for any data point as the corresponding projection direction θj may not be one of the k closest ones to any data point. Thus, many of the θj will be unused. This problem is avoided in h2, where every θj is used but the respective response region Cj may not be local to the manifold. For this purpose we need to identify good projection directions. Later in Lemma B.4 we show that, for any δ > 0, with probability at least 1 2δ over the choice of Θ and D n, the number of good projection directions t is linear in k.

Due to space limitation, manifold assumptions and other important details of analysis of Alg. 2 are presented in Appendix B and we present the main theoretical results below. Theorem 4.2. Let Dn = {(xi, yi)}n i=1 D n = {(x i, y i)}n i=1 X {0, 1} be the training data where the data lies on a low dimensional manifold satisfying manifold assumption presented in section B.1 and suppose the Ea S representation is given as in (14). Pick any 0 < δ < 1. Assume η to be (L, β) smooth. If k c d ln(m/δ), where c d is a constant that depend on d, then with probability at least 1 2δ,

Pr(g(x) = y) L 2

2m αdkn + 4L 2k

where αd is a constant that depends on d.

Corollary 4.3. In Theorem 4.2, setting m = kn

d0 2β+d0 ensure that with probability at least 1 δ,

Pr(g(x) = y) L = O n β 2β+d0 .

Remark 4.4. The convergence rate of Corollary 4.3 depends only on d0 and is minimax-optimal for plug-in classifiers under the assumption that η is (L, β)-smooth [Audibert and Tsybakov, 2007].

Figure 2: Empirical evaluation of Alg. 1, k-NN (for k = 1 and 10) and RF on eight datasets Here expansion factor is m/d. An error bar in the form of a shaded graph is provided for Alg. 1 over 10 independent runs.

5 Empirical evaluations

We investigate the effectiveness of our proposed method by evaluating it on eight benchmark datasets, details of which are provided in appendix A. We address the following questions: 1. Does performance our proposed classifier improve with increasing m as suggested by the theory? 2. How does our proposed classifier perform compared to other non-parametric classifiers? For each dataset, we generate train and test set using scikit-learn s train_test_split method (80 : 20 split). We compare our proposed method against two non-parametric classifiers scikit-learn s implementation of k-nearest neighbor classifier (k-NN) and random forest (RF). For k-NN, we used two values of k: k = 1 and 10. For RF we use a grid search over the number of estimators (trees) from the set {250, 500, 750, 1000} and perform a 3 fold cross validation to choose the final model. We preset our experimental results in Fig. 2 where we plot test accuracy of Alg. 1, k-NN (for k = 1 and 10) and RF by varying expansion factor, where we define expansion factor to be m/d. As per Theorem 3.12, we set k to be d log m. As can be seen from Fig. 2, with increasing m test accuracy of Alg. 1 increases in all eight datasets and becomes comparable to that of k-NN and RF for large m, thus corroborating our theoretical findings.

6 Conclusions, limitations and future work

In this paper, we present an interesting connection between non-parametric estimation and expansionand-sparsify representation. We presented two non-parametric classification algorithms using Ea S representation and proved that both algorithms yield minimax-optimal convergence rates. The convergence rate of the first algorithm depends on the ambient dimension d, while the convergence rate of the second algorithm, under manifold assumption, depends only on the intrinsic dimension d0 d. In both algorithms, the projection directions are chosen in a data-independent manner. One limitation of our current work is that, even though the second algorithm adapts to the manifold structure, there is a large constant, possibly depending exponentially on d, involved in bounding the excess Bayes risk, that is hidden under the Big Oh notation. In the future, we plan to investigate various data-dependent projection direction choices for a sparse representation, that would adapt to a manifold structure, and the constant involved in bounding of the excess Bayes risk from above, would be independent of ambient dimension d.

Acknowledgements: We thank the anonymous reviewers for their constructive feedback. This work is supported by funding from the NSF AI Institute for Foundations of Machine Learning (IFML) (FAIN:2019844).

Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117 122, jan 2008. URL https://doi.org/10. 1145/1327452.1327494.

Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of Statistics, 35(2):608 633, 2007. doi: 10.1214/009053606000001217. URL https: //doi.org/10.1214/009053606000001217.

Gérard Biau, Luc Devroye, and Gábor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(66):2015 2033, 2008. URL http://jmlr. org/papers/v9/biau08a.html.

Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006.

S. J. Caron, V. Ruta, L. F. Abbott, and R. Axel. Random convergence of olfactory inputs in the Drosophila mushroom body. Nature, 497(7447):113 117, May 2013.

Maurice J Chacron, André Longtin, and Leonard Maler. Efficient computation via sparse coding in electrosensory neural networks. Current Opinion in Neurobiology, 21(5):752 760, 2011.

Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper_files/paper/2010/file/ b534ba68236ba543ae44b22bd110a1d6-Paper.pdf.

Sanjoy Dasgupta and Christopher Tosh. Expressivity of expand-and-sparsify representations. ar Xiv preprint ar Xiv:2006.03741, 2020. URL https://arxiv.org/pdf/2006.03741.pdf.

Sanjoy Dasgupta, Charles F Stevens, and Saket Navlakha. A neural algorithm for a fundamental computing problem. Science, 358(6364):793 796, 2017.

Sanjoy Dasgupta, Timothy C Sheehan, Charles F Stevens, and Saket Navlakha. A neural data structure for novelty detection. Proceedings of the National Academy of Sciences, 115(51):13093 13098, 2018.

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 04, page 253 262. Association for Computing Machinery, 2004. doi: 10.1145/997817.997857. URL https://doi.org/10.1145/997817.997857.

Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Stochastic Modelling and Applied Probability. Springer, 1996. ISBN 978-1-46120711-5.

D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 1306, 2006. doi: 10.1109/TIT.2006.871582.

Wei Gao and Zhi-Hua Zhou. Towards convergence rate analysis of random forests for classification. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9300 9311, 2020.

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. VLDB 99, page 518 529. Morgan Kaufmann Publishers Inc., 1999. ISBN 1558606157.

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer series in statistics. Springer, 2002. ISBN 978-0-387-95441-7.

A. C. Lin, A. M. Bygrave, A. de Calignon, T. Lee, and G. ck. Sparse, decorrelated odor coding in the mushroom body enhances learned odor discrimination. Nat Neurosci, 17(4):559 568, Apr 2014.

N. Y. Masse, G. C. Turner, and G. S. Jefferis. Olfactory information processing in Drosophila. Curr Biol, 19(16):R700 713, Aug 2009.

Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom., 39(1):419 441, mar 2008. ISSN 0179-5376.

B. A. Olshausen and D. J. Field. Sparse coding of sensory inputs. Curr Opin Neurobiol, 14(4): 481 487, Aug 2004.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_ files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.

Parikshit Ram and Kaushik Sinha. Flynn: Fruit-fly inspired federated nearest neighbor classification. In International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2021 (FL-ICML 21), 2021.

Parikshit Ram and Kaushik Sinha. Federated nearest neighbor classification with a colony of fruitflies. In Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 8036 8044. AAAI Press, 2022.

Erwan Scornet. On the asymptotics of random forests. J. Multivar. Anal., 146(C):72 83, apr 2016. ISSN 0047-259X. URL https://doi.org/10.1016/j.jmva.2015.06.009.

Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests. The Annals of Statistics, 43(4):1761 1741, 2015.

Kaushik Sinha and Parikshit Ram. Fruit-fly inspired neighborhood encoding for classification. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021.

D. D. Stettler and R. Axel. Representations of odor in the piriform cortex. Neuron, 63(6):854 864, Sep 2009.

Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 620, 1977.

Cheng Tang, Damien Garreau, and Ulrike von Luxburg. When do random forests fail? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 4, 2018.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.

G. C. Turner, M. Bazhenov, and G. Laurent. Olfactory representations by Drosophila mushroom body neurons. J Neurophysiol, 99(2):734 746, Feb 2008.

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198.

R. I. Wilson. Early olfactory processing in Drosophila: mechanisms and principles. Annu Rev Neurosci, 36:217 241, Jul 2013.

Dataset Name # Samples # Features mnist35 13454 20 fmnist24 11200 20 segment (v.2) 2310 19 wine (v.7) 2554 11 wind (v.2) 6574 14 puma8NH (v.2) 8192 8 cpu_small (v.3) 8192 12 twomoons 5000 2 Table 1: Dataset statistics

A Dataset details and computing environment

Details of the eight datasets used in our experiments are listed in Table 1. Among the eight datasets, the twomoons is a synthetic dataset generated using make_moons method from sklearn.dataset2 with noise parameter 0.2. All the remaining datasets are taken from the Open ML repository3 Vanschoren et al. [2013]. Both mnist and fmnist (fashion-mnist for short) are 10 class classification problem with 784 features. We convert them to binary classification problems by using the label 3 and 5 for the mnist dataset and label 2 (Pullover) and 4 (Coat)for fmnist dataset. For efficiency purpose, the feature dimensions for both these datasets are reduced to 20 using principal component analysis (PCA). For the remaining six datasets, the task is that of binary classification. For all eight datasets, the features are normalized using Standard Scaler option in scikit-learn and are made to be unit norm.

We run our experiments on a laptop with Intel Xeon W-10855M Processor, 64GB memory and NVDIA Quadro RTX 5000 Mobile GPU (with 16GB memory).

B Missing details of convergence rate of Algorithm 2 from section 4

In this section we present missing details of convergence rate analysis of Algorithm 2 from section 4. Proofs of various technical results presented in this section as well as the proof of Theorem 4.2 are deferred to section B.

B.1 Manifold assumption

To analyze convergence rate of Algorithm 2, we proceed in the same was as in section 3.2. However, to take advantage of the fact that data lie on a low-dimensional manifold in our analysis, we make similar manifold assumptions as in Dasgupta and Tosh [2020]. The input space X is a compact d0-dimensional Riemannian sub manifold M of Rd contained in the unit sphere, that is M Sd 1. M has nice boundaries and that the distribution on it µ, is almost uniform: formally, there exists constants c1, c2, c3 > 0 such that for all x M and for all r r0 (where r0 is an absolute constant)

c1rd0 µ(BM(x, r) < c2rd0 (17)

Vol(BM(x, r)) c3rd0 (18)

Here BM(x, r) = B(x, r) M.

To effectively analyze the data distribution supported on a manifold, we impose conditions on the curvature by adopting the common requirement that M has a positive reach ρ > 0: that is, every point in an open tubular neighborhood of M of of radius ρ has a unique nearest neighbor in M [Niyogi et al., 2008]. For each x M, let N(x) denotes the (d d0) dimensional subspace of normal vectors to the tangent plane at x. For each x M, the sets Γρ(x) = {x + ru Rd : u N(x), u = 1, 0 < r < ρ} are disjoint and let τρ be their union. Let πM : τρ M be the projection map that sends any point in τρ(x) to x, its nearest neighbor in M.

2https://scikit-learn.org/stable/ 3https://www.openml.org/

B.2 Bounding Ex,Dn|ˆη(x) η(x)|

The response regions Cj, j [m] are now random quantities that depend on the choice of Θ as well as D n used to compute individual thresholds. The quantity ˆη(x) given in (16) depends on t. In this section, we fix Θ and t and conditioned on this, we bound Ex,Dn,D n|ˆη(x) η(x)|. Later in Lemma B.4 we show that, for any δ > 0, with probability at least 1 2δ over the choice of Θ and D n, t is linear in k.

Lemma B.1. Fix any Θ and t. Then the following holds.

Ex,Dn,D n|ˆη(x) η(x)| r m

Sketch of proof: The proof strategy is similar to Lemma 3.8, but requires careful analysis to accommodate the facts that Cj, j [m] depend on the choice of D n and ˆη in (16) has an indicator function. Details are provided in Appendix B.

B.3 Bounding Ex,Dn| η(x) η(x)|

Using identical proof strategy of Lemma 3.10, we have, Lemma B.2. Under empirical k-thresholding scheme, if η is (L, β) smooth, then for all x C1 , Cm, |η(x) η(x)| L max j [m](diam(Cj))β

Unfortunately, unlike before, all Cj such that x Cj may not be local to x and thus average y values in some of these Cj, may be very different from η(x). To address this, we call a θ Rd good if it lies in Γρ/2. Good θj s are close to the manifold M and are guaranteed to be activated by a single neighborhood of M. Thus, for large enough m, if the diameters of the Cj s corresponding to good θj s are made reasonable small then the average y values in such Cj s will be a reasonably close of η(x) for x Cj. First, for any good θj, we bound the diameter of Cj emphasizing that the response region of a good θj is local.

Lemma B.3. Pick any good θ Rd. Define = θ πM(θ) to be the distance from θ to its projection on M. Let C(θ) be the response region associated with θ as defined in (15). Pick any 0 < δ < 1. Then with probability at least 1 δ, over the choice of D n, the following holds:

In particular this implies, diam(C(θ)) 4 2k c1m 1/d0 , provided (2k/(c1m))1/d0 < min(ρ, r0) and

n satisfies n c0m

k log n + log m

δ , where c0 > 0 is a universal constant.

Next we show that each x M has number of good θj linear in k with high probability.

Lemma B.4. Pick θ1, . . . , θm ν. There is a constant c d, depending on d, for which the following holds. Pick any 0 < δ < 1. Set k c d ln(m/δ). then with probability at least 1 2δ over the choice of θj s and D n, for every x M, there are αdk/2 good θj s with x C(θj) where c d and αd are constants that depend on dimension d.

Combining4 Lemma B.3 and B.4, we have Lemma B.5. Suppose the data distribution is supported on a d0-dimensional submanifold M of Rd with reach ρ > 0, that additionally satisfied (17) and (18). Suppose that the rows of Θ are chosen from N(0, Id). There is a constant c d, depending on the dimension d, and c0 > 0, universal constant, for which the following holds. Pick any 0 < δ < 1. Let k, m and n be chosen so that

4We note that Lemma B.3 a modified version of Lemma 6 of Dasgupta and Tosh [2020] to incorporate the uncertainty associated with the data dependent threshold selection in (13). Also, Lemma B.4 is a modified version of Lemma 7 of Dasgupta and Tosh [2020] which incorporates threshold selection uncertainty and additionally guaranteeing the number of good θj to be linear in k.

k c d ln(m/delta), (2k/(c1m))1/d0 < min(ρ, r0) and n (c0m/k)(log n + log(m/δ)). Then with probability at least 1 2δ over the choice of Θ and D n,

sup x X | η(x) η(x)| 4L 2k

Combining everything leads to the main result presented in Theorem 4.2.

C Various proofs from section 3

C.1 Proof of Lemma 3.1

Proof. Simply interchange the summation. That is,

i=1 wi,n(x) =

I[xi Cj] Pn i=1 I[xi Cj]

Pn i=1 I[xi Cj] Pn i=1 I[xi Cj]

D Various proofs from section 3.1

In this section we present various technical results that are needed to prove our main theorem (Theorem 3.3). Due to space limitation proofs of some of these technical results are deferred to the supplementary material.

D.1 Proof of Lemma 3.4

Proof. For part (i), note that if x Ck i , then by definition, j σ(i), hΘ,k(x)[j] = 1 which implies x Cj. Therefore, Ck i j σ(i)Cj. On the other hand, if x j σ(i)Cj then hΘ,k(x)[j] = 1 j σ(i), which implies x Ck i . Therefore, j σ(i)Cj Ck i .

For part (ii), take any i, j [ m k ], i = j, then σ(i) = σ(j). Therefore, if x Ck i then x / Ck j . Since x was arbitrary, Ck i Ck j = . Now, for any x X, since hΘ,k(x) has exactly k bits activated

(set to 1), there must exist l [ m k ] such that x Ck l . Therefore, ( m k) i=1Ck i = X.

For part (iii), take any x Cj. From the definition of Cj, hΘ,k(x)[j] = 1 and therefore, x Ck i for any i such that j σ(i). From part(ii) since such Ck i s are disjoint and x was arbitrary, the result follows.

Finally, part (iv) follows immediately from part (ii) and (iii) since Ck i s are disjoint.

D.2 proof of Lemma 3.5

Proof. While general idea of this result appeared in Dasgupta and Tosh [2020], we provide a simplified proof. Let ν(r) = infx X ν(B(x, r)). We first claim that if ν(r) 2 m k + c0 d log m + log 1

δ then for every j [m], Cj B(θj, r). By definition, every ball of radius r, centered at some x X, has ν-mass at least ν(r). Thus, by Lemma D.1, with probability at least 1 δ, every x X will have its nearest k θj s within a distance r. Therefore, the jth bit of the Ea S representation given in (1) will be activated by points x within distance r of θj. Therefore, diam(Cj) 2r for all j = 1, . . . , m, where r satisfies

ν(B(x, r)) 2 k + c0 d log m + log 1

By Lemma D.2, we can take r to be the value

(3/4)(d 1)/2 2 k + c0 d log m + log 1

d k + c0 d log m + log 1

Therefore, we have

diam(Cj) 2r 4

d k + c0 d log m + log 1

k + c0 d log m + log 1

where, for the last inequality, we have used the fact that d1/(2(d 1))

2 and for d 3 it holds that 61/(d 1)

Lemma D.1 ([Chaudhuri and Dasgupta, 2010]). There is an absolute constant c0 > 0 for which the following holds. Pick any 0 < δ < 1. Pick θ1, . . . , θm independently at random from a distribution µ on Rd. Then with probability at least 1 δ, any ball B in Rd with

d log m + log 1

contains at least k of the θi. Lemma D.2 (Lemma 12 [Dasgupta and Tosh, 2020]). Suppose d 3, r (0, 1), and ν is the uniform distribution over Sd 1. Then for any x Sd 1,

ν(B(x, r)) 1 3

d rd 1 1 r2/4 (d 1)/2 1 3

d (3/4)(d 1)/2 rd 1.

D.3 Proof of Theorem 3.3

Proof. Since the weights wn,i given in equation 4 are non-negative, in order to satisfy condition (i) of Stone s theorem, we need to show that there exists a positive constant c such that for any non-negative measurable function f satisfying Ef(x) < and for any n, E (Pn i=1 wn,i(x)f(xi)) c E(f(x)). Using Lemma D.3, we show that this condition is satisfied for c = 1.

Concerning condition (ii) of Stone s theorem, first define an = an(kn, mn, δn) =

8 kn+c0(d log mn+log( 1

1/(d 1) , where c0 is an absolute constant. Then limn δn = 0

and using condition (i) and (ii) of this theorem we also have limn an = 0. Now pick any a > 0. We can always find a positive integer N such that for n > N, we have a > an satisfying Pn i=1 wi,n(x)I{ xi x >a} Pn i=1 wi,n(x)I{ xi x >an} for all x X. Replacing k, m, δ and a with kn, δn and an respectively in Lemma D.6, we have

i=1 wi,n(x)I[ xi x > a]

i=1 wi,n(x)I[ xi x > an]

Taking limit yields, limn E (Pn i=1 wi,n(x)I[ xi x > a]) = 0.

Concerning condition (iii) of Stone s Theorem, note that

E max 1 i n wn,i(x) E

1 Pn l=1 I[xl Cj]

1 Pn l=1 I[xi Cji(x,Θ)]

E 1 min1 i k{Nji(x,Θ)}

where, for i = 1, . . . , k, we have used the notation ji(x, Θ) to denote the k random coordinates of hΘ,k(x) that are set to 1 and Nji(x,Θ) to denote the number of points from x1, . . . , xn that fall in Cji(x,Θ). Now from equation 19, is it easy to see that limn E (max1 i n wn,i(x)) = 0 since using lemma D.7, min1 i k{Nji(x,Θ)} in probability, as n .

D.4 Technical results for satisfying condition (i) of Stone s theorem (Theorem 3.2)

Lemma D.3. For any non-negative measurable function f : X R+ satisfying Ef(x) < and for any n, E (Pn i=1 wn,i(x)f(x1)) E(f(x)), where the weights wn,i are as given in (4).

i=1 wi,n(x)f(xi)

I[xi Cj]f(xi) Pn l=1 I[xl Cj]

I[xi Cj]f(xi) Pn l=1 I[xl Cj]

I[xi Cj]f(xi) Pn l=1 I[xl Cj]

I[xi Cj]f(xi) Pn l=1 I[xl Cj]

Cj f(x1)µ(dx1)

Pr(x Ck i )

Cj f(x1)µ(dx1)

Pr(x Ck i ) µ(Cj)

Cj f(x1)µ(dx1)

µ(Ck i ) µ(Cj)

Cj f(x1)µ(dx1)

µ(Ck i ) µ(Cj)

Cj f(x1)µ(dx1) c= 1

Cj f(x1)µ(dx1)

Ck i f(x1)µ(dx1) e= 1

Ck i f(x1)µ(dx1)

Ck i f(x1)µ(dx1) = ( m k) X

Ck i f(x1)µ(dx1)

f(x1)µ(dx1) = E(f(x1)) g= E(f(x))

where, inequality a follows from Lemma D.4. Equality b follows from the following observation. In the line above inequality b, we are summing k m k terms. Since k m k = m m 1 k 1 and any j [m] can appear in exactly m 1 k 1 different subsets of of size k, equality b is simply rearranging the terms from the line above by changing the indices appropriately. Equality c follows from part (iv)

of Lemma 3.4. Equality d follows from part (iii) of Lemma 3.4.Equality e follows from the following observation. In the line above equality e, we are summing m m 1 k 1 terms since any j [m] can appear in exactly m 1 k 1 different subsets of of size k. Again noticing that m m 1 k 1 = k m k

and each σ(i) has k terms in it, we are simply rearranging the terms from the line above by changing the indices appropriately. Finally, equality f follows from part (ii) of Lemma E.4 and equality g follows from the fact that x and x1 are i.i.d.

Lemma D.4. E Pn i=1 I[xi Cj]f(xi) Pn l=1 I[xl Cj] x 1 µ(Cj) R

Cj f(x1)µ(dx1).

I[xi Cj]f(xi) Pn l=1 I[xl Cj]

i=1 E I[xi Cj]f(xi) Pn l=1 I[xl Cj]

I[xi Cj]f(xi) 1 + P

l =i I[xl Cj]

= n E I{x1 Cj}f(x1) 1 1 + Pn l=2 I[xl Cj]

I{x1 Cj}f(x1)Ex2,...,xn

1 1 + Pn l=2 I[xl Cj]

I{x1 Cj}f(x1) 1 nµ(Cj)

Cj f(x1)µ(dx1)

where, equality a follows from that fact, that if xi Cj, then the term in the parenthesis does not change and if xi / Cj, then the term within the parenthesis is zero, and inequality b follows from lemma D.5 since conditioned on x and x1, the random variable Pn l=2 I{xl Cj} is Binomially distributed with parameters (n 1) and µ(Cj).

Lemma D.5. (Binomial bound Györfi et al. [2002]) Let the random variable B(n, p) be Binomially distributed with parameters n and p. Then,

E 1 1 + B(n, p)

D.5 Technical results for satisfying condition (ii) of Stone s theorem (Theorem 3.2)

Lemma D.6. Pick any 0 < δ < 1 and let a = 8 k+c0(d log m+log( 1

1/(d 1) , where c0 > 0 is an

absolute constant defined in Lemma 3.5. Then,

i=1 wi,n(x)I[ xi x > a]

Proof. From Lemma 3.5, we have that with probability at least 1 δ, diam(Cj) a for all 1 j m. Let us define the random variable:

I[xi Cj] Pn i=1 I[xi Cj]

I[ xi x > a]

and the event: A2 = j:x Cj Cj B(x, a)

Let Ac 2 denotes complement of the event A2. Then,

i=1 wi,n(x)I[ xi x > a]

I[xi Cj] Pn i=1 I[xi Cj]

I{ xi x >a}

= E (A1|A2) Pr (A2) + E (A1|Ac 2) Pr (Ac 2)

a 0 Pr(A2) + 1 δ = δ

where the inequality follows from the fact that maximum value of A1 is 1 and conditioned on the event A2, value of A1 is 0.

D.6 Technical results for satisfying condition (iii) of Stone s theorem (Theorem 3.2)

Lemma D.7. For any x µ, let j1(x, Θ), . . . , jk(x, Θ) [m] be the k random coordinates of hΘ,k(x) that are set to 1. Let Nj1(x,Θ), . . . , Njk(x,Θ) be the number of data points falling in Cj1(x,Θ), . . . , Cjk(x,Θ) respectively. Then min{Nj1(x,Θ), . . . , Njk(x,Θ)} in probability, whenever n and mk/n 0 as n .

Proof. For any random x X, let Ck i(x,Θ) be the random cell of the partition {Ck i }( m k) i=1 which sets k random coordinates j1(x, Θ), . . . , jk(x, Θ) of hΘ,k(x) to one. Then, Ck i(x,Θ) = k i=1Cji(x,Θ). Let Nn(x, Θ) = Pn i=1 I{xi Ck i(x,Θ)} be the number of data points falling in the same cell as x. We first we show that Nn(x, Θ) in probability. Let N1, N2, . . . N( m k) be the number of points of

x, x1, . . . , xn falling in the m k respective cells Ck 1 , . . . , Ck ( m k). Let S = {x, x1, . . . , xn} denote the

set of positions of these n + 1 points. Since these points are independent and identically distributed, fixing the set S (but not the order of the points) and Θ, the conditional probability that x falls in the ith cell Ck i is Ni/(n + 1). Then for any fixed integer t > 0,

Pr(Nn(x, Θ) < t) = E [Pr (Nn(x, Θ) < t|S, Θ)]

n + 1 (t 1)mk

which converges to zero on our assumption on n. Since Ck i(x,Θ) Cjl(x,Θ) for l = 1, . . . , k, we have Nn(x, Θ) min{Nj1(x,Θ), . . . , Njk(x,Θ)} and the statement of the Lemma follows.

E Various proofs from section 3.2

For ease of exposition, we use the notation x[1,n] to denote x1, . . . , xn and y[1,n] to denote y1, . . . , yn for various proofs appearing in the section.

E.1 Proof of lemma 3.8

Proof. Fix any Θ. This ensures that C1, . . . , Cm are fixed. Now,

E(ˆη(x) η(x))2

= Ex,x[1,n],y[1,n] (ˆη(x) η(x))2

= Ex[1,n] Ex,y[1,n] (ˆη(x) η(x))2|x[1,n])

j:x Cj (ˆηj ηj)

j:x Cj (ˆηj ηj)2 x[1,n]

Pn i=1 Yi I[xi Cj] Pn i=1 I[xi Cj] η(Cj) 2 x[1,n]

Pn i=1 yi I[xi Cj]

nµn(Cj) η(Cj) 2 x[1,n]

Pn i=1 yi I[xi Cj]

nµn(Cj) η(Cj) 2 I[nµn(Cj) > 0] + η2(Cj)I[µn(Cj) = 0] x[1,n]

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0] x[1,n]

j:x Cj η2(Cj)I[µn(Cj) = 0] x[1,n]

hΘ,k(x)[j]=1

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0] x[1,n]

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0] x, x[1,n])

I[nµn(Cj) > 0]

µ(Cj)I[nµn(Cj) > 0]

j=1 µ(Cj)Ex[1,n]

I[nµn(Cj) > 0]

2µ(Cj) (n + 1)µ(Cj) + m

kne = m 2k(n + 1) + m

where, inequality a is due to Jensen s inequality. Inequality b follows from lemma E.3. Inequality c follows from lemma E.1 and inequality d follows from lemma E.2. Finally inequality e follow from the observation that nµn(Cj) is Binomially distributed with parameters n and µ(Cl) and by an application of lemma E.4. The result follows noting that by Jensen s inequality E|ˆη(x) η(x)| p

E(ˆη(x) η(x))2.

Lemma E.1. Pick m d projection matrix Θ. Suppose the Ea S representation uses (i) a mapping Θ and (ii) k-winner-take-all sparsification. Let x be sampled from µ and let Dn = ((x1, y1), . . . , (xn, yn)) is a random training set where xi is sampled from µ and yi is distributed as η(xi) for i [n]. Then the following holds.

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0] x, x[1,n]

I[nµn(Cj) > 0]

Proof. Conditioned on x, only k of the m coordinates in the Ea S representation of x are non-zero. WLOG, for ease of exposition, assume these k non-zero coordinate to be j1, . . . , jk [m]. Then the number of xi that falls in any such Cjl, where l [k], is nµn(Cjl). The yi values corresponding to these xi points (there are nµn(Cjl) of them in total) are identically and independently distributed with expectation

E(yi|xi Cjl) = Pr(yi = 1|xi Cjl) = 1 µ(Cjl)

Cjl Pr(yi = 1|xi = x)µ(dx)

Cjl η(x)µ(dx) = η(Cjl)

Therefore, we can write

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0] x, x[1,n]

Ey[1,n] h (Pn i=1(yi η(Cjl))I[xi Cjl])2 I[nµn(Cjl) > 0]|x[1,n] i

(nµn(Cjl))2

"Pn i=1 Eyi (yi η(Cjl))2I[xi Cjl]I[nµn(Cjl) > 0]|xi

(nµn(Cjl))2

Pn i=1 η(Cjl)(1 η(Cjl))I[xi Cjl]I[nµn(Cjl) > 0]

(nµn(Cjl))2

I[nµn(Cjl) > 0]

where, equality a is due to the following observation. For any i, j [m], i = j and xi, xj Cl for some l [m], yi and yj are identically and independently distributed with expectation η(Cl). Therefore, the expectation of the cross product is simply:

Eyi,yj [(yi η(Cl))(yj η(Cl)] = Eyi,yj yiyj η(Cl)(yi + yj) + η(Cl)2

= Eyi Eyj η(Cl)(Eyi + Eyj) + η(Cl)2 = 0

Equality b follows from variance computation. In particular for any Yi, i [m] with xi Cl for some l [m], E (yi η(Cl))2 = Ey2 i 2η(Cl)Eyi + η(Cl)2 = η(Cl) 2η(Cl)2 + η(Cl)2 = η(Cl)(1 η(Cl)). Finally, inequality c follows from the fact that for any z [0, 1], the maximum value of z(1 z) is 1

It is easy to observe that the final result is equivalent to 1

I[nµn(Cj)>0]

Lemma E.2. Pick m d projection matrix Θ. Suppose the Ea S representation uses (i) a mapping Θ and (ii) k-winner-take-all sparsification. Let x be sampled from µ and let Dn = ((x1, y1), . . . , (xn, yn)) is a random training set where xi is sampled from µ and yi is distributed as η(xi) for i [n]. Then conditioned on x1, . . . , xn, the following holds.

I{nµn(Cj)>0}

µ(Cj)I{nµn(Cj)>0}

I[nµn(Cj) > 0]

a= ( m k) X

Pr(x Ck i )

I[nµn(Cj) > 0]

µ(Ck i )I[nµn(Cj) > 0]

µ(Ck i )I[nµn(Cj) > 0]

I[nµn(Cj) > 0]

i:j σ(i) µ(Ck i )

µ(Cj)I[nµn(Cj) > 0]

where equality a follows from part (ii) of lemma 3.4 since {Ck i }( m k) i=1 forms a partition of X and the definition of σ in section 3.2. Equality b follows from the following observation. In the line above inequality b, we are summing k m k terms. Since k m k = m m 1 k 1 and any j [m] can appear in exactly m 1 k 1 different subsets of of size k, equality b is simply rearranging the terms from the line above by changing the indices appropriately. Equality c follows from part (iv) of lemma 3.4.

Lemma E.3. Pick m d projection matrix Θ. Suppose the Ea S representation uses (i) a mapping Θ and (ii) k-winner-take-all sparsification. Let x be sampled from µ and let Dn = ((x1, y1), . . . , (xn, yn)) is a random training set where xi is sampled from µ and yi is distributed as η(xi) for i [n]. Then the following holds.

j:x Cj η2(Cj)I[µn(Cj) = 0] x[1,n]

j:X Cj η2(Cj)I[µn(Cj) = 0] x[1,n]

j:x Cj η2(Cj)I[µn(Cj) = 0] x[1,n]

i=1 Pr(x Ck i )

j σ(i) η2(Cj)I[µn(Cj) = 0]

j σ(i) µ(Ck i )η2(Cj)I[µn(Cj) = 0]

j=1 η2(Cj)I[µn(Cj) = 0]

i:j σ(i) µ(Ck i )

j=1 η2(Cj)µ(Cj) [Ex1,...,xn I[µn(Cj) = 0]]

j=1 η2(Cj)µ(Cj)(1 µ(Cj))n = 1

j=1 η2(Cj)nµ(Cj)(1 µ(Cj))n

j=1 η2(Cj)nµ(Cj)e nµ(Cj)

j=1 nµ(Cj)e nµ(Cj) m

n nµ(Cj)e nµ(Cj)o e m

where equality a follows from the fact that the quantity with the inner square bracket is unaffected

by the Yis. Equality b follows from part (ii) of lemma 3.4 since {Ck i }( m k) i=1 forms a partition of X and the definition of σ in section 3.2. Equality c follows from the following observation. In the line above inequality c, we are summing k m k terms. Since k m k = m m 1 k 1 and any j [m] can appear in exactly m 1 k 1 different subsets of of size k, equality c is simply rearranging the terms from the line above by changing the indices appropriately. Inequality d follows from the fact maxj η(Cj) 1. Finally inequality e follows from the fact that supz ze z = 1

Lemma E.4. (Binomial bound Györfi et al. [2002]) Let the random variable B(n, p) be Binomially distributed with parameters n and p. Then,

E 1 B(n, p)I[B(n, p) > 0] 2 (n + 1)p

E.2 Proof of Lemma 3.10

Proof. Take any x X. Then

|η(x) η(x)| =

x Cj (η(x) η(x )

j:x Cj |η(x) ηj|

η(x) 1 µ(Cj)

Cj η(x )µ(dx )

Cj (η(x) η(x ))µ(dx )

Cj |η(x) η(x )|µ(dx )

Cj L x x βµ(dx )

L (diam(Cj))β

Cj µ(dx ) L max j [m](diam(Cj))β

E.3 Proof of Theorem 3.15

Proof. Define η : X1 [0, 1] to be a triangular function defined below: for 0 < θ 2π,

η(cos θ, sin θ, 0, 0, . . . , 0) = θ

π, if θ π 2 θ

π, if θ > π

Clearly, η is 1

2-Lipschitz and for k = 1, η(x) = Pm j=1 η(Cj)I{x Cj}. Therefore, using Theorem E.5, with probability at least 1/2 over the choice of Θ,

sup x X1 |η(x) η(x)| c d 1 m1/(d 1) log m

where c d is am absolute constant depending on d. Taking expectation, with probability at least 1/2 over the choice of Θ, we have, EX,Dn| η(X) η(X)| c d 1 m1/(d 1) log m. Next, using Lemma 3.8 with k = 1 yields, EX,Dn|ˆη(X) η(X)| p m

Using triangle inequality combining these results,

EX,Dn|η(X) ˆη(X)| = EX,Dn|η(X) η(X) + η(X) ˆη(X)| EX,Dn| η(X) η(X)| EX,Dn|ˆη(X) η(X)|

c d 1 m1/(d 1) log m rm

Ignoring the log term we see that for large enough n, the first term on the right hand side of (20) will dominate. In particular, for n = Ω m1+ 2 d 1 , we get EX,Dn|η(X) ˆη(X)| Ω m 1 d 1 . To

see this (ignoring the log term), if p m

n c d 2 1 m1/(d 1) , then EX,Dn|η(X) ˆη(X)| c d 2 1 m1/(d 1) . Now, rm

n c d 2 1 m1/(d 1) n 4 (c d)2 m1+ 2 d 1 .

In particular, setting n = 4 (c d )2 m1+ 2 d 1 , we have

EX,Dn|η(X) ˆη(X)| c d 2 m 1 d 1 = c d 2

d 1 n 1 d+1 .

Theorem E.5. [Theorem 4 of Dasgupta and Tosh [2020]] For any d > 3, let input space X1 be the one-dimensional sub-manifold of Rd given in (12). Take k = 1. Suppose that random matrix Θ has rows chosen from the distribution ν that is uniform over Sd 1. For any 0 < λ < 1, there exists a λ-Lipschitz function f : X1 R such that with probability at least 1/2 over the choice of Θ, no matter how the weights w1, . . . , wm are set, the resulting function ˆf(x) = Pm j=1 wj I{x Cj} has approximation error at least

sup x X1 | ˆf(x) f(x)| c d λ 1 m1/(d 1) log m

where c d is some absolute constant depending on d.

F Various proofs from section 4 and Appendix B

F.1 Lemma F.1 and its proof

We first show that while Ea S representation using h2 is not k-sparse,for large enough sample size, with high probability, it is at least k/2-sparse and at most 2k-sparse in expectation. F).

Lemma F.1. Pick any δ > 0. For k, m and n, satisfying n c0m

k log n + log m

δ , where c0 > 0 is a universal constant, with probability at least 1 δ, over the random choice of training set D n, the following holds for all j [m].

(k/2m) Prx µ(θj x τn(θj)) (2k/m).

Proof. We start with a version of relative generalization error bound, originally due to Vapnik and Chervonenkis, that appeared in Chaudhuri and Dasgupta [2010].

Theorem F.2. (Theorem 15 of Chaudhuri and Dasgupta [2010]) Let G be a class of functions from X to {0, 1} with VC dimension d < , and P a probability distribution on X. Let E be the expectation with respect to P. Suppose n points are drawn independently from P; let En denotes expectation with respect to this sample. Then for any δ > 0, with probability at least 1 δ the following holds for all g G:

Eng, β2 n + βn + p

Eg Eg Eng min β2 n + βn p

where βn = q

4(d log(2n)+log(8/δ))

Let θi be the ith row of the projection matrix Θ and let µi be the distribution of θi x where x is distribution according to µ. Then, for any a R,

Prx µ(θi x a) = µi(θi x a).

Suppose x 1, . . . , x n are sampled independently at random from µ. For any interval Aa = [a, ), let

µi n(Aa) = 1

j=1 1I[θi x j a].

Consider the class of indicator functions over the intervals [a, ) for a R. This class has VC dimension 1 and define β2 = 4

n log 2n + log( 8m

δ ) . Suppose µi n(Aa) satisfies the condition µi n(Aa) 4β2.

The bound µi(Aa) µi n(Aa) β2 + β p

µin(Aa) from Theorem F.2 yields,

µi(Aa) µi n(Aa) + β2 + β p

µin(Aa) µi n(Aa) + 1

4 µi n(Aa) + 1

2 .µi n(Aa) = 7

Similarly, the bound β p

µin(Aa) µi(Aa) µi n(Aa) from Theorem F.2 yields,

µi(Aa) µi n(Aa) β p

µin(Aa) µi n(Aa) 1

2 µi a(Aa) = 1

Combining these two bounds, we can can conclude that for all intervals Aa satisfying µi n(Aa) 4β2, with probability at least 1 δ

m it holds that

1 2 µi n(Aa) µi(Aa) 7

4 µi n(Aa).

Now if we only consider the intervals Aτn(θj), then from (13) it is clear that µi n(Aτn(θj)) k

n. This is because for any interval µi n( ) can take (n + 1) possible values in steps of 1

n, namely, 0, 1

n, . . . , 1. Therefore, from the definition of τn(θj) in (13), if µi n(Aτn(θj)) = k m, then to ensure that µi n(Aτn(θj)) k

m, its maximum value can be at most k

n. Therefore, the condition µi n(Aτn(θj)) 4β2 implies,

log 2n + log 8m

log 2 + log n + log 8 + log m

log n + log m

or in other words, k

n log n + log m

δ . Noting that,

7 4 µi n(Aτn(θj)) 7k

Therefore, if k

n log n + log m

δ for some universal constant c0 > 0, then with probability at least 1 δ

m, k 2m Prx µ (θj x τn(θj) 2k

Taking union bound over θ1, . . . , θm yields the desired results.

F.2 Proof of Lemma B.3

Proof. Pick any good θ, and let x = πM(θ) be its projection on M. Since X consists of unit vectors, using Lemma F.1 with probability at least 1 δ, the points that lie in C(θ) are at least k/(2m) fraction or at most 2k/m fraction of x s (under distribution µ) that have highest dot product with θ or equivalent to closest to θ. Thus, C(θ) is the set of the form B(θ, r ) where radius r is so chosen that k/(2m) µ(B(θ, r )) 2k/m. However, it is not of the form B(x, r )which causes complication. In particular, two questions need to be answered: (i) if a point x M lies within distance r < ρ of x, how far can it possibly be from θ, and conversely, (ii) if x M lies within distance r < ρ of θ , how far can it possibly be from x? These two questions are answered in Lemma 6 of Dasgupta and Tosh [2020] showing that,

BM(θ, r ) B x, r ρ ρ ((r )2 2) (22)

For the left-hand containment pick r = q

ρ ρ+ k 2c2m 1/d0 . Further taking,

r = r ρ ρ ((r )2 2) = k 2c2m

and using (17) yields that µ(BM(x, r ) < k/(2m). Now using (21) and (22) we have, BM(x, r) BM(θ, r ) BM(x, r ). Since µ(BM(x, r )) < k/(2m), this ensures BM(x, r) BM(θ, r ) C(θ).

Pick r = 2k c1m 1/d0 . Using (17), this ensures that µ(B(x, r)) 2k/m. Further taking

r = r ρ ρ ((r )2 2) =

and using (21) and (22), we have, BM(x, r) BM(θ, r ) BM(x, r ). Since µ(BM(x, r)) 2k/m, this ensures that C(θ) BM(x, r ) which in turn is contained in BM(x, r ). Since for any good θ, ρ/2, the result follows.

F.3 Proof of Lemma B.4

Proof. Let E1 be the event that bounds the diameter of the response region of any good θ, as specified in Lemma B.3, with probability at least 1 δ, over the random choice of the training set. We condition on this event. For any x M and r > 0, let

A(x, r) = {θ Γρ/2 : πM(θ) x < r}

ρ ρ+ for ρ/2, using Lemma B.3 and choosing large enough m satisfying

(2k/(c1m))1/d0 < min(ρ, r0), if we set

θ A(x, r1) x BM

which implies x C(θ). It has been established that A(x, r) has non-negligible probability mass under ν.

Lemma F.3 (Lemma 14 of Dasgupta and Tosh [2020]). Suppose ν is multivariate Gaussian N(0, I). There is a constant cd, that depends on the dimension d, such that any x M and 0 < r < r0, we have ν(A(x, r)) cdrd0.

Using (17), M has a r1/2 cover ˆ M of size at most (2c2/c1)8d0m/k. To see this pick points x1, . . . , xn M that are at a distance r1/2 from each other. The balls B(x1, r1/4) are disjoint and each has µ(B(xi, r1/4)) c1(r1/4)d0 = (c1/(2c2))(1/8d0)k/m. Since total probability mass of these N balls is at most 1, this gives the bound on N.

Pick any ˆx ˆ M. For i [m], let Ui be a binary random variable that takes value 1 if θi A(ˆx, r1/2) and 0 otherwise. Therefore,

Ui = 1 w.p. ν(A(ˆx, r1/2)) 0 w.p.1 ν(A(ˆx, r1/2))

Note that ν(ˆx, r1/2) cd(r1/2)d0 = (cd/(2c2))(1/4d0)k/m = αdk/m, where we have set αd = (cd/(2c2))(1/4d0). Let U = Pm i=1 Ui be that number of θi s that fall in A(ˆx, r1/2) and EU =

Pm i=1 EUi αdk. Let E2 be the event that U > αdk/2 and let Ec 2 be the complement event. Using Chernoff bound, we have

Pr(Ec 2|E1) = Pr (U αdk/2|E1) Pr (U (1/2)EU|E1) e EU/8 e αdk/8

Bounding the right most quantity above to be at most δ/ ˆ M, for k as specified in the lemma statement, for suitable choice of c d, we conclude that E2 holds (conditioned on E1) with probability at least 1 δ. Therefore, with probability at least 1 2δ, both E1 and E2 hold, impying that for every ˆx ˆ M, there are at least αdk/2 good θi s in A(ˆx, r1/2).

Now pick any arbitrary x M. There is some ˆx ˆ M with x ˆx r1/2. Moreover, for any θj A(ˆx, r1/2) θj A(x, r1/2) x C(θj).

F.4 Proof of Lemma B.1

For ease of exposition, we use the notation x[1,n] to denote x1, . . . , xn and y[1,n] to denote y1, . . . , yn, x [1,n] to denote x 1, . . . , x n and y [1,n] to denote y 1, . . . , y n for various proofs appearing in the section. Note number of non-zero entries in the Ea S representation given in (14) is variable and we use the notation k(x) to denote the number of non-zero entries in h2(x).

Proof. Fix any Θ. Then, conditioning on x [1,n] ensures that C1, . . . , Cm are fixed. Now,

Ex,Dn,D n(ˆη(x) η(x))2

= Ex,x[1,n],y[1,n],x [1,n],y [1,n] (ˆη(x) η(x))2

a= Ex,x[1,n],y[1,n],x [1,n] (ˆη(x) η(x))2

= Ex[1,n],x [1,n]

(ˆη(x) η(x))2 x[1,n], x [1,n])

= Ex[1,n],x [1,n]

j:x Cj (ˆηj ηj)

2 x[1,n], x [1,n]

b Ex[1,n],x [1,n]

j:x Cj (ˆηj ηj)2 x[1,n], x [1,n]

= Ex[1,n],x [1,n]

Pn i=1 yi I[xi Cj] Pn i=1 I[xi Cj] η(Cj) 2 I[j At(x)] x[1,n], x [1,n]

= Ex[1,n],x [1,n]

Pn i=1 yi I[xi Cj]

nµn(Cj) η(Cj) 2 I[j At(x)] x[1,n], x [1,n]

= Ex[1,n],x [1,n]

Pn i=1 yi I[Xi Cj]

nµn(Cj) η(Cj) 2 I[nµn(Cj) > 0]I[j At(x)]

+η2(Cj)I[µn(Cj) = 0]I[j At(x)]

x[1,n], x [1,n]

= Ex[1,n],x [1,n]

Pn i=1(yi η(Cj))I{xi Cj}

2 I[nµn(Cj) > 0]I[j At(x)] x[1,n], x [1,n]

+Ex[1,n],x [1,n]

j:x Cj η2(Cj)I[µn(Cj) = 0]I[j At(x)] x[1,n], x [1,n]

c Ex[1,n],x [1,n]

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0]I[j At(x)] x[1,n], x [1,n]

= Ex[1,n],x [1,n]

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0]I[j At(x)]

x, x[1,n], x [1,n])

4Ex[1,n],x [1,n]

I[nµn(Cj) > 0]I[j Ax t ] nµn(Cj)

x[1,n], x [1,n]

2m t(n + 1)

tne = m 2t(n + 1) + m

where equality a follows from the fact that the quantity with the inner square bracket is unaffected by the yi s. Inequality b is due to Jensen s inequality. Inequality c follows from Lemma F.4, in equality d

follows from Lemma F.6 and inequality e follows from Lemma F.7. The result finally follows noting that by Jensen s inequality E|ˆη(x) η(x)| p

E(ˆη(x) η(x))2.

Lemma F.4. Pick m d projection matrix Θ. Suppose the Ea S representation uses (i) a mapping Θ and (ii) empirical k-threshold sparsification. Let x be sampled from µ and let Dn = ((x1, y1), . . . , (xn, yn)) is a random training set where xi is sampled from µ and yi is distributed as η(xi) for i [n]. Similarly, D n = ((x 1, y 1), . . . , (x n, y n)) be another random training set independent of Dn, where x i is sampled from µ and y i is distributed as η(x i) for i [n]. Then the following holds.

Ex[1,n],x [1,n]

j:x Cj η2(Cj)I[µn(Cj) = 0]I[j At(x)] x[1,n], x [1,n]

Proof. We start by taking the 1/t factor outside.

t Ex[1,n],x [1,n]

j:x Cj η2(Cj)I[µn(Cj) = 0]I[j At(x)] x[1,n], x [1,n]

t Ex[1,n],x [1,n]

j:x Cj η2(Cj)I[µn(Cj) = 0]I[j At(x)] x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =0 Pr(k(x) = k )Ex

j:x Cj η2(Cj)I[nµn(Cj) = 0]I[j At(x)] k(x) = k ]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =1 Pr(k(x) = k )Ex

j:x Cj η2(Cj)I[nµn(Cj) = 0]I[j At(x)] k(x) = k ]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =1 Pr(k(x) = k )Ex

j:x Cj I[nµn(Cj) = 0]I[j At(x)] k(x) = k ]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =1 Pr(k(x) = k )

j=1 µ(Cj)I[nµn(Cj) = 0]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

Pr(k(x) = 0)

j=1 µ(Cj)I[nµn(Cj)] +

k =1 Pr(k(x) = k )

j=1 µ(Cj)I[nµn(Cj) = 0]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

j=1 µ(Cj)I[nµn(Cj) = 0]

k =0 Pr(k(x) = k )

! x[1,n], x [1,n]

t Ex[1,n],x [1,n]

j=1 µ(Cj)I[nµn(Cj) = 0]

x[1,n], x [1,n]

j=1 Ex[1,n]

µ(Cj)I[nµn(Cj) = 0] x [1,n]

j=1 µ(Cj)(1 µ(Cj)n

nt Ex [1,n]

j=1 nµ(Cj)(1 µ(Cj)n

nt Ex [1,n]

j=1 nµ(Cj)e nµ(Cj)

nt Ex [1,n]

n nµ(Cj)e nµ(Cj)o f m

nt Ex [1,n]

where equality a follows from the fact that the quantity with the inner square bracket is unaffected by the yi s. Equality b follows from the fact that for k(x) = 0, number of non-zero entries in the Ea S of x using (14) is zero and x / Cj, j. Therefore, we denote the quantity

P j:x Cj η2(Cj)I[nµn(Cj)]I[j Ax t ] = 0. Inequality c follows from the fact that η(Cj) 1, j. Inequality d follows from Lemma F.5. Inequality e follows since we are adding a non-negative quantity. Finally, inequality f follows from the fact that supz ze z = 1

j:x Cj I[nµn(Cj) = 0]I[j At(x)] k(x) = k

j=1 µ(Cj)I[nµn(Cj) = 0]

Proof. Conditioned on k(x) = k , Ea S representation given in (14) has exactly k non-zero entries. Therefore, using Lemma 3.4,

j:x Cj I[nµn(Cj) = 0]I[j At(x)] k(x) = k

a= ( m k ) X

i=1 Pr x Ck i X

j:σ(i) I[nµn(Cj) = 0]I[j At(x)]

= ( m k ) X

j:σ(i) µ(Ck i )I[nµn(Cj) = 0]I[j At(x)]

i:j σ(i) µ(Ck i )I[nµn(Cj) = 0]I[j At(x)]

j=1 I[nµn(Cj) = 0]

i:j σ(i) µ(Ck i )I[j At(x)]

j=1 I[nµn(Cj) = 0]

i:j σ(i) µ(Ck i )

j=1 µ(Cj)I[nµn(Cj) = 0]

where equality a follows from part (ii) of lemma 3.4 since {Ck i }( m k ) i=1 forms a partition of X and the definition of σ in section 3.2 with k = k . Equality b follows from the following observation. In the line above inequality b, we are summing k m k terms. Since k m k = m m 1 k 1 and any j [m] can appear in exactly m 1 k 1 different subsets of of size k , equality b is simply rearranging the terms from the line above by changing the indices appropriately. Equality c follows from part (iv) of lemma 3.4.

Lemma F.6. Pick m d projection matrix Θ. Suppose the Ea S representation uses (i) a mapping Θ and (ii) empirical k-threshold sparsification. Let x be sampled from µ and let Dn = ((x1, y1), . . . , (xn, yn)) is a random training set where xi is sampled from µ and yi is distributed as η(xi) for i [n]. Similarly, D n = ((x 1, y 1), . . . , (x n, y n)) be another random training set independent of Dn, where x i is sampled from µ and y i is distributed as η(x i) for i [n]. Then the following holds.

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0]I[j At(x)] x, x[1,n], x [1,n]

I[nµn(Cj) > 0]I[j At(x)]

Proof. Conditioned on x and x [1,n], only k(x) = k of the m coordinates in the Ea S representation of x are non-zero. If k = 0 then the right hand side of the Lemma statement still holds with tghe value being zero since each Cj will be an empty set in that case. WLOG, for ease of exposition,

assume that k > 0 and these k non-zero coordinate to be j1, . . . , jk [m]. Then the number of xi that falls in any such Cjl, where l [k ], is nµn(Cjl). The yi values corresponding to these xi points (there are nµn(Cjl) of them in total) are identically and independently distributed with expectation

E(yi|xi Cjl) = Pr(yi = 1|xi Cjl) = 1 µ(Cjl)

Cjl Pr(yi = 1|xi = x)µ(dx)

Cjl η(x)µ(dx) = η(Cjl)

Therefore, we can write

Pn i=1(yi η(Cj))I[xi Cj]

2 I[nµn(Cj) > 0]I[j At(x)] x, x[1,n], x [1,n]

(Pn i=1(yi η(Cjl))I[xi Cjl])2 I[nµn(Cjl) > 0]I[jl At(x)] x, x[1,n], x [1,n]

(nµn(Cjl))2

(yi η(Cjl))2I[xi Cjl]I[nµn(Cjl) > 0]I[jl At(x)] x, xi, x i

(nµn(Cjl))2

Pn i=1 η(Cjl)(1 η(Cjl))I[xi Cjl]I[nµn(Cjl) > 0]I[jl At(x)]

(nµn(Cjl))2

I[nµn(Cjl) > 0]I[jl At(x)]

where, equality a is due to the following observation. For any i, j [m], i = j and xi, xj Cl for some l [m], yi and yj are identically and independently distributed with expectation η(Cl). Therefore, the expectation of the cross product is simply:

Eyi,yj [(yi η(Cl))(yj η(Cl)] = Eyi,yj yiyj η(Cl)(yi + yj) + η(Cl)2

= Eyi Eyj η(Cl)(Eyi + Eyj) + η(Cl)2 = 0.

Equality b follows from variance computation. In particular for any yi, i [m] with xi Cl for some l [m],

E (yi η(Cl))2 = Ey2 i 2η(Cl)Eyi +η(Cl)2 = η(Cl) 2η(Cl)2 +η(Cl)2 = η(Cl)(1 η(Cl)).

Finally, inequality c follows from the fact that for any z [0, 1], the maximum value of z(1 z) is 1

It is easy to observe that the final result is equivalent to 1

I[nµn(Cj)>0]I[j At(x)]

Lemma F.7. Pick m d projection matrix Θ. Suppose the Ea S representation uses (i) a mapping Θ and (ii) empirical k-threshold sparsification. Let x be sampled from µ and let Dn = ((x1, y1), . . . , (xn, yn)) is a random training set where xi is sampled from µ and yi is distributed as η(xi) for i [n]. Similarly, D n = ((x 1, y 1), . . . , (x n, y n)) be another random training set independent of Dn, where x i is sampled from µ and y i is distributed as η(x i) for i [n]. Then the following holds.

Ex[1,n],x [1,n]

I[µn(Cj) > 0]I[j At(x)]

x[1,n], x [1,n]

2m t(n + 1)

Proof. We start by taking the 1/t factor outside.

t Ex[1,n],x [1,n]

I[µn(Cj) > 0]I[j At(x)]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =0 Pr(k(x) = k )Ex

I[nµn(Cj) > 0]I[j At(x)]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =1 Pr(k(x) = k )Ex

I[nµn(Cj) > 0]I[j At(x)]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

k =1 Pr(k(x) = k )

µ(Cj)I[nµn(Cj) > 0]

x[1,n], x [1,n]

t Ex[1,n],x [1,n]

Pr(k(x) = 0)

µ(Cj)I[nµn(Cj) > 0]

k =1 Pr(k(x) = k )

µ(Cj)I[nµn(Cj) = 0]

nµn(Cj) x[1,n], x [1,n]

t Ex[1,n],x [1,n]

µ(Cj)I[nµn(Cj) > 0]

k =0 Pr(k(x) = k )

! x[1,n], x [1,n]

t Ex[1,n],x [1,n]

µ(Cj)I[nµn(Cj) > 0]

x[1,n], x [1,n]

j=1 Ex[1,n]

µ(Cj)I[nµn(Cj) > 0]

2µ(Cj) (n + 1)µ(Cj)

= 2m t(n + 1)

where equality a follows from the fact that for k(x) = 0, number of non-zero entries in the Ea S of x using (14) is zero and x / Cj, j. Therefore, we denote the quantity P

j:x Cj I[nµn(Cj)>0]I[j At(x)]

nµn(Cj) = 0. Inequality b follows from Lemma F.8. Inequality c follows since we are adding a non-negative quantity. Finally, inequality d follows from the fact that nµn(Cj) is Binomially distributed with parameters n and µ(Cj) and by an application of lemma E.4.

I[nµn(Cj) > 0]I[j At(x)]

µ(Cj)I[nµn(Cj) > 0]

Proof. Conditioned on k(x) = k , Ea S representation given in (14) has exactly k non-zero entries. Therefore, using Lemma 3.4,

I[nµn(Cj) > 0]I[j At(x)]

a= ( m k ) X

i=1 Pr x Ck i X

I[nµn(Cj) > 0]I[j At(x)]

= ( m k ) X

µ(Ck i )I[nµn(Cj) > 0]I[j At(x)]

µ(Ck i )I[nµn(Cj) > 0]I[j At(x)]

I[nµn(Cj) > 0]I[j At(x)]

i:j σ(i) µ(Ck i )

I[nµn(Cj) > 0]

i:j σ(i) µ(Ck i )

µ(Cj)I[nµn(Cj) > 0]

where equality a follows from part (ii) of lemma 3.4 since {Ck i }( m k ) i=1 forms a partition of X and the definition of σ in section 3.2 with k = k . Equality b follows from the following observation. In the line above inequality b, we are summing k m k terms. Since k m k = m m 1 k 1 and any j [m] can appear in exactly m 1 k 1 different subsets of of size k , equality b is simply rearranging the terms from the line above by changing the indices appropriately. Equality c follows from part (iv) of lemma 3.4.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: See section 3,4,5.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: See section 6.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: See section 3,4 and Appendix B-F. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: section 5, Appendix A. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justification: Data is available from Open ML repository. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Section 5. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Error bar provided in the form of shaded graph. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: See Appendix A. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: Data used from Open ML repository. KNN and Random Forest models are run from Scikit-learn. See section 5.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.