# dnnr_differential_nearest_neighbors_regression__ce26dfa9.pdf DNNR: Differential Nearest Neighbors Regression Youssef Nader * 1 Leon Sixt * 1 Tim Landgraf 1 K-nearest neighbors (KNN) is one of the earliest and most established algorithms in machine learning. For regression tasks, KNN averages the targets within a neighborhood which poses a number of challenges: the neighborhood definition is crucial for the predictive performance as neighbors might be selected based on uninformative features, and averaging does not account for how the function changes locally. We propose a novel method called Differential Nearest Neighbors Regression (DNNR) that addresses both issues simultaneously: during training, DNNR estimates local gradients to scale the features; during inference, it performs an n-th order Taylor approximation using estimated gradients. In a large-scale evaluation on over 250 datasets, we find that DNNR performs comparably to state-of-the-art gradient boosting methods and MLPs while maintaining the simplicity and transparency of KNN. This allows us to derive theoretical error bounds and inspect failures. In times that call for transparency of ML models, DNNR provides a good balance between performance and interpretability.2 1. Introduction K-nearest neighbors (KNN) is an early machine learning algorithm (Cover & Hart, 1967) and a prototypical example for a transparent algorithm. Transparency means that a model s decision can be explained by inspecting of its parts. KNN s transparency follows from its simplicity: it can be expressed in simple terms as the system averages the targets of the most similar points to the query . At the same time, the algorithm s simplicity makes it amenable for theoretical *Equal contribution 1Department of Computer Science, Freie Universit at Berlin, Germany. Correspondence to: Youssef Nader , Leon Sixt , Tim Landgraf . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 2For code, see https://github.com/younader/ DNNR paper code analysis, such as obtaining bounds on KNN s prediction error (Chaudhuri & Dasgupta, 2014). However, KNN s predictive performance is limited. Most works aiming to improve KNN primarily focused on the selection of the neighbors, the distance metric, and the number of nearest neighbors k (Wettschereck & Dietterich, 1993; Weinberger & Tesauro, 2007; Cleary & Trigg, 1995). KNN s averaging scheme assumes that the target variable s changes are independent of those in the input features. Here, we introduce Differential Nearest Neighbor Regression (DNNR) to make use of that very gradient information. For each neighbor of a query point, we estimate the gradient of the function, and then instead of averaging the targets we average a Taylor approximation. KNN can then be seen as a zero-order Taylor approximation, while DNNR uses higher orders of the Taylor s theorem. A visual summary of the differences between KNN regression and DNNR can be found in Figure 1. In a theoretical analysis, we derived a bound on the pointwise error of DNNR in relation to parameters such as the number of training points and the neighborhood size and found that the error bound favors DNNR over KNN. In an empirical evaluation on over 250 different datasets, we confirmed that DNNR outperforms KNN regression and performs on par with gradient boosting methods. An ablation study then confirmed that both the gradient-based prediction and the feature scaling contribute to the performance gains. Using a synthetic dataset generated with known underlying ground truth, we simulated the error bound and found that DNNR requires fewer training points than KNN. Furthermore, we present an investigation on a DNNR failure and showcase how the model s transparency can be used in such an analysis. The regulation of Machine Learning algorithms in high-risk applications is being discussed globally or already under preparation (European Commission, 2021). We contribute to the overall goal of transparent and safer ML models in the following ways: We propose a new regression method (DNNR) that performs on par with state-of-the-art algorithms; DNNR is theoretically grounded: we provide a proof to bound DNNR s point-wise error (Theorem 1) and validate its usefulness empirically; DNNR: Differential Nearest Neighbors Regression An extensive evaluation against 11 methods on a set of 8 regression datasets, the PMLB benchmark (133 datasets), and Feynman symbolic regression (119 datasets); We provide detailed analyses to understand DNNR s performance (ablation study; impact of data properties) and transparency (inspection of failure cases). 2. Related Work KNN is a non-parametric model based on a simple voting decision rule where the target of a given point is predicted by averaging the targets of neighboring samples (Cover & Hart, 1967). For an introduction to KNN, we refer the reader to (Chen & Shah, 2018). Numerous methods have been proposed to improve this simple decision rule. (Kulkarni & Posner, 1995; Chaudhuri & Dasgupta, 2014) investigated the KNN convergence rate under different sampling conditions while (Balsubramani et al., 2019) and (Wettschereck & Dietterich, 1993) proposed different methods for an adaptive choice of k. Although the choice of the number of neighbors is critical, it is not the only factor governing KNN performance. A large subset of the KNN literature proposed techniques for the choice of the distance metric that defines the neighborhood. (Cleary & Trigg, 1995) introduced an entropy-based distance metric, and (Wang et al., 2007) proposed an adaptive distance. Metric learning methods propose data-driven learning mechanisms for the distance function (Weinberger & Tesauro, 2007; Weinberger et al., 2006; Wang et al., 2018). A similar approach changes the data representation upon which the distance function operates via feature weighting or feature selection (Aha, 1998; Vivencio et al., 2007). (Bhatia & Vandana, 2010) provides a more comprehensive overview of the different KNN techniques. However, all these methods do not change how the prediction is being performed all use an averaging scheme of the targets that effectively does not account for how the function changes within the local neighborhood. A method that uses local changes is local linear regression (LL) (Fan, 1992). Similar to KNN, local linear regression selects k-nearest neighbors and then fits a hyperplane locally. This differs from our proposed method as we fit the gradient for all nearest neighbors separately. The single hyperplane of LL regression assumes an identical gradient for each neighbor. We show results for LL regression in the quantitative evaluation. Gradient Approximation Estimating the gradient from data has been studied for various reasons, including variable selection and dimensionality reduction (Hristache et al., 2001; Mukherjee & Zhou, 2006). Several non-parametric methods exist to estimate the gradient from data (Fan & Algorithm 1 Pseudocode of DNNR s prediction for a query point X. The feature scaling is omitted in this pseudocode. The OLS function solves an ordinary least-squares problem. Require: a query point x, train data {(Xi, Yi)}, nearest neighbors k, nearest neighbors for gradient estimation k , range for target value ymin, ymax: M nn(x, k) # M contains the indices of the k nearest neighbors predictions = [] for each neighbor index m M do A nn(Xm, k ) # A contains indices of the k neighbors for Xm X XA xm Y YA ym ˆγm OLS( X, Y ) # ˆγm approximates the gradient ˆym ym + ˆγ(xm x) predictions.append(ˆym) end for ˆy = mean(predictions) return clip(ˆy, ymin, ymax) Gijbels, 1996; De Brabanter et al., 2013), and bounds of convergence already exist for some techniques (Berahas et al., 2021; Turner et al., 2010). An error bound on L1penalized gradient approximation using KNN is derived in (Ben-Shabat & Gould, 2020) using local gradient approximation for 3D model reconstruction by fitting truncated jets in the KNN neighborhoods. (Ausset et al., 2021). The work also applied the estimated gradient to variable selection and gradient optimization problems, but did not utilize it to improve the prediction. As we will present in detail, our method combines non-parametric gradient estimation using KNN, Taylor expansion, and feature scaling for regression modeling. To our surprise, this combination was neither explored theoretically nor empirically before. Notation We will consider the typical supervised regression problem. The training data is given as a set of n tuples of data points and target values {(X1, Y1), . . . , (Xn, Yn)}. We will denote the expected target value by η(x) = E[Y |X = x]. A ball with radius r around x is given by Bx,r. For a ball around x with exactly k training points, we will use a # sign as in Bx,#k. We summarize our notation in the Appendix Table 4. DNNR Vanilla KNN predicts the target of a given datapoint x by averaging the targets of nearby training points: ηKNN(x) = 1 Xm Bx,#k Ym. (1) DNNR: Differential Nearest Neighbors Regression (a) KNN (b) DNNR Figure 1. (a) An illustration of KNN regression. To predict a value for a query (circle with question mark), the target values of the nearest points (red circles) are averaged. KNN s prediction is marked by the red cross. The other data points (gray circles) are not used for prediction. (b) Similar illustration of DNNR. The local gradient (gray dashed line) is estimated for each neighbor and a target value is interpolated linearly (light red crosses). The final prediction (red cross) is the average of these interpolated values. This simple averaging scheme can be seen as a zeroth-term Taylor expansion around the inference point. If we would know the gradient of the function η, we could easily extend it to a first degree Taylor expansion: ηknown (x) = 1 Xm Bx,#k (Ym + η(Xm)(x Xm)) . (2) Of course, we only have access to the training data, not the underlying target function. Therefore, we approximate the gradient using nearby points. We can approximate η(Xm) by solving a least-squares problem: ˆγm = arg min γm ||Aγm q||, (3) where ˆγm Rd is the estimated gradient at point Xm, A Rk d contains the normalized differences (Xi Xm)/hi as row vectors, hi = ||Xi Xm||, k is the number of points used to approximate the gradient, i indexes these k nearby points around Xm , and q Rk denotes the differences in the labels q = (Yi Ym)/hi. The result ˆγm can then substitute the real gradient in equation (2) to yield the DNNR approximation: ηDNNR(x) = 1 Xm Bx,#k (Ym + ˆγm(x Xm)) . (4) Using the approximate Taylor expansion, we aim to improve the prediction performance by also utilizing the local change in the function. The pseudocode of DNNR is listed in Algorithm 1. The algorithm can also be extended easily to higher-orders of a Taylor expansion. In the evaluation, we will report results using the diagonal of the Hessian, i.e. the elements corresponding to 2η Feature Weighting Using an isotropic distance weighs each dimension equally. This way, a neighbor may be picked based on irrelevant dimensions. A simple improvement is to scale the feature dimensions as done in previous work (Weinberger & Tesauro, 2007). We use a common approach for a distance metric d using a diagonal matrix W to scale each dimension: d(xi, xj) = (xi xj)T W(xi, xj) (5) DNNR s predictions are differentiable w.r.t. the input dimensions, so a loss can be backpropagated to the scaling matrix W and optimized using gradient descent. Inspired by the Taylor theorem, we require that nearby points predict each other well while predictions of far points may come with a larger error. Therefore, the loss enforces a correlation between distance and the prediction error: W = arg min W i,j I cossim d(Xi, Xj), Yi ˆηDNNR(Xnn(i,k )) , (6) where I is an index set of nearby points, and cossim denotes the cosine similarity. At first sight, an alternative might have been minimizing the prediction s mean squared error (MSE). However, minimizing the MSE would not alter the scaling, as the prediction is scale-invariant, i.e. downscaling a dimension will increase the gradient, and the prediction will stay the same. Therefore, a spatial inductive bias in equation 6 is needed. 4. Theoretical Analysis We focus on the point-wise error estimate of DNNR vs. KNN regression. The proof contains two parts: the approximation error of the gradient and the point-wise prediction error. In Appendix C.2, we show that: Lemma 1 Let f : D Rd R be of class Cµ, a D, and B(a) D be some neighborhood of a. Suppose that around point a we have m neighboring points vk, k = 1, . . . , m with a, v1, . . . , vm B(a) D. Suppose further that all µ-th order derivatives are Lipschitz, i 1 . . . µ: DNNR: Differential Nearest Neighbors Regression if al1 1 ... a ld d Lipϑi(B(a)) where l1 + . . . + ld = i and we approximate the gradient locally at a by ˆγ = E1ˆω via the least-squares solution ˆω = arg minω Rd ||Aω q||, where νT 1 νT 1 νT 2 νT 2 ... νT m νT m f(a+h1ν1) f(a) h1 f(a+h2ν2) f(a) h2... f(a+hmνm) f(a) A Rm p, q Rm, E1 = Id 0 Rd p, hk = ||vk a|| with hkνk = vk a; νT k = (ν1k, . . . , νdk), νT k denotes higher-order terms νT k = hµ 1 k l1,...,ld Qd i1 νki i 2 µ µ,l1+...ld=µ and p = Pµ i=1 (d+µ)! d! . Then a bound on the error in the leastsquares gradient estimate is given by: || f(a) ˆγ||2 ϑmaxhµ max σ1(µ + 1)! i=1 ||νi||2µ 1 , (8) where σ1 is the smallest singular value of A, which is assumed to have rank(A) = p, ϑmax = maxi 1...k ϑi and hmax = max1 k m hk. This lemma extends a result from the two-dimensional case (Turner et al., 2010). The gradient approximation depends on the Lipschitz constant ϑmax, the distance to the neighbors hmax, and the smallest singular value σ1 of the normed differences A. The lemma also shows that by accounting for higher-order terms, e.g. picking µ > 1, the gradient approximation can be made more accurate. The following theorem is based on Theorem 3.3.1 in (Chen & Shah, 2018). We built on the same assumptions except requiring Lipschitz instead of H older continuity. Technical Assumptions (Atechnical X,ρ,PX ): The feature space X and distance ρ form a separable metric space. The feature distribution PX is a Borel probability measure. Assumption Besicovitch(ABesicovitch η ): The regression function η satisfied the Besicovitch condition if limr 0 E[Y |X Bx,r] = η(x) for x almost everywhere w.r.t. PX. Assumption Lipschitz (ALipschitz(ϑmax) η ): The regression function η is Lipschitz continuous with parameter ϑmax if |η(x) η(x )| ϑmaxρ(x, x ) for all x, x X. Using Lemma 1, we prove the following theorem in Appendix C.2: Theorem 1 (DNNR pointwise error) Under assumptions Atechnical X,ρ,PX and ABesicovitch η , let x supp(PX) be a feature vector, ε > 0 be an error tolerance in estimating the expected label η(x) = E[Y |X = x], and δ (0, 1) be a probability tolerance. Suppose that Y [ymin, ymax] for some constants ymin and ymax. There exists a threshold distance h DNNR (0, inf) such that for any smaller distance h (0, h ), if the number of training points n satisfies: n 8 PX(Bx,h) log 2 and the number of nearest neighbors satisfies 2(ymax ymin)2 2n PX(Bx,h), (10) then with probability at least 1 δ over randomness in sampling the training data, DNNR regression at point x has error |ˆηDNNR(x) η(x)| ε. (11) If we know that ALipschitz(ϑmax) η also holds, we can pick h DNNR as: h DNNR = r ε ϑmax (1 + τ), (12) where τ =E Pm i=1 ||νi||2µ 1 σ1 X Bx,h . ν, σ1, and ϑmax are defined as in Lemma 1. The theorem states that we can bound the point-wise error locally with a high probability given that the conditions on n and k are fulfilled. Theorem 3.3.1 from (Chen & Shah, 2018) provides a KNN regression point-wise error bound (their theorem is in turn based on (Chaudhuri & Dasgupta, 2014)). For KNN, the restriction on the maximum distance h KNN = ε 2ϑmax , while the other conditions on sample size and probability are identical. To compare DNNR and KNN, it is beneficial to solve for the error tolerance ε. εDNNR = h2 DNNRϑmax (1 + τ), (13) and for KNN: εKNN = 2ϑmaxh KNN. (14) The influence of the different variables is as follows: Both depend on the Lipschitz constant ϑmax linearly. Distance to nearest neighbors: hmax vs h2 max. For DNNR, the error decreases quadratically in hmax. When hmax becomes small, h2 max will be even smaller. For DNNR, τ represents the error in estimating the gradient. As long τ < 2 hmax 1, the error tolerance of DNNR will be lower than for KNN. As τ 1 σ1 , an ill-conditioned matrix A might increase DNNR s error. 5. Experiments We compared DNNR against other methods, including stateof-the-art gradient boosting methods. First, we discuss DNNR: Differential Nearest Neighbors Regression which baselines we compared against and the general experimental setup. Then, we present large-scale quantitative evaluations followed by an ablation study and further qualitative analyses. We compared DNNR against state-of-the-art boosting methods (Cat Boost, XGBoost, Gradient Boosted Trees (Dorogush et al., 2018; Chen & Guestrin, 2016)), classical methods such as (KNN, multi-layer perceptron (MLP), and Tab Net, that is a deep learning approach for tabular data (Arik & Pfister, 2021). We also included KNN-Scaled, which uses DNNR s feature weighting but KNN s averaging scheme. We used standard scaling for all datasets (zero mean, unit variance per dimension). Each model, except Tab Net, was optimized using a grid search over multiple parameters, and the models were refit using their best parameters on the validation data before test inference. We ensured that each method had a comparable search space, which are listed in Appendix D). Due to its long training times, we had to skip the grid search for Tab Net. Approximately 4.1k CPU hours were used to run the experiments. For the larger benchmarks (Feynman and PMLB), we followed the setup in (Cava et al., 2021). Our baselines report similar or slightly better performance than on SRBench3. 5.2. Quantitative Experiments Benchmark Datasets: The goal of this benchmark is to inspect DNNR s performance on eight real-world regression datasets: Yacht, California, Protein, Airfoil, Concrete, Sarcos, CO2 Emissions, and NOX emissions. The last two datasets are part of the Gas Emission dataset. All datasets were taken from the UCI repository (Dua & Graff, 2017), except California (Kelley Pace & Barry, 1997) and Sarcos (Rasmussen & Williams, 2006). These datasets were also used in previous work (Bui et al., 2016). Some datasets also have discrete features (Yacht, Airfoil), which challenges DNNR s assumption of continuity. All the datasets were evaluated using a 10-fold split, except for the Sarcos dataset which comes with a predefined test set. Additionally, we fixed the data leakage in the Sarcos dataset by removing all test data points from the training set. In Table 1, we report the averaged mean squared error (MSE) over the 10-folds for each dataset and model. Overall, Cat Boost is the best performing method. We find that DNNR achieves the best performance on the Sarcos and California datasets and the second-order achieves the best performance on Protein. For NOx Emissions, CO2Emissions, and Protein, DNNR is within 5% percentage difference to the best performing method (see Table 3). Discrete features violate 3See results here: https://cavalab.org/srbench/blackbox/ 0.4 0.6 0.8 1.0 Accuracy Solution Random Forest Noise Level 0 0.001 0.01 Figure 2. Accuracy on the Feynman Symbolic Regression Database under three levels of noise. The marks show the percentage of solutions with R2 > 0.999 . The bars denote 95% confidence intervals. 0.6 0.7 0.8 0.9 Cat Boost DNNR-2 ord. DNNR XGBoost Grad. B. Trees LGBM Random Forest MLP KNN-Scaled Figure 3. Results on the PMLB benchmark. The markers show the median R2 performance over all datasets runs. Horizontal bars indicate the 95% bootstrapped confidence interval. DNNR s assumptions, which can affect its performance, as the results on Airfoil and Yacht indicate. On these two datasets, DNNR is not under the best-performing methods (e.g. Cat Boost: 1.25, XGBoost: 1.63, for Airfoil) but still performs better than vanilla KNN (2.30 vs. 4.25). Discrete features can render the Taylor approximation meaningless, e.g. all neighbors may have the same value in one dimension rendering the gradient zero, or a linear approximation may not be sufficient when values exhibit large jumps. Interestingly, the second-order DNNR yields better results on the Airfoil and Concrete datasets, presumably because the second-order approximates sharp gradients better. On the other datasets, DNNR delivers a significant improvement over KNN and also over KNN with DNNR feature scaling (KNN-Scaled). Feynman Benchmark As the second benchmark, we selected the Feynman Symbolic Regression Database, which consists of 119 datasets sampled from classical and quantum physics equations (Udrescu & Tegmark, 2020). These equations are continuous differentiable functions. The difficulty can be increased by adding noise. DNNR: Differential Nearest Neighbors Regression Table 1. The MSE on eight regression datasets averaged over 10-folds. The best-performing values are marked as bold. The standard deviations are given after the signs. For the Sarcos dataset, we only evaluate on the given test set. Protein CO2Emission California Yacht Airfoil Concrete NOx Emission Sarcos Cat Boost 11.82 0.33 1.11 0.40 0.19 0.01 0.25 0.25 1.26 0.24 14.00 5.35 14.65 1.00 1.313 Grad. B. Trees 12.15 0.41 1.25 0.44 0.20 0.01 0.33 0.15 1.76 0.51 16.03 6.13 15.60 1.00 1.813 LGBM 12.74 0.35 1.17 0.41 0.19 0.01 6.86 9.10 2.03 0.38 16.10 5.11 15.66 1.00 1.613 MLP 14.34 0.44 1.23 0.47 0.35 0.26 4.34 10.21 10.17 3.01 35.27 12.17 19.91 1.82 1.277 Rand. Forest 11.80 0.25 1.12 0.43 0.23 0.02 1.13 0.73 2.81 0.66 22.75 5.64 16.35 1.03 2.264 XGBoost 12.01 0.28 1.21 0.44 0.20 0.02 0.26 0.16 1.63 0.40 16.76 6.68 14.99 0.98 1.824 KNN 13.39 0.37 1.17 0.43 0.39 0.03 68.46 51.11 4.25 0.96 60.76 14.19 17.37 1.26 1.752 KNN-Scaled 12.54 0.57 1.20 0.44 0.19 0.02 2.66 1.92 4.14 0.94 40.25 7.00 15.41 1.21 1.770 LL 14.07 0.18 1.20 0.46 0.33 0.07 50.83 14.81 7.04 1.39 51.40 10.50 16.69 1.10 0.792 LL-Scaled 12.90 0.29 1.10 0.43 0.21 0.02 2.47 1.70 3.53 1.24 32.54 6.33 14.57 0.96 0.786 Tabnet 17.02 2.68 1.22 0.38 0.39 0.04 2.83 2.59 9.98 3.04 43.73 14.59 12.97 0.92 1.304 DNNR 12.31 0.35 1.12 0.47 0.19 0.02 1.05 0.63 2.83 0.60 36.52 18.03 13.34 0.96 0.708 DNNR-2 ord. 11.64 0.44 1.24 0.50 0.22 0.02 0.48 0.43 2.30 0.48 28.35 13.47 15.61 1.94 0.727 The evaluation for the Feynman benchmark was executed with 10 different splits for each dataset and noise level (std=0, 0.001, 0.01) similar to (Cava et al., 2021). For the first split, we divided the data into 70/5/25% train, validation, and test sets. The hyperparameter tuning was done with validation data of the first split. Then the models were refit using the best parameters and evaluated on the 25% test set. Subsequent splits (75/25% train/test) then used these hyperparameters. For the Feynman benchmark, accuracy is defined as the percentage of datasets that were solved with a coefficient of determination R2 > 0.999. We report this accuracy in Figure 2. DNNR is the second-best performing method after MLP. Cat Boost s performance is also notable, with an accuracy of more than 80%. The different noise levels had minor effects on the methods. PMLB Benchmark The PMLB benchmark contains real and synthetic datasets with categorical features, discrete targets, and noisy data in general. In total, we used 133 PMLB datasets. The evaluation setup for the PMLB datasets was similar to the Feynman benchmark. Figure 3 shows the median R2 performance of the different models with a 95% confidence interval. Cat Boost is the best performing method with an R2 median > 0.9 with DNNR secondorder closely in the second position. DNNR, XGBoost, and Gradient Boosted Trees perform similarly well. The worst-performing method is KNN regression. While adding feature weighting (KNN-Scaled) improves the R2 median considerably by over 0.1, only DNNR s additional use of gradient information yields results comparable to gradient boosting methods. 5.3. Ablation In the previous evaluations, we already included KNNScaled to measure the effect of the scaling versus the gradient information. We dissected DNNR even further and tested various design alternatives: such as scaling of the neighborhood (MLKR (Weinberger & Tesauro, 2007)) and regularization on the gradient estimation (Lasso (Ausset et al., 2021)). We based this analysis on the Airfoil, Concrete, and 5000 samples from Friedman-1 datasets (see Section 5.4). As before, we conducted a hyperparameters sweep for each model configuration and used a 10-fold validation for each dataset. For the Concrete and Friedman-1 datasets, using only gradient information and no feature scaling (DNNR-Unsc.) already improved over KNN s performance. For the Airfoil dataset, which contains categorical features, using gradients without scaling (DNNR-Unsc.) leads to worse results. MLKR improves KNN s performance more than DNNR s scaling for KNN. However, when using gradient estimation, MLKR is less suitable as can be seen in the difference between DNNR and DNNR-MLKR on Airfoil and Concrete. MLKR draws apart local neighborhoods as points are scaled based on similarities in the target value. The inference might than be carried out by points that are drastically different from the query, both in L2-metric and a different gradient. Furthermore, we would like to note that MLKR is also computationally more expensive than DNNR s scaling as they use Gaussian kernels resulting in a runtime quadratic in the number of samples. These results highlight that while the gradient information might be helpful for unscaled neighborhoods, scaling yields better gradients and results in better approximations. Using Lasso regularization on the gradient estimation as done in (Ausset et al., 2021) did not perform well. We speculate that the regularization limits the gradient estimation. Future work might test if DNNR benefits from Lasso regularization in high-dimensional problems as motivated by the authors. DNNR: Differential Nearest Neighbors Regression Table 2. Ablation study. We report the MSE for different variations of DNNR for three datasets. Airfoil Concrete Friedman-1 DNNR 2.83 0.60 36.52 18.03 0.01 0.00 DNNR-Unsc. 4.82 0.74 49.97 13.59 1.03 0.15 KNN-MLKR 3.32 0.84 36.85 9.89 0.40 0.07 KNN-Scaled 4.14 0.94 40.25 7.00 7.27 0.53 DNNR-MLKR 3.20 1.08 1.5e13 4e13 0.03 0.00 KNN 4.25 0.96 60.76 14.19 3.93 0.43 DNNR Rand. 24.38 3.44 115.14 20.52 6.07 0.56 DNNR-Lasso 5.25 1.05 40.24 8.43 7.33 0.52 DNNR-2 ord. 2.30 0.48 28.35 13.47 0.01 0.00 5.4. Effect of noise, #samples, and #features This analysis investigates how different data properties affect the model s performance. Such an analysis requires a controlled environment: we used the Friedman-1 dataset (Friedman, 1991). This dataset is based on the following equation: y(x) = 10x3 + 5x4 + 20 (x2 0.5)2 + 10 sin (πx0x1) + sϵ, (15) where xi is uniformly distributed in [0, 1], the noise ϵ is sampled from a standard normal distribution, and s controls the variance of the noise. Unimportant features can be added by simply sampling xj U(0, 1). Friedman-1 allowed us to test the models under different sampling conditions: the number of samples, the magnitude of noise, and the number of unimportant features. As defaults, we choose the number of samples = 5000, the number of features = 10, and the noise level = 0. Besides DNNR, we also evaluated Gradient Boosted Trees, Cat Boost, Random Forest, MLP, and KNN. For each setting, we run a 5-fold evaluation (the hyperparameters of each method are fitted on the first fold and then fixed for the remaining 4). We report the effect of each condition in Appendix Figure 8. For the number of samples, we note that DNNR s error declined rapidly. Second-best is Cat Boost. For noise, we observed two groupings. While one group (MLP & KNN) performed poorly, their MSE did not increase when adding noise. For the better performing group (DNNR, Gradient Boosted Trees, Cat Boost, Random Forest), the error did increase when adding noise. In this group, DNNR performed the best for low noise levels but was beaten by Cat Boost slightly for higher levels of noise. Increasing the number of unimportant features impacted KNN and MLP particularly. DNNR dropped from being the best method to sharing the second place with Gradient Boosted Trees as the feature scaling cannot entirely mitigate the effect of unimportant features. Tree-based methods were barely affected, as they are adept at handling unimportant features and operate on the information gain of each feature. Figure 4. Comparison between the error bound of KNN (yellow) and DNNR (blue). On the x-axis, all test data points sorted by their error bounds are plotted. The DNNR performs better than KNN and the DNNR error bound is also lower. 10 3 10 2 10 1 100 Error Tolerance [ε] #Samples [n] Figure 5. Depicts the error tolerance versus the number of training samples for the Friedman-1 dataset. KNN requires multiple orders more training points to guarantee the same error tolerance. 5.5. Application of the theoretical bound In this evaluation, we analyze the error bounds from section 4. As dataset, we use the Friedman-1 dataset introduced before in section 5.4. The synthetic dataset allows the sampling of arbitrary points. Thereby we can also simulate very dense neighborhoods. First, we apply Theorem 1 to the dataset. We pick a probability tolerance of 0.95 (δ = 0.05), and a Lipschitz constant for equation (15) of ϑmax = 40. For DNNR, we estimate a value of τ 5.59 from the data. We show the dependency between error tolerance and the number of samples required in Figure 5. In this exemplary calculation, KNN requires multiple orders of magnitude more training data than DNNR to achieve the same theoretical guarantee on the error tolerance. Still, even DNNR would require an unrealistic amount of samples, e.g. around 1015 for an error tolerance of ε = 0.1. As a more practical application, we investigated how local conditions, e.g. distance to the neighbors and the Lipschitz constant, influence the local prediction error. Therefore, we sampled a realistically sized train and test set (10.000 and 2.000 samples) and then compared the error tolerance for KNN and DNNR according to equations (13) and (14). For both methods, we choose k = 7, and for the number of DNNR: Differential Nearest Neighbors Regression (a) Feature relevances (b) Failure (error of 4.85) Figure 6. (a) Feature relevance of DNNR Unscaled on the California Housing dataset. (b) A failure of DNNR Unscaled. The red circle marks the query point. The prediction is done by approximating the gradient locally for each nearest neighbors (crosses). The circles visualize the points used for gradient approximation. As DNNR Unscaled weights each dimension equally, it does not use spatially nearby points, even though longitude and latitude are scored important. neighbors to approximate the gradient, we used k = 32. We purposely violate the conditions on k and n of Theorem 1, as we want to analyze how applicable the estimated error tolerances are in a more realistic setting. In Figure 4, we sorted the 2000 test points according to the error tolerances of KNN and DNNR respectively. For both methods, we see that the error tolerances strictly bound the actual errors by a gap of around one order of magnitude. We also observe that the error tolerances were correlated with the actual errors, e.g. as the error tolerances decrease, so does the actual error. 5.6. Feature importance & Inspecting a prediction DNNR allows inspecting which neighbors were used for the prediction and how they contributed. For the following exemplary inspection, we use the California Housing dataset (Kelley Pace & Barry, 1997). The dataset s dependent variable is the median house value per block group (a block group is an aggregation of a local area). The eight observational variables represent the location, median income, and information about the houses, such as average rooms or occupation. For this study, we analyzed DNNR Unscaled, i.e. DNNR without the feature scaling. We omitted the feature scaling, as the feature relevance would be impacted by the scaling and the DNNR with scaling is also performing so well that the error would be minor to inspect. First, we can provide a simple local feature relevance score by multiplying the estimated gradient with the difference in the input: ξm = |(x xm) ˆγm|, (16) where xm is the point where we estimate the gradient and ˆγm the locally fitted gradient. This formulation of feature importance is analogous to a linear model where one would take w x (Molnar, 2019, sec. 5.1.). It is a known property that the gradient reflects the model s sensitivity and can be used for feature importance, and variable selection (Mukherjee & Zhou, 2006; Guyon & Elisseeff, 2003). We show the distributions of the local feature importance in Figure 6a. The most important dimensions are longitude, latitude, and median income. We validate the local feature importance by applying it to variable selection. Using all dimensions, we get an MSE of around 0.34 (this is lower than in Table 1 as we do not use feature weighting). When deleting the three most important dimensions, the MSE increases to 0.99. However, keeping only the most important dimensions, the MSE slightly improves to 0.33. Therefore, we conclude that the feature importance has found the most important dimensions. We now move on to inspect a failure case of DNNR. In Figure 6b, we show how the neighborhood of a poorly predicted point can be inspected. The prediction (red circle) is off by 4.85. From looking at the projection of the data to the longitude and latitude dimensions, we can see that the prediction is based on points (crosses) far away from the query. These points might have a similar number of bedrooms but differ in the location. As we found in the previous experiment, the latitude/longitude belong to the most important features. This inspection motivates the feature scaling once again, as DNNR Unscale weights all dimension equally, it selected the nearest neighbors using less relevant dimensions. 6. Conclusion, Limitations, and Future Work Conclusion DNNR showed that local datapoint gradients carry valuable information for prediction and can be exploited using a simple Taylor expansion to provide a significant performance boost over KNN regression. In large-scale evaluations, DNNR achieved comparable results to state-ofthe-art complex gradient boosting models. An advantage of DNNR s simplicity is that we can obtain error bounds by extending KNN s theory. Our theoretical analysis illustrates the benefits of using DNNR over KNN. DNNR strikes a good balance between performance and transparency and may therefore be the method of choice in problems with elevated requirements for the system s interpretability. Limitations Our evaluation of DNNR points to a potential limitation on discrete data. When features or targets have the same value, the gradient is zero, and DNNR falls back to KNN s decision. On partially discrete datasets, DNNR always performs at least as well as KNN, e.g. the results of DNNR and KNN on the Yacht dataset (Table 1). DNNR also inherits some limitations from KNN, such that the L2-metric might not represent similarities optimally. Future Work DNNR was designed for regression tasks but could also be adapted for classification, e.g. by fitting DNNR: Differential Nearest Neighbors Regression the gradients of a logistic function as in (Mukherjee & Wu, 2006) or via label smoothing (M uller et al., 2019). An interesting future direction may be extending DNNR specifically to symbolic regression by utilizing the estimated gradient information. Future work could also explore the use of DNNR for data augmentation, where points could be sampled, and the label computed estimated with the local gradient. Another research direction would be to tighten the theoretical bounds or to study the effect of scaling from a theoretical perspective. Acknowledgments We thank Maximilian Granz, Oana-Iuliana Popescu, Benjamin Wild, Luis Herrmann, and David Dormagen for fruitful discussion and feedback on our manuscript. We are also grateful for our reviewers feedback. LS was supported by the Elsa-Neumann Scholarship of the state of Berlin. We thank the HPC Service of ZEDAT, Freie Universit at Berlin, for generous allocations of computation time (Bennett et al., 2020). Aha, D. W. Feature Weighting for Lazy Learning Algorithms, pp. 13 32. Springer US, Boston, MA, 1998. ISBN 978-1-4615-5725-8. doi: 10.1007/978-1-4615-5725-8 2. URL https://doi.org/10.1007/978-1-46155725-8 2. Arik, S. O. and Pfister, T. Tabnet: Attentive interpretable tabular learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6679 6687, May 2021. URL https://ojs.aaai.org/index.php/ AAAI/article/view/16826. Ausset, G., Cl emen, S., et al. Nearest neighbour based estimates of gradients: Sharp nonasymptotic bounds and applications. In International Conference on Artificial Intelligence and Statistics, pp. 532 540. PMLR, 2021. URL http://proceedings.mlr.press/ v130/ausset21a/ausset21a.pdf. Balsubramani, A., Dasgupta, S., Freund, Y., and Moran, S. An adaptive nearest neighbor rule for classification. In Neur IPS, 2019. URL https://proceedings.neurips.cc/ paper/2019/file/ a6a767bbb2e3513233f942e0ff24272c Paper.pdf. Ben-Shabat, Y. and Gould, S. Deepfit: 3d surface fitting via neural network weighted least squares. In Computer Vision ECCV 2020, pp. 20 34, Cham, 2020. Springer International Publishing. ISBN 978-3-030- 58452-8. URL https://link.springer.com/ chapter/10.1007/978-3-030-58452-8 2. Bennett, L., Melchers, B., and Proppe, B. Curta: A generalpurpose high-performance computer at zedat, freie universit at berlin. http://dx.doi.org/10.17169/refubium-26754, 2020. Berahas, A., Cao, L., Choromanski, K., and Scheinberg, K. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Foundations of Computational Mathematics, 05 2021. doi: 10.1007/s10208-021-09513-z. URL https:// doi.org/10.1007/s10208-021-09513-z. Bhatia, N. and Vandana. Survey of nearest neighbor techniques. International Journal of Computer Science and Information Security, 8, 07 2010. URL https: //doi.org/10.5120/16754-7073. Bui, T., Hernandez-Lobato, D., Hernandez-Lobato, J., Li, Y., and Turner, R. Deep gaussian processes for regression using approximate expectation propagation. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1472 1481, New York, New York, USA, 20 22 Jun 2016. PMLR. URL https: //proceedings.mlr.press/v48/bui16.html. Cava, W. L., Orzechowski, P., Burlacu, B., de Franca, F. O., Virgolin, M., Jin, Y., Kommenda, M., and Moore, J. H. Contemporary symbolic regression methods and their relative performance. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https: //openreview.net/forum?id=x VQMr DLy Gst. Chaudhuri, K. and Dasgupta, S. Rates of convergence for nearest neighbor classification. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/ paper/2014/file/ db957c626a8cd7a27231adfbf51e20eb Paper.pdf. Chen, G. H. and Shah, D. Explaining the success of nearest neighbor methods in prediction. Foundations and Trends in Machine Learning, 10(5-6):337 588, 2018. ISSN 1935-8237. doi: 10.1561/2200000064. URL http://dx.doi.org/10.1561/2200000064. Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. KDD 16, pp. 785 794, DNNR: Differential Nearest Neighbors Regression New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785. URL https://doi.org/ 10.1145/2939672.2939785. Cleary, J. G. and Trigg, L. E. K*: An instance-based learner using an entropic distance measure. In Prieditis, A. and Russell, S. (eds.), Machine Learning Proceedings 1995, pp. 108 114. Morgan Kaufmann, San Francisco (CA), 1995. ISBN 978-1-55860-377-6. doi: https: //doi.org/10.1016/B978-1-55860-377-6.50022-0. URL https://www.sciencedirect.com/science/ article/pii/B9781558603776500220. Cover, T. and Hart, P. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21 27, 1967. doi: 10.1109/TIT.1967.1053964. URL https: //doi.org/10.1109/TIT.1967.1053964. De Brabanter, K., De Brabanter, J., De Moor, B., and Gijbels, I. Derivative estimation with local polynomial fitting. J. Mach. Learn. Res., 14(1):281 301, jan 2013. ISSN 1532-4435. URL https://www.jmlr.org/ papers/volume14/debrabanter13adeleted/debrabanter13a.pdf. Dorogush, A. V., Ershov, V., and Gulin, A. Catboost: gradient boosting with categorical features support. Ar Xiv, abs/1810.11363, 2018. URL https://arxiv.org/ abs/1810.11363. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. European Commission. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, 2021. URL https://eur-lex.europa.eu/legalcontent/EN/TXT/?uri=CELEX:52021PC0206. Fan, J. Design-adaptive nonparametric regression. Journal of the American statistical Association, 87(420):998 1004, 1992. URL https://doi.org/10.2307/ 2290637. Fan, J. and Gijbels, I. Local polynomial modelling and its applications. Number 66 in Monographs on statistics and applied probability series. Chapman & Hall, 1996. ISBN 0412983214. URL http://gso.gbv.de/DB=2.1/ CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM= ppn+19282144X&sourceid=fbw bibsonomy. Friedman, J. H. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1 67, 1991. ISSN 00905364. URL http://www.jstor.org/ stable/2241837. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157 1182, 2003. URL https://www.jmlr.org/papers/volume3/ guyon03a/guyon03a.pdf. Hristache, M., Juditsky, A., Polzehl, J., and Spokoiny, V. Structure Adaptive Approach for Dimension Reduction. The Annals of Statistics, 29(6):1537 1566, 2001. doi: 10.1214/aos/1015345954. URL https://doi.org/ 10.1214/aos/1015345954. Kelley Pace, R. and Barry, R. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, 1997. ISSN 0167-7152. doi: https://doi.org/10.1016/S0167-7152(96)00140-X. URL https://www.sciencedirect.com/science/ article/pii/S016771529600140X. Kulkarni, S. and Posner, S. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028 1039, 1995. doi: 10.1109/18.391248. URL https: //doi.org/10.1109/18.391248. Molnar, C. Interpretable Machine Learning. 2019. URL https://christophm.github.io/ interpretable-ml-book/. Mukherjee, S. and Wu, Q. Estimation of gradients and coordinate covariation in classification. Journal of Machine Learning Research, 7:2481 2514, 12 2006. URL https://jmlr.org/papers/ volume7/mukherjee06b/mukherjee06b.pdf. Mukherjee, S. and Zhou, D.-X. Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7(18):519 549, 2006. URL http:// jmlr.org/papers/v7/mukherjee06a.html. M uller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper/2019/file/ f1748d6b0fd9d439f71450117eba2725Paper.pdf. Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, January 2006. URL http://www.gaussianprocess.org/ gpml/. Turner, I. W., Belward, J. A., and Oqielat, M. N. Error bounds for least squares gradient estimates. SIAM Journal on Scientific Computing, 32:2146 2166, 2010. URL https://doi.org/10.1137/080744906. DNNR: Differential Nearest Neighbors Regression Udrescu, S.-M. and Tegmark, M. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16), 2020. doi: 10.1126/ sciadv.aay2631. URL https://www.science.org/ doi/abs/10.1126/sciadv.aay2631. Vivencio, D. P., Hruschka, E. R., do Carmo Nicoletti, M., dos Santos, E. B., and de O. Galv ao, S. D. C. Featureweighted k-nearest neighbor classifier. 2007 IEEE Symposium on Foundations of Computational Intelligence, pp. 481 486, 2007. URL https://doi.org/10.1109/ FOCI.2007.371516. Wang, J., Neskovic, P., and Cooper, L. N. Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognition Letters, 28(2):207 213, 2007. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2006.07.002. URL https://www.sciencedirect.com/science/ article/pii/S0167865506001917. Wang, Q., Wan, J., and Yuan, Y. Deep metric learning for crowdedness regression. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2633 2643, 2018. doi: 10.1109/ TCSVT.2017.2703920. URL https://doi.org/ 10.1109/TCSVT.2017.2703920. Weinberger, K. Q. and Tesauro, G. Metric learning for kernel regression. In Meila, M. and Shen, X. (eds.), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pp. 612 619, San Juan, Puerto Rico, 21 24 Mar 2007. PMLR. URL https://proceedings.mlr.press/v2/ weinberger07a.html. Weinberger, K. Q., Blitzer, J., and Saul, L. Distance metric learning for large margin nearest neighbor classification. In Weiss, Y., Sch olkopf, B., and Platt, J. (eds.), Advances in Neural Information Processing Systems, volume 18. MIT Press, 2006. URL https://proceedings.neurips.cc/ paper/2005/file/ a7f592cef8b130a6967a90617db5681b Paper.pdf. Wettschereck, D. and Dietterich, T. G. Locally adaptive nearest neighbor algorithms. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS 93, pp. 184 191, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. URL https://proceedings.neurips.cc/ paper/1993/file/ 5f0f5e5f33945135b874349cfbed4fb9Paper.pdf. DNNR: Differential Nearest Neighbors Regression (a) Sample 1305: Error of 4.85 (b) Zoom of 1305 (c) Sample 1220 (Error 0.02) (d) Zoom of 1220 (e) Sample 1081 (Error 0.13) (f) Zoom of 1081 (g) Sample 393 (Error 1.01) (h) Zoom of 393 Figure 7. The selected neighbors for different queries on the California dataset. The red circle marks the query point. The prediction is done by approximating the gradient locally for each nearest neighbors (crosses). The filled circles visualize the points used for gradient approximation. The neighbors are further away for higher errors (a) & (g) while close to the query for lower errors (c) & (e). 10 20 30 40 50 N. Features DNNR Random Forest Gradient Boosted Trees MLP Cat Boost KNN (a) Number of Features 0.00 0.02 0.04 0.06 0.08 0.10 Noise level DNNR Random Forest Gradient Boosted Trees MLP Cat Boost KNN (b) Noise Level 0 20000 40000 60000 80000 N. Samples DNNR Random Forest Gradient Boosted Trees MLP Cat Boost KNN (c) Number of Samples Figure 8. The effect of different parameters of the dataset. The results are obtains on the Friedman-1 dataset. The confidence intervals denote the standard derivation over multiple folds. DNNR: Differential Nearest Neighbors Regression Table 3. Mean percentage difference from the best performing model on each dataset. Protein CO2Emission California Yacht Airfoil Concrete NOx Emission Sarcos Cat Boost 1.54% 1.09% 0.00% 0.00% 0.00% 0.00% 12.91% 85.45% Grad. B. Trees 4.42% 13.71% 7.94% 32.14% 39.32% 14.53% 20.26% 156.07% KNN 15.03% 6.63% 108.47% 27065.48% 236.23% 334.12% 33.91% 147.46% KNN-neigh 7.76% 8.90% 2.65% 953.97% 227.69% 187.56% 18.79% 150.00% LGBM 9.51% 6.09% 3.17% 2623.41% 60.36% 15.05% 20.74% 127.82% LL 20.93% 8.99% 72.49% 20072.22% 457.28% 267.28% 28.69% 11.86% LL-neigh 10.83% 0.00% 12.70% 880.95% 179.35% 132.46% 12.35% 11.02% MLP 23.22% 11.99% 84.13% 1623.02% 704.43% 151.99% 53.46% 80.37% Random Forest 1.39% 1.54% 21.69% 348.81% 122.15% 62.53% 26.08% 219.77% Tabnet 46.25% 10.63% 108.47% 1024.21% 689.79% 212.47% 0.00% 84.18% XGBoost 3.17% 9.99% 6.35% 2.78% 28.72% 19.73% 15.54% 157.63% DNNR-2 ord. 0.00% 12.35% 13.76% 88.89% 81.96% 102.52% 20.37% 2.68% DNNR 5.79% 1.27% 0.00% 315.08% 123.58% 160.93% 2.86% 0.00% C. Approximation Bounds Notation Meaning x An input vector X The random variable of the input Y The random variable of the output k The number of nearest neighbors k The number of neighbors to estimate the gradient η(x) The expected target value: η(x) = E[Y |X = x] η(x) The gradient of η(x) w.r.t. x ηKNN(x) Estimated regression function for KNN regression ηDNNR(x) Estimated regression function for DNNR nn(x, k) Returns the indices of the k nearest neighbors of x ˆYm,x Estimated regression value for x from point Xm ϑmax Maximum Lipschitz constant A Matrix to estimate the gradient using OLS σ1 Smallest singular value of A Cµ Set of µ times differentiable continuous functions (v )µf The µ-order directional derivative of f w.r.t v En[.] Expectation over n training points (Xi, Yi)1 i n ˆγm Locally estimated gradient for point m. ˆωm Locally estimated gradient and higher order terms ε The error tolerance δ Probability tolerance hm Distance between x and point Xm ρ(x, x ) Distance metric Table 4. Notation used in this work. The proof of the approximation bounds of DANN extends the the proof of KNN approximation bounds given in (Chen & Shah, 2018, p. 68ff.) by a Taylor Approximation. For the approximation of the gradient, we will rely on results given in (Turner et al., 2010). DNNR: Differential Nearest Neighbors Regression Notation Our notation follows the one given in (Chen & Shah, 2018): C.1. General Properties We list here some inequalities used in the later proof. Jensen s Inequality Given a random variable X and a convex function f(X) then: f(E[X]) E[f(X)]. (17) Hoeffding s Inequality Let X1, . . . , Xk be independent random variables bounded between [a, b]. The empirical mean is given by: X = 1 k(X1 + . . . + Xk). Then: P( X E[ X] > t) 2 exp 2kt2 Chernoff Bound for Binomial distribution Let X = Pn i=1 Xi be a sum of n independent binary random variable each Xi = 1 with probability pi. Let µ = E[X] = Pk i=1 pi = n p, where p = 1 n Pk i=1 pi. Then P(X (1 δ)µ) exp( µδ2/2). (19) C.2. Gradient Approximation Here, we first proof two lemmas for bounding the gradient. They generalize the proof in (Turner et al., 2010, section 3.1) from 2 to d-dimensions. The first Lemma bounds terms of a Taylor series and is need for the proof of Lemma . Lemma 2 Let f : D Rd R be of class Cµ, a D, and B(a) D some neighborhood of a. Suppose that for i = 0, 1, . . . , n we have µf µa Lipϑi(B(a)) and ϑmax = maxi 1,...µ ϑi. Then for any a + hν B(a) with ||ν|| = 1 k! (ν )kf(a) f(a + hν) f(a) (µ + 1)!ϑmax||v||µ 1. (20) Proof Lemma 2. The proof starts with rearranging the Taylor Series which is given here: f(a + hν) = f(a) + h(ν )f(a) 1! + . . . + hµ (ν )µf(a) µ! + Rµ. (21) The remainder Rµ has the following form: 0 (1 t)µ(ν )µ+1f(a + thν)dt. (22) By dividing by h and rearanging some terms, we have: f(a + hν) f(a) k! (ν )kf(a) = Rµ 1 Now, we add the same quantity to both sides of the previous equation, using R 1 0 (1 t)µ 1dt = 1 µ! (ν )µf(a) f(a + hν) f(a) k! (ν )kf(a) 0 (1 t)µ 1n (ν )µf(a) (ν )µf(a + thν) o dt. DNNR: Differential Nearest Neighbors Regression Therefore, we have: k! (ν )kf(a) f(a + hν) f(a) 0 (1 t)µ 1n (ν )µ(f(a) f(a + thν)) o dt 0 (1 t)µ 1n (ν )µ(f(a) f(a + thν)) o dt 0 (1 t)µ 1 (ν )µ (f(a) f(a + thν)) dt (µ + 1)!ϑmax |ν|µ 1 . In last step, we rewrote the µ-th directional derivative using partial derivative and bounded it using the Lipschitz-continuity as follows: |(ν )µ(f(a) f(a + thν))| k1+...+kd=µ µ k1, . . . , kd ! µ(f(a) f(a + thν)) ak1 1 . . . akd d k1+...+kd=µ µ k1, . . . , kd ! µ |(f(a) f(a + thν))| ak1 1 . . . akd d k1+...+kd=µ µ k1, . . . , kd ϑmax|a a + thν| (Lipschitz continuity of order µ) = ϑmax|thν| (|ν1| + . . . + |νd|)µ (Multinomial theorem) = ϑmax|th| |ν|µ 1 . This finishes the proof of Lemma 2. Lemma 3 Let f : D Rd R be of class Cµ, a D, and B(a) D be some neighborhood of a. Suppose around point a we have m neighboring points vk, k = 1, . . . , m with a, v1, . . . , vm B(a) D. Suppose further that µf al1 1 ... a ld d Lipϑmax(B(a)) for l1 + . . . + ld = µ, and we approximate the gradient locally at a by E1ˆω via the least-squares solution ˆω = arg minω Rd ||Aω q||, where νT 1 νT 1 νT 2 νT 2 ... νT m νT m f(a+h1ν1) f(a) h1 f(a+h2ν2) f(a) h1... f(a+hmνm) f(a) E1 = Id0 Rd p, hk = ||vk a|| with hkνk = vk a; νT k = (ν1k, . . . , νdk), νT k = hµ 1 k l1,...,ld Qd i1 νki i 2 µ µ,l1+...ld=µ and p = Pµ i=1 (d+µ)! d! . Then a bound on the error in the least-squares gradient estimate is given by: || f(a) E1ˆω|| ϑmaxhµ max σ1(µ + 1)! i=1 ||νi||2µ 1 , (25) DNNR: Differential Nearest Neighbors Regression where σ1 is the smallest singular value of A, which is assumed to have rank(A) = p, ϑmax is as defined in Lemma 2, and hmax = max1 k m hk. Proof. Let E2 R(p d) p be the last p d rows of the identity matrix Ip and f a1 (a), . . . , f a l21 1 . . . a l2d d , . . . , 2f a l2d d , . . . , µf a ln1 1 , . . . , µf Now U ˆω can be partitioned as E1(U ˆω) E2(U ˆω) , and hence: ||U ˆω||2 = ||E1(U ˆω)||2 + ||E2(U ˆω)||2 ||E1(U ˆω)||2 = || f(a) E1ˆω||2. (27) Next, with ˆω = A q and ||A || = 1 σ2 1 , we have ||U ˆω||2 = ||U A q||2 = ||A (AU q)||2 ||A ||2||AU q||2 = 1 σ2 1 ||AU q||2. (28) Now using the result in Lemma 2, the following upper bound can be derived: ||AU q||2 = k=1 Ai,k Uk qk k=1 Ai,k Uk qk k! (νi )kf(a) f(a + hν) f(a) hµ i (µ + 1)!ϑmax||νi||µ 1 2 (using Lemma 2) (32) i=1 (hµ i ||νi||µ 1)2 (33) 2 (hµ max)2 m X i=1 ||νi||2µ 1 (34) The result follows from || f(a) E1ˆω|| ||U ˆω|| ϑmaxhµ max σ1(µ + 1)! i=1 ||νi||2µ 1 . (35) C.3. Da NN pointwise error The proof follows the one from (Chen & Shah, 2018). In (Chen & Shah, 2018), also a method to break ties, e.g. points with the same distance is proposed a common problem when the data is discrete and not continuous. We will not discuss how to break ties, as the gradient estimation assumes continuous dimensions where ties should a.s. never happen. Technical Assumptions (Atechnical X,ρ,PX ): The feature space X and distance ρ form a separable metric space. The feature distribution PX is a Borel probability measure. DNNR: Differential Nearest Neighbors Regression Assumptions Besicovitch(ABesicovitch η ): The regression function η satisfied the Besicovitch condition if limr 0 E[Y |X Bx,r] = η(x) for x almost everywhere w.r.t. PX.. Assumption Lipschitz (ALipschitz(ϑmax) η ): The regression function η is Lipschitz continuous with parameter ϑmax if |η(x) η(x )| ϑmaxρ(x, x ) for all x, x X. Lemma 4 Under assumptions Atechnical X,ρ,PX, ABesicovitch η , let x supp(PX) be a feature vector, and η(x) = E[Y |X = x] R, be the expected label value for x. Let ε > 0 be an error tolerance in estimating expected label η(x), and δ (0, 1) be a probability tolerance. Suppose that Y [ymin, ymax] for some constants ymin and ymax. Let ξ (0, 1). Then there exists a threshold distance h (0, inf) such that for any smaller distance h (0, h ) and with the number of nearest neighbors satisfying k (1 ξ)n PX(Bx,h), then with probability at least 1 2 exp kε2 2(ymax ymin)2 exp ξ2n PX(Bx,r we have |ˆηDNNR(x) η(x)| ε. (37) Furthermore, if the function η satisfies assumptions ALipschitz(ϑmax) η , then we can take h DNNR = r ε ϑmax (1 + τ), (38) where τ =E Pm i=1 ||νi||2µ 1 σ1 X Bx,h . ν, σ1, and ϑmax are defined as in Lemma 1. Proof of Lemma 4: Fix x supp(PX). Let ε > 0. We upper-bound the error |ˆη η(x)| with the triangle inequality: |ˆη(x) η(x)| = |ˆη(x) En[ˆη(x)] + En[ˆη(x)] η(x)| |ˆη(x) En[ˆη(x)]| | {z } + |En[ˆη(x)] η(x)| | {z } The proof now continues by showing that both 1 and 2 are below ε/2 with high probability. The proof is adapted from the KNN pointwise regression proof in (Chen & Shah, 2018, p. 68ff.). The part 1 is almost identical to the proof in (Chen & Shah, 2018, p. 68ff.). For part 2 , we will use the Taylor Approximation and then bound the gradient using the results from Lemma 1. Part 1 |ˆη(x) En[ˆη(x)]| ε Lemma 5 Under same assumption of Theorem 1, we have: P |ˆη(x) En[ˆη(x)]| ε 2(ymax ymin)2 Proof Lemma 5: As we want to apply the Hoeffding s inequality, we have to show that the ˆY are independent Probability Model: The randomness of the ˆηDNNR(x) can be described as: 1. Sample a feature vector X X from the marginal distribution of the (k + 1)-st nearest neighbor of x, let hk+1 = ρ(x, X) denote the distance between x and X. 2. Sample k feature vectors i.i.d. from PX conditioned on landing in the ball Bo x,ρ(x, X), 3. Sample n k 1 feature vectors i.i.d. from PX conditoned on landing in X \ Bo x,hk+1, 4. Randomly permute the n feature vectors sampled, DNNR: Differential Nearest Neighbors Regression 5. For each feature vector Xi generated, sample its label Yi based on the conditional distribution PY |X=Xi. Therefore, we can write the expectation over the n training points as: En[ˆη(x)] = Ehk+1(x)[E[ ˆY |X Bo x,hk+1], (40) where ˆY = Y + Y (X x) denotes the prediction for point x from a data point X with label Y and gradient Y . The points samples in step 2 are precisely the k nearest neighbor of x, and their ˆY values i.i.d. with expectation En[ˆη(x)] = E X[E[ ˆY |X Bo x,hk+1]. As they are bounded between ymin and ymax (this can be enforced by clipping), Hoeffding s inequality yields: P |ˆη(x) En[ˆη(x)]| ε 2(ymax ymin)2 This finishes the proof of Lemma 5. Part 2 |En[ˆη(x)] η(x)| ε As discussed above (40) the expectation of ˆη(x) is En[ˆη(x)] = Ehk+1(x)[E[Y |X Bo x,hk+1] (42) Suppose that we could show that there exists home h > 0 such that |E[Y |X Bo x,r] η(x)| ε 2 for all r (0, h]. (43) Then provided that hk+1 h: |En[ˆη(x) η(x)]| = |Ehk+1(x)[E[Y |X Bo x,hk+1] η(x)]| (44) Ehk+1(x)[|E[Y |X Bo x,hk+1] η(x)|] (Jensen s ineq.) (45) 2 (inequality (43)) (46) Before establishing the existence of h, we first show that for any distance r > 0, with high probability we can ensure that hk+1 r. Thus, once we show that h exists, we also know that we can ensure that hk+1 h with high probability. Lemma 6 . Let r > 0 and ξ (0, 1). For positive integer k (1 ξ)n PX(Bx,r), PX(hk+1(x) r) exp ξ2n PX(Bx,r Thus, we have hk+1(x) r with probability at least 1 exp 0.5ξ2n PX(Bx,r) . Proof of Lemma 6. Fix r > 0 and ξ (0, 1). Let Nx,r be the number of training points that land in the closed ball Bx,r. Note that Nx,r Binomial(n, PX(Bx,r)). Then by a Chernoff bound for the binomial distribution, for any integer k (1 ξ)n PX(Bx,r), we have P(Nx,r k) exp (n PX(Bx,r k)2 2n PX(Bx,r) exp (n PX(Bx,r (1 ξ)n PX(Bx,r))2 2n PX(Bx,r) = exp ξ2n PX(Bx,r)) If Nx,r k, then also hk+1(x) r. Therefore, the event {hk+1(x) r} {Nx,r k} and we have: PX(hk+1(x) r) P(Nx,r k) exp ξ2n PX(Bx,r)) DNNR: Differential Nearest Neighbors Regression Now, we show which distance h ensures that inequality (43) holds. When we only know that lim r 0 E[Y |X Bo x,r] = η(x), (52) then the definition of a limit implies that there exists h > 0 (that depends on x and ε) such that |E[Y |X Bo x,h] η(x)| ε 2 for all h (0, h ), (53) i.e., inequality (43) holds, and so we have |En[ˆη(x)] η(x)| ε 2 as shown earlier in inequality (46). In the following derivation, we will assume η to be Lipschitz continuous with parameters ϑmax. Further, we move η(x) inside the expectation and use it s first term of Taylor series with X as root point: η(x) = η(X) + η (X)(x X) + o(|x X|), where o(|x X|) bounds the higher-order terms. E [ˆη(x) | X Bx,h] η(x) = = |E [Y + Y (x X) | X Bx,h] η(x)| = |E [Y + Y (x X) η(X) η (X)(x X) + o(|x X|) | X Bx,h]| = |E [Y η(X) + ( Y η (X)) (x X) + o(|x X|) | X Bx,h]| Now, we know that E[|Y η(X)| | X Bx,h] = 0, as the noise term has zero mean. = |E [( Y η (X)) (x X) + o(|x X|) | X Bx,h]| E |( Y η (X)) (x X)| + |o(|x X|)| X Bx,h We can bound |x X| < hmax and | Y η (X)| by using the results from Lemma 2. The higher order terms o(|x X|) can be bound by the remainder of the Talyor series: |Rµ| ϑµ|X x|µ (µ+1)! , where ϑµ ϑmax. E |( Y η (X)) (x X)| + |o(|x X|)| X Bx,h (54) ϑmaxhµ+1 max σ1(µ + 1)! i=1 ||νi||2µ 1 + ϑmax h2 max ϑmaxhµ+1 max (µ + 1)! E q Pm i=1 ||νi||2µ 1 σ1 + ϑmax h2 max τϑmaxhµ+1 max (µ + 1)! + ϑmax h2 max We used in the last step τ = E Pm i=1 ||νi||2µ 1 σ1 . As this proof only concerns first-order approximations, we use τϑmaxh2 max 2 + ϑmax h2 max hmax r ε ϑmax(1 + τ) (59) This finishes the proof of Lemma 4 . Theorem 1 follows from selecting ξ = 1 2 and observing that: DNNR: Differential Nearest Neighbors Regression n 8 PX(Bx,h) log 2 δ exp ξ2n PX(Bx,h) k 2(ymax ymin)2 2(ymax ymin)2 D. Hyperparameters Table 5. Number of Hyperparameters tuned for each model on the benchmark datasets. For DNNR, KNN and MLP, we take small datasets to be n < 2000, and medium datasets to be n < 50000. The same applies for PMLB and Feynman. We were unable to tune Tab Net model due to computation constraints and used their Tabnet-L configuration for larger datasets (Sarcos, Protein, CO2 and NOx Emissions) and Tabnet-S (n d and n a = 16) for smaller datasets. Airfoil CO2Emission California Concrete NOx Emission Protein Yacht DNNR 150 40 40 150 40 40 150 LL 50 50 50 50 50 50 50 Random Forest 12 12 12 12 12 12 12 Grad. B. Trees 48 48 48 48 48 48 48 MLP 594 72 72 594 72 72 594 Cat Boost 48 48 48 48 48 48 48 XGBoost 48 48 48 48 48 48 48 LGBM 48 48 48 48 48 48 48 KNN 384 64 64 384 64 64 384 Tabnet 1 1 1 1 1 1 1 Table 6. XGBoost Hyperparameters learning rate max depth n estimators [0.001,0.01,0.1,0.3] [3,5,10] [50,100,500,1000] Table 7. Light GBM Hyperparameters learning rate max depth n estimators [0.001,0.01,0.1,0.3] [3,5,10] [50,100,500,1000] Table 8. Cat Boost Hyperparameters verbose learning rate max depth n estimators [False] [0.001,0.01,0.1,0.3] [3,5,10] [50,100,500,1000] Table 9. Gradient Boosting Hyperparameters learning rate max depth n estimators [0.001,0.01,0.1,0.3] [3,5,10] [50,100,500,1000] DNNR: Differential Nearest Neighbors Regression Table 10. Random Forests Hyper Parameters criterion n estimators max features [mse] [50,100,500,1000] [auto,sqrt,log2] Table 11. MLP Hyperparameters for small datasets hidden layer sizes [(25,), (50,), (100,), (250,), (25, 25), (50, 50), (100, 100), (250, 250), (25, 25, 25), (50, 50, 50), (100, 100, 100) alpha batch size learning rate learning rate init early stopping [0,0.01,1] [64,128] [constant,invscaling,adaptive] [0.001,0.01,0.1] [True] Table 12. MLP Hyperparameters for medium datasets hidden layer sizes alpha [(128,),(128, 128),(128, 128, 128)] [0,0.01] batch size learning rate learning rate init early stopping [512] [constant,invscaling,adaptive] [0.0001,0.001,0.01,0.1] [True] Table 13. MLP Hyperparameters for large datasets hidden layer sizes alpha [(128,),(128, 128),(128, 128, 128)] [0] batch size learning rate learning rate init early stopping [512] [constant,adaptive] [0.0001,0.001,0.01] [True] Table 14. KNN Hyperparameters for small datasets n neighbors weights [2,5,7,10,20,30,40,50] [uniform,distance] algorithm leaf size p [ball tree,kd tree,brute] [10,30,50,100] [1,2] Table 15. KNN Hyperparameters for medium datasets n neighbors weights leaf size p [2,5,7,10,25,50,100,250] [distance,uniform] [10,30,50,100] [2] Table 16. KNN Hyperparameters for large datasets n neighbors weights leaf size p [2,3,5,7,10,12,15,20,25] [distance] [10,30,50,100] [2] DNNR: Differential Nearest Neighbors Regression Table 17. Tabnet Hyperparameters for large datasets n d verbose n a lambda sparse batch size virtual batch size [128] [False] [128] [0.0001] [4096] [128] momentum n steps gamma optimizer params scheduler params max epochs patience [0.8] [5] [1.5] [{ lr : 0.02}] [{ step size : 8000, gamma : 0.9}] [3000] [100] Table 18. Tabnet Hyperparameters for small datasets n d n a verbose lambda sparse batch size virtual batch size [8,16] [8,16] [False] [0.0001] [64,128] [8,16] momentum n steps gamma optimizer params scheduler params max epochs patience [0.02] [3] [1.3] [{ lr : 0.02}] [{ step size : 10, gamma : 0.95}] [3000] [100] Table 19. DNNR Hyperparameters for small datasets, n neighbohrs corresponds to the k. For the k (number of neighbors used in approximating the gradient) we use n values sampled between lower bound d and lower bound d n neighbhors upper bound lower bound n [1, 2, 3, 5, 7] [15] [2] [30] Table 20. DNNR Hyperparameters for medium datasets , n neighbohrs corresponds to the k. For the k (number of neighbors used in approximating the gradient) we use n values sampled between lower bound d and lower bound d n neighbhors upper bound lower bound n [3,4] [18] [2] [20] Table 21. DNNR Hyperparameters for large datasets , n neighbohrs corresponds to the k. For the k (number of neighbors used in approximating the gradient) we use n values sampled between lower bound d and lower bound d n neighbhors upper bound lower bound n [3] [12] [2] [14] Table 22. LL Hyperparameters n neighbohrs corresponds to the k. For the size of the neighborhood we use n values sampled between lower bound d and lower bound d upper bound lower bound n [25] [2] [50]