# robust_kernel_density_estimation_with_medianofmeans_principle__121d95e9.pdf Robust Kernel Density Estimation with Median-of-Means principle Pierre Humbert * 1 Batiste Le Bars * 2 Ludovic Minvielle * 3 In this paper, we introduce a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Medianof-Means principle (Mo M-KDE). This estimator is shown to achieve robustness for a large class of anomalous data, potentially adversarial. While previous works only prove consistency results under very specific contamination models, this work provides finite-sample highprobability error-bounds without any prior knowledge on the outliers. To highlight the robustness of our method, we introduce an influence function adapted to the considered O I framework. Finally, we show that Mo M-KDE achieves competitive results when compared with other robust kernel estimators, while having significantly lower computational complexity. 1. Introduction Over the past years, the task of learning in the presence of outliers has become an increasingly important objective in both statistics and machine learning. Indeed, in many situations, training data can be contaminated by undesired samples, which may badly affect the resulting learning task, especially in adversarial settings. Therefore, building robust estimators and algorithms that are resilient to outliers is becoming crucial in many learning procedures. In particular, the inference of a probability density function from a contaminated random sample is of major concern. Density estimation methods are mostly divided into paramet- *Authors in alphabetical order 1Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France. 2Université Lille, CNRS, Inria, Centrale Lille, UMR 9189 - CRISt AL, F-59000 Lille. 3Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, F-91190, Gif-sur-Yvette, France. Correspondence to: Batiste Le Bars , Pierre Humbert . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). ric and nonparametric techniques. Among the nonparametric family, the Kernel Density Estimator (KDE) is arguably the most known and used for both univariate and multivariate densities (Parzen, 1962; Silverman, 1986; Scott, 2015), but it is also known to be sensitive to datasets contaminated by outliers (Kim and Scott, 2011; 2012; Vandermeulen and Scott, 2014). The construction of a robust KDE is therefore an important area of research that can have useful applications, such as anomaly detection and resilience to adversarial data corruption. Yet, only few works have proposed such a robust estimator. Kim and Scott (2012) proposed to combine KDE with ideas from M-estimation to construct the so-called Robust Kernel Density Estimator (RKDE). However, no consistency results were provided and robustness was rather shown experimentally. Later, RKDE was proven to converge to the true density, however at the condition that the dataset remains uncorrupted (Vandermeulen and Scott, 2013). More recently, Vandermeulen and Scott (2014) proposed another robust estimator, called Scaled and Projected KDE (SPKDE). Authors proved the L1-consistency of SPKDE under a variant of the Huber s ε-contamination model (Huber, 1992) where two strong assumptions are made. First, the contamination parameter ε is assumed to be known, and second, the outliers must be uniform over the support of the true density. Unfortunately, as they did not provide rates of convergence, it still remains unclear at which speed SPKDE converges to the true density. Finally, both RKDE and SPKDE require iterative algorithms to compute their estimators, thus increasing the overall complexity of their construction. For theoretical findings, the recent work of Liu and Gao (2019) managed to obtain minimax optimal rates for kernel density estimates, but in the quite restrictive Huber s model (see Sec. 2.1). In statistical analysis, another idea to construct robust estimators is to use the Median-of-Means principle (Mo M). Introduced by Nemirovsky and Yudin (1983), Jerrum et al. (1986), and Alon et al. (1999), the Mo M was first designed to estimate the mean of a real random variable. It relies on the simple idea that rather than taking the average of all the observations, the sample is split in several non-overlapping blocks over which the mean is computed. The Mo M estimator is then defined as the median of these means. Being easy to compute, the Mo M properties have been studied by Robust Kernel Density Estimation with Median-of-Means principle Minsker et al. (2015) and Devroye et al. (2016) to estimate the means of heavy-tailed distributions. Furthermore, due to its robustness to outliers, Mo M-based estimators have recently gained a renewed interest in the machine learning community (Lecué et al., 2020b;a; Laforgue et al., 2020). Contributions. In this paper, we propose a new robust nonparametric density estimator based on the combination of the Kernel Density Estimation method and the Median-of Means principle (Mo M-KDE). We place ourselves in a more general framework than the classical Huber contamination model, called O I, which gets rid of any assumption on the outliers. We demonstrate the statistical performance of the estimator through finite-sample high-confidence error bounds in the L -norm and show that Mo M-KDE is consistent under the condition that the outlier proportion tends to 0 a necessary condition in robust kernel density estimation (Liu and Gao, 2019). Additionally, we prove the consistency in the L1-norm, which is known to reflect the global performance of the estimate (Devroye and Gyorfi, 1985). To the best of our knowledge, this is the first work that presents such results in the context of robust kernel density estimation, under the O I framework. As a measure of robustness, we also introduce an influence function adapted to the O I framework. It allows us to find the number of outliers above which the Mo M-KDE is less sensitive to outliers than the KDE. Finally, we demonstrate the empirical performance of Mo M-KDE on both synthetic and real data and show the practical interest of such estimator as it has a lower computational complexity than the baseline RKDE and SPKDE. 2. Median-of-Means Kernel Density Estimation We first recall the classical kernel density estimator. Let X1, , Xn be independent and identically distributed (i.i.d.) random variables that have a probability density function (pdf) f( ) with respect to the Lebesgue measure on Rd. The Kernel Density Estimate of f (KDE), also called the Parzen Rosenblatt estimator, is a nonparametric estimator given by ˆfn(x) = 1 nhd where h > 0 and K : Rd R+ is an integrable function satisfying Z K(u)du = 1 (Tsybakov, 2008). Such a function K( ) is called a kernel and the parameter h is called the bandwidth of the estimator. The bandwidth is a smoothing parameter that controls the bias-variance tradeoff of ˆfn( ) with respect to the input data. Figure 1. From left to right: True density and outliers from a uniform density, estimation with KDE, and with Mo M-KDE. While this estimator is central in statistics, a major drawback is its weakness against outliers (Kim and Scott, 2008; 2011; 2012). Indeed, as it assigns uniform weights 1/n to every K( ) regardless of whether Xi is an outlier or not, inliers and outliers contribute equally in the construction of the KDE, which results in undesired bumps over outlier locations in the final estimated density (see Figure 1). In the following, we propose a KDE-based density estimator robust to the presence of outliers in the sample set. These outliers are considered in a general framework described in the next section. 2.1. Outlier setup Throughout the paper, we consider the O I framework introduced by Lecué and Lerasle (2019). This very general framework allows the presence of outliers in the dataset and relax the standard i.i.d. assumption on each observation. We therefore assume that the n random variables are partitioned into two (unknown) groups: a subset {Xi | i I} made of inliers, and another subset {Xi | i O} made of outliers such that O I = and O I = {1, . . . , n}. While we suppose the Xi I are i.i.d. from a distribution that admits a density f with respect to the Lebesgue measure, no assumption is made on the outliers Xi O. Hence, these outlying points can be dependent or adversarial. The O I framework is related to the well-known Huber s ε-contamination model (Huber, 1992) where it is assumed that data are i.i.d. with distribution g = εf I + (1 ε)f O, and ε [0, 1); the distribution f I being related to the inliers and f O to the outliers. However, there are several important differences. First, in the O I the proportion of outliers is fixed and equals |O|/n, whereas it is random in the Huber s ε-contamination model (Lerasle, 2019). Second, the O I is less restrictive. Indeed, contrary to Huber s model which considers that inliers and outliers are respectively i.i.d from the same distributions, O I does not make any assumption about the outliers. This framework should not be confused with the ε-corruption model where an adversary is allowed to directly modify the inliers set (Diakonikolas et al., 2019). Robust Kernel Density Estimation with Median-of-Means principle 2.2. Mo M-KDE We now present our main contribution, a robust kernel density estimator based on the Mo M. This estimator is essentially motivated by the fact that the classical kernel density estimation at one point corresponds to an empirical average (see Equation (1)). Therefore, the Mo M principle appears to be an intuitive solution to build a robust version of the KDE. A formal definition of Mo M-KDE is given below. Definition 1. (Mo M Kernel Density Estimator) Let 1 S n, and let B1, , BS be a random partition of {1, , n} into S non-overlapping blocks Bs of equal size ns n/S. The Mo M Kernel Density Estimator (Mo M-KDE) of f at x0 is given by ˆf Mo M(x0) Median ˆfn1(x0), , ˆfn S(x0) , (2) where ˆfns(x0) is the value of the standard kernel density estimator at x0 obtained via the samples of the s-th block Bs. Note that ˆf Mo M( ) do not necessarily integrates to 1. However, as suggested by Devroye and Lugosi (2012) (section 5.6), it can always be normalized by its integral. Broadly speaking, Mo M estimators appear to be a good tradeoff between the unbiased but non robust empirical mean and the robust but biased median (Lecué et al., 2020b). Furthermore, we remark that, when S = 1 the standard KDE is recovered. 2.3. Time complexity The complexity of Mo M-KDE to evaluate one point is the same as the standard KDE, O(n); O(S n S ) for the blockwise evaluation and O(S) to compute the median with the median-of-medians algorithm (Blum et al., 1973). Since RKDE and SPKDE are KDEs with modified weights, they also perform the evaluation step in O(n) time. However, these weights need to be learnt, thus requiring an additional non-negligible computing capacity. Indeed, each one of them rely on an iterative method respectively the iteratively reweighted least squares algorithm and the projected gradient descent algorithm, that both have a complexity of O(niter n2), where niter is the number of needed iterations to reach a reasonable accuracy. Mo M-KDE on the other hand does not require any learning procedure. Depending on the application that we have, Mo M-KDE may require a normalization step. When using a Monte Carlo method, the complexity of such a normalization step at a given precision of 1/ nnorm is O(nnorm) (Weinzierl, 2000). Note that the evaluation step can be accelerated through several ways, hence potentially reducing computational time of all these competing methods (Gray and Moore, 2003a;b; Wang and Scott, 2019; Backurs et al., 2019). Theoretical time complexities are gathered in Table 1. Method Learning Evaluation Iterative method KDE O(n) no RKDE O(niter n2) O(n) yes SPKDE O(niter n2) O(n) yes Mo M-KDE O(n + nnorm) no Table 1. Computational complexity. 3. Theoretical analysis In this section, we give a finite-sample high-probability error bound in the L -norm for Mo M-KDE under the O I framework. To our knowledge, this work is the first to provide such error bounds in robust kernel density estimation under this framework. In order to build this high-probability error bound, it is assumed, among other standard hypotheses, that the true density is Hölder-continuous, a smoothness property usually considered in KDE analysis (Tsybakov, 2008; Jiang, 2017; Wang et al., 2019). In addition, we show the consistency in the L1-norm. In this last result, we will see that the aforementioned assumptions are not necessary to obtain the consistency. In the following, we give the necessary definitions and assumptions to perform our non-asymptotic analysis. 3.1. Setup and assumptions Let us first list the usual assumptions, notably on the considered kernel function, that will allow us to derive our results. They are standard in KDE analysis, and are chosen for their simplicity of comprehension (Tsybakov, 2008; Jiang, 2017). More general hypotheses could be made in order to obtain the same results, notably assuming kernel of order ℓ(see for example the works of Tsybakov (2008) and Wang et al. (2019)). Assumption 1. (Bounded density) f < . We make the following assumptions on the kernel K which are fairly standard. Assumption 2. The kernel K : Rd R is bounded by a positive constant Cρ and integrates to 1. Assumption 3. Let F be the class of functions of the form z Rd 7 K(x z). Then, F is a uniformly bounded VC class (Wang et al., 2019). All the above assumptions are respected by most of the popular kernels, in particular the Gaussian, Exponential, Uniform, Triangular, Cosine kernel, etc. These are key properties to provide the bounds presented in the next section. Before stating our main results, we recall the definition of Robust Kernel Density Estimation with Median-of-Means principle the Hölder class of functions. Definition 2. (Hölder class) Let T be an interval of Rd, and 0 < α 1 and L > 0 be two constants. We say that a function f : T R belongs to the Hölder class Σ(L, α) if it satisfies x, x T, |f(x) f(x )| L x x α . (3) This definition implies a notion of smoothness on the function f, and is a convenient property to bound the bias of KDE-based estimators. 3.2. L and L1 consistencies of Mo M-KDE This section states our central finding, a L finite-sample error bound for Mo M-KDE that proves its consistency under the O I framework. Lemma 1. (L error-bound of the KDE without outliers - (Wang et al., 2019)) Suppose that f belongs to the class of densities P(α, L) defined as P(α, L) f | f 0, Z f(x)dx = 1, (4) and f Σ(α, L) , where Σ(α, L) is the Hölder class of functions on Rd (Definition 2). Grant assumptions 1 to 3 and let h (0, 1), γ > 0, n large enough, and nhd 1. Then with probability at least 1 exp( γ), we have γ + log(1/h) nhd + C2hα , (5) where C2 = L Z u αK(u)du < and C1 is a constant that only depends on f , the dimension d, and the kernel properties. This lemma, which is verified several times in the literature (see e.g. (Giné and Guillou, 2002; Jiang, 2017; Wang et al., 2019)), comes from the well-known bias-variance decomposition, where we separately bound the variance (see e.g. Wang et al. (2019); Kim et al. (2019)) and the bias (see e.g. Tsybakov (2008); Rigollet and Vert (2009)). It shows the consistency of KDE without outliers, as soon as h 0 and nhd . We now present our main result. Its objective is to show that even under the O I framework, we do not need any critical additional hypothesis besides the ones of the previous lemma to show that Mo M-KDE is consistent. Proposition 1. (L error-bound of the Mo M-KDE under the O I) Suppose that f belongs to the class of densities P(α, L) and grant assumptions 1 to 3. Let S be the number of blocks such that S 2|O| + 1. Then, for any h (0, 1), γ > 0, n/S large enough, and nhd S, we have with probability at least 1 exp( γ), ˆf Mo M f C1 S(log(S) + γ + log(1/h)) nhd +C2hα , where C2 = L Z u αK(u)du < , and C1 is a constant that only depends on f , the dimension d, and the kernel properties. The proof of this proposition is given in the Appendix. In addition to S 2|O| + 1 in Proposition 1, the other conditions come from those in Lemma 1, which are necessary to obtain the rate in the uncorrupted scenario. The upper bound in the probability is minimal for S = 2|O| + 1. Note also that when |O| = 0 and S = 1, we exactly recover the result of Lemma 1 i.e. the rate for the KDE without outliers. This was expected as for S = 1, the Mo M-KDE is exactly the KDE. Corollary 1. (Rate of convergence) Consider the assumptions of Proposition 1 with S = 2|O| + 1, γ = log(n) and let With probability higher than 1 1 n log(n) α/(2α+d) + log(n) Corollary 1, proved in Appendix, shows that the rate is split in two terms. On the right, we have the classical minimax optimal rate of the KDE without outliers. On the left, we have a rate that depends on the outlier proportion |O|/n. It also shows that a sufficient condition to obtain consistency (error tending to 0) is to have n (standard) and the proportion of outliers |O|/n tending to 0. This latter condition was also expected since the minimax analysis done for the specific case of the Huber s model already captured it as a necessary condition (Liu and Gao, 2019). Note further that, when there is no outlier, i.e. |O| = 0, we recover the standard KDE minimax optimal rate (Wang et al., 2019). Last but not least, our two main results illustrate that the convergence of the Mo M-KDE only depends on the number of outliers in the dataset, and not on their type . This estimator is therefore robust in a wide range of scenarios. Robust Kernel Density Estimation with Median-of-Means principle In particular, the proof of Proposition 1 and Corollary 1 are also valid under the adversarial scenario (Diakonikolas et al., 2019; Depersin and Lecué, 2021) which is a more complex case than the O I framework. For instance, this framework allows the inliers to be highly correlated because of the corruption step. We now give a L1-consistency result under mild hypotheses, which is known to reflect the global performance of the estimate. Indeed, small L1 error leads to accurate probability estimation (Devroye and Gyorfi, 1985). Proposition 2. (L1-consistency in probability) If n/S , h 0, nhd , S/ nhd 0, and S 2|O| + 1, then ˆf Mo M f 1 P n 0 . This result is obtained by bounding the left-hand part by the errors in the healthy blocks only, i.e. those without anomalies. Under the hypothesis of the proposition, these errors are known to converge to 0 in probability (Wang et al., 2019). The complete proof is given in the Appendix. Contrary to SPKDE (Vandermeulen and Scott, 2014), no assumption on the outliers generation process is necessary to obtain this consistency result. Moreover, while we need to assume that the proportion of outliers is perfectly known to prove the convergence of SPKDE, the Mo M-KDE converges whenever the number of outliers is overestimated. Finally, note that again in Proposition 2, a condition to obtain consistency is that the fraction of outliers |O|/n tends to zero. As explained upper, this assumption is natural since the fraction of outliers tending to zero has been shown to be a necessary condition for consistency in the Huber framework (Liu and Gao, 2019). 3.3. Influence function in the O I framework As a measure of robustness, we now introduce an Influence Function (IF) (or sensitivity curve) adapted to the O I framework. It is inspired from the classical IF, first proposed by Hampel (1974), which measures how an estimator changes when the initial distribution is modified by adding a small amount of contamination at a point x . Therefore, it provides a notion of stability in the Huber model framework (Andrews, 1986; Debruyne et al., 2008). Generalizing the IF from the Huber model to the O I framework is not that straightforward. Indeed, the definition of the IF for function estimate (introduced for the Huber model in (Kim and Scott, 2011) ; Definition 1) is IFHuber(x, x ; T, F) = lim s 0 T(x; Fs) T(x; F) where Fs = (1 s)F + sδx and T is a function estimate based on F evaluated at x. This formulation is adapted for Figure 2. Influence function for Mo M-KDE and KDE where n = 1000, m = 40 outliers are placed at x = 12, and where the samples in In are drawn from the true density: 2/3 N(0, 1) + 1/3 N(5, 1). the Huber model as it consider all samples drawn from a mixture of a true density (here F) and an outlying one (δx). This is unfortunately not adapted to the O I framework. To highlight the robustness of our estimator using an IFbased idea , we adapt it to this framework and proposed the following definition. Definition 3. (IFO I) Let Tn(x0; In) be a density estimator evaluated at x0 and learned with a healthy data set In = {Xi}n i=1. Let m N and x Rd. The IFO I is defined as: IFO I(x0, x , m; In, Tn) |Tn(x0; In) Tn(x0; In {x }m i=1)| , where by healthy points we mean inliers i.e. samples that are independently drawn from the true density function. Given this definition, IFO I quantifies how much the value at x0 of an estimated density function changes whenever the healthy dataset is increased by m points located at x . Therefore, the link with the notion of stability is made obvious: the smaller IFO I is, the more stable and thus robust the estimator is. An illustration of the IF for Mo M-KDE and KDE is given in Figure 2. It shows that the IF for Mo M-KDE is lower than the one of KDE, especially near the outlying points {x }. Note that the variability observed in the case of Mo M-KDE is due to the block splitting procedure. See Section C of the Appendix for more results on the IF. In the next proposition, we provide a lower bound on the number of added samples m over which the IFO I of the Mo M-KDE is lower that the one of KDE with high probability. Proposition 3. Let x , x0 Rd and In be a healthy data Robust Kernel Density Estimation with Median-of-Means principle set. Grant assumptions 1 to 3 and denote i In K Xi x0 Let S 2m + 1 with m J0, n 2 J the number of added samples and δ > 0 such that |b a/n| > Cρ p 2δS/n, then with probability higher than 1 4 exp( δ) we have: IFO I(x0, x , m; In, ˆf Mo M) IFO I(x0, x , m; In, ˆf KDE) . The proof of this proposition is given in the Appendix. Given the previous proposition, the lower bound on m over which the Mo M-KDE is better than KDE is not necessarily easy to interpret. When everything is fixed except x0 and x , we see that the bound is low whenever |b a/n| = |K( x0 x n P K( x0 Xi h )| is large. A sufficient condition for this is to take x far from the sampling set In, i.e take x as an outlier. Under this condition, the bound will get even lower whenever x0 gets closer to x . 4. Numerical experiments In this section, we display numerical results supporting the relevance of Mo M-KDE. All experiments were run on a personal laptop computer using Python. The code of Mo M-KDE is made available online.1. Comparative methods. In the following experiments, we propose to compare Mo M-KDE to the classical KDE and two robust versions of KDE, called RKDE (Kim and Scott, 2012) and SPKDE (Vandermeulen and Scott, 2014). As previously explained, RKDE takes the ideas of robust M-estimation and translate it to kernel density estimation. The authors point out that classical KDE estimator can be seen as the minimizer of a squared error loss in the Reproducing Kernel Hilbert Space (RKHS) H corresponding to the chosen kernel. Instead of minimizing this loss, they propose to minimize a robust version of it: ˆf RKDE = argmin g H i=1 ρ( φ(Xi) g H), where φ is the canonical feature map associated to the RKHS and ρ( ) is taken to be either the Huber or the Hampel function, both known to bring robustness in M-estimations. The solution of the newly expressed problem is then found using the iteratively reweighted least squares algorithm. Note 1https://github.com/lminvielle/mom-kde. For the sake of comparison, we also implemented RKDE and SPKDE. that this RKHS approach has also been combined with the Mo M principle for robust estimation in (Lerasle et al., 2019). However, no algorithm was proposed to build this estimator, which is why no comparison is made here. SPKDE proposes to scale and project the standard KDE in a way that it decontaminates the dataset. Recall that ˆfn is the classical KDE estimator of Equation (1), this procedure is done by finding ˆf SPKDE = argmin g n i=1 β ˆfn g 2, where n corresponds to the convex hull of {K( Xi h )}n i=1 and β is an hyperparameter that controls the robustness. The minimization is shown to be equivalent to a quadratic program over the simplex, solved via projected gradient descent. Metrics. The performance of the Mo M-KDE is measured through three metrics. Two are used to measure the similarity between the estimated and the true density. One describes performances of an anomaly detector based on the estimated density. The first one is the Kullback-Leibler divergence (Kullback and Leibler, 1951) which is the most used in robust KDE (Kim and Scott, 2008; 2011; 2012; Vandermeulen and Scott, 2014). Used to measure the similarity between distributions, it is defined as DKL( ˆf f) = Z ˆf(x) log As the Kullback-Leibler divergence is non-symmetric and may have infinite values when distributions do not share the same support, we also consider the Jensen-Shannon divergence (Endres and Schindelin, 2003; Liese and Vajda, 2006). It is a symmetrized version of DKL, with positive values, bounded by 1 (when the logarithm is used in base 2), and has found applications in many fields, such as deep learning (Goodfellow et al., 2014) or transfer learning (Segev et al., 2017). It is defined as DJS( ˆf f) = 1 DKL( ˆf g) + DKL(f g) , 2( ˆf + f) . Motivated by real-world application, the third metric is not related to the true density, which is usually not available in practical cases. Instead, we quantify the capacity of the learnt density to detect anomalies using the well-known Area Under the ROC Curve criterion (AUC) (Bradley, 1997). An input point x0 is considered abnormal whenever ˆf(x0) is below a given threshold. Robust Kernel Density Estimation with Median-of-Means principle (a) Uniform (b) Regular Gaussian (c) Thin Gaussian (d) Advsersarial Thin Gaussian Figure 3. Density estimation with synthetic data. The displayed metric is the Jensen-Shannon divergence. A lower score means a better estimation of the true density. Hyperparameters. All estimators are built using the Gaussian kernel. The number of blocks S in Mo M-KDE is selected on a regular grid of 20 values between 1 and 2|O| + 1 in order to obtain the lowest DJS. Recall that from a theoretical point of view, we showed that if S 2|O| + 1, i.e. at least half of the blocks does not contain anomalies, consistency results and rates can be obtained (Propositions 1 and 2). However, if the proportion of anomalies gets large, S 2|O| + 1 implies a small number of samples in each block which may lead to bad estimates. Thus, it seems that S 2|O| + 1 is not a necessary condition to obtain good results. In practice, this suggests that S could be lower than 2|O| + 1, reason why we use this grid search strategy. The bandwidth h is chosen for KDE via the pseudo-likelihood k-cross-validation method (Turlach, 1993), and used for all estimators. The construction of RKDE follows exactly the indications of its authors (Kim and Scott, 2012) and ρ( ) is taken to be the Hampel function as it empirically showed to be the most robust. For SPKDE, the true ratio of anomalies is given as an input parameter. 4.1. Results on synthetic data To evaluate the efficiency of the Mo M-KDE against KDE and its robust competitors, we set up several outlier situations. In all theses situations, we draw N = 1000 inliers from an equally weighted mixture of two normal distribution N(µ1, σ1) and N(µ2, σ2) with µ1 = 0, µ2 = 6, and σ1 = σ2 = 0.5. The outliers however are sampled through various schemes: (a) Uniform. A uniform distribution U([µ1 3, µ2 + 3]) which is the classical setting used for outlier simulation. (b) Regular Gaussian. A similar-variance normal distribution N(3, 0.5) located between the two inlier clusters. (c) Thin Gaussian. A low-variance normal distribution N(3, 0.01) located between the two inliers clusters. (d) Adversarial Thin Gaussian. A low variance normal distribution N(0, 0.01) located on one of the inliers Gaussian mode. This scenario can be seen as adversarial as an ill-intentioned agent may hide wrong points in region of high density. It is the most challenging setting for standard robust estimators as they are in general robust to outliers located outside the support of the density we wish to estimate. For all situations, we consider several ratios of contamination and set the number of outliers |O| in order to obtain a ratio |O|/n ranging from 0.05 to 0.5 with 0.05-wide steps. Finally, to evaluate the pertinence of our results, for each set of parameters, data are generated 10 times. We display in Figure 3 the results over synthetic data using the DJS score. The average scores and standard deviations over the 10 experiments are represented for each outlier scheme and ratio. Overall, the results show the good performance of Mo M-KDE in all the considered situations. Furthermore, they highlight the dependency of the two competitors to the type of outliers. Indeed, as SPKDE is designed to handle uniformly distributed outliers, the algorithm struggles when confronted with differently distributed outliers (see Figure 3 b, c, d). RKDE performs generally better, but fails against adversarial contamination, which may be explained by its tendency to down-weight points located in low-density regions, which for this particular case correspond to the inliers. Results for DKL and AUC are reported in the Appendix C. Generally, they show similar results and the same conclusions on the good performance of Mo M-KDE can be made. In the Appendix C, we also add comparisons with methods that first proceed to an anomaly detection step before fitting a classical KDE. However, their results are not as good as those of the robust estimators. 4.2. Results on real data Experiments are also conducted over six classification datasets: Banana, German, Titanic, Breast-cancer, Robust Kernel Density Estimation with Median-of-Means principle (a) Banana (b) German (c) Titanic (d) Breast cancer (e) Iris (f) Digits (O: 0, I: all) (g) Digits (O: 1, I: all) (h) Digits (O: 1, I: 0) Figure 4. Anomaly detection with real datasets, measured with AUC over varying outlier proportion. A higher score means a better detection of the outliers. For Digits, we specify which classes are chosen to be inliers (I) and outliers (O). Iris, and Digits. They contain respectively n = 5300, 1000, 2201, 569, 150 and 1797 data points having d = 2, 20, 3, 30, 4 and 64 input dimensions. They are all publicly available either from open repositories at http://www.raetschlab.org/Members/ raetsch/benchmark/ (for the first three) or directly from the Scikit-learn package (for the last three) (Pedregosa et al., 2011). We follow the approach of Kim and Scott (2012) that consists in setting the class labeled 0 as outliers and the rest as inliers. To artificially control the outlier proportion, we randomly downsample the abnormal class to reach a ratio |O|/n ranging from 0.05 to 0.5 with 0.05-wide steps. When a dataset does not contain enough outliers to reach a given ratio, we similarly downsample the inliers. For each dataset and each ratio, the experiments are performed 50 times, the random downsampling resulting in different learning datasets. The empirical performance is evaluated through the capacity of each estimator to detect anomalies, which we measure with the AUC. Results are displayed in Figure 4. With the Digits dataset, we also explore additional scenarios with changing inlier and outlier classes (specified in the figure captions). Overall, results are in line with performances observed over synthetic experiments, achieving good results in comparison to its competitors. Note that even in the highest dimensional scenarios, i.e. Digits and Breast cancer (d = 64 and d = 30), Mo M-KDE still behaves well, outperforming its competitors. This behavior for high-dimensionality problems was also notice for robust KDE-based methods in a recent anomaly detection review (Domingues et al., 2018). Additional results are reported in the the Section C of the Appendix. 5. Conclusion The present paper introduced Mo M-KDE, a new efficient way to perform robust kernel density estimation. The method has been shown to be consistent in both L and L1 error-norm in presence of very generic outliers. Mo M-KDE achieved good empirical results in various situations while having a lower computational complexity than its competitors. While the present work uses the coordinate-wise median to construct its robust estimator, it might be interesting to investigate the use of other generalizations of the median in high dimension, e.g. the geometric median. Another possible extension of the proposed method is to consider a bootstrap version where several random splits are performed and thus several Mo M-KDE are aggregated. In that case, one can expect the final output to be smoother and the bound to be less dependent on the number of blocks S. On the other hand, with such approach the complexity of the method increases. First empirical visualizations of this method can be found in the Appendix. Acknowledgments Part of this research was funded by the Id AML chair of Centre Borelli from ENS Paris Saclay. Robust Kernel Density Estimation with Median-of-Means principle Alon, N., Matias, Y., and Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137 147. Andrews, D. W. (1986). Stability comparison of estimators. Econometrica: Journal of the Econometric Society, pages 1207 1235. Backurs, A., Indyk, P., and Wagner, T. (2019). Space and time efficient kernel density estimation in high dimensions. In Advances in Neural Information Processing Systems, pages 15773 15782. Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L., and Tarjan, R. E. (1973). Time bounds for selection. Journal of Computer and System Sciences, 7:448 461. Boucheron, S. and Thomas, M. (2012). Concentration inequalities for order statistics. Electronic Communications in Probability, 17:1 12. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145 1159. Debruyne, M., Christmann, A., Hubert, M., and Suykens, J. A. (2008). Robustness and stability of reweighted kernel based regression. Technical report. Depersin, J. and Lecué, G. (2021). Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms. ar Xiv preprint ar Xiv:2102.00995. Devroye, L. and Gyorfi, L. (1985). Nonparametric density estimation: the L1 view. New York: John Wiley & Sons. Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R. I., et al. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695 2725. Devroye, L. and Lugosi, G. (2012). Combinatorial methods in density estimation. Springer Science & Business Media. Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., and Stewart, A. (2019). Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864. Domingues, R., Filippone, M., Michiardi, P., and Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognition, 74:406 421. Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7):1858 1860. Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l Institut Henri Poincare (B) Probability and Statistics, volume 38, pages 907 921. Elsevier. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680. Gray, A. G. and Moore, A. W. (2003a). Nonparametric density estimation: toward computational tractability. In Proceedings of the 2003 SIAM International Conference on Data Mining, pages 203 211. SIAM. Gray, A. G. and Moore, A. W. (2003b). Rapid evaluation of multiple density models. In International Conference on Artificial Intelligence and Statistics. Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383 393. Huber, P. J. (1992). Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492 518. Springer. Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169 188. Jiang, H. (2017). Uniform convergence rates for kernel density estimation. In Proceedings of the 34th International Conference on Machine Learning, pages 1694 1703. JMLR. org. Kim, J. and Scott, C. (2008). Robust kernel density estimation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3381 3384. IEEE. Kim, J. and Scott, C. (2011). On the robustness of kernel density M-estimators. In Proceedings of the 28th International Conference on Machine Learning, pages 697 704. Citeseer. Kim, J. and Scott, C. D. (2012). Robust kernel density estimation. The Journal of Machine Learning Research, 13(Sep):2529 2565. Kim, J., Shin, J., Rinaldo, A., and Wasserman, L. (2019). Uniform convergence of the kernel density estimator adaptive to intrinsic volume dimension. In ICML 201936th International Conference on Machine Learning, volume 97. Robust Kernel Density Estimation with Median-of-Means principle Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79 86. Laforgue, P., Staerman, G., and Clémençon, S. (2020). How robust is the median-of-means? concentration bounds in presence of outliers. ar Xiv preprint ar Xiv:2006.05240. Lecué, G. and Lerasle, M. (2019). Learning from MOM s principles: Le Cam s approach. Stochastic Processes and their applications, 129(11):4385 4410. Lecué, G., Lerasle, M., et al. (2020a). Robust machine learning by median-of-means: theory and practice. The Annals of Statistics, 48(2):906 931. Lecué, G., Lerasle, M., and Mathieu, T. (2020b). Robust classification via MOM minimization. Machine Learning. Lerasle, M. (2019). Lecture notes: selected topics on robust statistical learning theory. ar Xiv preprint ar Xiv:1908.10761. Lerasle, M., Szabó, Z., Mathieu, T., and Lecué, G. (2019). MONK outlier-robust mean embedding estimation by median-of-means. In International Conference on Machine Learning, pages 3782 3793. PMLR. Liese, F. and Vajda, I. (2006). On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394 4412. Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413 422. IEEE. Liu, H. and Gao, C. (2019). Density estimation with contamination: minimax rates and theory of adaptation. Electronic Journal of Statistics, 13(2):3613 3653. Minsker, S. et al. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 2335. Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley Interscience, New-York. Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065 1076. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 12:2825 2830. Rigollet, P. and Vert, R. (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli, 15(4):1154 1178. Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Segev, N., Harel, M., Mannor, S., Crammer, K., and El Yaniv, R. (2017). Learn on source, refine on target: a model transfer learning framework with random forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1811 1824. Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. CRC press. Sriperumbudur, B. and Steinwart, I. (2012). Consistency and rates for clustering with DBSCAN. In International Conference on Artificial Intelligence and Statistics, pages 1090 1098. Tsybakov, A. B. (2008). Introduction to nonparametric estimation. Springer Science & Business Media. Turlach, B. A. (1993). Bandwidth selection in kernel density estimation: a review. In CORE and Institut de Statistique. Citeseer. Vandermeulen, R. and Scott, C. (2013). Consistency of robust kernel density estimators. In Conference on Learning Theory, pages 568 591. Vandermeulen, R. A. and Scott, C. (2014). Robust kernel density estimation by scaling and projection in Hilbert space. In Advances in Neural Information Processing Systems, pages 433 441. Wang, D., Lu, X., and Rinaldo, A. (2019). DBSCAN: optimal rates for density-based cluster estimation. The Journal of Machine Learning Research, 20(170):1 50. Wang, Z. and Scott, D. W. (2019). Nonparametric density estimation for high-dimensional data-algorithms and applications. Wiley Interdisciplinary Reviews: Computational Statistics, 11(4):e1461. Weinzierl, S. (2000). Introduction to monte carlo methods. ar Xiv preprint hep-ph/0006269. Robust Kernel Density Estimation with Median-of-Means principle A. Additional comments We begin this section with a simple example highlighting the robustness of Mo M-KDE compared with KDE (see also Figure 1 for a different visual example). Example 1. (Mo M-KDE v.s. Uniform KDE) Let the inliers be i.i.d. samples from a uniform distribution on the interval [ 1, 1] and the outliers be i.i.d. samples from another uniform distribution on [ 3, 3]. Let the kernel function be the uniform kernel, x0 = 2 and h (0, 1). Then if S > 2|O|, we obtain | ˆf Mo M(x0) f(x0)| = 0 a.s. and P | ˆfn(x0) f(x0)| = 0 = (1 h/3)|O| = 1 . Proof. Since the inliers are uniform on [ 1, 1], when x0 = 2, we have f(x0) = 0. Recall that the classical KDE at x0 = 2 with the uniform kernel and h (0, 1) is: ˆfn(x0) = 1 2nh i=1 I(x0 h Xi x0 + h) . The support of the inliers is [ 1, 1], hence, i I, I(x0 h Xi x0 + h) = 0 a.s. and without outliers we would have ˆfn(x0) = 0, resulting in no errors at x0 = 2. To prove the first equality, it suffices to observe that S > 2|O| implies that more than half the blocks do not contain outliers. Over each one of these block, the KDE at x0 is thus equal to 0 a.s. making the Mo M estimate equal to 0 as well. Finally it results in | ˆf Mo M(x0) f(x0)| = ˆf Mo M(x0) = 0, almost surely. Let us prove the second equality: P | ˆfn(x0) f(x0)| = 0 = P ˆfn(x0) = 0 i=1 I(x0 h Xi x0 + h) = 0 i O I(x0 h Xi x0 + h) = 0 = P i O, Xi / [x0 h, x0 + h] = (1 h/3)|O| . This result shows that the Mo M-KDE makes (almost surely) no error at the point x0. On the contrary, the KDE here has a non-negligible probability to make an error. We now make three comments on the behavior of the Mo M-KDE. Optimality. Characterizing the optimality i.e. obtaining lower bound/minimax rates, for our framework is a difficult task. This is why, at the moment, we characterize the optimality by comparing our rate with the one of the standard KDE, known to be optimal without outliers (a strategy already used in Lecué et al. (2020a) for example). Note that Liu and Gao (2019) study minimax rates of the KDE. Nevertheless, their framework is different. Indeed, they consider the Huber model and study the optimality under L2-convergence. Actually, to be consistent with Liu and Gao (2019), we have also been interested in finding error rates with L2-convergence. However, the median operator makes the analysis a lot more complicated, as it would require well-suited concentration inequalities for order statistics in the manner of Boucheron and Thomas (2012) but for sub-gaussian random variables. Robust Kernel Density Estimation with Median-of-Means principle 4 2 0 2 4 6 8 10 12 14 0.00 0.25 Mo M-KDE KDE outliers 4 2 0 2 4 6 8 10 12 14 0.00 0.25 Mo M-KDE + Bootstrap KDE outliers Figure 5. Illustration of Mo M-KDE + Bootstrap. 4 2 0 2 4 6 8 10 12 14 0.000 0.010 IF of Mo M-KDE IF of KDE outliers (x ) Figure 6. Influence function for the 10-randomized version of Mo M-KDE with S = 2m + 1 and KDE where the m outliers are placed at x = 12. The true density is 2/3 N(0, 1) + 1/3 N(5, 1). Mo M is dependent from the partition. In Figure 2, we see that the Mo M-IF is very erratic. This is essentially due to the partitioning. We actually have several ideas to mitigate this behavior. One natural idea is to perform several partitions/estimations and then aggregate them. However this would come at the price of additional computations and the theoretical analysis would become quickly intractable. Randomization to reduce the variance. During the process of the Mo M-KDE, the split of the dataset is arbitrary. One natural idea, is to consider different splits in order to learn a single Mo M-KDE per split, and then average them. The idea being to learn a smoother density estimator. We give an illustration of this procedure in Figure 5. We clearly see that Mo M-KDE together with Bootstrap is smoother than just one Mo M-KDE. The impact of this strategy is also illustrated in Figures 6 regarding the IF. B. Technical proofs Lemma 1 (L error-bound of the KDE without outliers - (Wang et al., 2019)) Suppose that f belongs to the class of densities P(α, L) defined as P(α, L) f | f 0, Z f(x)dx = 1, and f Σ(α, L) , where Σ(α, L) is the Hölder class of function on Rd. Grant assumptions 1 to 3 and let h (0, 1), γ > 0, n large enough, and nhd 1. Then with probability at least 1 exp( γ), we have γ + log(1/h) nhd + C2hα , where C2 = L Z u αK(u)du < and C1 is a constant that only depends on f , the dimension d, and the kernel properties. Robust Kernel Density Estimation with Median-of-Means principle Proposition 1 (L error-bound of the Mo M-KDE under the O I) Suppose that f belongs to the class of densities P(α, L) and grant assumptions 1 to 3. Let S be the number of blocks such that S 2|O| + 1. Then, for any h (0, 1), γ > 0, n/S large enough, and nhd S, we have with probability at least 1 exp( γ), ˆf Mo M f C1 S(log(S) + γ + log(1/h)) nhd + C2hα , where C2 = L Z u αK(u)du < and C1 is a constant that only depends on f , the dimension d, and the kernel properties. Proof. From the definition of the Mo M-KDE, we have the following implication (Lecué et al., 2020b) ˆf Mo M(x) f(x) ε = s=1 I ˆfns(x) f(x) > ε S/2 Thus to upper-bound the probability of the left-hand event, it suffices to upper-bound the probability of the right-hand event. Moreover, we have ˆfns(x) f(x) sup x ˆfns(x) f(x) = I ˆfns(x) f(x) > ε I sup x ˆfns(x) f(x) > ε s=1 I ˆfns(x) f(x) > ε s=1 I sup x ˆfns(x) f(x) > ε s=1 I ˆfns(x) f(x) > ε s=1 I sup x ˆfns(x) f(x) > ε , which implies that s=1 I ˆfns(x) f(x) > ε S/2 s=1 I sup x ˆfns(x) f(x) > ε S/2 Let Zs = I supx ˆfns(x) f(x) > ε and let S = s {1, , S} | Bs O = i.e. the set of indices s such that the block Bs does not contain any outliers. Since X s SC I( ) is bounded by |O|, almost surely, the following holds. s=1 I sup x ˆfns(x) f(x) > ε = s S Zs + |O| . (6) Note that S is never empty thanks to the hypothesis S 2|O| + 1. For n large enough e.g. (n/S)hd > γ and (n/S)hd > | log(γ)| (Sriperumbudur and Steinwart (2012) Thm 3.1), Lemma 1 with ε = C1 S(γ + log(1/h)) nhd + C2hα, leads to pε := P sup x ˆfn1(x) f(x) > ε e γ . Combining this last inequality with equation (6) gives Robust Kernel Density Estimation with Median-of-Means principle s=1 I sup x ˆfns(x) f(x) > ε S/2 s S Zs + |O| S/2 s S Zs S/2 |O| 1 (1 pε)|S| 1 (1 pε)S 1 (1 e γ)S . Finally, using the fact that 1 y e x (1 e x)y, we obtain ˆf Mo M f C1 S(γ + log(1/h)) nhd + C2hα ! Replacing γ by log(S) + γ > 0 gives ˆf Mo M f C1 S(log(S) + γ + log(1/h)) nhd + C2hα ! Note that we can easily extend this proof to the adversarial framework (Depersin and Lecué, 2021). Indeed, suppose Z1, , ZS defined above equation (6) are corrupted and Z 1, , Z S are the corresponding uncorrupted version of them. We have that Z1, , ZS are potentially not independent but Z 1, , Z S are independent and Zi = Z i for any i S (this is the adversarial framework). Then, s=1 (Zs Z s) s=1 Z s + |O| . The right-hand side is now a sum of i.i.d. bounded random variables and restarting from equation (6) we can conclude. Corollary 1 (Rate of convergence) Consider the assumptions of Proposition 1 with S = 2|O| + 1, γ = log(n) and 1/(2α+d) . Thus, with probability higher than 1 1 ˆf Mo M f |O| n log(n) α/(2α+d) + log(n) Proof. From the previous proposition we know that with probabability 1 1/n, we have ˆf Mo M f C1 S(log(S) + log(n) + log(1/h)) nhd + C2hα . Robust Kernel Density Estimation with Median-of-Means principle Replacing h by S log(n) 1/(2α+d) in this last equation gives us that ˆf Mo M f S log(n) α/(2α+d) |O| n log(n) α/(2α+d) + log(n) Proposition 2. (L1-consistency in probability) If n/S , h 0, nhd , S/ nhd 0, and S 2|O| + 1, then ˆf Mo M f 1 P n 0 . Proof. We first rewrite the Mo M-KDE as ˆf Mo M(x) = s=1 ˆfns(x)IAs(x) , where As = n x | ˆf Mo M(x) = ˆfns(x) o . Without loss of generality, we assume that Ak S s =ℓAℓ= , S s=1As = Rd, and s=1 IAs(x) = 1 . Z ˆf Mo M(x) f(x) dx = Z s=1 ˆfns(x)IAs(x) f(x) ˆfns(x) f(x) IAs(x) ˆfns(x) f(x) IAs(x)dx ˆfns(x) f(x) dx ˆfns(x) f(x) dx + X ˆfns(x) f(x) dx . (7) From the L1-consistency of the KDE in probability, if the number of anomalies grows at a small enough speed (Devroye and Gyorfi, 1985), the left part is bounded, i.e. ˆfns(x) f(x) dx X Z ˆfns(x) f(x) dx P n 0 . (8) To be more specific, since Rn = nhd, the rate given in Thm 5.1 of Devroye and Gyorfi(1985), the convergence is guarantee whenever S/Rn tends to 0 e.g. with S = O(Rτ n) and any τ (0, 1). We now upper-bound the right part of equation (7). Let consider a particular block As where s SC. In this block, the estimator fns is selected and is calculated with samples containing anomalies. As x As, fns(x) is the median (by definition), if S > 2|O|, we can always find a s S such that fns(x) fns (x) or fns(x) fns (x). Now let denote by A+ s = n x As | ˆfns(x) f(x) o and A s = n x As | ˆfns(x) < f(x) o . We have A+ s A s = As and each one of these blocks can be decomposed respectively into |S| sub-blocks (not necessarily disjoint) {As ,+ s }s S and {As , s }s S such that s S, Robust Kernel Density Estimation with Median-of-Means principle As ,+ s = n x As | ˆfns (x) ˆfns(x) f(x) o and As , s = n x As | ˆfns (x) ˆfns(x) < f(x) o . Finally, the right-hand term of equation (7) can be upper-bounded by ˆfns(x) f(x) dx X ˆfns(x) f(x) dx + Z ˆfns(x) f(x) dx ˆfns(x) f(x) dx + Z ˆfns(x) f(x) dx ˆfns (x) f(x) dx + Z ˆfns (x) f(x) dx Z ˆfns (x) f(x) dx + Z ˆfns (x) f(x) dx Since s S we have Z | ˆfns (x) f(x)|dx P n 0, we can conclude using similar arguments as those used for (8) that X As | ˆfns(x) f(x)|dx P n 0, which concludes the proof. Proposition 3. Let x , x0 Rd and In be an healthy data set. Grant assumptions 1 to 3 and denote i I K Xi x0 Let S 2m + 1 with m J0, n 2 J the number of added samples and δ > 0 such that |b a/n| > Cρ p 2δS/n, then with probability higher than 1 4 exp( δ) we have: IFO I(x0, x , m; In, ˆf Mo M) IFO I(x0, x , m; In, ˆf KDE) . The IFO I for the KDE is IFO I(x0, x , m, In; ˆf KDE) = 1 (n + m)hd i=1 K Xi x0 i=1 K Xi x0 i=1 K Xi x0 i=1 K Xi x0 + m n + m K x x0 i=1 K Xi x0 + m n + m K x x0 a + m n + m nb a n2/m + n i=1 K Xi x0 , and b K x x0 IFO I for the Mo M-KDE Robust Kernel Density Estimation with Median-of-Means principle Let S > 2m be the number of blocks in the Mo M-KDE, {Bs}S s=1 and { Bs }S s =1 be respectively the blocks of the contaminated data set In {x }m and the healthy data set. We have: IFO I(x0, x , m, In; ˆf Mo M) = 1 i In Bs K Xi x0 i {x } Bs K x x0 i Bs K Xi x0 i Bs K Xi x0 i Bs K Xi x0 where Bs is the block selected by the median for the healthy Mo M-KDE. The inequality is obtained by noticing that, with S > 2m, there always exists an healthy block Bs that makes it true. Finally, denoting S n + m K Xi x0 S n K Xi x0 = Z(s) Z(s ) , and using both the triangular and Hoeffding inequalities, and the fact that E(Z(s)) = E(Z(s )), we have hd IFO I(x0, x , m, In; ˆf Mo M) Z(s) Z(s ) E(Z(s)) + E(Z(s )) Z(s) E(Z(s)) + Z(s ) E(Z(s )) , and for t > 0, P Z(s) E(Z(s)) t 2 exp 2t2(n + m)) P Z(s ) E(Z(s )) t 2 exp 2t2n = 2 exp 2nst2 where ns = n/S. Furthermore, given two real-valued random variables X, Y , we know that for t > 0, P (|X| + |Y | 2t) P (|X| t) + P (|Y | t) . Therefore, we have P Z(s) E(Z(s)) + Z(s ) E(Z(s )) t 4 exp nst2 Setting t = Cρ 2δ ns with δ > 0, finally gives P Z(s) Z(s ) < t 1 4 exp( δ). We now know that with probability 1 4 exp( δ), IFO I(x0, x , m, In; ˆf Mo M) < 1 2δ ns and we seek for which value of m, this value is smaller than Robust Kernel Density Estimation with Median-of-Means principle the IFO I of the KDE i.e. 2δ ns nb a n2/m + n n2/m + n ns|nb a| m n2 ns|nb a| m n2 n|nb a| m n n|b a/n| C. Additional results As stated in the main paper, we display here the additional results containing: For synthetic data, the Kullback-Leibler divergence in both directions, i.e. DKL( ˆf, f) and DKL(f, ˆf), and the ROC AUC measuring the performance of an anomaly detector based on ˆf. Results are displayed on Figure 7. For synthetic data, a benchmark of Outlier Detection (OD) + KDE. Instead of applying directly a robust estimator, we first proceed to an outlier detection step after which we fit a standard KDE. Results are displayed on Figure 8. For Digits data, the ROC AUC measuring the performance of an anomaly detector based on ˆf. As stated in the main paper, this is done under multiple scenarios, where outliers and inliers can be chosen among the nine available classes. Here we show the AUC when the outliers are set as one class (class 2 to class 9), and inliers are set as the rest of all classes. Results are displayed on Figure 9. Additional visualizations highlighting the behavior of the Influence Function of KDE and Mo M-KDE (Figures 10 KL, ROC and AUC for Synthetic data. When considering the Kullback-Leibler divergence, results lead to a very similar conclusion as previously stated, that is, an overall good performance of Mo M-KDE while its competitors, notably SPKDE, are more data-dependent. When the density estimate ˆf is used in a simple anomaly detector, results are quite different. Indeed, when outliers are uniformly distributed, even if Mo M-KDE seems to better estimate the true density (according to DJS and DKL), this doesn t make ˆf Mo M a better anomaly detector. It seems that in this case, the outliers are either easily detected because distant from the density estimate, or located in dense regions, thus making them impossible to identify, and this for all density estimates provided by competitors. In the case of adversarial contamination, the conclusion is quite similar. Although Mo M-KDE better fits the true density, the situation is extremely difficult for anomaly detection, hence making all competitors yield very poor results. In the two other cases Gaussian outlier, anomaly detection results follow the density estimation. Benchmark of Outlier Detection (OD) + KDE: We proceed to the OD step using two different methods, the KDE and the Isolation Forest (IF) (Liu et al., 2008). For the first method, we fit a KDE estimate using all data and we erase the proportion ε = |O|/n of samples that have the lower density. For the IF, we proceed similarly by removing a proportion ε of outliers. Finally, we fit a KDE estimate on the new samples. The results are given in Figure 8 and show that OD + KDE (purple curve) or IF + KDE (brown) is not as efficient as RKDE or Mo M-KDE. Robust Kernel Density Estimation with Median-of-Means principle DKL( ˆf, f) (a) Uniform (b) Regular Gaussian (c) Thin Gaussian (d) Advsersarial Thin Gaussian Figure 7. Density estimation with synthetic data. The displayed metrics are the Kullback-Leibler divergence (a lower score means a better estimation of the true density) and the AUC (a higher score means a better detection of the outliers). Digits data. Results over Digits scenarios are inline with main conclusions over real data. Although from one scenario to another, all methods have varied results, the overall observation is that Mo M-KDE is either similar or better than its competitors. Influence Function. Illustrations of the behavior of the Influence Function (IF) for Mo M-KDE and KDE are given in Figures 10 and 11. They show that the IF for Mo M-KDE is lower than the one of KDE when the m outlying points are all placed at a low density point x (see Figure 10). On the other hand, Figure 11 shows that whenever the outlying points are all placed at a high density point x , the two IF exhibit similar behavior. Robust Kernel Density Estimation with Median-of-Means principle (a) Uniform (b) Regular Gaussian (c) Thin Gaussian (d) Advsersarial Thin Gaussian Figure 8. Density estimation with synthetic data. The displayed metric is the Jensen-Shannon divergence. A lower score means a better estimation of the true density. (a) O: 2, I: all (b) O: 3, I: all (c) O: 4, I: all (d) O: 5, I: all (e) O: 6, I: all (f) O: 7, I: all (g) O: 8, I: all (h) O: 9, I: all Figure 9. Anomaly detection with Digits data, measured with AUC over varying outlier proportion. A higher score means a better detection of the outliers. We specify which classes are chosen to be inliers (I) and outliers (O). Robust Kernel Density Estimation with Median-of-Means principle 4 2 0 2 4 6 8 10 12 14 0.000 0.010 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 0.010 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 0.010 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 0.010 IF of Mo M-KDE IF of KDE outliers (x ) Figure 10. Influence function for the 10-randomized version of Mo M-KDE with S = 2m + 1 and KDE where the m outliers are placed at x = 12. The true density is 2/3 N(0, 1) + 1/3 N(5, 1). Robust Kernel Density Estimation with Median-of-Means principle 4 2 0 2 4 6 8 10 12 14 0.000 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 0.025 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 0.025 IF of Mo M-KDE IF of KDE outliers (x ) 4 2 0 2 4 6 8 10 12 14 0.000 0.025 IF of Mo M-KDE IF of KDE outliers (x ) Figure 11. Influence function for Mo M-KDE with S = 2m + 1 and KDE where the m = 100 outliers are placed at x , and where the samples in In are drawn from the true density: 2/3 N(0, 1) + 1/3 N(5, 1).