# robust_kernel_density_estimation_with_medianofmeans_principle__121d95e9.pdf

Robust Kernel Density Estimation with Median-of-Means principle

Pierre Humbert * 1 Batiste Le Bars * 2 Ludovic Minvielle * 3

In this paper, we introduce a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Medianof-Means principle (Mo M-KDE). This estimator is shown to achieve robustness for a large class of anomalous data, potentially adversarial. While previous works only prove consistency results under very speciﬁc contamination models, this work provides ﬁnite-sample highprobability error-bounds without any prior knowledge on the outliers. To highlight the robustness of our method, we introduce an inﬂuence function adapted to the considered O I framework. Finally, we show that Mo M-KDE achieves competitive results when compared with other robust kernel estimators, while having signiﬁcantly lower computational complexity.

1. Introduction

Over the past years, the task of learning in the presence of outliers has become an increasingly important objective in both statistics and machine learning. Indeed, in many situations, training data can be contaminated by undesired samples, which may badly affect the resulting learning task, especially in adversarial settings. Therefore, building robust estimators and algorithms that are resilient to outliers is becoming crucial in many learning procedures. In particular, the inference of a probability density function from a contaminated random sample is of major concern.

Density estimation methods are mostly divided into paramet-

*Authors in alphabetical order 1Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France. 2Université Lille, CNRS, Inria, Centrale Lille, UMR 9189 - CRISt AL, F-59000 Lille. 3Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, F-91190, Gif-sur-Yvette, France. Correspondence to: Batiste Le Bars <batiste.le-bars@inria.fr>, Pierre Humbert <pierre.humbert@universite-paris-saclay.fr>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

ric and nonparametric techniques. Among the nonparametric family, the Kernel Density Estimator (KDE) is arguably the most known and used for both univariate and multivariate densities (Parzen, 1962; Silverman, 1986; Scott, 2015), but it is also known to be sensitive to datasets contaminated by outliers (Kim and Scott, 2011; 2012; Vandermeulen and Scott, 2014). The construction of a robust KDE is therefore an important area of research that can have useful applications, such as anomaly detection and resilience to adversarial data corruption. Yet, only few works have proposed such a robust estimator.

Kim and Scott (2012) proposed to combine KDE with ideas from M-estimation to construct the so-called Robust Kernel Density Estimator (RKDE). However, no consistency results were provided and robustness was rather shown experimentally. Later, RKDE was proven to converge to the true density, however at the condition that the dataset remains uncorrupted (Vandermeulen and Scott, 2013). More recently, Vandermeulen and Scott (2014) proposed another robust estimator, called Scaled and Projected KDE (SPKDE). Authors proved the L1-consistency of SPKDE under a variant of the Huber s ε-contamination model (Huber, 1992) where two strong assumptions are made. First, the contamination parameter ε is assumed to be known, and second, the outliers must be uniform over the support of the true density. Unfortunately, as they did not provide rates of convergence, it still remains unclear at which speed SPKDE converges to the true density. Finally, both RKDE and SPKDE require iterative algorithms to compute their estimators, thus increasing the overall complexity of their construction. For theoretical ﬁndings, the recent work of Liu and Gao (2019) managed to obtain minimax optimal rates for kernel density estimates, but in the quite restrictive Huber s model (see Sec. 2.1).

In statistical analysis, another idea to construct robust estimators is to use the Median-of-Means principle (Mo M). Introduced by Nemirovsky and Yudin (1983), Jerrum et al. (1986), and Alon et al. (1999), the Mo M was ﬁrst designed to estimate the mean of a real random variable. It relies on the simple idea that rather than taking the average of all the observations, the sample is split in several non-overlapping blocks over which the mean is computed. The Mo M estimator is then deﬁned as the median of these means. Being easy to compute, the Mo M properties have been studied by

Robust Kernel Density Estimation with Median-of-Means principle

Minsker et al. (2015) and Devroye et al. (2016) to estimate the means of heavy-tailed distributions. Furthermore, due to its robustness to outliers, Mo M-based estimators have recently gained a renewed interest in the machine learning community (Lecué et al., 2020b;a; Laforgue et al., 2020).

Contributions. In this paper, we propose a new robust nonparametric density estimator based on the combination of the Kernel Density Estimation method and the Median-of Means principle (Mo M-KDE). We place ourselves in a more general framework than the classical Huber contamination model, called O I, which gets rid of any assumption on the outliers. We demonstrate the statistical performance of the estimator through ﬁnite-sample high-conﬁdence error bounds in the L -norm and show that Mo M-KDE is consistent under the condition that the outlier proportion tends to 0 a necessary condition in robust kernel density estimation (Liu and Gao, 2019). Additionally, we prove the consistency in the L1-norm, which is known to reﬂect the global performance of the estimate (Devroye and Gyorﬁ, 1985). To the best of our knowledge, this is the ﬁrst work that presents such results in the context of robust kernel density estimation, under the O I framework. As a measure of robustness, we also introduce an inﬂuence function adapted to the O I framework. It allows us to ﬁnd the number of outliers above which the Mo M-KDE is less sensitive to outliers than the KDE. Finally, we demonstrate the empirical performance of Mo M-KDE on both synthetic and real data and show the practical interest of such estimator as it has a lower computational complexity than the baseline RKDE and SPKDE.

2. Median-of-Means Kernel Density Estimation

We ﬁrst recall the classical kernel density estimator. Let X1, , Xn be independent and identically distributed (i.i.d.) random variables that have a probability density function (pdf) f( ) with respect to the Lebesgue measure on Rd. The Kernel Density Estimate of f (KDE), also called the Parzen Rosenblatt estimator, is a nonparametric estimator given by

ˆfn(x) = 1 nhd

where h > 0 and K : Rd R+ is an integrable

function satisfying Z K(u)du = 1 (Tsybakov, 2008).

Such a function K( ) is called a kernel and the parameter h is called the bandwidth of the estimator. The bandwidth is a smoothing parameter that controls the bias-variance tradeoff of ˆfn( ) with respect to the input data.

Figure 1. From left to right: True density and outliers from a uniform density, estimation with KDE, and with Mo M-KDE.

While this estimator is central in statistics, a major drawback is its weakness against outliers (Kim and Scott, 2008; 2011; 2012). Indeed, as it assigns uniform weights 1/n to every K( ) regardless of whether Xi is an outlier or not, inliers and outliers contribute equally in the construction of the KDE, which results in undesired bumps over outlier locations in the ﬁnal estimated density (see Figure 1).

In the following, we propose a KDE-based density estimator robust to the presence of outliers in the sample set. These outliers are considered in a general framework described in the next section.

2.1. Outlier setup

Throughout the paper, we consider the O I framework introduced by Lecué and Lerasle (2019). This very general framework allows the presence of outliers in the dataset and relax the standard i.i.d. assumption on each observation. We therefore assume that the n random variables are partitioned into two (unknown) groups: a subset {Xi | i I} made of inliers, and another subset {Xi | i O} made of outliers such that O I = and O I = {1, . . . , n}. While we suppose the Xi I are i.i.d. from a distribution that admits a density f with respect to the Lebesgue measure, no assumption is made on the outliers Xi O. Hence, these outlying points can be dependent or adversarial.

The O I framework is related to the well-known Huber s ε-contamination model (Huber, 1992) where it is assumed that data are i.i.d. with distribution g = εf I + (1 ε)f O, and ε [0, 1); the distribution f I being related to the inliers and f O to the outliers. However, there are several important differences. First, in the O I the proportion of outliers is ﬁxed and equals |O|/n, whereas it is random in the Huber s ε-contamination model (Lerasle, 2019). Second, the O I is less restrictive. Indeed, contrary to Huber s model which considers that inliers and outliers are respectively i.i.d from the same distributions, O I does not make any assumption about the outliers. This framework should not be confused with the ε-corruption model where an adversary is allowed to directly modify the inliers set (Diakonikolas et al., 2019).

Robust Kernel Density Estimation with Median-of-Means principle

2.2. Mo M-KDE

We now present our main contribution, a robust kernel density estimator based on the Mo M. This estimator is essentially motivated by the fact that the classical kernel density estimation at one point corresponds to an empirical average (see Equation (1)). Therefore, the Mo M principle appears to be an intuitive solution to build a robust version of the KDE. A formal deﬁnition of Mo M-KDE is given below. Deﬁnition 1. (Mo M Kernel Density Estimator) Let 1 S n, and let B1, , BS be a random partition of {1, , n} into S non-overlapping blocks Bs of equal size ns n/S.

The Mo M Kernel Density Estimator (Mo M-KDE) of f at x0 is given by

ˆf Mo M(x0) Median ˆfn1(x0), , ˆfn S(x0) , (2)

where ˆfns(x0) is the value of the standard kernel density estimator at x0 obtained via the samples of the s-th block Bs. Note that ˆf Mo M( ) do not necessarily integrates to 1. However, as suggested by Devroye and Lugosi (2012) (section 5.6), it can always be normalized by its integral.

Broadly speaking, Mo M estimators appear to be a good tradeoff between the unbiased but non robust empirical mean and the robust but biased median (Lecué et al., 2020b). Furthermore, we remark that, when S = 1 the standard KDE is recovered.

2.3. Time complexity

The complexity of Mo M-KDE to evaluate one point is the same as the standard KDE, O(n); O(S n

S ) for the blockwise evaluation and O(S) to compute the median with the median-of-medians algorithm (Blum et al., 1973). Since RKDE and SPKDE are KDEs with modiﬁed weights, they also perform the evaluation step in O(n) time. However, these weights need to be learnt, thus requiring an additional non-negligible computing capacity. Indeed, each one of them rely on an iterative method respectively the iteratively reweighted least squares algorithm and the projected gradient descent algorithm, that both have a complexity of O(niter n2), where niter is the number of needed iterations to reach a reasonable accuracy. Mo M-KDE on the other hand does not require any learning procedure. Depending on the application that we have, Mo M-KDE may require a normalization step. When using a Monte Carlo method, the complexity of such a normalization step at a given precision of 1/ nnorm is O(nnorm) (Weinzierl, 2000). Note that the evaluation step can be accelerated through several ways, hence potentially reducing computational time of all these competing methods (Gray and Moore, 2003a;b; Wang and Scott, 2019; Backurs et al., 2019). Theoretical time complexities are gathered in Table 1.

Method Learning Evaluation Iterative method

KDE O(n) no RKDE O(niter n2) O(n) yes SPKDE O(niter n2) O(n) yes Mo M-KDE O(n + nnorm) no

Table 1. Computational complexity.

3. Theoretical analysis

In this section, we give a ﬁnite-sample high-probability error bound in the L -norm for Mo M-KDE under the O I framework. To our knowledge, this work is the ﬁrst to provide such error bounds in robust kernel density estimation under this framework. In order to build this high-probability error bound, it is assumed, among other standard hypotheses, that the true density is Hölder-continuous, a smoothness property usually considered in KDE analysis (Tsybakov, 2008; Jiang, 2017; Wang et al., 2019). In addition, we show the consistency in the L1-norm. In this last result, we will see that the aforementioned assumptions are not necessary to obtain the consistency. In the following, we give the necessary deﬁnitions and assumptions to perform our non-asymptotic analysis.

3.1. Setup and assumptions

Let us ﬁrst list the usual assumptions, notably on the considered kernel function, that will allow us to derive our results. They are standard in KDE analysis, and are chosen for their simplicity of comprehension (Tsybakov, 2008; Jiang, 2017). More general hypotheses could be made in order to obtain the same results, notably assuming kernel of order ℓ(see for example the works of Tsybakov (2008) and Wang et al. (2019)).

Assumption 1. (Bounded density) f < .

We make the following assumptions on the kernel K which are fairly standard.

Assumption 2. The kernel K : Rd R is bounded by a positive constant Cρ and integrates to 1.

Assumption 3. Let F be the class of functions of the form z Rd 7 K(x z). Then, F is a uniformly bounded VC class (Wang et al., 2019).

All the above assumptions are respected by most of the popular kernels, in particular the Gaussian, Exponential, Uniform, Triangular, Cosine kernel, etc. These are key properties to provide the bounds presented in the next section.

Before stating our main results, we recall the deﬁnition of

Robust Kernel Density Estimation with Median-of-Means principle

the Hölder class of functions.

Deﬁnition 2. (Hölder class) Let T be an interval of Rd, and 0 < α 1 and L > 0 be two constants. We say that a function f : T R belongs to the Hölder class Σ(L, α) if it satisﬁes

x, x T, |f(x) f(x )| L x x α . (3)

This deﬁnition implies a notion of smoothness on the function f, and is a convenient property to bound the bias of KDE-based estimators.

3.2. L and L1 consistencies of Mo M-KDE

This section states our central ﬁnding, a L ﬁnite-sample error bound for Mo M-KDE that proves its consistency under the O I framework.

Lemma 1. (L error-bound of the KDE without outliers - (Wang et al., 2019)) Suppose that f belongs to the class of densities P(α, L) deﬁned as

P(α, L) f | f 0, Z f(x)dx = 1, (4)

and f Σ(α, L) ,

where Σ(α, L) is the Hölder class of functions on Rd (Definition 2). Grant assumptions 1 to 3 and let h (0, 1), γ > 0, n large enough, and nhd 1. Then with probability at least 1 exp( γ), we have

γ + log(1/h)

nhd + C2hα , (5)

where C2 = L Z u αK(u)du < and C1 is a constant

that only depends on f , the dimension d, and the kernel properties.

This lemma, which is veriﬁed several times in the literature (see e.g. (Giné and Guillou, 2002; Jiang, 2017; Wang et al., 2019)), comes from the well-known bias-variance decomposition, where we separately bound the variance (see e.g. Wang et al. (2019); Kim et al. (2019)) and the bias (see e.g. Tsybakov (2008); Rigollet and Vert (2009)). It shows the consistency of KDE without outliers, as soon as h 0 and nhd .

We now present our main result. Its objective is to show that even under the O I framework, we do not need any critical additional hypothesis besides the ones of the previous lemma to show that Mo M-KDE is consistent.

Proposition 1. (L error-bound of the Mo M-KDE under the O I) Suppose that f belongs to the class of densities P(α, L) and grant assumptions 1 to 3. Let S be the number of blocks such that S 2|O| + 1. Then, for any h (0, 1), γ > 0, n/S large enough, and nhd S, we have with probability at least 1 exp( γ),

ˆf Mo M f C1

S(log(S) + γ + log(1/h))

nhd +C2hα ,

where C2 = L Z u αK(u)du < , and C1 is a constant

that only depends on f , the dimension d, and the kernel properties.

The proof of this proposition is given in the Appendix. In addition to S 2|O| + 1 in Proposition 1, the other conditions come from those in Lemma 1, which are necessary to obtain the rate in the uncorrupted scenario. The upper bound in the probability is minimal for S = 2|O| + 1. Note also that when |O| = 0 and S = 1, we exactly recover the result of Lemma 1 i.e. the rate for the KDE without outliers. This was expected as for S = 1, the Mo M-KDE is exactly the KDE.

Corollary 1. (Rate of convergence) Consider the assumptions of Proposition 1 with S = 2|O| + 1, γ = log(n) and let

With probability higher than 1 1

n log(n) α/(2α+d) + log(n)

Corollary 1, proved in Appendix, shows that the rate is split in two terms. On the right, we have the classical minimax optimal rate of the KDE without outliers. On the left, we have a rate that depends on the outlier proportion |O|/n. It also shows that a sufﬁcient condition to obtain consistency (error tending to 0) is to have n (standard) and the proportion of outliers |O|/n tending to 0. This latter condition was also expected since the minimax analysis done for the speciﬁc case of the Huber s model already captured it as a necessary condition (Liu and Gao, 2019). Note further that, when there is no outlier, i.e. |O| = 0, we recover the standard KDE minimax optimal rate (Wang et al., 2019).

Last but not least, our two main results illustrate that the convergence of the Mo M-KDE only depends on the number of outliers in the dataset, and not on their type . This estimator is therefore robust in a wide range of scenarios.

Robust Kernel Density Estimation with Median-of-Means principle

In particular, the proof of Proposition 1 and Corollary 1 are also valid under the adversarial scenario (Diakonikolas et al., 2019; Depersin and Lecué, 2021) which is a more complex case than the O I framework. For instance, this framework allows the inliers to be highly correlated because of the corruption step.

We now give a L1-consistency result under mild hypotheses, which is known to reﬂect the global performance of the estimate. Indeed, small L1 error leads to accurate probability estimation (Devroye and Gyorﬁ, 1985).

Proposition 2. (L1-consistency in probability) If n/S , h 0, nhd , S/

nhd 0, and S 2|O| + 1, then ˆf Mo M f 1 P n 0 .

This result is obtained by bounding the left-hand part by the errors in the healthy blocks only, i.e. those without anomalies. Under the hypothesis of the proposition, these errors are known to converge to 0 in probability (Wang et al., 2019). The complete proof is given in the Appendix. Contrary to SPKDE (Vandermeulen and Scott, 2014), no assumption on the outliers generation process is necessary to obtain this consistency result. Moreover, while we need to assume that the proportion of outliers is perfectly known to prove the convergence of SPKDE, the Mo M-KDE converges whenever the number of outliers is overestimated.

Finally, note that again in Proposition 2, a condition to obtain consistency is that the fraction of outliers |O|/n tends to zero. As explained upper, this assumption is natural since the fraction of outliers tending to zero has been shown to be a necessary condition for consistency in the Huber framework (Liu and Gao, 2019).

3.3. Inﬂuence function in the O I framework

As a measure of robustness, we now introduce an Inﬂuence Function (IF) (or sensitivity curve) adapted to the O I framework. It is inspired from the classical IF, ﬁrst proposed by Hampel (1974), which measures how an estimator changes when the initial distribution is modiﬁed by adding a small amount of contamination at a point x . Therefore, it provides a notion of stability in the Huber model framework (Andrews, 1986; Debruyne et al., 2008).

Generalizing the IF from the Huber model to the O I framework is not that straightforward. Indeed, the deﬁnition of the IF for function estimate (introduced for the Huber model in (Kim and Scott, 2011) ; Deﬁnition 1) is

IFHuber(x, x ; T, F) = lim s 0 T(x; Fs) T(x; F)

where Fs = (1 s)F + sδx and T is a function estimate based on F evaluated at x. This formulation is adapted for

Figure 2. Inﬂuence function for Mo M-KDE and KDE where n = 1000, m = 40 outliers are placed at x = 12, and where the samples in In are drawn from the true density: 2/3 N(0, 1) + 1/3 N(5, 1).

the Huber model as it consider all samples drawn from a mixture of a true density (here F) and an outlying one (δx). This is unfortunately not adapted to the O I framework. To highlight the robustness of our estimator using an IFbased idea , we adapt it to this framework and proposed the following deﬁnition.

Deﬁnition 3. (IFO I) Let Tn(x0; In) be a density estimator evaluated at x0 and learned with a healthy data set In = {Xi}n i=1. Let m N and x Rd. The IFO I is deﬁned as:

IFO I(x0, x , m; In, Tn)

|Tn(x0; In) Tn(x0; In {x }m i=1)| ,

where by healthy points we mean inliers i.e. samples that are independently drawn from the true density function.

Given this deﬁnition, IFO I quantiﬁes how much the value at x0 of an estimated density function changes whenever the healthy dataset is increased by m points located at x . Therefore, the link with the notion of stability is made obvious: the smaller IFO I is, the more stable and thus robust the estimator is. An illustration of the IF for Mo M-KDE and KDE is given in Figure 2. It shows that the IF for Mo M-KDE is lower than the one of KDE, especially near the outlying points {x }. Note that the variability observed in the case of Mo M-KDE is due to the block splitting procedure. See Section C of the Appendix for more results on the IF.

In the next proposition, we provide a lower bound on the number of added samples m over which the IFO I of the Mo M-KDE is lower that the one of KDE with high probability.

Proposition 3. Let x , x0 Rd and In be a healthy data

Robust Kernel Density Estimation with Median-of-Means principle

set. Grant assumptions 1 to 3 and denote

i In K Xi x0

Let S 2m + 1 with m J0, n

2 J the number of added samples and δ > 0 such that |b a/n| > Cρ p

2δS/n, then with probability higher

than 1 4 exp( δ) we have:

IFO I(x0, x , m; In, ˆf Mo M)

IFO I(x0, x , m; In, ˆf KDE) .

The proof of this proposition is given in the Appendix.

Given the previous proposition, the lower bound on m over which the Mo M-KDE is better than KDE is not necessarily easy to interpret. When everything is ﬁxed except x0 and x , we see that the bound is low whenever |b a/n| = |K( x0 x

n P K( x0 Xi

h )| is large. A sufﬁcient condition for this is to take x far from the sampling set In, i.e take x as an outlier. Under this condition, the bound will get even lower whenever x0 gets closer to x .

4. Numerical experiments

In this section, we display numerical results supporting the relevance of Mo M-KDE. All experiments were run on a personal laptop computer using Python. The code of Mo M-KDE is made available online.1.

Comparative methods. In the following experiments, we propose to compare Mo M-KDE to the classical KDE and two robust versions of KDE, called RKDE (Kim and Scott, 2012) and SPKDE (Vandermeulen and Scott, 2014).

As previously explained, RKDE takes the ideas of robust M-estimation and translate it to kernel density estimation. The authors point out that classical KDE estimator can be seen as the minimizer of a squared error loss in the Reproducing Kernel Hilbert Space (RKHS) H corresponding to the chosen kernel. Instead of minimizing this loss, they propose to minimize a robust version of it:

ˆf RKDE = argmin g H

i=1 ρ( φ(Xi) g H),

where φ is the canonical feature map associated to the RKHS and ρ( ) is taken to be either the Huber or the Hampel function, both known to bring robustness in M-estimations. The solution of the newly expressed problem is then found using the iteratively reweighted least squares algorithm. Note

1https://github.com/lminvielle/mom-kde. For the sake of comparison, we also implemented RKDE and SPKDE.

that this RKHS approach has also been combined with the Mo M principle for robust estimation in (Lerasle et al., 2019). However, no algorithm was proposed to build this estimator, which is why no comparison is made here.

SPKDE proposes to scale and project the standard KDE in a way that it decontaminates the dataset. Recall that ˆfn is the classical KDE estimator of Equation (1), this procedure is done by ﬁnding

ˆf SPKDE = argmin g n

i=1 β ˆfn g 2,

where n corresponds to the convex hull of {K( Xi

h )}n i=1 and β is an hyperparameter that controls the robustness. The minimization is shown to be equivalent to a quadratic program over the simplex, solved via projected gradient descent.

Metrics. The performance of the Mo M-KDE is measured through three metrics. Two are used to measure the similarity between the estimated and the true density. One describes performances of an anomaly detector based on the estimated density. The ﬁrst one is the Kullback-Leibler divergence (Kullback and Leibler, 1951) which is the most used in robust KDE (Kim and Scott, 2008; 2011; 2012; Vandermeulen and Scott, 2014). Used to measure the similarity between distributions, it is deﬁned as

DKL( ˆf f) = Z ˆf(x) log

As the Kullback-Leibler divergence is non-symmetric and may have inﬁnite values when distributions do not share the same support, we also consider the Jensen-Shannon divergence (Endres and Schindelin, 2003; Liese and Vajda, 2006). It is a symmetrized version of DKL, with positive values, bounded by 1 (when the logarithm is used in base 2), and has found applications in many ﬁelds, such as deep learning (Goodfellow et al., 2014) or transfer learning (Segev et al., 2017). It is deﬁned as

DJS( ˆf f) = 1

DKL( ˆf g) + DKL(f g) ,

2( ˆf + f) .

Motivated by real-world application, the third metric is not related to the true density, which is usually not available in practical cases. Instead, we quantify the capacity of the learnt density to detect anomalies using the well-known Area Under the ROC Curve criterion (AUC) (Bradley, 1997). An input point x0 is considered abnormal whenever ˆf(x0) is below a given threshold.

Robust Kernel Density Estimation with Median-of-Means principle

(a) Uniform (b) Regular Gaussian (c) Thin Gaussian (d) Advsersarial Thin Gaussian

Figure 3. Density estimation with synthetic data. The displayed metric is the Jensen-Shannon divergence. A lower score means a better estimation of the true density.

Hyperparameters. All estimators are built using the Gaussian kernel. The number of blocks S in Mo M-KDE is selected on a regular grid of 20 values between 1 and 2|O| + 1 in order to obtain the lowest DJS. Recall that from a theoretical point of view, we showed that if S 2|O| + 1, i.e. at least half of the blocks does not contain anomalies, consistency results and rates can be obtained (Propositions 1 and 2). However, if the proportion of anomalies gets large, S 2|O| + 1 implies a small number of samples in each block which may lead to bad estimates. Thus, it seems that S 2|O| + 1 is not a necessary condition to obtain good results. In practice, this suggests that S could be lower than 2|O| + 1, reason why we use this grid search strategy. The bandwidth h is chosen for KDE via the pseudo-likelihood k-cross-validation method (Turlach, 1993), and used for all estimators. The construction of RKDE follows exactly the indications of its authors (Kim and Scott, 2012) and ρ( ) is taken to be the Hampel function as it empirically showed to be the most robust. For SPKDE, the true ratio of anomalies is given as an input parameter.

4.1. Results on synthetic data

To evaluate the efﬁciency of the Mo M-KDE against KDE and its robust competitors, we set up several outlier situations. In all theses situations, we draw N = 1000 inliers from an equally weighted mixture of two normal distribution N(µ1, σ1) and N(µ2, σ2) with µ1 = 0, µ2 = 6, and σ1 = σ2 = 0.5. The outliers however are sampled through various schemes:

(a) Uniform. A uniform distribution U([µ1 3, µ2 + 3]) which is the classical setting used for outlier simulation.

(b) Regular Gaussian. A similar-variance normal distribution N(3, 0.5) located between the two inlier clusters.

(c) Thin Gaussian. A low-variance normal distribution N(3, 0.01) located between the two inliers clusters.

(d) Adversarial Thin Gaussian. A low variance normal distribution N(0, 0.01) located on one of the inliers Gaussian mode. This scenario can be seen as adversarial as an ill-intentioned agent may hide wrong points in region of high density. It is the most challenging setting for standard robust estimators as they are in general robust to outliers located outside the support of the density we wish to estimate.

For all situations, we consider several ratios of contamination and set the number of outliers |O| in order to obtain a ratio |O|/n ranging from 0.05 to 0.5 with 0.05-wide steps. Finally, to evaluate the pertinence of our results, for each set of parameters, data are generated 10 times.

We display in Figure 3 the results over synthetic data using the DJS score. The average scores and standard deviations over the 10 experiments are represented for each outlier scheme and ratio. Overall, the results show the good performance of Mo M-KDE in all the considered situations. Furthermore, they highlight the dependency of the two competitors to the type of outliers. Indeed, as SPKDE is designed to handle uniformly distributed outliers, the algorithm struggles when confronted with differently distributed outliers (see Figure 3 b, c, d). RKDE performs generally better, but fails against adversarial contamination, which may be explained by its tendency to down-weight points located in low-density regions, which for this particular case correspond to the inliers. Results for DKL and AUC are reported in the Appendix C. Generally, they show similar results and the same conclusions on the good performance of Mo M-KDE can be made.

In the Appendix C, we also add comparisons with methods that ﬁrst proceed to an anomaly detection step before ﬁtting a classical KDE. However, their results are not as good as those of the robust estimators.

4.2. Results on real data

Experiments are also conducted over six classiﬁcation datasets: Banana, German, Titanic, Breast-cancer,

Robust Kernel Density Estimation with Median-of-Means principle

(a) Banana (b) German (c) Titanic (d) Breast cancer

(e) Iris (f) Digits (O: 0, I: all) (g) Digits (O: 1, I: all) (h) Digits (O: 1, I: 0)

Figure 4. Anomaly detection with real datasets, measured with AUC over varying outlier proportion. A higher score means a better detection of the outliers. For Digits, we specify which classes are chosen to be inliers (I) and outliers (O).

Iris, and Digits. They contain respectively n = 5300, 1000, 2201, 569, 150 and 1797 data points having d = 2, 20, 3, 30, 4 and 64 input dimensions. They are all publicly available either from open repositories at http://www.raetschlab.org/Members/ raetsch/benchmark/ (for the ﬁrst three) or directly from the Scikit-learn package (for the last three) (Pedregosa et al., 2011). We follow the approach of Kim and Scott (2012) that consists in setting the class labeled 0 as outliers and the rest as inliers. To artiﬁcially control the outlier proportion, we randomly downsample the abnormal class to reach a ratio |O|/n ranging from 0.05 to 0.5 with 0.05-wide steps. When a dataset does not contain enough outliers to reach a given ratio, we similarly downsample the inliers. For each dataset and each ratio, the experiments are performed 50 times, the random downsampling resulting in different learning datasets. The empirical performance is evaluated through the capacity of each estimator to detect anomalies, which we measure with the AUC.

Results are displayed in Figure 4. With the Digits dataset, we also explore additional scenarios with changing inlier and outlier classes (speciﬁed in the ﬁgure captions). Overall, results are in line with performances observed over synthetic experiments, achieving good results in comparison to its competitors. Note that even in the highest dimensional scenarios, i.e. Digits and Breast cancer (d = 64 and d = 30), Mo M-KDE still behaves well, outperforming its competitors. This behavior for high-dimensionality problems was also notice for robust KDE-based methods in a recent anomaly detection review (Domingues et al., 2018).

Additional results are reported in the the Section C of the Appendix.

5. Conclusion

The present paper introduced Mo M-KDE, a new efﬁcient way to perform robust kernel density estimation. The method has been shown to be consistent in both L and L1 error-norm in presence of very generic outliers. Mo M-KDE achieved good empirical results in various situations while having a lower computational complexity than its competitors.

While the present work uses the coordinate-wise median to construct its robust estimator, it might be interesting to investigate the use of other generalizations of the median in high dimension, e.g. the geometric median. Another possible extension of the proposed method is to consider a bootstrap version where several random splits are performed and thus several Mo M-KDE are aggregated. In that case, one can expect the ﬁnal output to be smoother and the bound to be less dependent on the number of blocks S. On the other hand, with such approach the complexity of the method increases. First empirical visualizations of this method can be found in the Appendix.

Acknowledgments

Part of this research was funded by the Id AML chair of Centre Borelli from ENS Paris Saclay.

Robust Kernel Density Estimation with Median-of-Means principle

Alon, N., Matias, Y., and Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137 147.

Andrews, D. W. (1986). Stability comparison of estimators. Econometrica: Journal of the Econometric Society, pages 1207 1235.

Backurs, A., Indyk, P., and Wagner, T. (2019). Space and time efﬁcient kernel density estimation in high dimensions. In Advances in Neural Information Processing Systems, pages 15773 15782.

Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L., and Tarjan, R. E. (1973). Time bounds for selection. Journal of Computer and System Sciences, 7:448 461.

Boucheron, S. and Thomas, M. (2012). Concentration inequalities for order statistics. Electronic Communications in Probability, 17:1 12.

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145 1159.

Debruyne, M., Christmann, A., Hubert, M., and Suykens, J. A. (2008). Robustness and stability of reweighted kernel based regression. Technical report.

Depersin, J. and Lecué, G. (2021). Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms. ar Xiv preprint ar Xiv:2102.00995.

Devroye, L. and Gyorﬁ, L. (1985). Nonparametric density estimation: the L1 view. New York: John Wiley & Sons.

Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R. I., et al. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695 2725.

Devroye, L. and Lugosi, G. (2012). Combinatorial methods in density estimation. Springer Science & Business Media.

Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., and Stewart, A. (2019). Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864.

Domingues, R., Filippone, M., Michiardi, P., and Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognition, 74:406 421.

Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7):1858 1860.

Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l Institut Henri Poincare (B) Probability and Statistics, volume 38, pages 907 921. Elsevier.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680.

Gray, A. G. and Moore, A. W. (2003a). Nonparametric density estimation: toward computational tractability. In Proceedings of the 2003 SIAM International Conference on Data Mining, pages 203 211. SIAM.

Gray, A. G. and Moore, A. W. (2003b). Rapid evaluation of multiple density models. In International Conference on Artiﬁcial Intelligence and Statistics.

Hampel, F. R. (1974). The inﬂuence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383 393.

Huber, P. J. (1992). Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492 518. Springer.

Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169 188.

Jiang, H. (2017). Uniform convergence rates for kernel density estimation. In Proceedings of the 34th International Conference on Machine Learning, pages 1694 1703. JMLR. org.

Kim, J. and Scott, C. (2008). Robust kernel density estimation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3381 3384. IEEE.

Kim, J. and Scott, C. (2011). On the robustness of kernel density M-estimators. In Proceedings of the 28th International Conference on Machine Learning, pages 697 704. Citeseer.

Kim, J. and Scott, C. D. (2012). Robust kernel density estimation. The Journal of Machine Learning Research, 13(Sep):2529 2565.

Kim, J., Shin, J., Rinaldo, A., and Wasserman, L. (2019). Uniform convergence of the kernel density estimator adaptive to intrinsic volume dimension. In ICML 201936th International Conference on Machine Learning, volume 97.

Robust Kernel Density Estimation with Median-of-Means principle

Kullback, S. and Leibler, R. A. (1951). On information and sufﬁciency. The Annals of Mathematical Statistics, 22(1):79 86.

Laforgue, P., Staerman, G., and Clémençon, S. (2020). How robust is the median-of-means? concentration bounds in presence of outliers. ar Xiv preprint ar Xiv:2006.05240.

Lecué, G. and Lerasle, M. (2019). Learning from MOM s principles: Le Cam s approach. Stochastic Processes and their applications, 129(11):4385 4410.

Lecué, G., Lerasle, M., et al. (2020a). Robust machine learning by median-of-means: theory and practice. The Annals of Statistics, 48(2):906 931.

Lecué, G., Lerasle, M., and Mathieu, T. (2020b). Robust classiﬁcation via MOM minimization. Machine Learning.

Lerasle, M. (2019). Lecture notes: selected topics on robust statistical learning theory. ar Xiv preprint ar Xiv:1908.10761.

Lerasle, M., Szabó, Z., Mathieu, T., and Lecué, G. (2019). MONK outlier-robust mean embedding estimation by median-of-means. In International Conference on Machine Learning, pages 3782 3793. PMLR.

Liese, F. and Vajda, I. (2006). On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394 4412.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413 422. IEEE.

Liu, H. and Gao, C. (2019). Density estimation with contamination: minimax rates and theory of adaptation. Electronic Journal of Statistics, 13(2):3613 3653.

Minsker, S. et al. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 2335.

Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efﬁciency in optimization. Wiley Interscience, New-York.

Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065 1076.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 12:2825 2830.

Rigollet, P. and Vert, R. (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli, 15(4):1154 1178.

Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons.

Segev, N., Harel, M., Mannor, S., Crammer, K., and El Yaniv, R. (2017). Learn on source, reﬁne on target: a model transfer learning framework with random forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1811 1824.

Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. CRC press.

Sriperumbudur, B. and Steinwart, I. (2012). Consistency and rates for clustering with DBSCAN. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1090 1098.

Tsybakov, A. B. (2008). Introduction to nonparametric estimation. Springer Science & Business Media.

Turlach, B. A. (1993). Bandwidth selection in kernel density estimation: a review. In CORE and Institut de Statistique. Citeseer.

Vandermeulen, R. and Scott, C. (2013). Consistency of robust kernel density estimators. In Conference on Learning Theory, pages 568 591.

Vandermeulen, R. A. and Scott, C. (2014). Robust kernel density estimation by scaling and projection in Hilbert space. In Advances in Neural Information Processing Systems, pages 433 441.

Wang, D., Lu, X., and Rinaldo, A. (2019). DBSCAN: optimal rates for density-based cluster estimation. The Journal of Machine Learning Research, 20(170):1 50.

Wang, Z. and Scott, D. W. (2019). Nonparametric density estimation for high-dimensional data-algorithms and applications. Wiley Interdisciplinary Reviews: Computational Statistics, 11(4):e1461.

Weinzierl, S. (2000). Introduction to monte carlo methods. ar Xiv preprint hep-ph/0006269.

Robust Kernel Density Estimation with Median-of-Means principle

A. Additional comments

We begin this section with a simple example highlighting the robustness of Mo M-KDE compared with KDE (see also Figure 1 for a different visual example).

Example 1. (Mo M-KDE v.s. Uniform KDE) Let the inliers be i.i.d. samples from a uniform distribution on the interval [ 1, 1] and the outliers be i.i.d. samples from another uniform distribution on [ 3, 3]. Let the kernel function be the uniform kernel, x0 = 2 and h (0, 1). Then if S > 2|O|, we obtain

| ˆf Mo M(x0) f(x0)| = 0 a.s.

and P | ˆfn(x0) f(x0)| = 0 = (1 h/3)|O| = 1 .

Proof. Since the inliers are uniform on [ 1, 1], when x0 = 2, we have f(x0) = 0. Recall that the classical KDE at x0 = 2 with the uniform kernel and h (0, 1) is:

ˆfn(x0) = 1 2nh

i=1 I(x0 h Xi x0 + h) .

The support of the inliers is [ 1, 1], hence, i I, I(x0 h Xi x0 + h) = 0 a.s. and without outliers we would have ˆfn(x0) = 0, resulting in no errors at x0 = 2.

To prove the ﬁrst equality, it sufﬁces to observe that S > 2|O| implies that more than half the blocks do not contain outliers. Over each one of these block, the KDE at x0 is thus equal to 0 a.s. making the Mo M estimate equal to 0 as well. Finally it results in | ˆf Mo M(x0) f(x0)| = ˆf Mo M(x0) = 0, almost surely.

Let us prove the second equality:

P | ˆfn(x0) f(x0)| = 0 = P ˆfn(x0) = 0

i=1 I(x0 h Xi x0 + h) = 0

i O I(x0 h Xi x0 + h) = 0

= P i O, Xi / [x0 h, x0 + h]

= (1 h/3)|O| .

This result shows that the Mo M-KDE makes (almost surely) no error at the point x0. On the contrary, the KDE here has a non-negligible probability to make an error.

We now make three comments on the behavior of the Mo M-KDE.

Optimality. Characterizing the optimality i.e. obtaining lower bound/minimax rates, for our framework is a difﬁcult task. This is why, at the moment, we characterize the optimality by comparing our rate with the one of the standard KDE, known to be optimal without outliers (a strategy already used in Lecué et al. (2020a) for example). Note that Liu and Gao (2019) study minimax rates of the KDE. Nevertheless, their framework is different. Indeed, they consider the Huber model and study the optimality under L2-convergence. Actually, to be consistent with Liu and Gao (2019), we have also been interested in ﬁnding error rates with L2-convergence. However, the median operator makes the analysis a lot more complicated, as it would require well-suited concentration inequalities for order statistics in the manner of Boucheron and Thomas (2012) but for sub-gaussian random variables.

Robust Kernel Density Estimation with Median-of-Means principle

4 2 0 2 4 6 8 10 12 14 0.00

0.25 Mo M-KDE KDE outliers

4 2 0 2 4 6 8 10 12 14 0.00

0.25 Mo M-KDE + Bootstrap KDE outliers

Figure 5. Illustration of Mo M-KDE + Bootstrap.

4 2 0 2 4 6 8 10 12 14 0.000

0.010 IF of Mo M-KDE IF of KDE outliers (x )

Figure 6. Inﬂuence function for the 10-randomized version of Mo M-KDE with S = 2m + 1 and KDE where the m outliers are placed at x = 12. The true density is 2/3 N(0, 1) + 1/3 N(5, 1).

Mo M is dependent from the partition. In Figure 2, we see that the Mo M-IF is very erratic. This is essentially due to the partitioning. We actually have several ideas to mitigate this behavior. One natural idea is to perform several partitions/estimations and then aggregate them. However this would come at the price of additional computations and the theoretical analysis would become quickly intractable.

Randomization to reduce the variance. During the process of the Mo M-KDE, the split of the dataset is arbitrary. One natural idea, is to consider different splits in order to learn a single Mo M-KDE per split, and then average them. The idea being to learn a smoother density estimator. We give an illustration of this procedure in Figure 5. We clearly see that Mo M-KDE together with Bootstrap is smoother than just one Mo M-KDE. The impact of this strategy is also illustrated in Figures 6 regarding the IF.

B. Technical proofs

Lemma 1 (L error-bound of the KDE without outliers - (Wang et al., 2019)) Suppose that f belongs to the class of densities P(α, L) deﬁned as

P(α, L) f | f 0, Z f(x)dx = 1, and f Σ(α, L) ,

where Σ(α, L) is the Hölder class of function on Rd. Grant assumptions 1 to 3 and let h (0, 1), γ > 0, n large enough, and nhd 1. Then with probability at least 1 exp( γ), we have

γ + log(1/h)

nhd + C2hα ,

where C2 = L Z u αK(u)du < and C1 is a constant that only depends on f , the dimension d, and the kernel

properties.

Robust Kernel Density Estimation with Median-of-Means principle

Proposition 1 (L error-bound of the Mo M-KDE under the O I) Suppose that f belongs to the class of densities P(α, L) and grant assumptions 1 to 3. Let S be the number of blocks such that S 2|O| + 1. Then, for any h (0, 1), γ > 0, n/S large enough, and nhd S, we have with probability at least 1 exp( γ),

ˆf Mo M f C1

S(log(S) + γ + log(1/h))

nhd + C2hα ,

where C2 = L Z u αK(u)du < and C1 is a constant that only depends on f , the dimension d, and the kernel

properties.

Proof. From the deﬁnition of the Mo M-KDE, we have the following implication (Lecué et al., 2020b)

ˆf Mo M(x) f(x) ε =

s=1 I ˆfns(x) f(x) > ε S/2

Thus to upper-bound the probability of the left-hand event, it sufﬁces to upper-bound the probability of the right-hand event. Moreover, we have ˆfns(x) f(x) sup x

ˆfns(x) f(x)

= I ˆfns(x) f(x) > ε I sup x

ˆfns(x) f(x) > ε

s=1 I ˆfns(x) f(x) > ε

s=1 I sup x

ˆfns(x) f(x) > ε

s=1 I ˆfns(x) f(x) > ε

s=1 I sup x

ˆfns(x) f(x) > ε ,

which implies that

s=1 I ˆfns(x) f(x) > ε S/2

s=1 I sup x

ˆfns(x) f(x) > ε S/2

Let Zs = I supx ˆfns(x) f(x) > ε and let S = s {1, , S} | Bs O = i.e. the set of indices s such that

the block Bs does not contain any outliers. Since X

s SC I( ) is bounded by |O|, almost surely, the following holds.

s=1 I sup x

ˆfns(x) f(x) > ε =

s S Zs + |O| . (6)

Note that S is never empty thanks to the hypothesis S 2|O| + 1.

For n large enough e.g. (n/S)hd > γ and (n/S)hd > | log(γ)| (Sriperumbudur and Steinwart (2012) Thm 3.1), Lemma 1

with ε = C1

S(γ + log(1/h))

nhd + C2hα, leads to

pε := P sup x

ˆfn1(x) f(x) > ε e γ .

Combining this last inequality with equation (6) gives

Robust Kernel Density Estimation with Median-of-Means principle

s=1 I sup x

ˆfns(x) f(x) > ε S/2

s S Zs + |O| S/2

s S Zs S/2 |O|

1 (1 pε)|S| 1 (1 pε)S

1 (1 e γ)S .

Finally, using the fact that 1 y e x (1 e x)y, we obtain

ˆf Mo M f C1

S(γ + log(1/h))

nhd + C2hα !

Replacing γ by log(S) + γ > 0 gives

ˆf Mo M f C1

S(log(S) + γ + log(1/h))

nhd + C2hα !

Note that we can easily extend this proof to the adversarial framework (Depersin and Lecué, 2021). Indeed, suppose Z1, , ZS deﬁned above equation (6) are corrupted and Z 1, , Z S are the corresponding uncorrupted version of them. We have that Z1, , ZS are potentially not independent but Z 1, , Z S are independent and Zi = Z i for any i S (this is the adversarial framework). Then,

s=1 (Zs Z s)

s=1 Z s + |O| .

The right-hand side is now a sum of i.i.d. bounded random variables and restarting from equation (6) we can conclude.

Corollary 1 (Rate of convergence) Consider the assumptions of Proposition 1 with S = 2|O| + 1, γ = log(n) and

1/(2α+d) . Thus, with probability higher than 1 1

ˆf Mo M f |O|

n log(n) α/(2α+d) + log(n)

Proof. From the previous proposition we know that with probabability 1 1/n, we have

ˆf Mo M f C1

S(log(S) + log(n) + log(1/h))

nhd + C2hα .

Robust Kernel Density Estimation with Median-of-Means principle

Replacing h by S log(n)

1/(2α+d) in this last equation gives us that

ˆf Mo M f S log(n)

α/(2α+d) |O|

n log(n) α/(2α+d) + log(n)

Proposition 2. (L1-consistency in probability) If n/S , h 0, nhd , S/

nhd 0, and S 2|O| + 1, then

ˆf Mo M f 1 P n 0 .

Proof. We ﬁrst rewrite the Mo M-KDE as

ˆf Mo M(x) =

s=1 ˆfns(x)IAs(x) ,

where As = n x | ˆf Mo M(x) = ˆfns(x) o . Without loss of generality, we assume that

Ak S s =ℓAℓ= , S s=1As = Rd, and

s=1 IAs(x) = 1 .

Z ˆf Mo M(x) f(x) dx = Z

s=1 ˆfns(x)IAs(x) f(x)

ˆfns(x) f(x) IAs(x)

ˆfns(x) f(x) IAs(x)dx

ˆfns(x) f(x) dx

ˆfns(x) f(x) dx + X

ˆfns(x) f(x) dx . (7)

From the L1-consistency of the KDE in probability, if the number of anomalies grows at a small enough speed (Devroye and Gyorﬁ, 1985), the left part is bounded, i.e.

ˆfns(x) f(x) dx X

Z ˆfns(x) f(x) dx P n 0 . (8)

To be more speciﬁc, since Rn =

nhd, the rate given in Thm 5.1 of Devroye and Gyorﬁ(1985), the convergence is guarantee whenever S/Rn tends to 0 e.g. with S = O(Rτ n) and any τ (0, 1). We now upper-bound the right part of equation (7). Let consider a particular block As where s SC. In this block, the estimator fns is selected and is calculated with samples containing anomalies. As x As, fns(x) is the median (by deﬁnition), if S > 2|O|, we can always ﬁnd a s S such that fns(x) fns (x) or fns(x) fns (x).

Now let denote by A+ s = n x As | ˆfns(x) f(x) o and A s = n x As | ˆfns(x) < f(x) o . We have A+ s A s = As

and each one of these blocks can be decomposed respectively into |S| sub-blocks (not necessarily disjoint) {As ,+ s }s S and {As , s }s S such that s S,

Robust Kernel Density Estimation with Median-of-Means principle

As ,+ s = n x As | ˆfns (x) ˆfns(x) f(x) o and As , s = n x As | ˆfns (x) ˆfns(x) < f(x) o .

Finally, the right-hand term of equation (7) can be upper-bounded by

ˆfns(x) f(x) dx X

ˆfns(x) f(x) dx + Z

ˆfns(x) f(x) dx

ˆfns(x) f(x) dx + Z

ˆfns(x) f(x) dx

ˆfns (x) f(x) dx + Z

ˆfns (x) f(x) dx

Z ˆfns (x) f(x) dx + Z ˆfns (x) f(x) dx

Since s S we have Z | ˆfns (x) f(x)|dx P n 0, we can conclude using similar arguments as those used for (8) that X

As | ˆfns(x) f(x)|dx P n 0, which concludes the proof.

Proposition 3. Let x , x0 Rd and In be an healthy data set. Grant assumptions 1 to 3 and denote

i I K Xi x0

Let S 2m + 1 with m J0, n

2 J the number of added samples and δ > 0 such that |b a/n| > Cρ p

2δS/n, then with probability higher than 1 4 exp( δ) we have:

IFO I(x0, x , m; In, ˆf Mo M) IFO I(x0, x , m; In, ˆf KDE) .

The IFO I for the KDE is

IFO I(x0, x , m, In; ˆf KDE) =

1 (n + m)hd

i=1 K Xi x0

i=1 K Xi x0

i=1 K Xi x0

i=1 K Xi x0

+ m n + m K x x0

i=1 K Xi x0

+ m n + m K x x0

a + m n + m

nb a n2/m + n

i=1 K Xi x0

, and b K x x0

IFO I for the Mo M-KDE

Robust Kernel Density Estimation with Median-of-Means principle

Let S > 2m be the number of blocks in the Mo M-KDE, {Bs}S s=1 and { Bs }S s =1 be respectively the blocks of the contaminated data set In {x }m and the healthy data set. We have:

IFO I(x0, x , m, In; ˆf Mo M) = 1

i In Bs K Xi x0

i {x } Bs K x x0

i Bs K Xi x0

i Bs K Xi x0

i Bs K Xi x0

where Bs is the block selected by the median for the healthy Mo M-KDE. The inequality is obtained by noticing that, with S > 2m, there always exists an healthy block Bs that makes it true.

Finally, denoting

S n + m K Xi x0

S n K Xi x0

= Z(s) Z(s ) ,

and using both the triangular and Hoeffding inequalities, and the fact that E(Z(s)) = E(Z(s )), we have

hd IFO I(x0, x , m, In; ˆf Mo M) Z(s) Z(s ) E(Z(s)) + E(Z(s ))

Z(s) E(Z(s)) + Z(s ) E(Z(s )) ,

and for t > 0,

P Z(s) E(Z(s)) t 2 exp 2t2(n + m))

P Z(s ) E(Z(s )) t 2 exp 2t2n

= 2 exp 2nst2

where ns = n/S. Furthermore, given two real-valued random variables X, Y , we know that for t > 0,

P (|X| + |Y | 2t) P (|X| t) + P (|Y | t) .

Therefore, we have

P Z(s) E(Z(s)) + Z(s ) E(Z(s )) t 4 exp nst2

Setting t = Cρ

2δ ns with δ > 0, ﬁnally gives P Z(s) Z(s ) < t 1 4 exp( δ). We now know that with probability

1 4 exp( δ), IFO I(x0, x , m, In; ˆf Mo M) < 1

2δ ns and we seek for which value of m, this value is smaller than

Robust Kernel Density Estimation with Median-of-Means principle

the IFO I of the KDE i.e.

2δ ns nb a n2/m + n

n2/m + n ns|nb a|

m n2 ns|nb a|

m n2 n|nb a|

m n n|b a/n|

C. Additional results

As stated in the main paper, we display here the additional results containing:

For synthetic data, the Kullback-Leibler divergence in both directions, i.e. DKL( ˆf, f) and DKL(f, ˆf), and the ROC AUC measuring the performance of an anomaly detector based on ˆf. Results are displayed on Figure 7.

For synthetic data, a benchmark of Outlier Detection (OD) + KDE. Instead of applying directly a robust estimator, we ﬁrst proceed to an outlier detection step after which we ﬁt a standard KDE. Results are displayed on Figure 8.

For Digits data, the ROC AUC measuring the performance of an anomaly detector based on ˆf. As stated in the main paper, this is done under multiple scenarios, where outliers and inliers can be chosen among the nine available classes. Here we show the AUC when the outliers are set as one class (class 2 to class 9), and inliers are set as the rest of all classes. Results are displayed on Figure 9.

Additional visualizations highlighting the behavior of the Inﬂuence Function of KDE and Mo M-KDE (Figures 10

KL, ROC and AUC for Synthetic data. When considering the Kullback-Leibler divergence, results lead to a very similar conclusion as previously stated, that is, an overall good performance of Mo M-KDE while its competitors, notably SPKDE, are more data-dependent. When the density estimate ˆf is used in a simple anomaly detector, results are quite different. Indeed, when outliers are uniformly distributed, even if Mo M-KDE seems to better estimate the true density (according to DJS and DKL), this doesn t make ˆf Mo M a better anomaly detector. It seems that in this case, the outliers are either easily detected because distant from the density estimate, or located in dense regions, thus making them impossible to identify, and this for all density estimates provided by competitors. In the case of adversarial contamination, the conclusion is quite similar. Although Mo M-KDE better ﬁts the true density, the situation is extremely difﬁcult for anomaly detection, hence making all competitors yield very poor results. In the two other cases Gaussian outlier, anomaly detection results follow the density estimation.

Benchmark of Outlier Detection (OD) + KDE: We proceed to the OD step using two different methods, the KDE and the Isolation Forest (IF) (Liu et al., 2008). For the ﬁrst method, we ﬁt a KDE estimate using all data and we erase the proportion ε = |O|/n of samples that have the lower density. For the IF, we proceed similarly by removing a proportion ε of outliers. Finally, we ﬁt a KDE estimate on the new samples. The results are given in Figure 8 and show that OD + KDE (purple curve) or IF + KDE (brown) is not as efﬁcient as RKDE or Mo M-KDE.

Robust Kernel Density Estimation with Median-of-Means principle

DKL( ˆf, f)

(a) Uniform (b) Regular Gaussian (c) Thin Gaussian (d) Advsersarial Thin Gaussian

Figure 7. Density estimation with synthetic data. The displayed metrics are the Kullback-Leibler divergence (a lower score means a better estimation of the true density) and the AUC (a higher score means a better detection of the outliers).

Digits data. Results over Digits scenarios are inline with main conclusions over real data. Although from one scenario to another, all methods have varied results, the overall observation is that Mo M-KDE is either similar or better than its competitors.

Inﬂuence Function. Illustrations of the behavior of the Inﬂuence Function (IF) for Mo M-KDE and KDE are given in Figures 10 and 11. They show that the IF for Mo M-KDE is lower than the one of KDE when the m outlying points are all placed at a low density point x (see Figure 10). On the other hand, Figure 11 shows that whenever the outlying points are all placed at a high density point x , the two IF exhibit similar behavior.

Robust Kernel Density Estimation with Median-of-Means principle

(a) Uniform (b) Regular Gaussian (c) Thin Gaussian (d) Advsersarial Thin Gaussian

Figure 8. Density estimation with synthetic data. The displayed metric is the Jensen-Shannon divergence. A lower score means a better estimation of the true density.

(a) O: 2, I: all (b) O: 3, I: all (c) O: 4, I: all (d) O: 5, I: all

(e) O: 6, I: all (f) O: 7, I: all (g) O: 8, I: all (h) O: 9, I: all

Figure 9. Anomaly detection with Digits data, measured with AUC over varying outlier proportion. A higher score means a better detection of the outliers. We specify which classes are chosen to be inliers (I) and outliers (O).

Robust Kernel Density Estimation with Median-of-Means principle

4 2 0 2 4 6 8 10 12 14 0.000

0.010 IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

0.010 IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

0.010 IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

0.010 IF of Mo M-KDE IF of KDE outliers (x )

Figure 10. Inﬂuence function for the 10-randomized version of Mo M-KDE with S = 2m + 1 and KDE where the m outliers are placed at x = 12. The true density is 2/3 N(0, 1) + 1/3 N(5, 1).

Robust Kernel Density Estimation with Median-of-Means principle

4 2 0 2 4 6 8 10 12 14 0.000

IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

0.025 IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

0.025 IF of Mo M-KDE IF of KDE outliers (x )

4 2 0 2 4 6 8 10 12 14 0.000

0.025 IF of Mo M-KDE IF of KDE outliers (x )

Figure 11. Inﬂuence function for Mo M-KDE with S = 2m + 1 and KDE where the m = 100 outliers are placed at x , and where the samples in In are drawn from the true density: 2/3 N(0, 1) + 1/3 N(5, 1).