# sobolev_space_regularised_pre_density_models__7858f8f6.pdf Sobolev Space Regularised Pre Density Models Mark Kozdoba * 1 Binyamin Perets * 1 Shie Mannor 1 2 We propose a new approach to non-parametric density estimation that is based on regularizing a Sobolev norm of the density. This method is statistically consistent, and makes the inductive bias of the model clear and interpretable. While there is no closed analytic form for the associated kernel, we show that one can approximate it using sampling. The optimization problem needed to determine the density is non-convex, and standard gradient methods do not perform well. However, we show that with an appropriate initialization and using natural gradients, one can obtain well performing solutions. Finally, while the approach provides pre-densities (i.e. not necessarily integrating to 1), which prevents the use of loglikelihood for cross validation, we show that one can instead adapt Fisher divergence based score matching methods for this task. We evaluate the resulting method on the comprehensive recent anomaly detection benchmark suite, ADBench, and find that it ranks second best, among more than 15 algorithms. 1. Introduction Density estimation is one of the central problems in statistical learning. In recent years, there has been a tremendous amount of work in the development of parametric neural network based density estimation methods, such as Normalizing Flows (Papamakarios et al., 2021), Neural ODEs (Chen et al., 2018), and Score Based methods, (Song et al., 2021). However, the situation appears to be different for non parametric density estimation methods, (Wasserman, 2006), (Härdle et al., 2004). While there is recent work for low dimensional (one or two dimensional) data, see for instance (Takada, 2008), (Uppal et al., 2019), (Cui et al., 2020), *Equal contribution 1Technion Israel Institute of Technology, Haifa, Israel 2NVIDIA Research. Correspondence to: Mark Kozdoba , Binyamin Perets . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). (Marteau-Ferey et al., 2020), (Ferraccioli et al., 2021) (see also the survey (Kirkby et al., 2023)), there still are very few non-parametric methods applicable in higher dimensions. Compared to parametric models, non parametric methods are often conceptually simpler, and the model bias (e.g., prior knowledge, type of smoothness) is explicit. This may allow better interpretability and better regularization control in small and medium sized data regimes. Ideally, a function g produced by a density estimator should satisfy the following properties: (i) It is non-negative, (ii) it is integrable, and integrates to 1, (iii) it models well multidimensional data, and (iv) it is computationally feasible. As mentioned above, to the best of our knowledge currently there are no non-parametric estimators that satisfy (iii) and (iv). Moreover, note that even the requirements (i) and (ii) are non-trivial in the sense that they are not natural in the context of typical non-parametric constructions. Indeed, some of the above mentioned estimators simply forego either or both of these conditions even in 1d situations (see also a detailed discussion in (Marteau-Ferey et al., 2020) on this topic). In this paper we develop the first non-parametric estimator that satisfies all the properties above, with the exception of integrating to 1 (but still integrable). Since many tasks derived from density estimation do not require the precise normalisation constant knowledge, this yields a practically useful and theoretically well founded method. Let S = {xi}N i=1 Rd be a set of data points sampled i.i.d from some unknown distribution. We study a density estimator of the following form: f := argmin f Ha 1 i=1 log f 2(xi) + f 2 Ha . (1) Here Ha is a Sobolev type Reproducing Kernel Hilbert Space (RKHS) of functions, having a norm of the form Rd f 2(x)dx + a Z Rd |(Df)|2 (x)dx, (2) where D represents a combination of derivatives of a certain order. The density estimate is given by the function (f )2. Note that (f )2 is clearly non-negative, and f Ha < Sobolev Space Regularised Pre Density Models Rd(f )2(x)dx < . Thus, (f )2 is integrable over Rd, although not necessarily integrates to 1. In this paper we refer to such functions as pre-densities. To convert such functions to densities, it sufficient to multiply them by the appropriate constant. Note also that (1) is essentially a regularized maximum likelihood estimate, where in addition to bounding the total mass of (f )2, we also bound the norm of some of the derivatives of f . The fact that Ha is an RKHS allows us to compute f via the standard Representer Theorem. Observe that it would not be possible to control only the L2 norm of f and maintain computabilty, since L2 is not an RKHS. However, adding the derivatives with any coefficient a > 0 makes the space into an RKHS which allows to control smoothness, hence, we call the objective SObolev Space REgularised Pre density estimator (SOSREP). The objective (1) has been introduced in (Good and Gaskins, 1971) and further studied in (Klonias, 1984), in the context of spline based methods in one dimension. In this paper we generalise this approach to arbitrary dimensions, which requires resolving several theoretical and computational issues. On the side of the theory, generalising the existing 1d results, we prove the asymptotic consistency of the SOSREP estimator for the SDO kernels in any fixed dimension d, under mild assumptions on the ground truth density generating the xi s. In addition, we present a family of examples in which the SOSREP and the standard Kernel Density Estimator (KDE, (Wasserman, 2006), with the same kernel) provably arbitrarily differ. Thus, SOSREP is a genuinely new estimator, non equivalent to KDE, and having some different properties. We also show that examples as above occur naturally in real datasets. Due to space constraints, the discussion of these examples is deferred to the Supplementary Material, Section F. On the computation side, we resolve several issues: First, for d > 1 the kernel corresponding to Ha, which we call the SDO kernel (Single Derivative Order; see Section 4), no longer has an analytical expression. We show however that it can, nevertheless, be approximated by an appropriate sampling procedure. Next, the problem (1) is not convex in f, and we find that standard gradient descent optimization applied naively produces poor results. We show that this may be resolved by an appropriate initialization, and further improved by using a certain natural gradient optimisation rather than the standard one. Specifically, we will show that in our setup the natural gradient preserves some positivity properties, which are crucial for finding good solutions. Curiously, we believe this is one of the very few instances where it is clear why natural gradients perform better. Further, note that as a result of consistency, it can also be shown that the estimator (f )2 becomes normalized, at least asymptotically with N (see Section 5). However, in general computing or even estimating the normalization constant is not straightforward in terms of numerical computation, and is outside the scope of this paper 1. Instead, we will focus on the Anomaly Detection (AD) applications, which do not require a normalization. A discussion of possible applications of SOSREP to another key task, that of generative modelling, is given in Section 7. Next, although the normalisation may not be required for the main task itself, lack of normalisation may still introduces a particular nuisance in the context of hyperparameter tuning, as it prevents the utilization of the maximum likelihood measure to establish the optimal bandwidth" parameter a. To resolve this, instead of the likelihood we consider the Fisher Divergence (FD), which uses log-likelihood gradients for divergence measurement, thereby eliminating the need for normalization. More specifically, we adapt the concept of score-matching (Hyvärinen, 2005; Song et al., 2020; Song and Ermon, 2019), a technique that has recently garnered renewed interest, to our setting. Finally, building on the above steps, we show that SOSREP achieves the remarkable performance of scoring second best on a recent comprehensive anomaly detection benchmark, (Han et al., 2022) for tabular data, which includes 47 datasets and 15 specialized AD methods. Interestingly, it is worth noting that anomaly detection has been regarded in the literature as a particularly difficult task, especially for deep density estimators (Nalisnick et al., 2019; Choi et al., 2019), and the difficulty persists even in the presence of substantial amounts data. We believe that this makes the good performance of a much more interpretable method, such as SOSREP, even more remarkable. The rest of the paper is organized as follows: Section 2 reviews related literature. Section 3 introduces the RSR estimator and treats associated optimization questions. The SDO kernel and the associated sampling approximation are discussed in Section 4. Consistency results are stated in Section 5. Section 6 contains the experimental results, and Section 7 concludes the paper. 2. Literature and Related Work As discussed in Section 1, a scheme that is equivalent to (1) was studied in (Good and Gaskins, 1971) and (Klonias, 1984); see also (Eggermont et al., 2001). However, these works concentrated solely on 1d case, and used spline methods to solve (3) in the special case that amounts to the use 1However, see Section 3.4, where an approach to an explicit normalisation at fixed finite N is discussed, along with other work were similar normalisation considerations are employed Sobolev Space Regularised Pre Density Models of one particular kernel. More recently, an estimator that is closely related to (1) was considered in (Ferraccioli et al., 2021), but was restricted to the 2 dimensional case, both theoretically and practically. In particular, to obtain solutions, their approach involves discretizing (triangulating) the domain of interest in R2. This would not be feasible in any higher dimension. In contrast to these approaches, our more general RKHS formulation in Section 3.1 allows the use of a variety of kernels, and our optimisation algorithm is suitable, and was evaluated on, high dimensional data. The work that is perhaps the most closely related to ours is (Marteau-Ferey et al., 2020), where the authors consider quadratic functions and regularisation with a version of an RKHS norm. The differences with respect to our work are both in scope and in details of the construction that enable computation. First, from a computational viewpoint, the objective in (Marteau-Ferey et al., 2020) requires optimisation over a certain family of non-negative matrices. While considering matrices has some advantages in terms of convexity, for a dataset of size N this requires N 2 parameters to encode such a matrix (their Theorem 1), and O(N 3) computational cost per step ((Marteau-Ferey et al., 2020) top of page 6). Further, their optimisation is a constrained one, which makes the problem significantly more difficult. It thus would be unpractical to apply their methods to something like the ADBench benchmark, and indeed, they have only experimented with a 1-dimensional Gaussian example with N = 50. In contrast, similarly to the standard kernel methods, our objective only requires N parameters, the optimisation step is O(N 2), and the optimisation can be done with standard optimisers, such as GD or Adam. Correspondingly, our evaluation is done on a state-of-theart outlier detection dataset. Second, while (Marteau-Ferey et al., 2020) is concerned with non negative functions generally, we focus on density estimation much more closely. In particular, we prove consistency of our estimator, which was not shown in (Marteau-Ferey et al., 2020) for the estimator introduced there. In addition, we prove, both theoretically and through empirical evaluation, that our method is different from KDE, which is the most common non parametric density estimator. The recent work (Tsuchida et al., 2024) may be viewed as neural network based variant of the construction in (Marteau Ferey et al., 2020). Specifically, with slight simplification, in this work densities of the form v, σ (Wt(x) + b) 2 are considered, where t : Rd RD is a fixed feature extractor, W Rn D and b RD are linear and bias terms, viewed as a shallow neural network, σ is a non-linearity, and v Rn. It is observed that for several possible choices of t, σ, the normalisation constant of such a density may be analytically computed as a function of v, W, b, thus making it an interesting density model, optimisable with maximum likelihood. Note that σ (Wt(x) + b) may be viewed as a (learnable, but shallow) feature embedding into Rn, and v as the linear combination of features, thus highlighting the analogy with the methods of (Marteau-Ferey et al., 2020) and with the approach in this paper. However, note also that the layer output dimension n here is involved in a trade-off: For higher n, the density family is more expressive, but the computation of the normalisation also becomes more complex (see Section 3.4). We note that in (Tsuchida et al., 2024) this approach was only evaluated as a density estimator on low dimensional data, and thus, in particular in view of the trade-off above, its behavior in higher dimensions is not clear at the moment. The paper (Sriperumbudur et al., 2017) studies density estimation in the extended exponential family. That is, the densities are modeled as q0(x)ef(x), where q0 is a fixed known reference density on Rd, and f varies in an RKHS H. In (Sriperumbudur et al., 2017) it is argued that maximumlikelihood estimators are impractical due to the need to compute the normalisation, and therefore a Fisher Divergence (i.e. score-based) objective is used for the estimation. In contrast, surprisingly, in this work we show that in fact one can use a likelihood-like objective (1) to obtain a useful consistent estimator. 2 We emphasize that both their and our estimators produce unnormalised densities. The computation of the estimator of (Sriperumbudur et al., 2017) involves a solution of a certain linear system, which, while conceptually simple problem, is a computationally difficult due to the dimensions of this system. A number of variants with better computational complexity were developed in follow up work, see (Sutherland et al., 2018), (Dai et al., 2019), and (Singh et al., 2018). Another common group of non parametric density estimators are the projection methods, (Wainwright, 2019). These methods have mostly been studied in one dimensional setting, see the survey (Kirkby et al., 2023). It is worth noting that with the exception of (Uppal et al., 2019), the estimators produced by these methods are not densities, in the sense that they do not integrate to 1, but more importantly, may take negative values. In the context of minmax bounds, projection methods in high dimensions were recently analyzed in (Singh et al., 2018), extending a classical work (Kerkyacharian and Picard, 1993). However, to the best of our knowledge, such methods have never been practically applied in high dimensions. Consistency of the SOSREP in one dimension was shown in (Klonias, 1984), for kernels that coincide with SDO in one dimension. Our results provide L2 consistency of the SOSREP in any fixed dimension d. While our approach generally follows the same lines as that of (Klonias, 1984), 2Note that while we also use FD in the experiments, we only use it to tune a single scalar parameter, which is computationally a considerably simpler problem than learning the whole density. Sobolev Space Regularised Pre Density Models some of the estimates are done differently, since the corresponding arguments in (Klonias, 1984) were intrinsically one dimensional. 3. The SOSREP Desnity Estimator In this Section we describe the general SOSREP Density Estimation Framework, formulated in an abstract Reproducing Kernel Hilbert Space. We then introduce and resolve issues related to the optimisation of the SOSREP objective, and discuss connections between convexity, positivity, and natural gradients. 3.1. The Basic Framework Let X be a set and let H be a Reproducing Kernel Hilbert Space (RKHS) of functions on X, with kernel k : X X R. In particular, H is equipped with an inner product , H and for every x X, the function k(x, ) = kx( ) : X R is in H and satisfies the reproducing property, kx, f H = f(x) for all f H. The norm on H is denoted by f 2 H = f, f H, and the subscript H may be dropped when it is clear from context. We refer to (Schölkopf et al., 2002) for a general introduction to RKHS theory. Given a set of points S = {x1, . . . , x N} X, we define the SOSREP estimator as the solution to the following optimization problem: f = argmin f H 1 i log f 2(xi) + f 2 H . (3) As discussed in Section 1, for appropriate spaces H, the function (f )2 corresponds to a pre-density (that is, R Rd(f )2(x)dx < , but not necessarily R Rd(f )2(x)dx = 1). We now discuss a few basic properties of the solution to (3). First, by the Representer Theorem for RKHS, the minimizer of (3) has the form f(x) = fα(x) = i=1 αikxi(x) (4) for some α = (α1, . . . , αN) RN. Thus one can solve (3) by optimizing over a finite dimensional vector α. Next, it is worth noting that standard RKHS problems, such as regression, typically use the term λ h 2 H, where λ > 0 controls the regularization strength. However, due to the special structure of (3), any solution with λ = 1 is a rescaling by a constant of a λ = 1 solution. Thus considering only λ = 1 in (3) is sufficient. In addition, we note that any solution of (3) satisfies f 2 H = 1. See Lemma 4 in Supplementary Material for full details on these two points. 3.2. Convexity, Positivity, and Natural Gradients Next, observe that the objective i log f 2(xi) + f 2 H i log f, kxi 2 H + f 2 H is not convex in f. This is due to the fact that the scalar function a 7 log a2 from R to R is not convex and is undefined at 0. However, the restriction of a 7 log a2 to a (0, ) is convex. Similarly, the restriction of L to the positive cone of functions C = {f | f(x) 0 x X} is convex. Empirically, we have found that the lack of convexity results in poor solutions found by gradient descent. Intuitively, this is caused by f changing sign, which implies that f should pass through zero at some points. If these points happen to be near the test set, this results in low likelihoods. At the same time, there seems to be no computationally affordable way to constrain the optimization to the positive cone C. Indeed, standard methods for constrained problems, such as the projected gradient descent, (Nesterov, 2003), would be costly due to the need to compute the projections. We resolve this issue in two steps: First, we use a nonnegative α initialization, αi 0. Note that for f given by (4), if the kernel is non-negative, then f is non-negative. Although some kernels are non-negative, the SDO kernel, and especially its finite sample approximation (Section 4.2) may have negative values. At the same time, there are few such values, and empirically such initialization tremendously improves the performance of the gradient descent. Second, we use the natural gradient, defined in the next section. One can show that for non-negative kernels, C is in fact invariant under natural gradient steps (supplementary material Section H). That is, if for fα given by (4) we have fα C, and α is obtained by the natural gradient step (8), then fα C. This does not seem to be true for the regular gradient. As a consequence, at least for fully non-negative kernels, by using natural gradient we can perform optimisation in C, without the need for computationally costly constraints. As for the SDO kerenl, although it is not completely non-negative, we observe empirically that the use of natural gradient results in a more stable algorithm and in performance improvement compared to the standard gradient descent. A comparison of standard and natural gradients w.r.t stability to negative values is given in Section 6.3. 3.3. Explicit Form of Gradients In this Section we explicitly compute the regular and the natural gradients, both in the function space and in terms of α. Sobolev Space Regularised Pre Density Models We are interested in the minimization of L(f), defined by (5). Using the representation (4) for x X, we can equivalently consider minimization in α RN. Let K = {k(xi, xj)}i,j N RN N denote the empirical ker- nel matrix. Then standard computations show that fα 2 H = Kα, α RN and we have (fα(x1), . . . , fα(x N)) = Kα (as column vectors). Thus one can consider L(fα) = L(α) as a functional L : RN R and explicitly compute the gradient w.r.t α. This gradient is given in (6). As detailed in 3.1, it is also useful to consider the Natural Gradient the gradient of L(f) as a function of f, directly in the space H. Briefly, a directional Fréchet derivative, (Munkres, 2018), of L at point f H in direction h H is defined as the limit Dh L(f) = limε 0 ε 1 (L(f + εh) L(f)). As a function of h, Dh L(f) can be shown to be a bounded and linear functional, and thus by the Riesz Representation Theorem, there is a vector, which we denote f L, such that Dh L(f) = f L, h for all h H. We call f L the Natural Gradient of L, since its uses the native space H. Intuitively, this definition parallels the regular gradient definition, but uses the H inner product to define the vector f L, instead of the standard, parametrization dependent inner product in RN, that is used to define αL. For the purposes of this paper, it is sufficient to note that similarly to the regular gradient, the natural gradient satisfies the chain rule, and we have f f 2 H = 2f and f g, f H = g for all g H. The explicit gradient expressions are given below: Lemma 1 (Gradients). The standard and the natural gradients of L(f) are given by αL = 2 Kα 1 N K (Kα) 1 RN and i=1 f 1(xi)kxi where for a vector v Rd, v 1 means coordinatewise inversion. If one chooses the functions kxi as a basis for the space HS = span {kxi}i N H, then α in (4) may be regarded as coefficients in this basis. For f = fα HS one can then write in this basis f L = 2 α 1 N (Kα) 1 RN. Therefore in the α-basis we have the following standard and natural gradient iterations, respectively: α α 2λ Kα 1 N K (Kα) 1 and (7) N (Kα) 1 , (8) where λ is the learning rate. 3.4. Normalisation Let us now discuss the normalisation of densities f 2 with f given by (4). Since the normalising constant for such a density is simply R Rd f 2(x)dx, we can write Z Rd f 2(x)dx = X Rd kxi(x)kxj(x)dx = α, Kα , where K RN N is the matrix with entries Kij = R kxi(x)kxj(x)dx. For the Gaussian kernel, these entries may be computed analytically, but for other kernels the situation is less clear. Another significant issue is that the operation α, Kα involves summation and multiplication of O(N 2) terms, and as such is likely to be numerically unstable for larger N. Due to these reasons, we believe that normalisation requires further study, and is out of scope for this work. It is worth noting, however, that in (Marteau-Ferey et al., 2020) a similar argument was used to achieve normalisation, with a small N and Gaussian kernels. Moreover, in (Tsuchida et al., 2024) (see Section 2), the normalisation constant also has a form α, Kα for an appropriate α. In this case, however, the matrix K depends on the parameters W, b instead of the inputs, and is of dimension n n, where n is the output size of the single hidden layer. We note in particular that since every entry of K depends on the parameters, the underlying computation graph of the autodifferentiation engine would scale as n2, thus explaining the expressivity-computational complexity trade-off of this method, as discussed in Section 2. 4. Single Derivative Order Kernel Approximation In this Section we introduce the Single Derivative Order kernel, which corresponds to norms of the form (2) discussed in Section 1. In Section 4.1 we introduce the relevant Sobolev functional spaces and derive the Fourier transform of the norm. In Section 4.2 we describe a sampling procedure that can be used to approximate the SDO. 4.1. The Kernel in Integral Form For a function f : Rd C and a tuple κ (N {0})d, let Dκ = f xκ1 1 ... x κd d denote the κ indexed derivative. By convention, for κ = (0, 0, . . . , 0) we set Dκf = f. Set also κ! = Qd j=1 κj! and |κ|1 = Pd j=1 κj. Set f 2 L2 = R |f(x)|2 dx. Then, for m N and a > 0 denote f 2 a = f 2 L2 + a X κ! (Dκf) 2 L2 . (9) The norm f 2 a induces a topology that is equivalent to that of a standard L2 Sobolev space of order m. We refer to Sobolev Space Regularised Pre Density Models (Adams and Fournier, 2003), (Saitoh and Sawano, 2016) for background on Sobolev spaces. However, here we are interested in properties of the norm that are finer than the above equivalence. For instance, note that for all a = 0 the norms f a are mutually equivalent, but nevertheless, a specific value of a is crucial in applications, for regularization purposes. Let Ha = n f : Rd C | f 2 a < o be the space of functions with a finite f 2 a norm. Denote by f, g Ha = f, g L2 + a X κ! (Dκf) , (Dκg) L2 (10) the inner product that induces the norm f 2 a. Theorem 2. For m > d/2 and any a > 0, the space Ha admits a reproducing kernel ka(x, y) satisfying ka x, f Ha = f(x) for all f Ha and x Rd. The kernel is given by ka(x, y) = Z Rd e2πi y x,z 1 + a (2π)2m z 2m dz = Z 1 + a (2π)2m z 2m e2πi y,z e2πi x,z dz. The proof of Theorem 2 follows the standard approach of deriving kernels in Sobolev spaces, via computation and inversion of the Fourier transform, see (Saitoh and Sawano, 2016). However, compact expressions such as (11) are only possible for some choices of derivative coefficients. Since the particular form (9) was not previously considered in the literature (except for d = 1, see below), we provide the full proof in the Supplementary Material. 4.2. Kernel Evaluation via Sampling To solve the optimization problem (3) in Ha, we need to be able to evaluate the kernel ka at various points. For d = 1, closed analytic expressions were obtained in cases m = 1, 2, 3 in (Thomas-Agnan, 1996). In particular, for m = 1, ka coincides with the Laplacian kernel kh(x, y) = e h|x y|. However, for d > 1, it seems unlikely that there are closed expressions. See (Novak et al., 2018) for a discussion of this issue for a similar family of norms. To resolve this, note that the form (11) may be interpreted as an average of the terms e2πi y,z e2πi x,z , where z is sampled from an unnormalized density wa(z) = (1 + a (2π)2m z 2m) 1 on Rd. This immediately suggest that if we can sample from wa(z), then we can approximate ka by summing over a finite set of samples zj instead of computing the full integral. In fact, a similar scheme was famously previously employed in (Rahimi and Recht, 2007), in a different con- text. There, it was observed that by Bochners s Theorem, (Rudin, 2017), any stationary kernel can be represented as k(x, y) = R ν(z)e2πi y,z e2πi x,z dz for some nonnegative measure ν. Thus, if one can sample z1, . . . , z T from ν, one can construct an approximation ˆka(x, y) = 1 t=1 cos ( zt, x + bt) cos ( zt, y + bt) , (12) where bt are additional i.i.d samples, sampled uniformly from [0, 2π]. In (Rahimi and Recht, 2007), this approximation was used as a dimension reduction for known analytic kernels, such as the Gaussian, for which the appropriate ν are known. Note that the samples zt, bt can be drawn once, and subsequently used for all x, y (at least in a bounded region, see the uniform approximation result in (Rahimi and Recht, 2007)). For the case of interest in this paper, the SDO kernel, Bochner s representation is given by (11) in Theorem 2. Thus, to implement the sampling scheme (12) it remains to describe how one can sample from the density wa(z) on Rd. To this end, note that wa(z) is spherically symmetric, and thus can be decomposed as z = rθ, where θ is sampled uniformly from a unit sphere Sd 1 and the radius r is sampled from a one dimensional density ua(r) = rd 1 1+a(2πr)2m (see the Supplementary Material for full details on this change of variables). Next, note that sampling θ is easy. Indeed, let g1, . . . , gd be i.i.d standard Gaussians. Then θ (g1, . . . , gd)/ p P i g2 i . Thus the problem is reduced to sampling a one dimensional distribution with a single mode, with known (unnormalized) density. This can be efficiently achieved by methods such as Hamiltonian Monte Carlo (HMC). However, we found that in all cases a fine grained enough discretization of the line was sufficient. 4.3. Relations With Other Kernels We now discuss the relation between SDO and the Laplacian, Gaussian and Matérn kernels. The SDO kernel with parameter m is defined by the norm (9), which involves only derivatives of order m and we use m = d/2 in the experiments. On the other hand, it is well known that the Gaussian kernel, k(x, y) = e x y 2 2σ2 , corresponds to a Sobolev space with a norm that is analogous to (9), but involves a summation over all orders of derivatives (see (Williams and Rasmussen, 2006), Chapter 6), and thus contains only infinitely smooth functions. More recently, it has been shown that the Laplacian kernel, k(x, y) = e x y 2σ , corresponds to a Sobolev space with a norm that involves all the derivatives up to order m = d/2 , (see (Rakhlin and Zhai, 2019), Proposition 7). This norm, however, explicitly includes lower order deriva- Sobolev Space Regularised Pre Density Models Figure 1: Anomaly Detection Results on ADBench, higher is Better. SOSREP is Second Best Among 18 Algorithms tives (m << d/2) terms. Thus bounding the norm also explicitly bounds the lower derivative magnitudes. In this sense, the Laplace kernel penalises non-smoothness more heavily than the SDO. In fact, SDO with m = d/2 may be viewed as the RKHS space with minimal smoothness requirements in d dimensions. Indeed, m > d/2 is necessary for the space to be an RKHS (otherwise (11) diverges, see Supplementary Sec. D.1 and D.2 for additional details), and no additional terms are present in the norm. Finally, it is worth noting that the Matérn family of kernels with a smoothness parameter ν (Williams and Rasmussen, 2006) contains the Laplacian and Gaussian kernels as two extremes, with ν = 1 2 and ν , respectively. As shown in (Fasshauer and Ye, 2011), in fact the corresponding RKHS spaces have increasingly growing smoothness order, from d/2 to , when ν = 1 2 + p, and p varies in [0, ). In addition to the SDO kernel, we have evaluated the SOSREP with Gaussian and Laplace kernels on the Anomaly Detection task and full results can be found in the Supplementary Material N. While the Gaussain and Laplacian kernels generally perform competitively and, in most cases, SOSREP estimators are better than KDE estimators, the performance of SDO is better. The precise reasons for this are not clear at the moment, and we believe that a detailed study of the interaction between the specific kernel and SOSREP is of interest. However, it is out of the scope of the present paper and is left as a future work. 5. Consistency In this Section we describe the consistency result for the SOSREP estimators with kernels of the form (11). Recall from the discussion in Section 3.2 that the objective L, given by (5), is convex when restricted to the positive cone. Here we consider the following positive cone: C = C a = n f span ka xi N i=1 | f(xi) > 0 i N o . (13) Compared to Section 3.2, belonging to C requires only positivity on the data points xi rather than on all x Rd, i.e. belonging to C is a weaker requirement. The convexity of L still holds, however, due to similar considerations. In addition, we restrict the cone to the span of xi since the optimal solutions belong to that span, and our algorithms operate inside the span too. Note also that f H1 if and only if f Ha for every a > 0. With these remarks in mind, we can state the consistency result: Theorem 3. Let x1, . . . , x N be i.i.d samples from a compactly supported density v2 on Rd, such that v H1. Set a = a(N) = 1/N, and let u N = u(x1, . . . , x N; a(N)) be the minimizer of the objective (5) in the cone (13). Then u N v L2 converges to 0 in probability. In words, when N grows, and the regularisation size a(N) decays as 1/N, the SOSREP estimators u N converge to v in L2. Note that since v L2 = 1 (as its a density), and since Sobolev Space Regularised Pre Density Models u N v L2 0, it follows by the triangle inequality that u N L2 1. That is, the estimator u N becomes approximately normalized as N grows. An overview and full details of the proof are provided in Supplementary Material Sections M.1 and M.2. 6. Experiments In this section, we present the evaluation of SOSREP on the ADBench anomaly detection benchmark, and empirically test the advantage of natural gradient descent for maintaining a non-negative f. The evaluations of SOSREP with different kernels on a 2-dimensional synthetic dataset are presented in the Supplementary Material section J.1. 6.1. Anomaly Detection Results for ADbench ADbench (Han et al., 2022) is a recent Anomaly Detection (AD) benchmark evaluating 15 state-of-the-art algorithms on 47 tabular datasets. Alongside these, we also evaluate standard KDE estimators with Gaussian and Laplace kernels. The performance of all these algorithms is compared to that of SOSREP with the SDO kernel. We focus on the classical unsupervised setup, in which at train time there is no information regarding which samples are anomalies. On test, anomaly labels are revealed only for final algorithm evaluation. For all density-based approaches (KDE,SOSREP), we employ the negative of the density as the anomaly score. That is, low density values would be considered anomalies. In accordance with the ADbench paper, we evaluate each dataset by calculating the average AUCROC of that score across four runs, each with a different random seed. The seed determines the train-test split of the ADbench data, as well as the randomness of the algorithms. In Figure 1, each algorithm is represented by a box plot where the average AUC-ROC across all datasets is indicated by an orange dot. The boundaries of the box represent the 25th and 75th quantiles, while the line inside the box denotes the median. The algorithms are sorted by their average AUC-ROC. Additional evaluations are available in the Supplementary Material, Section J.2. These include a comparison of SOSREP using different kernels (Gaussian, Laplace, and SDO) against other methods on a per-dataset basis contrary to the per-method analysis provided here. Furthermore, evaluations that assess the relative ranks of the algorithms, instead of the mean AUC-ROC, can also be found in the same section. The results there are very similar to those in Fig. 1. As discussed in (Han et al., 2022), evaluation by rank may help mitigate potential evaluation biases arising from variations in dataset difficulty. For both AUC-ROC and rank evaluations, SOSREP emerges as the 2nd best AD method overall. Notably, this Figure 2: For each HP a, we calculate the Fisher divergence between the density learned on the training set and the density inferred on the test set. is achieved with the generic version of our method, without any pre or post-processing specifically dedicated to AD. In contrast, many other methods are specifically tailored for AD and do include extensive pre and post-processing. In addition to performing well on the standard ADBench benchmark, and perhaps even more impressively, SOSREP excels also on the more demanding setup of duplicate anomalies, which was also extensively discussed in (Han et al., 2022). Here, SOSREP rises to the forefront as the top AD method (with an average AUC-ROC of 71.6 for X5 duplicates - a lead of 4% over the closest contender). This scenario, which is analogous to numerous practical situations, is a focal point for ADbench s assessment of unsupervised AD methods due to its inherent difficulty, leading to substantial drops in performance for former leaders like Isolation Forest. Detailed explanations are available in the Supplementary Material Section J.3. As for computational cost, the entire set of 47 datasets was processed in 186 minutes using a single 3090RTX GPU and one CPU, averaging about 4 minutes per dataset. 6.2. Fisher-Divergence Based Hyperparameter Tuning As discussed in Section 1, dealing with unnormalized densities adds some complexity to hyperparameter tuning. Indeed, hyperparameter tuning typically involves the comparison of multiple models, for instance models with different smoothness control (aka bandwidth ) parameters a > 0. In probabilistic models, this would typically involve choosing the model which gives the highest likelihood to the data (note that the anomaly labels are not available at this stage). However, due to the lack of normalisation, the likelihoods are not computable. In this paper, instead of likelihoods we compute the Fisher Diveregences between the model and the data, as these do not require normalisation, and choose the model with the parameter that yields the lowest diver- Sobolev Space Regularised Pre Density Models Figure 3: Fraction of negative values for natural versus α gradient-based optimization across datasets. The X-axis represents datasets from ADbench (see Supplementary Material Section L for details). gence (more precisely, we compute the FD up to an additive constant which does not depend on the the model). Figure 2 illustrates the FD for SOSREP for different a values for a single dataset. The x-axis shows the parameter values on a logarithmic scale. For visual clarity, we also shifted our version of FD to be strictly positive and applied a logarithmic transformation. One can see that the graph is somewhat noisy, but typically there is one clear minimum value. A detailed discussion of the stages involved in the FD computation and the associated minimum selection can be found in Section I of the supplementary material. 6.3. Natural-Gradient vs Standard Gradient Comparison In the context of the discussion in Section 3.1, we conduct an experiment to demonstrate that the standard gradient descent may significantly amplify the fraction of negative values in a SOSREP solution, while the natural gradient keeps it constant. We have randomly chosen 15 datasets from ADBench, and for each dataset we have used 50 non negative α initializations. Then we ran both algorithms for 1000 iterations. The fraction of negative values of fα (on the train set) was measured at initialization, and in the end of each run. In Figure 3, for each dataset and for each method, we show an average of the worst 5 fractions among the 50 initializations. Thus, for instance, for the shuttle data, the initial fraction is negligible, and is unchanged by the natural gradient. However, the standard gradient ( alpha GD in the figure in blue) for this dataset yields about 70% negative values in the 5 worst cases (i.e., 10% of initializations). We conclude that the natural gradient is better at preserving non-negativity than standard gradient descent. 7. Conclusion In this paper we have developed the theory and the optimisation methods for a multi dimensional unnormalised density estimator. This is the first estimator, among those broadly based on variants of functional space normalisation, that is computationally tractable, can be applied in higher dimensions, and performs well on a useful and difficult task. In particular, we have shown that the estimator is asymptotically consistent, that it is possible to use sampling to approximate the associated kernel, that it is possible to use a version of natural gradient for optimisation, and Fisher divergence for hyperparameter selection. We have found that the combination of these techniques result in a model that has excellent performance on a state-of-the-art outlier detection benchmark. The success of the approach on the ADBench benchmarks indicates that it is capable of modelling complicated densities. It is thus natural to ask next whether the unnormalised densities introduced in this paper can be used as a basis for generative models. Generative modelling for tabular data of small and intermediate sizes is an active and difficult research area, and SOSREP estimators may provide an alternative approach to the problem. Indeed, note that most Markov Chain Monte Carlo methods (Metropolis-Hastings, Langevin and Hamiltonian MC, etc.) work with unnormalised densities by design and can in principle be applied to SOSREP estimated densities out of the box to generate samples. We believe that evaluating such approaches, and designing schemes to alleviate the local minima issues common to MCMC, is an important and promising direction for future research that is based on the present work. All code necessary to replicate the results presented in this paper, as well as the implementation needed to run SOSREP with the kernels described herein, is publicly available. You can access and download the code from our Git Hub repository at (https://github.com/bp6725/SOSREP). We hope this will enable others in the community to verify and build upon our work. Acknowledgements We would like to thank the anonymous reviewers for pointing out several relevant references, and for a number of other constructive suggestions that helped improve the quality and readability of this work. This work was funded by the European Union under Horizon Europe grant number 101084642 and supported by UK Research and Innovation (UKRI) under the UK government s Horizon Europe funding guarantee [grant number 101084642]. Sobolev Space Regularised Pre Density Models Impact Statement This paper presents work which has the goal of advancing the fields of Density Estimation and Outlier Detection, and studies a general method, applicable across a variety of domains. While there may be many potential societal consequences related to the applications of the fields of study as above, we believe none of them are specific to our work, and thus do not need to be specifically highlighted here. Adams, R. A. and Fournier, J. J. (2003). Sobolev spaces. Elsevier. Betancourt, M. (2017). A conceptual introduction to hamiltonian monte carlo. ar Xiv preprint ar Xiv:1701.02434. Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., Assent, I., and Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Discov., 30(4):891 927. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31. Choi, H., Jang, E., and Alemi, A. A. (2019). Waic, but why? generative ensembles for robust anomaly detection. Cui, Z., Kirkby, J. L., and Nguyen, D. (2020). Nonparametric density estimation by b-spline duality. Econometric Theory, 36(2):250 291. Dai, B., Dai, H., Gretton, A., Song, L., Schuurmans, D., and He, N. (2019). Kernel exponential family estimation via doubly dual embedding. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2321 2330. PMLR. Eggermont, P. P. B., La Riccia, V. N., and La Riccia, V. (2001). Maximum penalized likelihood estimation. Volume I: Density Estimation, volume 1. Springer. Fasshauer, G. E. and Ye, Q. (2011). Reproducing kernels of generalized sobolev spaces via a green function approach with distributional operators. Numerische Mathematik, 119:585 611. Ferraccioli, F., Arnone, E., Finos, L., Ramsay, J. O., Sangalli, L. M., et al. (2021). Nonparametric density estimation over complicated domains. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B STATISTICAL METHODOLOGY, 83(2):346 368. Good, I. and Gaskins, R. A. (1971). Nonparametric roughness penalties for probability densities. Biometrika, 58(2):255 277. Grafakos, L. (2008). Classical fourier analysis, volume 2. Springer. Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. (2018). Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367. Guha, S., Mishra, N., Roy, G., and Schrijvers, O. (2016). Robust random cut forest based anomaly detection on streams. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 2712 2721. JMLR.org. Han, S., Hu, X., Huang, H., Jiang, M., and Zhao, Y. (2022). Adbench: Anomaly detection benchmark. Härdle, W., Müller, M., Sperlich, S., Werwatz, A., et al. (2004). Nonparametric and semiparametric models, volume 1. Springer. Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 18(3):1059 1076. Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4). Kerkyacharian, G. and Picard, D. (1993). Density estimation by kernel and wavelets methods: optimality of besov spaces. Statistics & Probability Letters, 18(4):327 336. Kirkby, J. L., Leitao, Á., and Nguyen, D. (2023). Spline local basis methods for nonparametric density estimation. Statistic Surveys, 17:75 118. Klonias, V. (1984). On a class of nonparametric density and regression estimators. The Annals of Statistics, pages 1263 1284. Kwon, D., Natarajan, K., Suh, S. C., Kim, H., and Kim, J. (2018). An empirical study on network anomaly detection using convolutional neural networks. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 1595 1598. Loève, M. (1977). Probability theory I. Springer. Lyu, S. (2012). Interpretation and generalization of score matching. Co RR, abs/1205.2629. Marteau-Ferey, U., Bach, F., and Rudi, A. (2020). Nonparametric models for non-negative functions. Advances in neural information processing systems, 33:12816 12826. Munkres, J. R. (2018). Analysis on manifolds. CRC Press. Sobolev Space Regularised Pre Density Models Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. (2019). Do deep generative models know what they don t know? Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media. Novak, E., Ullrich, M., Wo zniakowski, H., and Zhang, S. (2018). Reproducing kernels of sobolev spaces on rd and applications to embedding constants and tractability. Analysis and Applications, 16(05):693 715. Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617 2680. Rahimi, A. and Recht, B. (2007). Random features for largescale kernel machines. Advances in neural information processing systems, 20. Rakhlin, A. and Zhai, X. (2019). Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595 2623. PMLR. Rudin, W. (2017). Fourier analysis on groups. Courier Dover Publications. Saitoh, S. and Sawano, Y. (2016). Theory of reproducing kernels and applications. Springer. Schölkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., and Póczos, B. (2018). Nonparametric density estimation under adversarial losses. Advances in Neural Information Processing Systems, 31. Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32. Song, Y., Garg, S., Shi, J., and Ermon, S. (2020). Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574 584. PMLR. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations. Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyv, A., Kumar, R., et al. (2017). Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 18(57):1 59. Stein, E. M. and Weiss, G. (1971). Introduction to fourier analysis on euclidean spaces (pms-32), volume 32. In Introduction to Fourier Analysis on Euclidean Spaces (PMS-32), Volume 32. Princeton university press. Sutherland, D. J., Strathmann, H., Arbel, M., and Gretton, A. (2018). Efficient and principled score estimation with nyström kernel exponential families. In International Conference on Artificial Intelligence and Statistics, pages 652 660. PMLR. Takada, T. (2008). Asymptotic and qualitative performance of non-parametric density estimators: a comparative study. The Econometrics Journal, 11(3):573 592. Thomas-Agnan, C. (1996). Computing a family of reproducing kernels for statistical applications. Numerical Algorithms, 13(1):21 32. Tsuchida, R., Ong, C. S., and Sejdinovic, D. (2024). Squared neural families: A new class of tractable density models. Advances in Neural Information Processing Systems, 36. Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation & convergence rates for gans under besov ipm losses. Advances in neural information processing systems, 32. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press. Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media. Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning. MIT press Cambridge, MA. Sobolev Space Regularised Pre Density Models A. Overview Of The Supplementary Material This Supplementary Material is organized as follows: Some basic properties of the SOSREP objective and of the related minimization problem: Section B Derivation of the Gradients of the SOSREP objective: Section C Invariance of the non-negative function cone under natural gradient steps: Section H SOSREP vs KDE comparison: Sections F and G Derivation of the integral form of SDO kernel, proof of Theorem 3: Section D.1 Additional details on sampling approximation procedure: Section D.2 Details on the Fisher Divergence estimation for SOSREP Hyperparameter tuning : I Comparison of the raw AUC-ROC metric on ADBench data: Section J.2 Discussion of an additional test regime, with duplicated anomalies: Section J.3 On-the-fly hyperparameter tuning procedure that was used to save time by finding the first stable local minimum: Section J.6 The code used for all the experiments in the paper will be publicly released with the final version of the paper. B. Basic Minimizer Properties As discussed in Section 3.1, the minimizer of the SOSREP objective (3) always has H norm 1. In addition, there is no added value in multiplying the norm by a regularization scalar, since this only rescales the solution. Below we prove these statements. Lemma 4. Define f = argmin h H 1 i log h2(xi) + h 2 H . (14) Then f satisfies f 2 = 1. Moreover, if f = argmin h H 1 i log h2(xi) + λ2 h 2 H , (15) for some λ > 0, then f = λ 1f. Proof. For any h H and a > 0, argmin a>0 1 i log(ah)2(xi) + ah 2 = (16) i log h2(xi) log a2 + a2 h 2 . (17) Taking derivative w.r.t a we have a2 + 2a h 2 = 0. (18) Thus optimal a for the problem (3) must satisfy ah 2 = a2 h 2 = 1. To conclude the proof of the first claim, choose h in (16) to be the minimizer in (14), h = f. Note that if f H = 1, then we can choose a = f 1 H = 1 to further decrease the value of the objective contradicting the fact that f is the minimizer. Sobolev Space Regularised Pre Density Models For the second claim, denoting g = λh, argmin h H 1 i log h2(xi) + λ2 h 2 (19) = λ 1 argmin g λH=H 1 i log g2(xi) + g 2 + 1 i log λ2 (20) = λ 1 argmin g H 1 i log g2(xi) + g 2 (21) = λ 1f. (22) C. Derivation of the Gradients, Proof Of Lemma 1 In this section we derive the expressions for standard and the natural gradients of the objective (5), as given in Lemma 1. Proof Of Lemma 1. We first derive the expression for αL in (6). Recall that f 2 H = α, Kα RN for α RN, where Kij = k(xi, xj). This follows directly from the form (4), and the fact that kx, ky = k(x, y) for all x, y H, by the reproducing property. For this term we have α α, Kα = 2Kα. Next, similarly by using (4), αf(x) = (k(x1, x), . . . , k(x N, x)) for every x Rd. Finally, we have i=1 log f 2(xi) = 1 i=1 f 2(xi) 2f(xi) αf(xi) (23) i=1 f 1(xi) αf(xi) (24) N K(f(x1), . . . , f(x N)) 1 (25) N K (Kα) 1 . (26) This yields (6). For f L, we similarly have f f 2 H = 2f, as discussed in section 3.3. Moreover, i=1 log f 2(xi) = f 1 N i=1 log f, xi 2 H (27) i=1 f, xi 2 H 2 f, xi H f f, xi H (28) i=1 f, xi 1 H xi. (29) This completes the proof. D. SDO Kernel Details In Section D.1 we provide a full proof of Theorem 2, while Section D.2 contains additional details on the sampling approximation of SDO. Sobolev Space Regularised Pre Density Models D.1. SDO Kernel Derivation We will prove a claim that is slightly more general than Theorem 2. For a tuple a Rm +, define the norm l! κ! (Dκf) 2 L2 , (30) where Dκ are the κ-indexed derivative, as discussed in Section 4.1. The SDO norm is a special case with a0 = 1, am = a, and al = 0 for 0 < l < m. Let H a be the subspace of L2 of functions with finite norm, H a = {f L2 | f a < } (31) and let the associated inner product be denoted by l! κ! (Dκf) , (Dκg) L2 . (32) Define the Fourier transform Rd f(u)e 2πi z,u du, (33) and recall that we have (see for instance (Stein and Weiss, 1971), (Grafakos, 2008)) F (Dκf) (z) = j=1 (2πizj)κj Ff(z) for all z Rd. (34) The following connection between the L2 and the derivative derived norms is well known for the standard Sobolev spaces ((Williams and Rasmussen, 2006; Saitoh and Sawano, 2016; Novak et al., 2018)). However, since (30) somewhat differs from the standard definitions, we provide the argument for completeness. Lemma 5. Set for z Rd l=1 al (2π)2l z 2l ! 1 Then for every f H a we have f 2 a = v a(z) F[f] 2 L2 . (36) Sobolev Space Regularised Pre Density Models l! κ! Dκf 2 L2 (37) l! κ! F [Dκf] 2 L2 (38) l! κ! |F [Dκf] (z)|2 |F [f] (z)|2 + j=1 (2πzj)2κj |F [f] (z)|2 = Z dz |F [f] (z)|2 j=1 (2πzj)2κj = Z dz |F [f] (z)|2 l=1 al (2π)2l X = Z dz |F [f] (z)|2 " l=1 al (2π)2l z 2l # Using the above Lemma, the derivation of the kerenl is standard. Suppose k a is the kernel corresponding to f a on H a. It remains to observe that by the reproducing property and by Lemma 5, for all x Rd f(x) = f, k a x Rd dz F[f](z)F[k ax](z)v2 a(z). (45) On the other hand, by the Fourier inversion formula, we have f(x) = Z dz F[f](z)e2πi x,z . (46) This implies that Z dz F[f](z)e2πi x,z = Z Rd dz F[f](z)F[k ax](z)v2 a(z) (47) holds for all f H a, which by standard continuity considerations yields F[k a x](z) = e 2πi x,z v2 a(z) . (48) Using Fourier inversion again we obtain k a(x, y) = Z Rd e2πi y x,z v2 a(z) dz = Z Rd e2πi y x,z 1 + Pm l=1 al (2π)2l z 2l dz. (49) D.2. Sampling Approximation As discussed in Section 4.2, we are interested in sampling points z Rd from a finite non negative measure with density given by wa(z) = (1 + a (2π)2m z 2m) 1. With a slight overload of notation, we will also denote by wa the scalar function wa : R R, wa(r) = (1 + a (2π)2mr2m) 1. (50) Sobolev Space Regularised Pre Density Models First, note that wa(z) depends on z only through the norm z , and thus a spherically symmetric function. Therefore, with a spherical change of variables, we can rewrite the integrals w.r.t w 2 a as follows: For any f : Rd C, Z Rd wa(z)f(z)dz = Z Sd 1 dθ wa(r)Ad 1(r)f(rθ) (51) = Ad 1(1) Z Sd 1 dθ wa(r)rd 1 f(rθ). (52) Here Sd 1 the unit sphere in Rd, θ is sampled from the uniform probability measure on the sphere, r is the radius, and Ad 1(r) = 2πd/2 Γ(d/2)rd 1 (53) is the d 1 dimensional volume of the sphere or radius r in Rd. The meaning of (52) is that to sample from w 2 a , we can sample θ uniformly from the sphere (easy), and r from a density ζ(r) = wa(r)rd 1 = rd 1 1 + a (2π)2mr2m (54) on the real line. Note that the condition m > d/2 that we impose throughout is necessary. Indeed, without this condition the decay of ζ(r) would not be fast enough at infinity, and the density would not have a finite mass. As discussed in Section 4.2, ζ(r) is a density on a real line, with a single mode and an analytic expression, which allows easy computation of the derivatives. Such distributions can be efficiently sampled using, for instance, off-the-shelf Hamiltonian Monte Carlo (HMC) samplers, (Betancourt, 2017). In our experiments we have used an even simpler scheme, by discretizing R into a grid of 10000 points, with limits wide enough to accommodate a wide range of parameters a. E. A Few Basic Properties of the Kernel Proposition 6. The kernel (11) is real valued and satisfies Ka(x, y) = Z Rd cos (2π y x, z ) 1 + a (2π)2m z 2m dz. (55) Proof. Write e2πi y x,z = cos(2π y x, z ) + i sin(2π y x, z ) and observe that sin is odd in z, while 1 + a (2π)2m z 2m is even. Proposition 7. For all x, y Rd, Kb(x, y) = b d 2m K1(b 1 2m x, b 1 2m y) (56) Proof. Write u = b 1 2m z and note that du = (b 1 2m )ddz. We have Kb(x, y) = Z Rd cos (2π y x, z ) 1 + b 2π z 2m dz (57) cos 2π D b 1 2m (y x), u E 1 + 2πu 2m du b d 2m (58) = b d 2m K1(b 1 2m x, b 1 2m y). (59) Lemma 8. There is a function c(m) of m such that for every x Rd, Z (Ka(x, y))2 dy = c(m) a d 2m . (60) Sobolev Space Regularised Pre Density Models Figure 4: (a) Distribution of Kernel Values and SDO Kernel Values Inside and Between Clusters. (b) SOSREP and KDE Loglikelihoods. the x-axis represents points in the data, arranged by clusters, y-axis shows the log-likelihood. Proof. Recall that for fixed x and a, the Fourier transform satisfies F(ka x)(z) = e 2πi x,z 1+a(2π)2m z 2m , where ka x( ) = Ka(x, ) (see eq. (48)). We thus have ka x 2 L2 = F(ka x) 2 L2 (61) = Z e 2πi x,z e 2πi x,z 1 + a(2π)2m z 2m 2 dz (62) = Z 1 1 + a(2π)2m z 2m 2 dz (63) = a d 2m Z 1 1 + (2π)2m z 2m 2 dz , (64) where we have used the variable change z = a 1 2m z . F. Difference between SOSREP and KDE Models In this Section we construct an analytic example where the SOSREP estimator may differ arbitrarily from the KDE estimator with the same kernel. Thus, the models are not equivalent, and encode different prior assumptions. Briefly, we consider a block model, with two clusters. We ll show that in this particular setting, in KDE the clusters influence each other more strongly, i.e the points in one cluster contribute to the weight of the points in other cluster, yielding more uniform models. In contrast, in SOSREP, rather surprisingly, the density does not depend on the mutual position of the clusters (in a certain sense). Note that this is not a matter of bandwith of the KDE, since both models use the same kernel. We believe that this property may explain the better performance of SOSREP in Anomaly Detection tasks, although further investigation would be required to verify this. Given a set of datapoints S = {xi}, for the purposes of this section the KDE estimator is the function fkde(x) = fkde,S(x) = 1 i kxi(x). (66) Let f SOSREP be the solution of (3). We will compare the ratios fkde(xi)/fkde(xj) versus the corresponding quantities for SOSREP, f 2 SOSREP (xi)/f 2 SOSREP (xj) for some pairs xi, xj. Note that these ratios do not depend on the normalization of fkde and f 2 SOSREP , and can be computed from the unnormalized versions. In particular, we do not require kxi to be normalized in (66). Sobolev Space Regularised Pre Density Models Consider a set S with two components, S = S1 S2, with S1 = {x1, . . . , x N} and S2 = {x 1, . . . x M} and with the following kernel values: k(xi, xi) = k(x j, x j) = 1 for all i N, j M k(xi, xj) = γ2 for i = j k(x i, x j) = γ 2 for i = j k(xi, x j) = βγγ for all i, j This configuration of points is a block model with two components, or two clusters. The correlations between elements in the first cluster are γ2, and are γ 2 in the second cluster. Inter-cluster correlations are βγγ . We assume that γ, γ, β [0, 1] and w.l.o.g take γ > γ . While this is an idealized scenario to allow analytic computations, settings closely approximating the configuration (67) often appear in real data. See Section F.1 for an illustration. In particular, Figure 4 show a two cluster configuration in that data, and the distribution of k(x, x ) values. The KDE estimator for K is simply fkde(xt) = 1 N + M 1 + (N 1)γ2 + Mβγγ N N + M γ2 + M N + M βγγ , (68) for xt S1, where the second, approximate equality, holds for large M, N. To simplify the presentation, we shall use this approximation. However, all computations and conclusions also hold with the precise equality. For x t S2 we similarly have fkde(x t) N N+M βγγ + M N+M γ 2, and when M = N, the density ratio is fkde(xt) fkde(x t) = γ2 + βγγ γ 2 + βγγ . (69) The derivation of the SOSREP estimator is considerably more involved. Here we sketch the argument, while full details are given in Supplementary Material Section G. First, recall from the previous section that the natural gradient in the α coordinates is given by 2 β N 1(Kβ) 1 . Since the optimizer of (3) must satisfy f L = 0, we are looking for β RN+M such that β = (Kβ) 1 (the term N 1 can be accounted for by renormalization). Due to the symmetry of K and since the minimizer is unique, we may take β = (a, . . . , a, b, . . . , b), where a is in first N coordinates and b is in the next M. Then β = (Kβ) 1 is equivalent to a, b solving the following system: ( a = a 1 1 + (N 1)γ2 + b 1Mβγγ b = a 1Nβγγ + b 1 1 + (M 1)γ 2 (70) This is a non linear system in a, b. However, it turns out that it may be explicitly solved, up to a knowledge of a certain sign variable (see Proposition 10). Moreover, for M = N, the dependence on that sign variable vanishes, and we obtain Proposition 9. Consider the kernel and point configuration described by (67), with M = N. Then for every xt S1, x s S2, f SOSREP (xt) f SOSREP (x s) = γ2 In particular, the ratio does not depend on β. It remains to compare the ratio (71) to KDE s ratio (69). If β = 0, when the clusters are maximally separated, the ratios coincide. However, let us consider the case, say, β = 1 2, and assume that γ γ. Then in the denominator of (69) the larger term is βγγ , which comes from the influence of the first cluster on the second. This makes the whole ratio to be of the order of a constant. On the other hand, in SOSREP there is no such influence, and the ratio (71) may be arbitrarily large. We thus expect the gap between the cluster densities to be larger for SOSREP, which is indeed the case empirically. One occurence of this on real data is illustrated in Figure 4. F.1. Evaluation of the Difference Between SOSREP and KDE for Real Data We have performed spectral clustering of the letter dataset from ADBech ((Han et al., 2022)), using the empirical SDO kernel as affinity matrix for both SOSREP and KDE. We then have chosen two clusters that most resemble the two block Sobolev Space Regularised Pre Density Models model (67) in Section F. The kernel values inside and between the clusters are shown in Figure 4a. Next, we train the SOSREP and KDE models for just these two clusters (to be compatible with the setting of Section F. The results are similar for densities trained on full data). The log of these SOSREP and KDE densities in shown in Figure 4b (smoothed by running average). By adding an appropriate constant, we have arranged that the mean of both log densities is 0 on the first cluster. Then one can clearly see that the gap between the values on the first and second cluster is larger for the SOSREP model, yielding a less uniform model, as expected from the theory. G. KDE vs SOSREP Comparison Proofs In this section we develop the ingredients required to prove Proposition 9. In section G.1 we reduce the solution of the SOSREP problem for the two block model to a solution of a non-linear system in two variables, and derive the solution of this system. In section G.2 we use these results to prove Proposition 9. G.1. Solution Of SOSREP for a 2-Block Model As discussed in section F, any SOSREP solution f must be a zero point of the natural gradient, f L = 0. Using the expressions given following Lemma 1, this implies β = 1 N (Kβ) 1. Since we are only interested in f up to a scalar normalization, we can equivalently assume simply β = (Kβ) 1. Further, by symmetry consideration we may take β = (a, . . . , a, b, . . . , b), where a is in first N coordinates and b is in the next M. Then, as mentioned in section F, β = (Kβ) 1 is equivalent to a, b solving the following system: ( a = a 1 1 + (N 1)γ2 + b 1Mβγγ b = a 1Nβγγ + b 1 1 + (M 1)γ 2 (72) It turns out that it is possible to derive an expression for the ratio of the squares of the solutions to this system in the general case. Proposition 10 (Two Variables SOSREP System). Let a, b be solutions of ( a = H11a 1 + H12b 1 b = H21a 1 + H22b 1 (73) a2/b2 = H2 11 (H21 + H12) ρ p (H21 H12)2 + 4H11H22 2H11H22 + H12 h (H21 H12) + ρ p (H21 H12)2 + 4H11H22 i for a ρ satisfying ρ {+1, 1}. Proof. Write u = a 1, v = b 1, and multiply the first and second equations by u and v respectively. Then we have ( 1 = H11u2 + H12uv 1 = H21uv + H22v2. (75) We write v = 1 H11u2 /H12u. (76) Also, from the first equation, H12uv = 1 H11u2. (77) Substituting into the second equation, 1 H11u2 + H22 (H12u)2 . (78) Sobolev Space Regularised Pre Density Models Finally setting s = u2 and multiplying by H2 12s, H2 12s = H21H12s (1 H11s) + H22 (1 H11s)2 . (79) Collecting terms, we have s2(H2 11H22 H11H12H21) + s(H12H21 H2 12 2H11H22) + H22 = 0. (80) Solving this, we get s = (H12H21 H2 12 2H11H22) p (H12H21 H2 12 2H11H22)2 4(H2 11H22 H11H12H21)H22 2(H2 11H22 H11H12H21) . (81) The expression inside the square root satisfies (H12H21 H2 12 2H11H22)2 4(H2 11H22 H11H12H21)H22 (82) = (H12(H21 H12) 2H11H22)2 4(H2 11H22 H11H12H21)H22 (83) = H2 12(H21 H12)2 4H11H22H12(H21 H12) + 4H11H12H21H22 (84) = H2 12(H21 H12)2 + 4H11H22H2 12 (85) = H2 12 (H21 H12)2 + 4H11H22 (86) Thus, simplifying, we have u2 = s = (H12(H21 H12) 2H11H22) + ρH12 p (H21 H12)2 + 4H11H22 2H11(H11H22 H12H21) , (87) where ρ {+1, 1}. Rewriting (76) again, we have H2 12u2 . (88) Sobolev Space Regularised Pre Density Models a2/b2 = v2/u2 = H2 12u4 (89) = 1 H12u2 H11 2 2(H11H22 H12H21) (H12(H21 H12) 2H11H22) + ρH12 p (H21 H12)2 + 4H11H22 1 2 2(H11H22 H12H21) + (H12(H21 H12) 2H11H22) ρH12 p (H21 H12)2 + 4H11H22 (H12(H21 H12) 2H11H22) + ρH12 p (H21 H12)2 + 4H11H22 2 2H12H21 + H12(H21 H12) ρH12 p (H21 H12)2 + 4H11H22 (H12(H21 H12) 2H11H22) + ρH12 p (H21 H12)2 + 4H11H22 2H21 + H21 H12 ρ p (H21 H12)2 + 4H11H22 (H12(H21 H12) 2H11H22) + ρH12 p (H21 H12)2 + 4H11H22 (H21 + H12) ρ p (H21 H12)2 + 4H11H22 (H12(H21 H12) 2H11H22) + ρH12 p (H21 H12)2 + 4H11H22 (H21 + H12) ρ p (H21 H12)2 + 4H11H22 2H11H22 + H12 h (H21 H12) + ρ p (H21 H12)2 + 4H11H22 i G.2. Proof Of Proposition 9 Similarly to the case with KDE, we will use the following approximation of the system (72) ( a = a 1Nγ2 + b 1Mβγγ b = a 1Nβγγ + b 1Mγ 2 (98) Proof. Let f be the SOSREP solution. By definition, the ratio f(xt) f(x s) is given by a2/b2 where a, b are the solutions to (98). That is, we take H12 = H21 = βγγ , H11 = γ2, and H22 = γ 2 in Proposition 10. Note that we have removed the dependence on N, since it does not affect the ratio. By Proposition 10, substituting into (74), (H21 + H12) ρ p (H21 H12)2 + 4H11H22 2H11H22 + H12 h (H21 H12) + ρ p (H21 H12)2 + 4H11H22 i H12 ρ H11H22 H11H22 + H12ρ H11H22 = γ4 2βγγ 2ργγ 2γ2γ 2 + 2βγγ ργγ (1 + βρ)2 (102) It remains to note that (β+ρ)2 (1+βρ)2 = 1 for any β and ρ {+1, 1}. Sobolev Space Regularised Pre Density Models H. Invariance of C under Natural Gradient Define the non-negative cone of functions C H by C = {f H | f(x) 0 x X} . (103) As discussed in section 3.1, the functional L(f) is convex on C. We now show that if the kernel k is non-negative, then the cone C is invariant under the natural gradient steps. In particular, this means that if one starts with initialization in C (easy to achieve), then the optimization trajectory stays in C, without a need for computationally heavy projection methods. Note that this is unlikely to be true for the standard gradient. Recall that the expression (6) for the natural gradient is given in Lemma 1. Proposition 11. Assume that k(x, x ) 0 for all x, x X and that λ < 0.5. If f C, then also f := f N PN i=1 f 1(xi)kxi i C. Proof. Indeed, by opening the brackets, f = (1 2λ) f + 2λ i=1 f 1(xi)kxi which is a non-negative combination of functions in C, thus yielding the result. I. Fisher Divergence for Hyper-Parameters Selection The handling of unnormalized models introduces a particular nuance in the context of hyperparameter tuning, as it prevents the use of the maximum likelihood of the data in order to establish the optimal parameter. When confronted with difficulties associated with normalization, it is common to resort to score-based methods. The score function is defined as s(x; a) = x log pm(x; a), (104) where pm(x; a) is a possibly unnormalized probability density on Rd, evaluated at x Rd, and dependent on the hyperparameter a. Since the normalization constant is independent of x, and s is defined via the gradient in x, s is independent of the normalization. As a result, distance metrics between distributions that are based on the score function, such as the Fisher Divergence, can be evaluated using non-normalized distributions. The Fisher Divergence (FD) is a similarity measure between distributions, which is based on the score function the gradient of the log likelihood. In particular, it does not require the normalization of the density. While the link between FD and maximum likelihood estimation has been previously studied in the context of score matching (Lyu, 2012), to the best of our knowledge, this represents the first application of FD for hyperparameter tuning. The divergence between data and a model can be approximated via the methods of (Hyvärinen, 2005), which have been recently computationally improved in (Song et al., 2020) in the context of score-based generative models, (Song and Ermon, 2019). In particular, we adapt the Hutchinson trace representation-based methods used in (Song et al., 2020) and (Grathwohl et al., 2018) to the case of models of the form (3). In this work, we employ this concept, leveraging it to identify the choice of parameters (in our case, a, the smoothness parameter) that minimize the FD between the density learned on the training set and the density inferred on the test set. Specifically, we apply score-matching ((Hyvärinen, 2005)), a particular approach to measuring the Fisher divergence between a dataset sampled from an unknown distribution and a proposed distribution model. I.1. Score Matching and Fisher Divergence Given independent and identically distributed samples x1, . . . , x N RD from a distribution pd(x) and an un-normalized density learned, pm(x; a) (where a is a parameter). Score matching sets out to reduce the Fisher divergence between pd and pm( ; a), formally expressed as 2 Epd[ sm(x; a) sd(x) 2] Sobolev Space Regularised Pre Density Models As detailed in (Hyvärinen, 2005), the technique of integration by parts can derive an expression that does not depend on the unknown latent score function sd: L(a; x1, . . . , xn) = 1 tr( xsm(xi; a)) + 1 2 sm(xi; a) 2 + C In this context, C is a constant independent of a, tr( ) denotes the trace of a matrix, and xsm(xi; a) = 2 xlog( pm(xi; a)) is the Hessian of the learned log-density function evaluated at xi. I.2. The Hessian Estimation for Small a s Deriving the Hessian for small a values proves to be challenging. Note that small a values signify overfitting to the training data, consequently, this leads to a density that is mainly close to zero between samples, thereby making the process highly susceptible to significant errors in numerically calculating the derivatives. This situation results in a Hessian that is fraught with noise. Hence, our strategy focuses on locating a stable local minimum with the highest possible a. In this context, we define a stable local minimum as a point preceded and succeeded by three points, each greater than the focal point. I.3. Approximating the Hessian Trace Although this method holds promise, it s worth noting the computational burden tied to the calculation of the Hessian trace. To mitigate this, we rely on two techniques. First, we utilize Hutchinson s trace estimator (Hutchinson, 1989), a robust estimator that facilitates the estimation of any matrix s trace through a double product with a random vector ϵ: Tr(H) = Eϵ ϵT Hϵ . Here ϵ is any random vector on Rd with mean zero and covariance I. This expression allows to reduce amount of computation of Tr(H), by computing the products Hϵ directly, for a few samples of ϵ, without the need to compute the full H itself. A similar strategy has been recently employed in (Grathwohl et al., 2018) in a different context, for a trace computation of a Jacobian of a density transformation, instead of the score itself. In more detail, score computations can be performed efficiently and in a lazy manner using automatic differentiation, offered in frameworks such as Py Torch. This allows us to compute a vector-Hessian product Hϵ per sample without having to calculate the entire Hessian for all samples, a tensor of dimensions N (d d), in advance. More specifically, we utilize Py Torch s automatic differentiation for computing the score function, which is a matrix of N d. Subsequently, this is multiplied by ϵ. We then proceed with a straightforward differentiation xs(xi) = 1 h (s (xi + h ϵ) s (xi)) for small step h, followed by a summation which is lazily calculated through Py Torch (see Algorithm 1). Algorithm 1 Calculating Hutchinson s Trace Estimator Require: Score function s, small constant h, sample x, # of random vectors n 1: Initialize trace Estimator to 0 2: for i = 1 to n do 3: Sample random vector ϵ from normal distribution 4: Calculate s(a; x + h ϵ) 5: Calculate (s(a; x + h ϵ) s(a; x)) 6: Compute (1/h) (s(a; x + h ϵ) s(a; x)) ϵ 7: Add result to trace Estimator 8: end for 9: Return trace Estimator J. Experiments J.1. Evaluation on 2D synthetic dataset with different kernels This section presents the application of SOSREP using different kernels on a 2D toy dataset, following the methodologies outlined in (Song and Ermon, 2019). Through these comparisons, we aim to enhance our understanding of what makes SOSREP particularly effective, especially with the SDO kernel. Sobolev Space Regularised Pre Density Models Figure 5: Log densities learned by SOSREP with the Laplacian, Gaussian, and SDO kernels. Columns are different synthetic datasets. For visualization, We subtracted the maximum from each log density and clipped the minimum value at -9, following (Song and Ermon, 2019). Above each panel are shown the average log-likelihoods on a test set. In Figure 5, we evaluate the SOSREP estimator using SDO, Gaussian, and Laplacian kernels on a number of 2D datasets. These datasets were also used in (Song and Ermon, 2019) (Figure 2) to evaluate various state-of-the-art density estimation methods. The top row in Figure 5 displays the ground truth data, with columns representing different datasets and rows corresponding to the various kernels. The color intensity indicates the values of the log-likelihood. Additionally, the average log-likelihood (llk) was computed on a test set, details of which are provided below. We adhered closely to the experimental protocol from (Song and Ermon, 2019), using their code to generate data. Exceptions were the Mixture of Gaussians (MOG) and Square datasets, for which we developed our own code, as original code was not available. All datasets were generated with N = 10000 samples. Like the method studied in (Song and Ermon, 2019), SOSREP is not normalized. For log-likelihood computation (llk), the normalization constant was estimated using an importance sampling procedure detailed in (Song and Ermon, 2019), Section 3.2, which we replicated for our experiments in Figure 2. In 2D, such approximations are reasonably effective. For clarity in visualization, the maximum log-likelihood was subtracted (setting the maximum llk to 0) and clipped at 9. This adjustment was made solely for visualization purposes and not for the llk computation. The bandwidth parameter a for SOSREP was selected based on the best LLK from a verification set using a grid search over a grid of 10 values. As shown in Figure 2 and in comparison to Figure 2 of (Song and Ermon, 2019), SOSREP is one of the few methods that consistently produces clean fits across all datasets. The only other method achieving similar results is "DKEF-G-15" and, to a lesser extent, "DKEF-G". It is noteworthy that almost all methods in Figure 2 of (Song and Ermon, 2019) assign a relatively high likelihood to the center of the multi-ring dataset, whereas SOSREP shows a distinct drop in likelihood at this point. Moreover, SOSREP accurately captures the sharp edges in the square and cosine datasets. Sobolev Space Regularised Pre Density Models Figure 6: Anomaly Detection Results on ADBench. Relative Ranking Per Dataset, Higher is Better. SOSREP is Second Best Among 18 Algorithms J.2. AUC-ROC Performance Analysis In addition to AUC-ROC, we focus on a ranking system where raw AUC-ROC scores from each dataset are converted into rankings from 1 to 18, with 18 indicating the best performance. This approach reduces bias from averaging AUC-ROC scores across datasets, particularly as scores tend to be higher on easier datasets. This is crucial as no single AD method consistently excels across all scenarios, a point elaborated in (Han et al., 2022). In Figure 6, each algorithm is represented by a box plot. The average AUC-ROC ranking across all datasets is marked by an orange dot. The 25th and 75th percentiles define the box boundaries, while the median is indicated by the internal line. Algorithms are ordered based on the average AUC-ROC ranking. Figure 7 presents a heatmap of the AUC-ROC values. In this visualization, circle sizes correlate with the AUC values and the color gradient reflects deviations from SOSREP s performance, providing a graphical representation of how AUC-ROC levels compare with those of SOSREP. J.3. Duplicate Anomalies. Duplicate anomalies are often encountered in various applications due to factors such as recording errors (Kwon et al., 2018), a circumstance termed as "anomaly masking" (Campos et al., 2016; Guha et al., 2016), posing significant hurdles for diverse AD algorithms. The significance of this factor is underscored in ADbench (Han et al., 2022), where duplicate anomalies are regarded as the most difficult setup for anomaly detection, thereby attracting considerable attention. To replicate this scenario, ADbench duplicates anomalies up to six times within training and test sets, subsequently examining the ensuing shifts in AD algorithms performance across the 47 datasets discussed in our work. As shown in ADBench, unsupervised methods are considerably susceptible to repetitive anomalies. In particular, as shown in Fig. 7a in the main text there, performance degradation is directly proportional to the increase in anomaly duplication. With anomalies duplicated six times, unsupervised methods record a median AUC-ROC reduction of -16.43%, where in SOSREP the drop is less then 2%. This universal performance deterioration can be attributed to two factors: 1) The inherent assumption made in these methods that anomalies are a minority in the dataset, a presumption crucial for detection. The violation of this belief due to the escalation in duplicated anomalies triggers the noticed performance downturn. 2) Methods based on nearest neighbours assume that anomalies significantly deviate from the norm (known as "Point anomalies"). However, anomaly duplication mitigates this deviation, rendering the anomaly less distinguishable. Notice that while the first factor is less of a problem for a density based AD algorithm ( any time the anomalies are still not the major part of the data), the second factor could harmful to DE based AD algorithms as well. The evidence of SOSREP possessing the highest median and its robustness to duplicate anomalies, along with a probable resistance to a high number of anomalies, not only Sobolev Space Regularised Pre Density Models Figure 7: Heatmap of AUC-ROC values. Circles size represents absolute value, and the color the shift from SOSREP. emphasizes its superiority as a DE method but also underscores its potential to serve as a state-of-the-art AD method. J.4. Comparison of SOSREP for Different Kernels In this section, we evaluate the SOSREP density estimator with various kernels on the ADBench task. Specifically, we consider the SDO kernel given by (11), the L2 Laplacian kernel, and the Gaussain kernel. The Laplacian (aka Exponential) kernel is given by k(x, y) = (1/σd) e x y /σ (105) for σ > 0, where x y is the Euclidean norm on Rd. The Gaussian kernel is given by k(x, y) = (1/σd) e x y 2/(2σ2). (106) Generally, in this task, we observe a similar behavior across all kernels, although the Gaussian kernel underperforms on a few datasets. Figure 17 presents the performance of each kernel on the ADBench benchmark (see section 6.1). Specifically, we plot the test set AUC values on the Y-axis against the different datasets on the X-axis. Note that due to convexity, the results are nearly identical for different initializations. Figures 16 and 15 display the ADbench results in a manner consistent with Section 6.1. It is noteworthy that the Laplacian kernel also achieves impressive results. Furthermore, when considering ranks and thus mitigating the impact of extreme AUC values, the Gaussian kernel demonstrates significantly improved performance. J.5. Per-dataset comparison of SOSREP for different kernels Figure 12 presents the AUC-ROC results for SOSREP with the SDO kernel, compared to other algorithms in ADBench. The datasets are displayed on the x-axis. For each dataset, the figure shows a box plot of the quantity (AUC-ROC SOSREP) (AUC-ROC other algorithm) (107) where "other algorithm" varies over 17 other benchmark algorithms from ADBench. Positive values indicate an advantage in performance for SOSREP. The datasets are ordered by the AUC-ROC of SOSREP. Figures 13 and 14 present the same plots for the SOSREP with the Gaussian and Laplacian kernels, respectively. Sobolev Space Regularised Pre Density Models Figure 8: Anomaly Detection Results on ADBench. Mean Relative Ranking Per Dataset, Higher is Better. Figure 9: Anomaly Detection Results on ADBench. Mean AUC Per Dataset, Higher is Better. Figure 10: Comparing SOSREP results on ADBench for different kernels. Figure 11: Evaluating SOSREP with various kernels on the ADbench benchmark. 27 Sobolev Space Regularised Pre Density Models Figure 12: AUC-ROC results for SOSREP with the SDO kernel compared to the other algorithms in ADBench. Figure 13: AUC-ROC results for SOSREP with the Gaussian kernel compared to the other algorithms in ADBench. Sobolev Space Regularised Pre Density Models Figure 14: AUC-ROC results for SOSREP with the Laplace kernel compared to the other algorithms in ADBench. J.6. Hyperparameter Tuning with On-the-fly Fisher-Divergence Minimization A primary task at hand involves hyperparameter tuning to select the optimal a according to the Fisher-Divergence (FD) procedure, as detailed in section I and Section 6.1. For efficient computation, we employ an on-the-fly calculation approach. The process initiates with a a value that corresponds to the order -th largest tested point. Subsequently, we explore one larger and one smaller a value. If both these points exceed the currently tested point, we continue sampling. Otherwise, we shift the tested point to a smaller value. To avoid redundant calculations, the results are continually stored, ensuring each result is computed only once. This methodology provides a balance between computational efficiency and thorough exploration of the hyperparameter space. K. Algorithm Overview In this section, we outline the procedure for our density estimation algorithm. Let X RN d represent the training set and Y RN d denote the test set for which we aim to compute the density. Initially, we establish the following hyper-parameters: 1.Nz, the number of samples Z taken for the kernel estimation. (Notice Nz is T from (12)) 2.niters, the number of iterations used in the gradient decent approximating the optimization for α. 3.lr, the learning rate applied in the gradient decent α optimization process. 4.nfd_iters, the number of iterations for the Hessian trace approximation. 5.h, the step size employed for the Hessian trace. Then, we follow Algorithm 2. L. Figure 3 - List of Datasets Those are the datasets for which we compared the fractions of negative values for alpha optimization vs natural gradients presented in Figure 3, ordered according to the X-axis in the figure : Sobolev Space Regularised Pre Density Models Algorithm 2 Density Estimation Procedure with Hypeparameter Tuning 1: repeat 2: Estimate the SDO kernel matrix via sampling as described in Section 3 and detailed in Algorithm 3. 3: Determine the optimal α as described in Section 3 and detailed in Algorithm 4. optimal_α forms f := f foptimal_α and the density estimator is F 2 (f)2. 4: Compute F 2(Y ) the density over Y as described in Section 3 and detailed in Algorithm 5. 5: Assess the Fisher-divergence as described in Supplementary Material Section I and detailed in both Algorithm 6 and Algorithm 1. 6: until For each smoothness parameter a. 7: The density estimator F 2 corresponds to the a value that yields the minimal FD. Algorithm 3 K( ) R2 N : Sampling the Multidimensional SDO Kernel Require: X, Y, a 1: θ g g 2 ; g N(0, 1) 2: r rd 1 1+a(2πr)2m using grid search. 4: b U[0, 2π] 5: samples 1 Nz cos(X Z + b) (cos(Y Z + b))T Algorithm 4 Optimal_alpha( ) RN : Calculating the Optimal Alphas Require: X, lr, niters 1: K K(X, X) 2: α [|α0|, . . . , |αN|] : αi N(0, 1) 3: for i = 1 in niters do 4: α α 2 lr ((K α) (K (1./(K α)))/Ndata) 5: end for Algorithm 5 F 2( ) RNY : Density over coordinates Y given observations X. Require: X, Y 1: α Optimal_alpha(X, . . .) 2: e K(X, Y ) 3: density (e α)2 Algorithm 6 FD( ) R : Fisher Divergence with Hessian Trace Approximation. Require: X(Train set), Y (Test set), nfd_iters, h 1: scores P log(F 2(X, Y )) 2: traces_sum 0 3: for i = 1 in niters do 4: ϵ U({ 1, 0, 1}size(X)) 5: shifted_scores P log(F 2(X, Y + h ϵ)) 6: traces_sum traces_sum + P((shifted_scores scores) ϵ)/h 7: end for 8: traces traces_sum/niters 9: FD E[traces + 1 2 scores 2] Sobolev Space Regularised Pre Density Models 3. Page Blocks 4. Mammography 5. Magic.gamma 7. Backdoor 9. Lymphography 12. Spam Base 13. Hepatites Sobolev Space Regularised Pre Density Models M. Consistency Theorem In this Section we state and prove the consistency result for the SOSREP estimators. Recall that for any a > 0, Ha is the RKHS with the norm given by (9) and the kernel ka(x, y) given by (11). For any f H1 define the SOSREP loss L(f) = La(f) = 1 X log f 2(xi) + f 2 Ha Note that f H1 if and only if f Ha for every a > 0. Recall from the discussion in Section 3.3 and that L is convex when restricted to the open cone C = (Ca) = n f span {kxi}N 1 | f(xi) > 0 o . (109) Note that C depends on a since the kernel kxi = ka xi depends on a. Observe also that compared to Section 3.3, here we require only positivity on the data points xi rather than on all x Rd, and we restrict the cone to the span of xi since all the solutions will be obtained there in any case. This of course does not affect the convexity. The consistency result we prove is as follows: Theorem 12. Let x1, . . . , x N be i.i.d samples from a compactly supported density v2 on Rd, such that v H1. Set a = a(N) = 1/N, and let u N = u(x1, . . . , x N; a(N)) be the minimizer of the objective (108) in the cone (109). Then u N v L2 converges to 0 in probability. In words, when N grows, and the regularisation size a(N) decays as 1/N, the the SOSREP estimators u N converge to v in L2. As discussed in the main text, note also that since v L2 = 1 (as its a density), and since u N v L2 0, it follows by the triangle inequality that u N L2 1. That is, the estimator u N becomes approximately normalized as N grows. In Section M.1 below we provide an overview of the proof, and in Section M.2 full details are provided. M.1. Overview Of The Proof As discussed in Section 2, consistency for the 1 dimensional case was shown in (Klonias, 1984) and our approach here follows similar general lines. The differences between the arguments are due to the more general multi dimensional setting here, and due to some difference in assumptions. To simplify the notation in what follows we set a := Ha , , a := , Ha. Recall that ka(x, y) denotes the kernel corresponding to Ha, The first step of the proof is essentially a stability result for the optimization of (108), given by Lemma 13 below. In this Lemma we observe that L is strongly convex and thus for any function v, we have u v Ha L(v) Ha , (110) where u is the true minimizer of L in C (i.e. the SOSREP estimator). This is particular means that if one can show that the right hand side above is small for the true density v, then the solution u must be close to v. Thus we can concentrate on analyzing the simpler expression L(v) Ha rather than working with the optimizer u directly. Remarkably, this result is a pure Hilbert space result, and it holds for any kernel. We now thus turn to the task of bounding L(v) Ha. As discussed in Section 3.3, the gradient of L in Ha is given by X f(xi) 1kxi + f. (111) Opening the brackets in L(v) 2 Ha we have L(v) 2 a = 1 N 2 X i,j v 1(xi)v 1(xj)k(xi, xj) 2 1 i v 1(xi) kxi, v a + v 2 a (112) i,j v 1(xi)v 1(xj)k(xi, xj) 2 + v 2 a . (113) Sobolev Space Regularised Pre Density Models By definitions, we clearly have v 2 a 1 0 when a 0. Thus we have to show that the first term in (113) concentrates around 1. To this end, we will first show that the expectation of i,j v 1(xi)v 1(xj)k(xi, xj) (114) (with respect to xi s) converges to 1. This is really the heart of the proof, as here we show why v approximately minimizes the SOSREP objective, in expectation, by exploiting the interplay between the kernel, the regularizing coefficient, and the density v. This argument is carried out in Lemmas 14, 16, and 17. Once the expectation is understood, we simply use the Chebyshev inequality to control the deviation of (114) around its mean. This requires the control of cross products of the terms in (114) and can be achieved by arguments similar to those used to control the expectation. This analysis is carried out in Propositions 18 - 20, and Lemma 21. One of the technically subtle issues throughout the proof is the presence of the terms v 1(xi), due to which some higher moments and expectations are infinite. This prevents the use of standard concentration results and requires careful analysis. This is also the reason why obtaining convergence rates and high probability bounds is difficult, although we believe this is possible. M.2. Full Proof Details Throughout this section let L2 be the L2 space of the the Lebesgue measure on Rd, L2 = n f : Rd C | R Rd |f|2 dx o . Next, we observe that L(f) is strongly convex with respect to the norm Ha (see (Nesterov, 2003) for an introduction to strong convexity). As a consequence, we have the following: Lemma 13. Let u be the minimizer of L in C . Then for every v C , u v Ha L(v) Ha . (115) Proof. The function f 7 f 2 Ha is 2-strongly convex with respect to 2 Ha. Since L(f) is obtained by adding a convex function, and multiplying by 1 2, it follows that L(f) is 1-strongly convex. Strong convexity implies that for all u, v C we have L(v) L(u), v u a v u 2 a , (116) see (Nesterov, 2003), Theorem 2.1.9. Using the Cauchy Schwartz inequality and the fact that u is a local minimum with L(u) = 0, we obtain L(v) a u v a L(v), v u a v u 2 a , (117) yielding the result. Now, suppose the samples xi are generated from a true density (f )2. Let v := f be the square root of this density. Then, to show that the estimator u is close to v, it is sufficient to show that L(v) a is small. We write L(v) explicitly: L(v) 2 a = 1 N 2 X i,j v 1(xi)v 1(xj)k(xi, xj) 2 1 i v 1(xi) kxi, v a + v 2 a (118) i,j v 1(xi)v 1(xj)k(xi, xj) 2 + v 2 a . (119) Since v is a fixed function in H1, it is clear from definition (9) that v 2 a 1 0 as a 0. Thus, to bound L(v) 2 a, it suffices to bound 1 N 2 X i,j v 1(xi)v 1(xj)k(xi, xj) 1 (120) Sobolev Space Regularised Pre Density Models with high probability over xi, when N is large and a is small. Note that the form (11) of the kernel ka implies that this is a stationary kernel, i.e. ka(x, y) = ga(x y) with Rd e2πi x,z 1 + a (2π)2m z 2m dz. (121) Lemma 14. Let ka(x, y) be the SDO kernel defined by (11). For any function v L2, and x Rd set (Kav)(x) = Z ka(x, y)v(y)dy. (122) Then Ka is a bounded operator from L2 to L2, with Ka op 1 for every a > 0. Moreover, for every v L2, Kav v L2 0 with a 0. Proof. Note first that Ka, given by (122), is a convolution operator, i.e. Kav = ga v = R ga(x y)v(y)dy. Further, by the Fourier inversion formula, the Fourier transform of ga satisfies Fga(z) = 1 1+a (2π)2m z 2m (see Section D.1 for further details), and recall that F(ga v) = Fga Fv. Since |Fga(z)| 1 for every z and a > 0, this implies in particular that Ka has operator norm at most 1. Next, by the Plancharel equality we have Kav v 2 L2 = F(Kav v) 2 L2 (123) 1 + a (2π)2m z 2m 1 a (2π)2m z 2m 1 + a (2π)2m z 2m = Z |F(v)(z)|2 a (2π)2m z 2m 1 + a (2π)2m z 2m Fix ε > 0. For a radius r denote by B(r) = {z | z r} the ball of radius r, and let Bc(r) be its complement. Since R |F(v)(z)|2 dz < , there is r > 0 large enough such that R Bc(r) |F(v)(z)|2 dz ε. We bound (126) on B(r) and Bc(r) separately. Choose a > 0 such that a (2π)2mr2m 1 ε 1 2 v 1 L2 . Then Z |F(v)(z)|2 a (2π)2m z 2m 1 + a (2π)2m z 2m B(r) |F(v)(z)|2 a (2π)2m z 2m 1 + a (2π)2m z 2m Bc(r) |F(v)(z)|2 a (2π)2m z 2m 1 + a (2π)2m z 2m B(r) |F(v)(z)|2 dz + Z Bc(r) |F(v)(z)|2 dz (129) ε + ε. (130) This completes the proof. Assumption 15. Assume that the density v2 is compactly supported in Rd, and denote the support by B = supp(v2). Note in particular that this implies that B is of finite Lebesgue measure, λ(B) . We treat the components with i = j and i = j in the sum (120) separately. In the case i = j set Yi = v 2(xi)k(xi, xi) = ga(0)v 2(xi), where ga was defined in (121). Note that the variable Yi has an expectation, but does not necessarily have higher moments. Nevertheless, it is still possible to bound the sum using the Marcinkiewicz-Kolmogorov strong law of large numbers, see (Loève, 1977), Section 17.4, point 4 . We record it in the following Lemma, where we also allow a to depend on N, denoting a = a(N), and assume that a(N) does not decay too fast with N. Sobolev Space Regularised Pre Density Models Lemma 16. Assume that lim N a d 2m (N)/N = 0. Then i v 2(xi) 0 (131) almost surely, with N . Note that since 2m > d by construction of the kernels, a(N) = 1/N satisfies the above decay assumption. Proof. We have Ev 2(x) = Z v 2(x) v2(x)1{B}(x)dx = λ(B) < . (132) Thus the Marcinkiewicz-Kolmogorov law implies that i v 2(xi) λ(B) (133) almost surely. Next, recall that by Proposition 7, we have ga(0) = a d 2m g1(0). Our decay assumption on a(N) implies then that ga(N)(0)/N 0, which together with (133) completes the proof. Next, for i = j, set Yij = v 1(xi)v 1(xj)k(xi, xj). Lemma 17. For every a > 0 we have |EYij| 1, and moreover EYij 1 when a 0. Proof. Recall that v2 is a density, i.e. R v2(x)dx = 1, and that the operator Ka was defined in (122). We have Z v 1(x)v 1(y)ka(x, y)v2(x)v2(y)dxdy Z v2(x)dx (134) Z ka(x, y)v(x)v(y)dxdy Z v2(x)dx (135) Z dx v(x) [(Kav)(x) v(x)] (136) v L2 (Kav) v L2 (137) = (Kav) v L2 . (138) The second statement now follows from Lemma 14. For the first statement, write |EYij| = Kav, v L2 Kav L2 v L2 1, (139) where we have used the Cauchy-Schwartz inequality and the first part of Lemma 14. It thus remains to establish that 1 N2 P i =j(Yij EYij) converges to 0 in probability. We will show this by bounding the second moment, i =j (Yij EYij) i =j,i =j (EYij Yi j EYij EYi j ) (140) and then using the Chebyshev inequality. Observe that there are three types of terms of the form (EYij Yi j EYij EYi j ) on the right hand side of (140). The first type is when {i, j} = {i , j } as sets. Second is when |{i, j} {i , j }| = 1, and the third is when {i, j} and {i , j } are disjoint. In the following three propositions, we bound each type of terms separately. Proposition 18. There is a function of m, c(m) > 0, such that EY 2 ij λ(B)a d 2m c(m) < . (141) Sobolev Space Regularised Pre Density Models Proof. Write EY 2 ij = Z v 2(x)v 2(y)(ka(x, y))2v2(x)v2(y)dxdy (142) = Z 1{B}(x)1{B}(y)(ka(x, y))2dxdy (143) Z 1{B}(x) ka x 2 L2 dx (144) = λ(B)a d 2m c(m), (145) where we have used Lemma 8 to compute ka x 2 L2. For the |{i, j} {i , j }| = 1 case we have Proposition 19. Let i, j, t be three distinct indices. Then |EYij Yjt| 1. (146) EYij Yjt = Z v 1(xi)v 1(xj)ka(xi, xj)v 1(xj)v 1(xt)ka(xj, xt)v2(xi)v2(xj)v2(xt)dxidxjdxt (147) = Z v(xi)ka(xi, xj)v(xt)ka(xt, xj)1{B}(xj)dxidxjdxt (148) = Z (Kav)2(xj)1{B}(xj)dxj (149) Z (Kav)2(xj)dxj (150) = Kav 2 L2 (151) where we have used Lemma 14 on the last line. Finally, for the disjoint case, Proposition 20. Let i, j, i , j be four distinct indices. Then E(Yij EYij)(Yi j EYi j ) = 0. (153) Proof. When i, j, i , j are distinct, Yij and Yi j are independent. We now collect these results to bound (140). Lemma 21. There is a function c (m, B) > 0 of m, B, such that, choosing a(N) = 1/N, we have i =j (Yij EYij) for every N > 0. Proof. Observe first that all the expressions of the form EYij EYi j are bounded by 1, by Lemma 17. Next, note that there are O(N 2) terms of the first type. Since 2m > d, we have a d 2m = a d 2m (N) N, and thus, the overall contribution of such terms to the sum in (154) is O(N 4 N N 2) = O(1/N). Similarly, there are O(N 3) terms of the second type, each bounded by constant, and thus the overall contribution is O(N 4 N 3) = O(1/N). And finally, the contribution of the terms of the third type is 0. Sobolev Space Regularised Pre Density Models We now prove the main consistency Theorem. Of Theorem 3. First, observe that by definition u v L2 u v a for any u, v H1. Next, by Lemma 13 and using (119), we have u N v L2 u N v a 1 N 2 X i,j v 1(xi)v 1(xj)k(xi, xj) 2 + v 2 a . (155) Clearly, by definition (9), we have v 2 a 1 0 as a 0. Write i,j v 1(xi)v 1(xj)k(xi, xj) 1 = 1 N 2 X i =j (Yij EYij) + N 2 N N 2 EY12 1 . (156) The last term on the right hand side converges to 0 deterministically with a and N, by Lemma 17. The first terms converges strongly, and hence in probability, to 0, by Lemma 16. And finally, the middle term converges in probability to 0 by Lemma 21 and by Chebyshev inequality. N. Comparison of SOSREP for Different Kernels In this section, we evaluate the SOSREP density estimator with various kernels on the ADBench task. Specifically, we consider the SDO kernel given by (11), the L2 Laplacian kernel, and the Gaussain kernel. The Laplacian (aka Exponential) kernel is given by k(x, y) = (1/σd) e x y /σ (157) for σ > 0, where x y is the Euclidean norm on Rd. The Gaussian kernel is given by k(x, y) = (1/σd) e x y 2/(2σ2). (158) Generally, in this task, we observe a similar behavior across all kernels, although the Gaussian kernel underperforms on a few datasets. Figure 17 presents the performance of each kernel on the ADBench benchmark (see section 6.1). Specifically, we plot the test set AUC values on the Y-axis against the different datasets on the X-axis. Note that due to convexity, the results are nearly identical for different initializations. Figures 16 and 15 display the ADbench results in a manner consistent with Section 6.1. It is noteworthy that the Laplacian kernel also achieves impressive results. Furthermore, when considering ranks and thus mitigating the impact of extreme AUC values, the Gaussian kernel demonstrates significantly improved performance. Sobolev Space Regularised Pre Density Models Figure 15: Anomaly Detection Results on ADBench. Mean Relative Ranking Per Dataset, Higher is Better. Figure 16: Anomaly Detection Results on ADBench. Mean AUC Per Dataset, Higher is Better. Figure 17: Comparing SOSREP results on ADBench for different kernels. Figure 18: Evaluating SOSREP with various kernels on the ADbench benchmark. 38 Sobolev Space Regularised Pre Density Models O. Full Table of Contents for the Supplementary materials 1. Basic Minimizer Properties Discussion on the minimizer of the SOSREP objective and its properties. Proofs for properties of the minimizer. Lemma on minimization properties. Detailed proofs for the properties of the minimizer. 2. Derivation of the Gradients, Proof of Lemma Derivation of the standard and natural gradients of the objective. Proof of the lemma regarding gradients. 3. SDO Kernel Details Full proof of Theorem on kernel form. Additional details on sampling approximation of SDO. SDO Kernel Derivation. Sampling Approximation details. 4. A Few Basic Properties of the Kernel Propositions on the real-valued nature of the kernel and its properties. Proof of properties related to the kernel. 5. Difference between SOSREP and KDE Models Analysis of the differences between SOSREP and KDE estimators. Empirical evaluation of the difference for real data. 6. KDE vs SOSREP Comparison Proofs Development of proofs for comparing KDE and SOSREP. Solution of SOSREP for a 2-Block Model. Proof of Proposition. 7. Invariance of C under Natural Gradient Discussion on the invariance of the non-negative cone under natural gradient steps. Proposition on invariance and its proof. 8. Fisher Divergence for Hyper-Parameters Selection The rationale behind using Fisher Divergence for hyperparameter selection. Score Matching and Fisher Divergence details. The Hessian Estimation for Small a s. Approximating the Hessian Trace. 9. Experiments Detailed analysis of AUC-ROC performance. Discussion on Duplicate Anomalies and its impact. Hyperparameter Tuning with On-the-fly Fisher-Divergence Minimization. 10. Algorithm Overview Detailed overview and breakdown of the main algorithm used in the paper. Specific algorithms for sampling the multidimensional SDO Kernel, calculating optimal alphas, density over coordinates, Fisher Divergence with Hessian Trace Approximation. Sobolev Space Regularised Pre Density Models 11. Figure 3 - List of Datasets Listing of datasets used for comparing fractions of negative values for alpha optimization vs natural gradients. 12. Consistency Theorem Statement and proof of the consistency result for SOSREP estimators. Detailed proof and related lemmas. 13. Comparison of SOSREP for Different Kernels Evaluation of SOSREP with various kernels on the ADBench task. Performance comparison and discussion.