# variational_weighting_for_kernel_density_ratios__7ca9794e.pdf Variational Weighting for Kernel Density Ratios Sangwoong Yoon Korea Institute for Advanced Study swyoon@kias.re.kr Frank C. Park Seoul National University / Saige Research fcp@snu.ac.kr Gunsu Yun POSTECH gunsu@postech.ac.kr Iljung Kim Hanyang University iljung0810@hanyang.ac.kr Yung-Kyun Noh Hanyang University / Korea Institute for Advanced Study nohyung@hanyang.ac.kr Kernel density estimation (KDE) is integral to a range of generative and discriminative tasks in machine learning. Drawing upon tools from the multidimensional calculus of variations, we derive an optimal weight function that reduces bias in standard kernel density estimates for density ratios, leading to improved estimates of prediction posteriors and information-theoretic measures. In the process, we shed light on some fundamental aspects of density estimation, particularly from the perspective of algorithms that employ KDEs as their main building blocks. 1 Introduction One fundamental component for building many applications in machine learning is a correctly estimated density for prediction and estimation tasks, with examples ranging from classification [1, 2], anomaly detection [3], and clustering [4] to the generalization of value functions [5], policy evaluation [6], and estimation of various information-theoretic measures [7, 8, 9]. Nonparametric density estimators, such as the nearest neighbor density estimator or kernel density estimators (KDEs), have been used as substitutes for the probability density component within the equation of the posterior probability, or the density-ratio equation, with theoretical guarantees derived in part from the properties of the density estimators used [10, 11]. Given a specific task which uses the ratio between two densities, p1(x) and p2(x) at a point x RD, we consider the ratio handled by the ratio of their corresponding two KDEs, bp1(x) and bp2(x): bp1(x) bp2(x) Estimate p1(x) p2(x). (1) Each estimator is a KDE which counts the effective number of data within a small neighborhood of x by averaging the kernels. The biases produced by the KDEs in the nominator and denominator [12, Theorem 6.28] are combined to produce a single bias of the ratio, as demonstrated in Fig. 1. For example, the ratio p1(x) p2(x) at x0 in Fig. 1(a) is clearly expected to be underestimated because of the dual effects in Fig. 1(b): the underestimation of the nominator p1(x0) and the overestimation of the denominator p2(x0). The underestimation is attributed to the concavity of p1(x) around x0 which leads to a reduced number of data being generated compared to a uniform density. The underestimation of p2(x) can be explained similarly. The second derivative Laplacian that creates 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Use weight function on data (a) (b) (c) Figure 1: Estimation of the density ratio p1(x0) p2(x0) and bias correction using the weight function α(x). (a) Two density functions, p1(x) and p2(x), and the point of interest x0 for ratio estimation. Two regions delineated by dashed lines are magnified in (b). (b) The concavity and convexity of the density functions around x0 and their KDEs. Concave density p1(x) generates less data than the uniform density of p1(x0) around x0 resulting in an underestimation. For a similar reason, convex density p2(x) results in an overestimation. The two biases are combined into an underestimation of the ratio. (c) KDE augmented with a nonsymmetric weight function α(x) can alleviate this bias by transforming the bias of bp2(x) to an appropriate underestimation from an overestimation. the concavity or convexity of the underlying density is a dominant factor that causes the bias in this example. The Laplacian of density has been used to produce equations in various bias reduction methods, such as bias correction [13, 14, 15] and smoothing of the data space [16, 17]. However, the example in Fig. 1 motivates a novel, position-dependent weight function α(x) to be multiplied with kernels in order to alleviate the bias. For example, to alleviate the overestimation of p2(x0), we can consider the α(x) shown in Fig. 1(c) that assigns more weight on the kernels associated with data located to the left of x0, which is a low-density region. When the weighted kernels are averaged, the overestimation of bp2(x0) can be mitigated or potentially even underestimated. Meanwhile, the bias of bp1(x0) remains unchanged after applying the weights since p1(x) is symmetric around x0. This allows the reversed underestimation of p2(x0) from the initial overestimation to effectively offset or counterbalance the underestimation of p1(x0) within the ratio. We derive the α(x) function that performs this alleviation over the entire data space. The appropriate information for α(x) comes from the geometry of the underlying densities. The aforementioned principle of bias correction leads to novel, model-based and model-free approaches. Based on the assumption of the underlying densities, we learn the parameters for the densities first and second derivatives and then variationally adjust α(x) for the estimator to create the variationally weighted KDE (VWKDE). We note that the model for those densities and their derivatives need not be exact because the goal is not to achieve precise density estimation but rather to accurately capture the well-behaved α(x) for the KDE ratios. Applications include classification with posterior information and information-theoretic measure estimates using density-ratio estimation. Calibration of posteriors [18] has been of interest to many researchers, in part, to provide a ratio of correctness of the prediction. Plug-in estimators of information-theoretic measures, such as the Kullback-Leibler (K-L) divergence, can also be advantageous. For K-L divergence estimation, similar previous formulations for the variational approach have included optimizing a functional bound with respect to the function constrained within the reproducing kernel Hilber space (RKHS) [19, 20, 21]. These and other methods that use weighted kernels (e.g., [22, 23, 24]) take advantage of the flexibility offered by universal approximator functions in the form of linear combinations of kernels. These methods, however, do not adequately explain why the weight optimization leads to an improved performance. Based on a derivation of how bias is produced, we provide an explicit modification of weight for standard kernel density estimates, with details of how the estimation is improved. The remainder of the paper is organized as follows. In Section 2, we introduce the variational formulation for the posterior estimator and explain how to minimize the bias. Section 3 shows how a weight function can be derived using the calculus of variations, which is then extended to general density-ratio and K-L divergence estimation in Section 4. Experimental results are presented in Section 5. Finally, we conclude with discussion in Section 6. 2 Variationally Weighted KDE for Ratio Estimation KDE bp(x) = 1 N PN j=1 kh(x, xj) is conventionally the average of kernels. The average roughly represents the count of data within a small region around x, the size of which is determined by a bandwidth parameter h. The amount of convexity and concavity inside the region determines the bias of estimation, as depicted in Fig. 1(a),(b). 2.1 Plug-in estimator of posterior with weight We consider a weighted KDE as a plug-in component adjusted for reliable ratio estimation using a positive and twice differentiable weight function: α(x) A with A = {α : RD R+ | α C2(RD)}. For two given sets of i.i.d. samples, D1 = {xi}N1 i=1 p1(x) for class y = 1 and D2 = {xi}N1+N2 i=N1+1 p2(x) for class y = 2, we use the following weighted KDE formulation: j=1 α(xj)kh(x, xj), bp2(x) = 1 j=N1+1 α(xj)kh(x, xj). (2) Here, the two estimators use a single α(x). The kernel function kh(x, x ) is a positive, symmetric, normalized, isotropic, and translation invariant function with bandwidth h. The weight function α(x) informs the region that should be emphasized, and a constant function α(x) = c reproduces the ratios from the conventional KDE. We let their plug-in posterior estimator be f(x), and the function can be calculated using f(x) = b P(y = 1|x) = bp1(x) bp1(x) + γ bp2(x). (3) with a constant γ R determined by the class-priors. 2.2 Bias of the posterior estimator We are interested in reducing the expectation of the bias square: E[Bias(x)2] = Z f(x) ED1,D2[f(x)] 2 p(x)dx. (4) The problem of finding the optimal weight function can be reformulated as the following equation in Proposition 1. Proposition 1. With small h, the expectation of the bias square in Eq. (4) is minimized by any α(x) that eliminates the following function Bα;p1,p2(x) = ( log α|x) h(x) + g(x), (5) at every point x. Here, h(x) = p1 and g(x) = 1 with gradient and Laplacian operators, and 2, respectively. All derivatives are with respect to x. The derivation of Eq. (5) begins with the expectation of the weighted KDE: ED1[bp1(x)] = Ex p1(x)[α(x )kh(x, x )] = Z α(x )p1(x )kh(x, x )dx (6) = α(x)p1(x) + h2 2 2[α(x)p1(x)] + O(h3). (7) Along with the similar expansion for ED2[bp2(x)], the following plug-in posterior can be perturbed with small h: ED1,D2 [f(x)] ED1[bp1(x)] ED1[bp1(x)] + γED2[bp2(x)] (8) = f(x) + h2 2 γp1(x)p2(x) (p1(x) + γp2(x))2 2[α(x)p1(x)] α(x)p1(x) 2[α(x)p2(x)] = f(x) + h2 2 P(y = 1|x)P(y = 2|x)Bα;p1,p2(x) + O(h3), (9) (a) (b) (c) Figure 2: Posterior estimates with KDEs and VEKDEs for two 20-dimensional homoscedastic Gaussian densities. (a) Bias corrected by VWKDE. (b) Bias and variance of posterior estimates depending on the bandwidth for standard KDE and for (c) VWKDE. with the substitution Bα;p1,p2(x) 2[α(x)p1(x)] α(x)p1(x) 2[α(x)p2(x)] α(x)p2(x) . (10) The point-wise leading-order bias can be written as Bias(x) = h2 2 P(y = 1|x)P(y = 2|x)Bα;p1,p2(x). (11) Here, the Bα;p1,p2(x) includes the second derivative of α(x)p1(x) and α(x)p2(x). Because two classes share the weight function α(x), Eq. (10) can be simplified into two terms without the second derivative of α(x) as Bα;p1,p2(x) = α x α(x) 2p1 x p1(x) 2p2 x p2(x) which leads to Eq. (5) in the Proposition. The detailed derivation in this section can be found in Appendix A. 2.3 Plug-in estimator of K-L divergence with weight In order to estimate KL(p1||p2) = Ex p1 h log p1(x) p2(x) i , we consider the following plug-in estimator: d KL(p1||p2) = 1 i=1 log bp1(xi) bp2(xi), (13) using xi in xi D1 for Monte Carlo averaging. When we calculate bp1 at xi, we exclude xi from the KDE samples. We use bp1(xi) = 1 N1 1 PN1 j=1 α(xj)kh(xi, xj)1I(i =j) with the indicator function 1I(I), which is 1 if I is true and 0 otherwise. Proposition 2. With small h, the expectation of the bias square is minimized by finding any α(x) that eliminates the same function as Bα;p1,p2(x) in Eq. (5) in Proposition 1 at each point x. In the task of estimating the K-L divergence, Proposition 2 claims that we obtain the equivalent Bα;p1,p2(x) to Eq. (5) during the derivation of bias. The pointwise bias in the K-L divergence estimator can be written as Bias(x) = h2 2 p1(x) p2(x)Bα;p1,p2(x), (14) with Bα;p1,p2(x) equivalent to Eq. (5). 3 Variational Formulation and Implementation Now we consider the α(x) that minimizes the mean square error for the estimation: Z ( log α|x) h(x) + g(x) 2 r(x)dx, (15) with h(x) = p1 . The r(x) function depends on the problem: r(x) = P(y = 1|x)2P(y = 2|x)2p(x) for posterior estimation and r(x) = p1(x) p2(x) 2 p(x) for K-L divergence estimation, with the total density, p(x) = (p1(x) + γp2(x))P(y = 1). The calculus of variation for optimizing the functional in Eq. (15) provides an equation that the optimal α(x) should satisfy: r(( log α) h + g)h = 0. (16) The detailed derivation of this equation can be found in Appendix B. 3.1 Gaussian density and closed-form solution for α(x) A simple analytic solution for this optimal condition can be obtained for two homoscedastic Gaussian density functions. The density functions have two different means, µ1 RD and µ2 RD, but share a single covariance matrix Σ RD D: p1(x) = N(x; µ1, Σ), p2(x) = N(x; µ2, Σ). (17) One solution for this homoscedastic setting can be obtained as α(x) = exp 1 2(x µ ) A(x µ ) , (18) with µ = µ1+µ2 2 and A = b I Σ 1(µ1 µ2)(µ1 µ2) Σ 1 ||Σ 1(µ1 µ2)||2 Σ 1 using an arbitrary constant b. Due to the choice of b, the solution is not unique. All the solutions produce a zero bias. Its detailed derivation can be found in the Appendix C. The reduction of the bias using Eq. (18) with estimated parameters is shown in Fig. 2. 3.2 Implementation In this work, we propose a model-free method and a mode-based method. The model-free approach uses the information of b log p1(x) and b log p2(x) estimated by a score matching neural network [25]. We obtain the second derivative, b 2 log p, by the automatic differentiation of the neural network for the scores. We then obtain d 2p p using d 2p p = b 2 log p b log pb log p. With the outputs of the neural networks for b log p and d 2p p , we train a new network for the function α(x; θ) with neural network parameters θ. On the other hand, the model-based approach uses a coarse Gaussian model for class-conditional densities. The Gaussian functions for each class have their estimated parameters bµ1, bµ2 RD and bΣ1, bΣ2 RD D. We use the score information from these parameters: b log p1(x) = bΣ 1 1 (x c µ1) and b log p2(x) = bΣ 1 2 (x c µ2). In the model-based approach, we let the log of α(x) be a function within the RKHS with basis kernels κσ( , ) with kernel parameter σ. We let log α(x; θ) = PN1+N2 i=1 θiκσ(x, xi) with parameters θ = {θ1, . . . , θN1+N2} for optimization. The weight function α(x) is obtained by optimizing the following objective function with N1 number of data from p1(x) and N2 number of data from p2(x): log α(xi; θ)bh(xi) 2 + log α(xi; θ)bh(xi)bg(xi), (19) with the substitutions bh(x) = b log p1(x) b log p2(x) and bg(x) = 1 p1(x) \ 2p2(x) In the model-based method, an addition of ℓ2 regularizer, λ PN1+N2 i=1 θ2 i , with a small positive constant λ makes the optimization (19) quadratic. When there are fewer than 3,000 samples, we use all of them as basis points. Otherwise, we randomly sample 3,000 points from {xi}N1+N2 i=1 . A brief summary of the implementation process is shown in Algorithms 1 and 2.1 1Code is available at https://github.com/swyoon/variationally-weighted-kernel-density-estimation Algorithm 1 Model-free Input: x, {xi}N1 i=1 p1, {xi}N1+N2 i=N1+1 p2 Output: Ratio b R(x) (= bp1/bp2(x)) Procedure: 1. Estimate b log p1 using {xi}N1 i=1 p1. 2. Estimate b log p2 using {xi}N1+N2 i=N1+1 p2. 3. Obtain α(x) that minimizes Eq. (19) 3. b R(x) = PN1 i=1 α(xi)kh(x,xi) PN1+N2 i=N1+1 α(xi)kh(x,xi) Return b R(x) Algorithm 2 Model-based Input: x, {xi}N1 i=1 p1, {xi}N1+N2 i=N1+1 p2 Output: Ratio b R(x) (= bp1/bp2(x)) Procedure: 1. Estimate bµ1, bΣ1 using {xi}N1 i=1 p1. 2. Estimate bµ2, bΣ2 using {xi}N1+N2 i=N1+1 p2. 3. Use b log pc|x = bΣ 1 c (x bµc) to obtain α(x) that minimizes Eq. (19) 3. b R(x) = PN1 i=1 α(xi)kh(x,xi) PN1+N2 i=N1+1 α(xi)kh(x,xi) Return b R(x) 4 Interpretation of α(x) for Bias Reduction The process of finding α(x) that minimizes the square of Bα;p1,p2(x) in Eq. (5) can be understood from various perspectives through reformulation. 4.1 Cancellation of the bias The second term 1 in Eq. (5) repeatedly appears in the bias of nonparametric processes using discrete labels [26]. In our derivation, the term is achieved with a constant α(x) or with no weight function. The role of the weight function is to control the first term α based on the gradient information in order to let the first term cancel the second. 4.2 Cancellation of flow in a mechanical system The equation for each class can be compared with the mechanical convection-diffusion equation, u t = v u + D 2u, which is known as the equation for Brownian motion under gravity [27] or the advective diffusion equation of the incompressible fluid [28]. In the equation, u is the mass of the fluid, t is the time, v is the direction of convection, and D is the diffusion constant. The amount of mass change is the sum of the convective movement of mass along the direction opposite to u and the diffusion from the neighborhood. We reorganize Eq. (5) into the following equation establishing the difference between the convection-diffusion equations of two fluids: Bα;p1,p2(x) = log α + 1 According to the equation, we can consider the two different fluid mass functions, u1(x) = log p1(x) and u2 = log p2(x), and the original convection movement along the directions v 1 = 1 2 log p1 and v 2 = 1 2 log p2. If we make an α(x) that modifies the convection directions v 1 and v 2 to v1 = v 1 α and v2 = v 2 α, and a mass change in one fluid is compensated by the change of the other, in other words, if u1 t , then the α(x) is the weight function that minimizes the leading term of the bias for ratio estimation. 4.3 Prototype modification in reproducing kernel Hilbert space (RKHS) A positive definite kernel function has its associated RKHS. A classification using the ratio of KDEs corresponds to a prototype classification in RKHS that determines which of the two classes has a closer mean than the other [29, Section 1.2]. The application of a weight function corresponds to finding a different prototype from the mean [30]. The relationship between the new-found prototype and the KDEs has been previously discussed [31]. 4 2 0 2 4 x True LPDR KDE LPDR VWKDE LPDR 0.4 0.5 0.6 0.7 0.8 h KDE LPDR Bias2 VWKDE LPDR Bias2 KDE LPDR Var VWKDE LPDR Var 0.4 0.5 0.6 0.7 0.8 h True KL(p1||p2) KDE KL(p1||p2) VWKDE KL(p1||p2) Figure 3: Estimation results of the LPDR log(p1/p2) and the K-L divergence. (a) Estimation of LPDR at each point. The estimation bias from the true LPDR is reduced by using VWKDE. (b) Squared bias and variance of the estimation with respect to the bandwidth h. Bias has been significantly reduced without increasing variance. (c) Mean and standard deviation of K-L divergence estimates with respect to the bandwidth h. 500 1000 1500 2000 2500 3000 # of samples per class KL divergence 2D Isotropic 500 1000 1500 2000 2500 3000 # of samples per class KL divergence 20D Isotropic 500 1000 1500 2000 2500 3000 # of samples per class KL divergence 50D Isotropic 0 1000 2000 3000 4000 # of samples per class KL divergence 10D Non-isotropic Heteroscedastic 0 1000 2000 3000 4000 # of samples per class KL divergence 20D Non-isotropic Heteroscedastic 0.0 0.5 1.0 1.5 2.0 Distance between means KL divergence 50D Varying Mean Diff NN NNG KLIEP NNGarcia NNWang MINE Ensemble von Mises KDE VWKDE-MB VWKDE-MF Figure 4: K-L divergence estimation results for synthetic distributions; (NN) Nearest-neighbor estimator; (NNG) NN estimator with metric learning [32]; (KLIEP) Direct importance estimation [21]; (NNGarcia) Risk-based f-divergence estimator [33]; (NNWang) Bias-reduced NN estimator [34]; (MINE) Mutual Information Neural Estimation [35]; (Ensemble) Weighted ensemble KDE estimator [36]; (von Mises) KDE estimator with von Mises expansion bias correction [37]; (KDE) KDE estimator; (VWKDE-MB, VWKDE-MF) Model-based and model-free approach of the proposed estimator in this paper. 5 Experiments 5.1 Estimation of log probability density ratio and K-L divergence in 1D We first demonstrate in Fig. 3 how the use of VWKDE alters log probability density ratio (LPDR) and K-L divergence toward a better estimation. We use two 1-dimensional Gaussians, p1(x) and p2(x), with means 0 and 1 and variances 1.12 and 0.92, respectively. We draw 1,000 samples from each density and construct KDEs and model-based VWKDEs for both LPDR and K-L divergence. For LPDR evaluation, we draw a separate 1,000 samples from each density, and the average square of biases and the variances at those points are calculated and presented in Fig. 3(b). The K-L divergence estimation result is shown in Fig. 3(c), where the true K-L divergence can be calculated analytically as KL(p1||p2) 0.664, and the estimated values are compared with this true K-L divergence. KDE-based LPDR estimation exhibits a severe bias, but this is effectively reduced by using VWKDE as an alternative plug-in. Although VWKDE slightly increases the variance of estimation, the reduction of bias is substantial in comparison. Note that since the bias is small over a wide range of h, VWKDE yields a K-L divergence estimate which is relatively insensitive to the choice of h. 5.2 Synthetic distributions We perform the VWKDE-based K-L divergence estimator along with other state-of-the-art estimators to estimate the K-L divergence KL(p1||p2) between two synthetic Gaussian distributions p1 and p2 having µ1 and µ2 as their mean vectors and Σ1 and Σ2 as their covariance matrices, respectively. 0 1000 2000 3000 4000 # of samples per class 2D Non-Gaussian 0 1000 2000 3000 4000 # of samples per class 20D Non-Gaussian MINE Ensemble von Mises KDE VWKDE-MB VWKDE-MF Figure 5: Estimation of K-L divergence between two non-Gaussian densities, p1(x) and p2(x). Each density is the Gaussian mixture of the three Gaussians, as shown in the 2-dimensional density contour in the figure on the left. They are the true densities but are very dissimilar to the single Gaussian model. The figure in the middle shows the estimation with 2-dimensional data, and the figure on the right shows the estimation with 20-dimensional data. With 20-dimensional data, the remaining 18 dimensionalities have the same mean isotropic Gaussians without correlation to the first two dimensionalities. Unseen Defect Figure 6: Detection of an artificially injected defect (MNIST digit "3"). The first panel shows the image with an injected defect. The remaining three panels show the detection scores of different methods. We use three different settings for p1 and p2: Isotropic (Iso), Non-isotropic Heteroscedastic (NH), and Varying Mean Diff (VMD). In Iso, Σ1 = Σ2 = I for an identity matrix I. µ1 = 0 for a zero vector 0. The first element of µ2 is 2, while the other elements are uniformly zero, resulting in KL(p1||p2) = 1. In NH, µ1 = µ2 = 0, and Σ1 = I. Σ2 is a matrix having a pair of off-diagonal element (Σ2)1,2 = (Σ2)2,1 = 0.1, and other off-diagonal elements are zero. The diagonal elements have a constant value ω, which is determined to yield KL(p1||p2) 1.0 with ω = 0.7502 (10D) and KL(p1||p2) 0.5 with ω = 0.8632 (20D). In VMD, the first element of µ2 has various values between 0 and 2, while µ1 = 0 and the other elements in µ2 remain zero. Σ1 = Σ2 = I. In Iso and NH, we vary the sample size, and in VMD, we use 2,000 samples per distribution. We repeat each experiment 30 times and display the mean and the standard deviation in Figure 4. In the upper left panel of Figure 4, we observe that almost all algorithms estimate the low-dimensional K-L divergence reliably, but the results deteriorate dramatically as shown in the upper middle and right panels with high-dimensionality. As shown in the lower left and lower middle panels, most of the baseline methods fail to produce the estimates near the true value when data are correlated. The model-based VWKDE-based estimator is the only estimator that recovers the true value in the 20D NH case. Figure 5 shows the K-L divergence estimation for non-Gaussian densities. In this example, the model for the score function in model-based VWKDE is different from the data-generating density, in which the estimator still shows very reliable estimates. 5.3 Unsupervised optical surface inspection We apply the proposed K-L divergence estimation using VWKDE for the inspection of the surface integrity based on the optical images. Most of the previous works have formulated the inspection using the supervised setting [38, 39, 40, 41]; however, often the defect patterns are diverse, and training data do not include all possible defect patterns. In real applications, identification and localization of unseen defect patterns are important. In this example, we apply the model-based VWKDE. Detection of defective surface Following the representations of previous works on image classification [42, 43], we extract random small patches from each image I and assume that those patches are the independently generated data. We use the probability density p I(x), for the patch x RD from I. Given the N number of normal surface images D = {Ii}N i=1 and a query image I , we determine whether I is a defective surface according to the following decision function f(I ) and a predefined Table 1: Performances for defect surface detection (left) and defect localization (right). m AUC and m AP are averaged over six surface types of DAGM. The DAGM dataset is provided with labels, and only CNN used the labels for training. The unseen defect is the artificially injected MNIST digit "3." m AUC DAGM Defect Unseen Defect VWKDE 0.785 0.002 0.967 0.003 KDE 0.734 0.005 0.926 0.003 NN-1 0.628 0.002 0.813 0.001 NN-10 0.540 0.003 0.614 0.002 NNWang 0.605 0.002 0.657 0.004 MMD 0.618 0.003 0.615 0.008 OSVM 0.579 0.001 0.538 0.000 CNN 0.901 0.011 0.809 0.029 m AP DAGM Defect Unseen Defect VWKDE 0.369 0.005 0.903 0.007 KDE 0.294 0.004 0.849 0.006 NN-1 0.095 0.008 0.488 0.002 NN-10 0.081 0.004 0.254 0.002 NNWang 0.029 0.005 0.024 0.000 MMD 0.151 0.006 0.032 0.001 OSVM 0.249 0.012 0.444 0.009 CNN 0.699 0.037 0.564 0.060 f(I ) = min Ii D d KL(p I ||p Ii). (21) Defect localization Once the defective surface is detected, the spot of the defect can be localized by inspecting the LPDR log(p I (x)/p Im(x)) score between the query image I and the Im with Im = arg min Ii D d KL(p I ||p Ii). The location of the patch x with the large LPDR score is considered to be the defect location. Note that a similar approach has been used for the witness function in statistical model criticism [44, 45]. For the evaluation of the algorithm, we use a publicly available dataset for surface inspection: DAGM2. The dataset contains six distinct types of normal and defective surfaces. The defective samples are not used in training, but they are used in searching the decision thresholds. We extract 900 patches per image, and each patch is transformed into a four-dimensional feature vector. Then, the detection is performed and compared with many well-known criteria: diverse K-L divergences estimators as well as the maximum mean discrepancy (MMD) [46] and the one-class support vector machines (OSVM) [47]. In addition, the Convolutional Neural Networks (CNNs) training result is presented for comparison with a supervised method. In DAGM, the testing data have defect patterns similar to those in the training data. To demonstrate unseen defect patterns, we artificially generate defective images by superimposing a randomly selected 15% of the normal testing images with a small image of the MNIST digit 3 at a random location (see Figure 6). Table 1 presents the area under curve (AUC) of the receiver operating characteristic curve for the detection as well as the mean average precision (m AP) for the localization. CNNs which use labels for training show good performances only in detecting and localizing DAGM defects. The K-L divergence estimation with VWKDE show the best performance over many unsupervised methods, and it provides significantly better performances both at identifying unseen defects and at localizing them. Figure 6 shows one example of how well the proposed method localizes the position of the unseen defects. 6 Conclusion In this paper, we have shown how a weighted kernel formulation for the plug-in densities could be optimized to mitigate the bias in consideration of the geometry of densities. The underlying mechanism uses the information from the first derivatives to alleviate the bias due to the second derivatives. In our experiments, a simple choice of Gaussian density model for obtaining the first and second derivatives led to a reliable reduction of bias. This insensitivity to the exactness due to a coarse model is nonintuitive considering the traditional dilemma prevalent in many conventional methods; a coarse and inexact model enjoys a small variance but at the cost of large bias. In our work, the usage of the coarse model had no effect on the flexibility of the plug-in estimator, while the high dimensional bias was tackled precisely. 2Deutsche Arbeitsgemeinschaft für Mustererkennung (The German Association for Pattern Recognition). Limitations of this study include the computational overhead for score learning using parametric or neural network methods and no benefit for the asymptotic convergence rate because it depends on the convergence rate of KDE. Using a non-flexible parametric model rather than a flexible one provides a consistent benefit to improve the KDE. Acknowledgments and Disclosure of Funding SY was supported by a KIAS Individual Grant (AP095701) via the Center for AI and Natural Sciences at Korea Institute for Advanced Study. SY and FCP were supported in part by IITP-MSIT (2021-002068, 2022-0-00480), ATC+ (20008547), SRRC NRF (RS-2023-00208052), and SNU Institute for Engineering Research. GY was partly supported by IITP-MSIT (2019-0-01906), and IK and YKN was supported by NRF/MSIT (No. 2018R1A5A7059549, 2021M3E5D2A01019545), IITP/MSIT (IITP-2021-0-02068, 2020-0-01373, RS-2023-00220628). YKN was supported by Samsung Research Funding & Incubation Center for Future Technology (SRFC-IT1901-13) partly in the derivation of the weight dependency on the bias and its mechanical interpretation. [1] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using kernel density estimation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15088 15099. Curran Associates, Inc., 2020. [2] Edward Gan and Peter Bailis. Scalable kernel density classification via threshold-based pruning. In SIGMOD 17: Proceedings of the 2017 ACM International Conference on Management of Data, pages 945 959, 2017. [3] Longin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. Outlier detection with kernel density functions. In Petra Perner, editor, Machine Learning and Data Mining in Pattern Recognition, pages 61 75, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. [4] Dirceu Scaldelai, Luiz Carlos Matioli, Solange Regina dos Santos, and Mariana Kleina. Multicluster KDE: a new algorithm for clustering based on multivariate kernel density estimation. Journal of Applied Statistics, 49(1):98 121, 2020. [5] Dirk Ormoneit and Saunak Sen. Kernel-based reinforcement learning. Machine Learning, 49:161 178, 2002. [6] Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee, Yung-Kyun Noh, and Kee-Eung Kim. Local metric learning for off-policy evaluation in contextual bandits with continuous actions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 3913 3925. Curran Associates, Inc., 2022. [7] Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, and Larry Wasserman. Nonparametric estimation of renyi divergence and friends. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 919 927, Bejing, China, 22 24 Jun 2014. PMLR. [8] Dongxin Xu and Deniz Erdogmuns. Renyi s entropy, divergence and their nonparametric estimators. In: Information Theoretic Learning. Information Science and Statistics, 2010. [9] Jose C. Principe, Dongxin Xu, Qun Zhao, and John W. Fisher. Learning from examples with information theoretic criteria. Journal of VLSI Signal Processing Systems, 26(1-2):75 113, 2000. [10] Kumar Sricharan, Raviv Raich, and Alfred O. Hero III. Estimation of nonlinear functionals of densities with confidence. IEEE Transactions on Information Theory, 58(7):4135 4159, 2012. [11] Ziv Goldfeld, Kristjan H. Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy. Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory, 66(7):4368 4391, 2020. [12] Larry Wasserman. All of nonparametric statistics: a concise course in nonparametric statistical inference. Springer, 2005. [13] Peter Hall. Effect of bias estimation on coverage accuracy of bootstrap confidence intervals for a probability density. The Annals of Statistics, 20(2):675 694, 1992. [14] Peter Hall and Byeong U. Park. New methods for bias correction at endpoints and boundaries. The Annals of Statistics, 30(5):1460 1479, 2002. [15] M. C. Jones, David F. Signorini, and Nils Lid Hjort. On multiplicative bias correction in kernel density estimation. Sankhy a: The Indian Journal of Statistics, Series A (1961-2002), 61(3):422 430, 1999. [16] David Ruppert and Daren B. H. Cline. Bias reduction in kernel density estimation by smoothed empirical transformations. The Annals of Statistics, 22(1):185 210, 1994. [17] Yung-Kyun Noh, Masashi Sugiyama, Kee-Eung Kim, Frank Park, and Daniel D Lee. Generative local metric learning for kernel regression. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2452 2462. Curran Associates, Inc., 2017. [18] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321 1330. PMLR, 06 11 Aug 2017. [19] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 304 313, New York, New York, USA, 20 22 Jun 2016. PMLR. [20] Xuan Long Nguyen, Martin J Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1089 1096. Curran Associates, Inc., 2008. [21] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008. [22] Vladimir N Vapnik and Sayan Mukherjee. Support vector method for multivariate density estimation. In Advances in Neural Information Processing Systems, pages 659 665, 2000. [23] Mark Girolami and Chao He. Probability density estimation from optimally condensed data samples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1253 1264, 2003. [24] Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton, and Bernhard Schölkopf. Tailoring density estimation via reproducing kernel moment matching. In Proceedings of the 25th international conference on Machine learning, pages 992 999. ACM, 2008. [25] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR, volume 115, pages 574 584, 2020. [26] Yung-Kyun Noh, Byoung-Tak. Zhang, and Daniel D. Lee. Generative local metric learning for nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(1):106 118, 2018. [27] S. Chandrasekhar. Stochastic problems in physics and astronomy. Reviews of Modern Physiscs, 15:1 89, 1943. [28] Scott A. Socolofsky and Gerhard H. Jirka. Environmental fluid mechanics. Part I: Mass transfer and diffusion. Engineering-lectures [online]. 2004. Karlsruhe : Inst. für Hydromechanik. 2nd ed. 2002. [29] Bernhard. Schölkopf and Alex J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, December 2002. [30] Krikamol Muandet and Bernhard Schölkopf. A unifying view of support measure machines, support vector machines, and parzen window classifiers. Unpublished, 2020. [31] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods in machine learning. The Annals of Statistics, 36(3):1171 1220, 2008. [32] Yung-Kyun Noh, Masashi Sugiyama, Song Liu, Marthinus C. du Plessis, Frank C. Park, and Daniel D. Lee. Bias reduction and metric learning for nearest-neighbor estimation of Kullback-Leibler divergence. Neural computation, page 1, 2018. [33] Darıo Garcıa-Garcıa, Ulrike von Luxburg, and Raúl Santos-Rodrıguez. Risk-based generalizations of f-divergences. In Proceedings of the 28th International Conference on Machine Learning, ICML, pages 417 424, 2011. [34] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5):2392 2405, 2009. [35] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Devon Hjelm, and Aaron Courville. Mutual information neural estimation. In International Conference on Machine Learning, pages 530 539, 2018. [36] Kevin Moon, Kumar Sricharan, Kristjan Greenewald, and Alfred Hero. Ensemble estimation of information divergence. Entropy, 20(8):560, 2018. [37] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M Robins. Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. ar Xiv preprint ar Xiv:1411.4342, 2014. [38] Franz Pernkopf and Paul O Leary. Visual inspection of machined metallic high-precision surfaces. EURASIP Journal on Advances in Signal Processing, 2002(7):650750, 2002. [39] Fabian Timm and Erhardt Barth. Non-parametric texture defect detection using weibull features. In Image Processing: Machine Vision Applications IV, volume 7877, page 78770J. International Society for Optics and Photonics, 2011. [40] Daniel Weimer, Bernd Scholz-Reiter, and Moshe Shpitalni. Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Annals, 65(1):417 420, 2016. [41] Seunghyeon Kim, Wooyoung Kim, Yung-Kyun Noh, and Frank C Park. Transfer learning for automated optical inspection. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 2517 2524. IEEE, 2017. [42] Barnabás Póczos, Liang Xiong, Dougal J Sutherland, and Jeff Schneider. Nonparametric kernel estimators for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2989 2996. IEEE, 2012. [43] GA Kaminka et al. Randomized distribution feature for image classification. In ECAI 2016: 22nd European Conference on Artificial Intelligence, 29 August-2 September 2016, The Hague, The Netherlands-Including Prestigious Applications of Artificial Intelligence (PAIS 2016), volume 285, page 426. IOS Press, 2016. [44] James R Lloyd and Zoubin Ghahramani. Statistical model criticism using kernel two sample tests. In Advances in Neural Information Processing Systems, pages 829 837, 2015. [45] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, pages 2280 2288, 2016. [46] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. [47] Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443 1471, 2001. [48] John. C. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers, 2000. [49] ABA. Graf, O. Bousquet, G. Rätsch, and B. Schölkopf. Prototype classification: Insights from machine learning. Neural Computation, 21(1):272 300, January 2009. [50] Adam Kowalczyk. Maximal margin perceptron. In Smola, Bartlett, Schölkopf, and Schuurmans, editors, Advances in Large Margin Classifiers, pages 75 113, 2000. A Bias Derivation of the Posterior Estimator The expectation of the weighted KDE is obtained from the following equation: ED1[bp1(x)] = Ex p1(x)[α(x )kh(x, x )] = Z α(x )p1(x )kh(x, x )dx (22) = Z α(x )p1(x ) 1 dx = Z α(x + hz)p1(x + hz)K(z)dz, (23) with the substitution z = x x h , or x = hz + x to produce dx = h Ddz, and K x x h = h Dkh(x, x ) with normalized and isotropic K(z). We apply Taylor expansion on the term α(x + hz)p1(x + hz) around x, and with the assumption that |hz| is small, α(x + hz)p1(x + hz) = α(x)p1(x) + hz [α(x)p1(x)] + h2 2 z [α(x)p1(x)]z + O(h3), (24) using the Hessian operator . Now the integration yields the expectation with respect to K(z) that satisfies R K(z)dz = 1, R z K(z)dz = 0, and R zz K(z)dz = I: ED1[bp1(x)] = α(x)p1(x) + h2 2 2[α(x)p1(x)] + O(h3), (25) with the Laplacian operator 2. Along with the expansion for ED2[bp2(x)], the following plug-in posterior can be perturbed by h assuming a small h: ED1,D2 [f(x)] ED1[bp1(x)] ED1[bp1(x)] + γED2[bp2(x)] (26) = f(x) + h2 2 γp1(x)p2(x) (p1(x) + γp2(x))2 2[α(x)p1(x)] α(x)p1(x) 2[α(x)p2(x)] = f(x) + h2 2 P(y = 1|x)P(y = 2|x)Bα;p1,p2(x) + O(h3), (27) giving the point-wise leading-order bias with respect to h: Bias(x) = h2 2 P(y = 1|x)P(y = 2|x)Bα;p1,p2(x). (28) Here, the Bα;p1,p2(x) is as follows: Bα;p1,p2(x) 2[α(x)p1(x)] α(x)p1(x) 2[α(x)p2(x)] α(x)p2(x) , (29) which includes the second derivative of α(x)p1(x) and α(x)p2(x). Because two classes use the same weight function α(x), Eq. (29) can be decomposed into two terms without the second derivative of α(x). Bα;p1,p2(x) = α x α(x) 2p1 x p1(x) 2p2 x p2(x) B Solution of the Calculus of Variation For the optimization of Eq. (15) with respect to α(x), we first make a substitution β = log α and apply a calculus of variation technique for optimal β(x). We express the objective functional with R m(x; β, β) dx using m(x; β, β) = β|xh(x) + g(x) 2 r(x). (31) With the substitution β = β for notational abbreviation, we apply the Euler-Lagrange equation for the m(x; β, β ) containing both β and β : m(x; β, β ) β x β m(x; β, β ) = 0, (32) where the divergence is x β = PD i=1 xi β i with the i-th component of β , β i, and the dimensionality is D. The first term can be calculated as m(x;β, β ) β = 0. The first derivative of the second term is β m(x; β, β ) = r β h + g h. After we substitute β(x) with log α(x) back, we obtain the equation for the optimal α(x) function: h r(( log α) h + g)h i = 0. (33) which is Eq. (16). C Analytic Solution for Two Homoscedastic Gaussians We consider the following two homoscedastic Gaussians p1(x) = N(x; µ1, Σ), p2(x) = N(x; µ2, Σ), (34) with a common covariance matrix Σ. In order to obtain the zero divergence in Eq. (16), the divergence-free vector field can be obtained using log α Σ 1x + v(x) (35) because the inner product of log α with h = log p1 log p2 = Σ 1(µ1 µ2) should yield a negative value of g(x), which is g(x) = x Σ 2(µ2 µ1) 1 2(µ 1 Σ 2µ1 µ 2 Σ 2µ2). The equation ( log α) h = g(x) gives 2Σ 1(µ1 + µ2). (36) Therefore, one possible solution for log α(x) is log α(x) = Σ 1(x µ), (37) with the mean of the two class-conditional means, µ = µ1+µ2 2 . Therefore, one particular solution for α(x) is α(x) = exp 1 2(x µ) Σ 1(x µ) . (38) This solution is not unique, and any log α that has a form of log α(x) = 1 2(x µ) Σ 1(x µ) + l(x), (39) with l(x) satisfying l Σ 1(µ1 µ2) = 0 is also the solution. One technique for finding such l(x) is that we pick up any differentiable seed function l0(x) and consider its derivative l0 with the a = Σ 1(µ1 µ2) com- ponent subtracted: I a a || a||2 l0. The l(x) is a function that its derivative satisfies l = I a a || a||2 l0. For example, if we choose l0(x) = x, then l(x) is a function that satisfies l = I a a || a||2 l0 = 1I a 1I || a||2 a, and we can get l(x) = 1I a 1I || a||2 x. If we choose l0(x) = 1 2||x||2, we get l(x) = 1 || a||2 x after similar calculations. Now the choice of 2 (x µ)2 , (40) gives us l(x) = b 2(x µ) I a a || a||2 (x µ), which produces our analytic weight function in Eq. (18): α(x) = exp 1 2(x µ ) A(x µ ) , (41) with µ = µ1+µ2 2 and A = b I Σ 1(µ1 µ2)(µ1 µ2) Σ 1 ||Σ 1(µ1 µ2)||2 Σ 1, with an arbitrary constant b. D Posterior Prediction for Various Algorithms Fig. 7 shows the posterior prediction results of various algorithms. For the estimation with support vector machines, [48] is used. For neural network estimation, 2-layer fully connected networks with 100 nodes for each layer were used minimizing the mean square error of the sigmoid output. Both VWKDE-MB and VWKDE-MF show superior results to other methods. Figure 7: Posterior predictions with various algorithms for 20-dimensional Gaussians. D.1 Kernel density estimation in high dimensions We also note the difficulty of density estimation with KDE in high dimensional space. Fig. 8 shows a onedimensional slice of two 20-dimensional Guassians. The maximum density in this slice is on the order of 10 8. Meanwhile, the KDE with 5,000 data points per class shows densities on the order of about 10 10 with a bandwidth of h = 0.8. 4 3 2 1 0 1 2 3 4 0.0 True p1 True p2 KDE p1 KDE p2 Figure 8: Underlying density functions and their KDE predictions. The bandwidth should be chosen to be sufficiently large because no pairs are nearby in a high-dimensional space. Although the estimated density with h = 0.8 is reasonably smooth, the KDE differs from the true density by several orders of magnitude with 5,000 data points. However, the patterns of relative overestimation and underestimation depicted in Fig. 1 are evident, and the proposed method can be applied even with inexact KDEs. E Fluid Flow Interpretation of Making Bias The change of concentration u(t) at time t due to the convection and diffusion can be written as t = v u + D 2u, (42) with the direction of convection v and the diffusion constant D . The first term represents the convection, and the second term represents the diffusion. The bias Eq. (30) can be reformulated as Eq. (30) = α = log α + 1 Eq. (44) and Eq. (45) can be understood as the two flows with concentrations u1 = log p1 and u2 = log p2, respectively, and the convection direction for u1 is v1 = log α + 1 2 log p1 , and the direction for u2 is v2 = log α + 1 2 log p2 . Without the weight, the convection directions are v1 = 1 2 log p1 and v2 = 1 2 log p2 but with weight, they change to v1 = log α + 1 2 log p1 and v2 = log α + 1 The role of α is to modify the directions of convection toward log α together, and in the example shown in Fig. 1, the change of u1 and u2 due to diffusion are negative and positive, respectively. The direction of convection can control the change of u1 and u2, and the α makes the difference between u1 t as small as possible. F Prototype Classification Interpretation of the Weighted Kernel Methods in Reproducing Kernel Hilbert Space (RKHS) The kernel algorithm for classification can be viewed as prototype algorithms in the RKHS due to the decomposition of positive definite kernel functions [31, 30, 29, Section 1.2]: k(x, x ) = ϕ(x), ϕ(x ) , ϕ(x), ϕ(x ) RKHS, (47) with the inner product operator ., . defined in RKHS. Given a dataset {xi, yi}N i=1, xi RD, yi {0, 1}, the classification using two prototypes w1 = 1 N1 P {i;yi=1} αiϕ(xi) and w0 = 1 N0 P {i;yi=0} αiϕ(xi) in RKHS determines which of the prototypes has a smaller distance to the ϕ(x) than the other. Here, N1, N0 are the numbers of data of classes 1 and 0, respectively. The classification using KDEs with the pointwise weights can be compared with the prototype classification in RKHS using the following derivation: {i;yi=1} αik(xi, x) 1 {i;yi=0} αik(xi, x) {i;yi=1} αi ϕ(xi), ϕ(x) 1 {i;yi=0} αi ϕ(xi), ϕ(x) {i;yi=1} αiϕ(xi), ϕ(x) {i;yi=0} αiϕ(xi), ϕ(x) = 1I ([ w1, ϕ(x) w0, ϕ(x) ] > θ) (51) = 1I ||w1 ϕ(x)||2 ||w0, ϕ(x)||2 > θ with a predetermined threshold θ. In Eq. (52), θ = θ (||w1||2 ||w0||2). With uniform weight αi = 1 for all i = 1, . . . , N, the classification is simply the comparison of two KDEs bp1 = 1 N1 P {i;yi=1} k(xi, x) and bp0 = 1 N0 P {i;yi=0} k(xi, x), which correspond to the prototype classification using two empirical means w1 = 1 N1 P {i;yi=1} ϕ(xi) and w0 = 1 N0 P {i;yi=0} ϕ(xi) in RKHS. Originating from this correspondence, one suggestion of the unification for the KDE and RKHS is presented in [30]. The explanations about the prototypes for various kernelized algorithms can be found in [49]. The prototypes of SVMs are known to be the closest two points within the convex hull of different classes [50]. Despite all these discussions, it is clear that the modification of the densities using a weight function will not improve the density estimation performance from the perspective of KDE. Despite the poor density estimation performance, the modification improves the classification or information-theoretic measure estimation for the KDE plug-in algorithms, but not necessarily the KDE itself. The improvement is partly supported by the prototype models in RKHS. G Least Square Approach for Binary Classification Reducing the bias of the posterior equation in Eq. (3) corresponds to the least square of the prediction error. The optimal square error is achieved with the Bayes classifier, which classifies a datum according to the posterior probability. The posterior probability of x0 being generated from p1(x) can be written as: P(y = 1|x0) = p1(x0)p(y = 1) p0(x0)p(y = 0) + p1(x0)p(y = 1) (53) = p1(x0) γp0(x0) + p1(x0), (54) where γ = p(y = 0)/p(y = 1). The least square error with L = Z (f(x) y)2p(x, y)dydx, (55) is achieved with the following prediction function f(x) = E[y = 1|x] = P(y = 1|x). (56) An accurate estimation of posterior is essential for successful classification. We construct a classifier based on the KDE density estimates bp0(x), bp1(x). f(x) = bp1(x) γbp0(x) + bp1(x) (57) = 1 1 + γ(bp0(x)/bp1(x)), (58) and consider the deviation of f(x) from the true E[y = 1|x]. H Details on Optical Surface Inspection Experiments We use a widely used public surface inspection dataset provided by DAGM3 for experiments. The dataset contains six distinct textile surface types and associated defect types. There are 1,150 images per class, half of which is for training and the remaining is for testing. Approximately 13% of total images are defective, and for each defective image, a masking image which roughly encloses the defective region are provided. The dataset is originally proposed for a supervised setting. We extract 900 patches of size 32 32 from each image, using a sliding window with step size 16. In each patch, we apply Gaussian smoothing and Scharr kernel to obtain a gradient distribution which can capture the texture information. To encode the gradient distribution as a feature vector, we compute its mean, standard deviation, skewness, and kurtosis. As a result, a surface image is transformed into a set of 900 four-dimensional vectors, or a 900 4 matrix. Feature vectors are standardized and whitened by aggregating all the patches from the same surface type. VWKDE can be time-consuming as a large number of KL divergences need to be computed. Therefore, we take a two-pass approach when applying VWKDE. Given a query image, we first apply KDE-based KL divergence estimator to obtain rough estimates of KL divergences. Then, we take k images with the lowest KL divergences and apply VWKDE-based KL divergence estimator to the k images to finally select the image with the lowest KL divergence. This method enables us to have the best of both worlds, the speed of KDE and the accuracy of VWKDE. The optimal bandwidth for bias reduction methods such as Ensemble [36], von Mises [37], and VWKDE is usually larger than other methods. We use the bandwidth with maximum leave-one-out log-likelihood of KDE for other methods but in these three methods, we used the heuristic rule of bandwidth selection using the maximum log-likelihood bandwidth for only 25% of randomly selected data. A convolutional neural network (CNN) which takes a 32 32 patch as an input and predicts whether the patch is defective is trained. For training, we label patches with 75% overlap to the defect mask as defective and patches 3Deutsche Arbeitsgemeinschaft für Mustererkennung (The German Association for Pattern Recognition). Data access: https://hci.iwr.uni-heidelberg.de/node/3616 without any overlap to the defect mask as normal. Patches do not belong to either class are discarded. Due to class imbalance, normal patches are undersamples to yield defect to normal ratio of 1:4. CNN is trained for each surface type separately. The structure of our CNN is Conv(20)-Conv(20)-Max Pool-Conv(20)-Conv(20)- Max Pool-FC(20)-Drop Out-FC(1), where Conv is a 3 3 convolution layer, Max Pool is a 2 max pooling layer, FC is a fully connected layer, and Drop Out is a drop out operation with probability 0.5. We use binary cross entropy loss for objective function and ADAM for optimization. In unsupervised defect localization, we threshold log probability density ratio (LPDR) estimate to obtain detection results. We threshold LPDR estimates dynamically at 90% of maximum LPDR observed in the image. Then, we use its KL divergence estimate as a confidence score for the detection. For a CNN, we use the output probability for a patch as a detection score, and set a threshold to 0.9, and the maximum probability of defect among the patches in an image is used as a confidence score. Note that, in this experiment, we generate one detection per an image because DAGM dataset is constrained to have as most one defect per an image. However, this condition can be relaxed in future work with other dataset. Intersection-over-union (IOU) is computed between a detection and a true defect mask. A detection with IOU larger than 0.1 considered as a correct detection. This threshold is lower than a typical threshold in object detection (0.5), because the defect mask is weakly labelled and usually larger than a precise defect region. Using a confidence score assigned for a detection, we compute average precision as in PASCAL VOC challenge, then take average over surface types.