# nearest_neighbor_dirichlet_mixtures__6ca9c4e6.pdf

Journal of Machine Learning Research 24 (2023) 1-46 Submitted 2/21; Published 8/23

Nearest Neighbor Dirichlet Mixtures

Shounak Chattopadhyay shounak.chattopadhyay@duke.edu Department of Statistical Science Duke University Durham, NC 27708-0251, USA

Antik Chakraborty antik015@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47907, USA

David B. Dunson dunson@duke.edu Department of Statistical Science Duke University Durham, NC 27708-0251, USA

Editor: Erik Sudderth

There is a rich literature on Bayesian methods for density estimation, which characterize the unknown density as a mixture of kernels. Such methods have advantages in terms of providing uncertainty quantiﬁcation in estimation, while being adaptive to a rich variety of densities. However, relative to frequentist locally adaptive kernel methods, Bayesian approaches can be slow and unstable to implement in relying on Markov chain Monte Carlo algorithms. To maintain most of the strengths of Bayesian approaches without the computational disadvantages, we propose a class of nearest neighbor-Dirichlet mixtures. The approach starts by grouping the data into neighborhoods based on standard algorithms. Within each neighborhood, the density is characterized via a Bayesian parametric model, such as a Gaussian with unknown parameters. Assigning a Dirichlet prior to the weights on these local kernels, we obtain a pseudo-posterior for the weights and kernel parameters. A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density. Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classiﬁcation.

Keywords: Bayesian, Density estimation, Distributed computing, Embarrassingly parallel, Kernel density estimation, Mixture model, Quasi-posterior, Scalable

1. Introduction

Bayesian nonparametric methods provide a useful alternative to black box machine learning algorithms, having potential advantages in terms of characterizing uncertainty in inferences and predictions. However, computation can be slow and unwieldy to implement. Hence, it is important to develop simpler and faster Bayesian nonparametric approaches, and hybrid methods that borrow the best of both worlds. For example, if one could use the Bayesian machinery for uncertainty quantiﬁcation and reduction of mean square errors

c 2023 Shounak Chattopadhyay, Antik Chakraborty, David B. Dunson.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0116.html.

Chattopadhyay, Chakraborty and Dunson

through shrinkage, while incorporating algorithmic aspects of machine learning approaches, one may be able to engineer a highly eﬀective hybrid. The focus of this article is on proposing such an approach for density estimation, motivated by the successes and limitations of nearest neighbor algorithms and Bayesian mixture models. Nearest neighbor algorithms are popular due to a combination of simplicity and performance. Given a set of n observations X (n) = (X1, . . . , Xn) in Rp, the density at x is estimated as ˆfknn(x) = k/(n Vp Rp k), where k is the number of neighbors of x in X (n), Rk = Rk(x) is the distance of x from its kth nearest neighbor in X (n), and Vp is the volume of the p-dimensional unit ball (Loftsgaarden and Quesenberry, 1965; Mack and Rosenblatt, 1979). Refer to Biau and Devroye (2015) for an overview of related estimators and corresponding theory. Nearest neighbor density estimators are a type of locally adaptive kernel density estimators. The literature on such methods identiﬁes two broad classes: balloon estimators and sample smoothing estimators; see Scott (2015); Terrell and Scott (1992) for an overview. Balloon estimators characterize the density at a query point x using a bandwidth function h(x); classical examples include the naive k-nearest neighbor density estimator (Loftsgaarden and Quesenberry, 1965) and its modiﬁcation in Mack and Rosenblatt (1979). More elaborate balloon estimators face challenges in terms of choice of h(x) and obtaining density estimators that do not integrate to 1. Sample smoothing estimators use n diﬀerent bandwidths h(Xi), one for each sample point Xi, to estimate the density at a query point x globally. By construction, sample smoothing estimators are bona ﬁde density functions integrating to 1. To ﬁt either the balloon or the sample smoothing estimator, one may compute an initial pilot density estimator employing a constant bandwidth and then use this pilot to estimate the bandwidth function (Breiman et al., 1977; Abramson, 1982). Another example of a locally adaptive density estimator is the local likelihood density estimator (Loader, 1996, 2006; Hjort and Jones, 1996), which ﬁts a polynomial model in the neighborhood of a query point x to estimate the density at x, estimating the parameters of the local polynomial by maximizing a penalized local log-likelihood function. The above methods produce a point estimate of the density without uncertainty quantiﬁcation (UQ). Alternatively, there is a Bayesian literature on locally adaptive kernel methods, which express the unknown density as:

h=1 πh K(x; θh), θh P0, (πh)m h=1 Q0, (1)

which is a mixture of m components, with the hth having probability weight πh and kernel parameters θh; by allowing the location and bandwidth to vary across components, local adaptivity is obtained. A Bayesian speciﬁcation is completed with prior P0 for the kernel parameters and Q0 for the weights. In practice, it is common to rely on an over-ﬁtted mixture model (Rousseau and Mengersen, 2011), which chooses m as a pre-speciﬁed ﬁnite upper bound on the number of components, and lets

π = (π1, . . . , πm)T Dirichlet(α, . . . , α). (2)

Augmenting component indices ci {1, . . . , m} for i = 1, . . . , n, a simple Gibbs sampler can be used for posterior computation, alternating between sampling (i) ci from a multinomial

Nearest Neighbor Dirichlet Mixtures

conditional posterior, for i = 1, . . . , n; (ii) θh | P0(θh) Q i:ci=h K(Xi; θh); and (iii) π | Dirichlet(α + n1, . . . , α + nm), with nh = Pn i=1 I(ci = h) for h = 1, . . . , m. Relative to frequentist locally adaptive methods, Bayesian approaches are appealing in automatically providing a characterization of uncertainty in estimation, while having excellent practical performance for a broad variety of density shapes and dimensions. However, implementation typically relies on Markov chain Monte Carlo (MCMC), with the Gibbs sampler sketched above providing an example of a common algorithm used in practice. Unfortunately, current MCMC algorithms for posterior sampling in mixture models tend to face issues with slow mixing, meaning the sampler can take a very large number of iterations to adequately explore diﬀerent posterior modes and obtain suﬃciently accurate posterior summaries. MCMC ineﬃciency has motivated a literature on faster approaches, including sequential approximations (Wang and Dunson, 2011; Zhang et al., 2014) and variational Bayes (Blei and Jordan, 2006). These methods are order dependent, tend to converge to local modes, and/or lack theory support. Newton and Zhang (1999); Newton (2002) instead rely on predictive recursion. Such estimators are fast to compute and have theory support, but are also order dependent and do not provide a characterization of uncertainty. Alternatively, one can use a Polya tree as a conjugate prior (Lavine, 1992, 1994), and there is a rich literature on related multiscale and recursive partitioning approaches, such as the optional Polya tree (Wong and Ma, 2010). However, Polya trees have disadvantages in terms of sensitivity to a base partition and a tendency to favor spiky/erratic densities. These disadvantages are inherited by most of the computationally fast modiﬁcations. This article develops an alternative to current locally adaptive density estimators, obtaining the practical advantages of Bayesian approaches in terms of uncertainty quantiﬁcation and a tendency to have relatively good performance for a wide variety of true densities, but without the computational disadvantage due to the use of MCMC. This is accomplished with a Nearest Neighbor-Dirichlet Mixture (NN-DM) model. The basic idea is to rely on fast nearest neighbor search algorithms to group the data into local neighborhoods, and then use these neighborhoods in deﬁning a Bayesian mixture model-based approach. Section 2 outlines the NN-DM approach and describes implementation details for Gaussian kernels. Section 3 provides some theory support for NN-DM. Section 4 contains simulation experiments comparing NN-DM with a rich variety of competitors in univariate and multivariate examples, including an assessment of UQ performance. Section 5 contains a real data application, and Section 6 a discussion.

2. Methodology

2.1 Nearest Neighbor Dirichlet Mixture Framework

Let d(x1, x2) denote a distance metric between data points x1, x2 X. For X = Rp, the Euclidean distance is typically chosen. For each i {1, 2, . . . , n}, let Xi[j] denote the jth nearest neighbor to Xi in the data X (n) = (X1, . . . , Xn) such that d(Xi, Xi[1]) . . . d(Xi, Xi[n]), with ties broken by increasing order of indices. By convention, we deﬁne Xi[1] = Xi. The indices on the k nearest neighbors to Xi are denoted as Ni = {j : d(Xi, Xj) d(Xi, Xi[k])}. Denote the set of data points in the ith neighborhood by

Chattopadhyay, Chakraborty and Dunson

Si = {Xj : j Ni}. In implementing the proposed method, we typically let the number of neighbors k vary as a function of n. When necessary, we use the notation kn to express this dependence. However, we routinely drop the n subscript for notational simplicity. Fixing x X, we model the density of the data within the ith neighborhood using

fi(x) = K(x; θi), θi P0, (3)

where θi are parameters speciﬁc to neighborhood i that are given a global prior distribution P0. To combine the fi(x)s into a single global f(x), similarly to equations (1)-(2), we let

i=1 πifi(x), π = (πi)n i=1 Dirichlet(α, . . . , α), θi P0. (4)

The key diﬀerence relative to standard Bayesian mixture model (1) is that in (4) we include one component for each data sample and assume that only the data in the k-nearest neighborhood of sample i will inform about θi. In contrast, (1) lacks any sample dependence, and we infer allocation of samples to mixture components in a posterior inference phase. Given the restriction that only data in the ith neighborhood Si inform about θi, the pseudo-posterior density eΠ1(θi; Si, P0) of θi with data Si and prior P0 is

eΠ1(θi; Si, P0) P0(θi) Y

j Ni K(Xj; θi), (5)

where the right-hand side of (5) is motivated from Bayes theorem. This pseudo-posterior is in a simple analytic form if P0 is conjugate to K(x; θ). The prior P0 can involve unknown parameters and borrows information across neighborhoods; this reduces the large variance problem common to nearest neighbor estimators. Since the neighborhoods are overlapping, proposing a pseudo-posterior update for π under (4) is not straightforward. However, one can deﬁne the number of eﬀective members in the ith neighborhood Si similar in spirit to the number of points in the hth cluster in mixture models of the form (1). By convention, we deﬁne the point Xi that generated its neighborhood Si to be an eﬀective member of that neighborhood. For any other data point Xj to be a eﬀective member of the neighborhood generated by Xi for j = i, we require Xj Si but Xj / Su for all u = 1, . . . , n such that u / {i, j}. That is, Xj lies in the neighborhood generated by Xi but does not lie in the neighborhood of any other Xu for u / {i, j}. In Section 3.2, we show that the number of eﬀective member points deﬁned as above approaches 1 as n . This motivates the following Dirichlet pseudo-posterior density eΠ2(π; X (n)) for the neighborhood weights π:

eΠ2(π; X (n)) = Dirichlet(π | α + 1, . . . , α + 1), (6)

where Dirichlet(p | q1, . . . , qd) denotes the density of the Dirichlet distribution evaluated at p with parameters (q1, . . . , qd). We provide a justiﬁcation for the pseudo-posterior update (6) in Section 3.2. This distribution is inspired from the conditional posterior on the kernel weights in the Dirichlet mixture of equations (1)-(2), but we use n components and ﬁx the eﬀective number of samples allocated to each component at one.

Nearest Neighbor Dirichlet Mixtures

Based on equations (3)-(6), our nearest neighbor-Dirichlet mixture produces a pseudoposterior distribution for the unknown density f(x) through simple distributions for the parameters characterizing the density within each neighborhood and for the weights. To generate independent Monte Carlo samples from the pseudo-posterior for f, one can simply draw independent samples of (θi)n i=1 and π from (5) and (6) respectively, and plug these samples into the expression for f(x) in (4). The resulting mechanism can be described as

θi ind eΠ1( ; Si, P0) for i = 1, . . . , n,

π eΠ2( ; X (n))

i=1 πi K(x; θi).

In (7), we denote the induced pseudo-posterior distribution on f by f eΠ. Although this is not exactly a coherent fully Bayesian posterior distribution, we claim that it can be used as a practical alternative to such a posterior in practice. This claim is backed up by theoretical arguments, simulation studies, and a real data application in the sequel.

2.2 Illustration with Gaussian Kernels

Suppose we have independent and identically distributed (iid) observations X (n) from the density f, where Xi Rp for i = 1, . . . , n and f is an unknown density function with respect to the Lebesgue measure on Rp for p 1. Let Rp p + denote the set of all real-valued p p positive deﬁnite matrices. Fix x Rp. We proceed by setting K(x; θ) to be the multivariate Gaussian density φp(x; η, Σ), given by

φp(x; η, Σ) = (2π) p/2|Σ| 1/2 exp { (x η)TΣ 1(x η)/2},

where θ = (η, Σ), η Rp and Σ Rp p + . We ﬁrst compute the neighborhoods Ni corresponding to Xi as in Section 2.1 and place a normal-inverse Wishart (NIW) prior on θi = (ηi, Σi), given by (ηi, Σi) NIWp(µ0, ν0, γ0, Ψ0) independently for i = 1, . . . , n. That is, we let

ηi | Σi N µ0, Σi

, Σi IWp(γ0, Ψ0),

with µ0 Rp, ν0 > 0, γ0 > p 1 and Ψ0 Rp p + ; for details about parametrization see Section I of the Appendix. Monte Carlo samples from the pseudo-posterior of f(x) can be obtained using Algorithm 1. The corresponding steps for the univariate case are provided in Section H of the Appendix. Although the pseudo-posterior distribution of f(x) lacks an analytic form, we can obtain a simple form for its pseudo-posterior mean by integrating over the pseudo-posterior distribution of (θi)n i=1 and π. Recall the deﬁnitions of µi and Ψi from Step 2 of Algorithm 1 and deﬁne Λi = {νn(γn p + 1)} 1(νn + 1) Ψi. Then the pseudo-posterior mean of f(x) is given by

i=1 tγn p+1(x; µi, Λi), (8)

Chattopadhyay, Chakraborty and Dunson

Step 1: For i = 1, . . . , n, compute the neighborhood Ni for data point Xi Rp

according to distance d( , ) with (k 1) nearest neighbors in X i = X (n) \ {Xi}.

Step 2: Update the parameters for neighborhood Ni to (µi, νn, γn, Ψi), where νn = ν0 + k, γn = γ0 + k,

ν0µ0 + k Xi , Xi = 1

j Ni Xj, and

Ψi = Ψ0 + X

j Ni (Xj Xi)(Xj Xi)T + kν0

νn ( Xi µ0)( Xi µ0)T.

Step 3: To compute the t-th Monte Carlo sample f(t)(x) of f(x), sample π(t) Dirichlet(α + 1, . . . , α + 1) and (η(t) i , Σ(t) i ) NIWp(µi, νn, γn, Ψi) independently for i = 1, . . . , n, and set

i=1 π(t) i φp x; η(t) i , Σ(t) i .

Algorithm 1: Nearest neighbor-Dirichlet mixture algorithm to obtain Monte Carlo samples from the pseudo-posterior of f(x) with Gaussian kernel and normal-inverse Wishart prior.

where tγ(x; µ, Λ) for x Rp is the p-dimensional Student s t-density with degrees of freedom γ > 0, location µ Rp and scale matrix Λ Rp p + . We proceed with using Gaussian kernels and NIW conjugate priors when implementing the NN-DM for the remainder of the paper.

2.3 Hyperparameter Choice

The hyperparameters in the prior for the neighborhood-speciﬁc parameters need to be chosen carefully we found results to be sensitive to γ0 and Ψ0. If non-informative values are chosen for these key hyperparameters, we tend to inherit typical problems of nearest neighbor estimators including lack of smoothness and high variance. Suppose Σ IWp(γ0, Ψ0) and for i, j = 1, . . . , p, let Σij and Ψ0, ij denote the i, jth entry of Σ and Ψ0, respectively. Then Σjj IG(γ /2, Ψ0, jj/2) where γ = γ0 p+1. For p = 1, the IWp(γ0, Ψ0) density simpliﬁes to an IG(γ0/2, γ0δ2 0/2) density with δ2 0 = Ψ0/γ0. Thus borrowing from the univariate case, we set Ψ0, jj = γ δ2 0 and Ψ0, ij = 0 for all i = j, which implies that Ψ0 = (γ δ2 0) Ip and we use leave-one-out cross-validation to select the optimum δ2 0. With p dimensional data, we recommend ﬁxing γ0 = p which implies a multivariate Cauchy prior predictive density. We choose the leave-one-out log-likelihood as the criterion function for cross-validation, which is closely related to minimizing the Kullback-Leibler divergence between the true and estimated density (Hall, 1987; Bowman, 1984). The explicit expression for the pseudo-posterior mean in (8) makes cross-validation computationally eﬃcient. The description of a fast implementation is provided in Section G of the Appendix.

Nearest Neighbor Dirichlet Mixtures

The proposed method has substantially faster runtime if one uses a default choice of hyperparameters. In particular, we found the default values µ0 = 0p, ν0 = 0.001, γ0 = p, and Ψ0 = Ip to work well across a number of simulation cases, especially when the true density is smooth. Although using cross-validation to estimate Ψ0 can lead to improved performance when the underlying density is spiky, cross-validation provides little to no gains for smooth true densities. Furthermore, with low sample size and increasing number of dimensions, we found this improvement to diminish rapidly. In order to obtain desirable uncertainty quantiﬁcation in simulations and applications, we found small values of α to work well. As a default value, we recommend using α = 0.001 for small samples and moderate dimensions.

The other key tuning parameter for NN-DM is the number of nearest neighbors k = kn. The pseudo-posterior mean in (8) reduces to a single tγn p+1 kernel if kn = n. In contrast, kn = 1 provides a sample smoothing kernel density estimate with a speciﬁc bandwidth function (Terrell and Scott, 1992). Therefore, the choice of k can impact the smoothness of the density estimate. To assess the sensitivity of the NN-DM estimate to the choice of k, we investigate how the out-of-sample log-likelihood of a test set changes with respect to k in Section 4.7. These simulations suggest that the proposed method is quite robust to the exact choice of k. In practice with ﬁnite samples and small dimensions, we recommend a default choice of kn = n1/3 + 1 and kn = 10 for univariate and multivariate cases, respectively. These values led to good performance across a wide variety of simulation cases as described in Section 4.

3.1 Asymptotic Properties

There is a rich literature on asymptotic properties of the posterior measure for an unknown density under Bayesian models, providing a frequentist justiﬁcation for Bayesian density estimation; refer, for example to Ghosal et al. (1999), Ghosal and van der Vaart (2007). Unfortunately, the tools developed in this literature rely critically on the mathematical properties of fully Bayes posteriors, providing theoretical guarantees for a computationally intractable exact posterior distribution under a Bayesian model. Our focus is instead on providing frequentist asymptotic guarantees for our computationally eﬃcient NN-DM approach, with this task made much more complex by the dependence across neighborhoods induced by the use of a nearest neighbor procedure.

We ﬁrst focus on proving pointwise consistency of the pseudo-posterior of f(x) induced by (7) for each x [0, 1]p, using Gaussian kernels as in Section 2.2. We separately study the mean and variance of the NN-DM pseudo-posterior distribution, ﬁrst showing that the pseudo-posterior mean in (8) is pointwise consistent and then that the pseudo-posterior variance vanishes asymptotically. The key idea behind our proof is to show that the pseudoposterior mean is asymptotically close to a kernel density estimator with suitably chosen bandwidth for ﬁxed p and kn at a desired rate. The proof then follows from standard arguments leading to consistency of kernel density estimators. The NN-DM pseudoposterior mean mimics a kernel density estimator only in the asymptotic regime; in ﬁnite sample simulation studies (refer to Section 4), NN-DM has much better performance. The detailed proofs of all results in this section are in the Appendix.

Chattopadhyay, Chakraborty and Dunson

Consider independent and identically distributed data X (n) from a ﬁxed unknown density f0 with respect to the Lebesgue measure on Rp equipped with the Euclidean metric, inducing the measure Pf0 on B(Rp). We use e E{f(x)}, f var{f(x)}, and epr{f(x) B} to denote the mean of f(x), variance of f(x), and probability of the event {f(x) B} for B B(Rp), respectively, under the pseudo-posterior distribution of f(x) implied by (7). We make the following regularity assumptions on f0:

Assumption 1 (Compact support) f0 is supported on [0, 1]p.

Assumption 2 (Bounded gradient) f0 is continuous on [0, 1]p with || f0(x)||2 L for all x [0, 1]p and some ﬁnite L > 0.

Assumption 3 (Bounded sup-norm) || log(f0)|| < .

Our asymptotic analysis relies on analyzing the behaviour of the pseudo-posterior updates within each nearest neighborhood. We leverage on key results from Biau and Devroye (2015); Evans et al. (2002) which are based on the assumption that the true density has compact support as in Assumption 1. Assumption 2 ensures that the kernel density estimator has ﬁnite expectation. Versions of this assumption are common in the kernel density literature; for example, refer to Tsybakov (2009). Assumptions 1 and 3 imply the existence of 0 < a1, a2 < such that 0 < a1 < f0(x) < a2 < for all x [0, 1]p, which is referred to as a positive density condition by Evans (2008); Evans et al. (2002). This is used to establish consistency of the proposed method and justify the choice of the pseudo-posterior distribution of the weights. These assumptions are standard in the literature studying frequentist asymptotic properties of nearest neighbor and Bayesian density estimators. For i = 1, . . . , n, recall the deﬁnitions of µi and Λi from (8):

νn Xi, Λi = νn + 1 νn(γn p + 1)Ψi,

where νn = ν0 + kn, γn = γ0 + kn, and Xi, Ψi are as in Algorithm 1. Deﬁne the bandwidth matrix

Hn = h2 n Ip, where h2 n = (νn + 1)(γ0 p + 1)

νn(γn p + 1) δ2 0. (9)

We have suppressed the dependence of µi and Λi on n for notational convenience. It is immediate that h2 n 0 if kn as n . Fix x [0, 1]p. To prove consistency of the pseudo-posterior mean, we ﬁrst show that ˆfn(x) and f K(x) = (1/n) Pn i=1 tγn p+1(x; Xi, Hn) are asymptotically close, that is we show that EPf0( | ˆfn(x) f K(x)| ) 0 as n . To obtain this result, we approximate µi by Xi and Λi by Hn using successive applications of the mean value theorem. Finally, we exploit the convergence of f K(x) to the true value f0(x) to obtain the consistency of ˆfn(x). The proof of convergence of f K(x) to f0(x) is provided in Section F of the Appendix. The precise statement regarding the consistency of the pseudo-posterior mean is given in the following theorem. Let a b denote the minimum of a and b.

Theorem 4 Fix x [0, 1]p. Let kn = o(ni0) with i0 = {2/(p2 + p + 2)} {4/(p + 2)2} such that kn as n , and ν0 = o{n 2/pk(2/p)+1 n }. Then, ˆfn(x) f0(x) in Pf0-probability as n .

Nearest Neighbor Dirichlet Mixtures

We now look at the pseudo-posterior variance of f(x). We let

Rn = Γ{(γn p + 2)/2}

Γ{(γn p + 1)/2}

νn + 2 4πνn(γn p + 2)

p/2 and Dn = (γn p + 1)(νn + 2)

2(γn p + 2)(νn + 1). (10)

For i = 1, . . . , n, let Bi = DnΛi and deﬁne

bfvar(x) = 1

i=1 tγn p+2(x; µi, Bi). (11)

As n , we have Dn 1/2. Analogous steps to the ones used in the proof of Theorem 4 can be used to imply that bfvar(x) f0(x) in Pf0-probability. Also, as n , k(p 1)/2 n Rn = O(1) using Stirling s approximation. We now provide an upper bound on the pseudoposterior variance of f(x) which shows convergence of the pseudo-posterior variance to 0.

Theorem 5 Let Hn be the bandwidth matrix deﬁned in (9). Let Rn, Dn be as in (10) and bfvar be as in (11). Under Assumptions 1-3 with x, kn, and ν0 as in Theorem 4, we have

f var{f(x)} Rn D p/2 n bfvar(x) |Hn|1/2

1 n(α + 1) + 1 + 1

This implies f var{f(x)} 0 in Pf0-probability as n .

Refer to Sections B and C in the Appendix for proofs of Theorems 4 and 5, respectively. Pointwise pseudo-posterior consistency follows from Theorems 4 and 5, as shown below.

Theorem 6 Let f0 satisfy Assumptions 1-3 with x, kn and ν0 as in Theorem 4. Fix ϵ > 0 and deﬁne the ϵ-ball around f0(x) by Uϵ = {y : |y f0(x)| ϵ}. Let epr{f(x) Uc ϵ } denote the probability of the set Uc ϵ under the pseudo-posterior distribution of f(x) as induced by (7). Then epr{f(x) Uc ϵ } 0 in Pf0-probability as n .

Proof Fix ϵ > 0 and consider the ϵ-ball Uϵ = {y : |y f0(x)| ϵ}. Then by Chebychev s inequality, we have epr{f(x) Uc ϵ } [{ ˆfn(x) f0(x)}2 + f var{f(x)}]/ϵ2 0 in Pf0-probability as n , using Theorems 4 and 5.

We next focus on the limiting distribution of f(x) for the univariate case. From Section H of the Appendix, the pseudo-posterior distribution of (ηi, σ2 i ) for i = 1, . . . , n is given by NIG(µi, νn, γn/2, γnδ2 i /2), where µi, νn, γn are as before and

γnδ2 i = γ0δ2 0 + X

j Ni (Xj Xi)2 + knν0

νn ( Xi µ0)2.

We establish in Theorem 7 that the limiting distribution of f(x) is a Gaussian distribution with appropriate centering and scaling. This allows interpretation of 100(1 β)% pseudocredible intervals as 100(1 β)% frequentist conﬁdence intervals on average for large n.

Chattopadhyay, Chakraborty and Dunson

Theorem 7 Fix x [0, 1]. Suppose f0 satisﬁes Assumptions 1-3 and also satisﬁes |f(4) 0 (x)| C0 for all x [0, 1] for some ﬁnite C0 > 0. Let kn satisfy kn = o(n2/7) such that n 2/9kn , hn be as in (9) satisfying hn 0, and α = αn , as n . For t R, deﬁne

Gn(t) = epr

f0(x) + h2 nf(2) 0 (x)

Then, we have

lim n EPf0{Gn(t)} = Φ t ; 0, f0(x)

where Φ(t ; 0, σ2) denotes the cumulative distribution function of the N(0, σ2) density.

For a proof of Theorem 7, we refer the reader to Section D of the Appendix.

3.2 Pseudo-Posterior Distribution of Weights

We investigate the rationale behind the pseudo-posterior update (6) of the weight π, which has a symmetric prior distribution π Dirichlet(α, . . . , α) as motivated in Section 1. As discussed in Section 1, the conditional update for the weights π in a ﬁnite Bayesian mixture model with m components given the cluster allocation indices {c1, . . . , cn} is obtained by Dirichlet(α + n1, . . . , α + nm), where α is the prior concentration parameter and nh = Pn i=1 I(ci = h) is the number of data points allocated to the hth cluster. This is not true in our case as the kn-nearest neighborhoods have considerable overlap between them. Instead, we consider the number of eﬀective member data points in each of these neighborhoods. Deﬁne the kn-nearest neighborhood of Xi to be the set Si = {Xj : d(Xi, Xj) d(Xi, Xi[kn])} where Xi[kn] is the kn-th nearest neighbor of Xi in the data X (n), following the notation in Section 2.1. We assume d( , ) is the Euclidean metric from here on, and let Ri = d(Xi, Xi[kn]) = ||Xi Xi[kn]||2 denote the distance of Xi from its kn-th nearest neighbor in X (n). Let Ni denote the number of eﬀective members in Si as deﬁned in Section 2.1. Then, we can express Ni as

u/ {i,j} {Xj / Su}

where I(A) is the indicator function of the set A. Under X1, . . . , Xn iid f0, we have

EPf0(N1) = 1 + (n 1)Pf0

u=3 {X2 / Su}

by symmetry. Furthermore, Ni are identically distributed for i = 1, . . . , n. We now state a result which provides a motivation for our choice of the pseudo-posterior update of π. For two sequences of real numbers (an) and (bn), we write an bn if |an/bn| c0 as n for some constant c0 > 0.

Nearest Neighbor Dirichlet Mixtures

Theorem 8 Suppose X1, . . . , Xn iid f0 with f0 satisfying Assumptions 1-3. Furthermore, suppose that kn ni0 ϵ for some ϵ (0, i0), where i0 is as deﬁned in Theorem 4. Then,

lim n n Pf0

u=3 {X2 / Su}

Proof of Theorem 8 is in Section E of the Appendix. The above theorem suggests we asymptotically have only one eﬀective member per neighborhood Si, namely the point Xi that itself generated this neighborhood. This result motivates our choice of the pseudoposterior update of the weight vector π. We illustrate uncertainty quantiﬁcation of the proposed method in ﬁnite samples in Section 4.4 with this choice of pseudo-posterior update of the weight vector π.

4. Simulation Experiments

4.1 Preliminaries

In this section, we compare the performance of the proposed density estimator with several other standard density estimators through several numerical experiments. We evaluate estimation performance based on the expected L1 distance (Devroye and Gyorﬁ, 1985). For the pair (f0, ˆf), where f0 is the true data generating density and ˆf is an estimator, the expected L1 distance is deﬁned as L1(f0, ˆf) = EPf0{ R |f0(x) ˆf(x)| dx}. We compute an estimate ˆL1(f0, ˆf) of L1(f0, ˆf) in two steps. First, we sample n training points X1, . . . , Xn f0 and obtain ˆf based on this sample, and then further sample nt independent test points Xn+1, . . . , Xn+nt f0 and compute

ˆf(Xn+i) f0(Xn+i) 1

In the second step, to approximate the expectation with respect to Pf0, the ﬁrst step is repeated R times. Letting ˆLr denote the estimate for the rth replicate, we compute the ﬁnal estimate as ˆL1(f0, ˆf) = (1/R) PR r=1 ˆLr. Then, it follows that ˆL1(f0, ˆf) L1(f0, ˆf) as nt, R , by the law of large numbers. In our experiments, we set nt = 500 and R = 20. We let 0p and 1p denote the vector with all entries equal to 0 and the vector with all entries equal to 1 in Rp, respectively, for p 1. All simulations were carried out using the R programming language (R Core Team, 2018). For Dirichlet process mixture models, we collect 2, 000 Markov chain Monte Carlo (MCMC) samples after discarding a burn-in of 3, 000 samples using the dirichletprocess package (J. Ross and Markwick, 2019). The default implementation of the Dirichlet process mixture model in p dimensions in the dirichletprocess package uses multivariate Gaussian kernels and has the base measure as NIWp(0p, p, p, Ip) with the Dirichlet concentration parameter having the Gamma(2, 4) prior (West, 1992). For the nearest neighbor-Dirichlet mixture, 1, 000 Monte Carlo samples are taken. For the kernel density estimator, we select the bandwidth by the default plug-in method hpi for univariate cases and Hpi for multivariate cases (Sheather and Jones, 1991; Wand and Jones, 1994) using the package ks (Duong, 2020). We additionally consider the k-nearest neighbor estimator studied in Mack

Chattopadhyay, Chakraborty and Dunson

and Rosenblatt (1979), setting the number of neighbors k = n1/2, and the variational Bayes (VB) approximation to Dirichlet process mixture models (Blei and Jordan, 2006). We also compare with the optional Polya tree (OPT) (Wong and Ma, 2010) using the package PTT. For univariate cases, we consider the recursive predictive density estimator (RD) from Hahn et al. (2018), Polya tree mixtures (PTM) using the package DPpackage (Jara et al., 2011), and the sample smoothing kernel density estimator (A-KDE) using the package quantreg. Lastly, we also compare with the local likelihood density estimator (LLDE) using the package locfit for both univariate and multivariate cases. Dirichlet process mixture model hyperparameter values are kept the same in both the MCMC and variational Bayes implementations, with the number of components of the variational family set to 10 for all cases. We denote the nearest neighbor-Dirichlet mixture, Dirichlet process mixture (DPM) implemented with MCMC, kernel density estimator, variational Bayes approximation to the DPM, and k-nearest neighbor density estimator by NN-DM, DP-MC, KDE, DP-VB, and KNN, respectively, in tables and ﬁgures.

4.2 Univariate Cases

We set n = 200, 500 with kn = n1/3 + 1. We consider 10 choices of f0 from the R package benchden (Mildenberger and Weinert, 2012); the speciﬁc choices are Cauchy (CA), claw (CW), double exponential (DE), Gaussian (GS), inverse exponential (IE), lognormal (LN), logistic (LO), skewed bimodal (SB), symmetric Pareto (SP), and sawtooth (ST) with default choices of the corresponding parameters. The prior hyperparameter choices for the proposed method are µ0 = 0, ν0 = 0.001, γ0 = 1; δ2 0 is chosen via the cross-validation method of Section 2.3. Detailed numerical results are deferred to Table 5 in the Appendix. Instead, in Figure 1, we provide a visual summary of the performance of each method under consideration by forming a box plot of the estimated L1 errors of the methods across all the data generating densities. Methods with lower median as indicated by the solid line of the box plot, and smaller overall spread are preferable as they provide higher accuracy and also maintain such accuracy across a collection of true density cases. Results of KNN are omitted in Figure 1 due to much higher values compared to other methods. For the KDE and RD estimator, the plot and the table exclude the results for the heavy-tailed densities CA, IE, and SP due to very high L1 errors.

Overall, a major advantage of the proposed method is its versatility among the considered methods. The Bayesian nonparametric methods DP-MC, DP-VB, PTM, OPT, and RD are often close to NN-DM in terms of their performance when the true densities are smooth and do not display locally spiky behavior. However, the NN-DM performs better than other methods in densities where such local behavior is present and performs very close to the best estimator for either the smooth heavy-tailed or thin-tailed densities. The KDE and RD perform well when data are generated from a smooth underlying density. However, there are some cases where the error for KDE and RD is very high. For instance, when n = 500 and f0 is the standard Cauchy (CA) density, the estimated L1 error for the KDE is 38501.85 and the algorithm for the RD estimate did not converge. Both the KDE and RD also perform poorly in very spiky multi-modal densities such as the ST. Compared to the LLDE and the A-KDE, the NN-DM displays similar performance in heavy-tailed and smooth densities when n = 200, with the NN-DM performing better for the spiky densities.

Nearest Neighbor Dirichlet Mixtures

A KDE DP MC DP VB KDE LLDE NN DM OPT PTM RD Method

A KDE DP MC DP VB KDE LLDE NN DM OPT PTM RD Method

Figure 1: Box plots of ˆL1(f0, ˆf) for the 10 diﬀerent choices of the true density f0 and diﬀerent estimators ˆf for univariate data. The box plots for KDE and RD exclude the heavy-tailed cases CA, IE, and SP.

However, when n = 500, the NN-DM shows signiﬁcant improvements over the LLDE and the A-KDE for spiky densities such as the CW and the ST. In Figure 2, we show the performance of the NN-DM estimator ˆfn (with hyperparameters chosen as described earlier) relative to the posterior mean under a DP-MC with default or hand-tuned hyperparameters, when 500 data points are generated from the sawtooth (ST) density. The Dirichlet process mixture with default hyperparameters is unable to detect the multiple spikes, merging adjacent modes to form larger clusters, perhaps due to inadequate mixing of the Markov chain Monte Carlo sampler or to the Gaussian kernels used in the mixture. As a result, we had to hand-tune the hyperparameters for the Dirichlet process mixture to obtain comparable performance with the NN-DM (without hand-tuning). We obtained the best results when changing the hyperparameters of the base measure of the DPMC to NIG(0, 0.01, 1, 1) while keeping the prior on α the same as before. This illustrates the deﬁciency of the DP-MC in estimating densities with spiky local behavior unless we hand-tune the hyperparameters, which requires knowledge of the true density. We also compare the performance of the two methods with a smoother test density in Figure 3, where the data are generated from a skewed bimodal (SB) distribution. Both the estimates are comparable, but the nearest neighbor-Dirichlet mixture provides better uncertainty quantiﬁcation. Similar results are obtained for n = 1000, and hence are omitted.

4.3 Multivariate Cases

For the multivariate cases, we consider n = 200 and 1000. The number of neighbors is set to k = 10 and the dimension p is chosen from {2, 3, 4, 6}. Recall the deﬁnition of φp(x; µ, Σ) from Section 2.2 and let Φ(x) be the cumulative distribution function of the standard Gaussian density. Let S0 = ρ 1p1T p + (1 ρ) Ip with ρ = 0.8. Let x = (x1, . . . , xp)T. We consider the following cases. (1) Mixture of Gaussians (MG): f0(x) = 0.4 φp(x; m1, S0) + 0.6 φp(x; m2, S0), where m1 =

Chattopadhyay, Chakraborty and Dunson

-10 -5 0 5 10 Points

-10 -5 0 5 10 Points

-10 -5 0 5 10 Points

Figure 2: Plot comparing density estimates for the NN-DM and DP-MC for n = 500 samples generated from the sawtooth (ST) density. Shaded regions correspond to 95% (pseudo) posterior credible intervals. The true density is displayed using dotted lines. The top panel shows the performance of DP-MC with default hyperparameters on the left and with hand-tuned hyperparameters on the right. The bottom panel shows the performance of the NN-DM.

Nearest Neighbor Dirichlet Mixtures

-2 0 2 Points

-2 0 2 Points

Figure 3: Similar to Figure 2, with data of sample size n = 500 generated from the skewed bimodal (SB) density. Left panel shows the DP-MC ﬁt and the right panel shows the NN-DM ﬁt.

2 1p, m2 = 2 1p. (2) Skew normal (SN): f0(x) = 2φp(x; m0, S0)Φ{s T 0W 1(x m0)} (Azzalini, 2005), where W is the diagonal matrix with diagonal entries W 2 ii = S0, ii for i = 1, . . . , p. We choose m0 = 0p and the skewness parameter vector s0 = 0.5 1p. (3) Multivariate t-distribution (T): f0(x) = td0(x; m , S0) is the density of the p-dimensional multivariate Student s t-distribution. We set d0 = 10 and m = 1p. (4) Mixture of multivariate skew t-distributions (MST): f0(x) = 0.25 td0(x; m1, S0, s0) + 0.75 td0(x; m2, S0, s0). Here, td( ; µ, S, s) is the skew t-density (Azzalini, 2005) with parameters d, µ, S, s, with d0, s0 deﬁned as before and m1, m2 the same as in the ﬁrst case. (5) Multivariate Cauchy (MVC): f0(x) {1 + (x µ )TS 1 0 (x µ )} where µ = 0p. (6) Multivariate Gamma (MVG): f0(x) cΦ(F1(x1), . . . , Fp(xp) | S0) Qp j=1 fj(xj; γj1, γj2) where fj and Fj denote the density and distribution function of the univariate gamma distribution with shape parameter γj1 and rate parameter γj2, respectively, for j = 1, . . . , p and cΦ( | Γ) is as described in Song (2000). This is a Gaussian copula based construction of the multivariate gamma distribution. We set γj1 = γj2 = 1 for j = 1, . . . , p. The hyperparameters for the nearest neighbor-Dirichlet mixture are chosen as µ0 = 0p, ν0 = 0.001, γ0 = p, and Ψ0 = {(γ0 p + 1)δ2 0}Ip = δ2 0 Ip, where the optimal δ2 0 is chosen via cross-validation as described in Section 2.3. Default hyperparameters as described in Section 4.1 are chosen for the MCMC and VB implementations of the DPM. Similar to the univariate case, we defer the numerical results to Table 6 in the Appendix and in Figure 4 display a visual summary consisting of box plot of estimated L1 errors over the densities considered. The proposed method is very robust against a wide selection of true distributions, with its L1 error scaling nicely with the dimension. The KDE shows a noticeably sharp decline in performance - when the dimension is changed from 2 to 6, the average increase in L1 error is by factors of about 5 and 7 for sample sizes 200 and 1000, respectively. This is possibly due to lack of adaptive density estimation in higher dimensions using a single bandwidth matrix, since data in Rp become increasingly sparse with increasing p. As in the univariate case, we had to exclude the MVC density for the KDE due to the algorithm not converging. The performances of NN-DM, DP-MC, and

Chattopadhyay, Chakraborty and Dunson

DP MC DP VB KDE LLDE NN DM OPT Method

n = 200, p = 2

DP MC DP VB KDE LLDE NN DM OPT Method

n = 1000, p = 2

DP MC DP VB KDE LLDE NN DM OPT Method

n = 200, p = 3

DP MC DP VB KDE LLDE NN DM OPT Method

n = 1000, p = 3

DP MC DP VB KDE LLDE NN DM OPT Method

n = 200, p = 4

DP MC DP VB KDE LLDE NN DM OPT Method

n = 1000, p = 4

DP MC DP VB KDE LLDE NN DM Method

n = 200, p = 6

DP MC DP VB KDE LLDE NN DM Method

n = 1000, p = 6

Figure 4: Box plots of ˆL1(f0, ˆf) for the 6 diﬀerent choices of the true density f0 and diﬀerent estimators ˆf for multivariate data. The box plots for KDE and LLDE exclude the MVC density. The box plots for p = 6 exclude results from OPT.

Nearest Neighbor Dirichlet Mixtures

DP-VB are quite competitive across densities, with NN-DM faring better than the DP-VB when estimating densities such as the MVC and the MVG. Furthermore, the NN-DM is hit the least signiﬁcantly by the curse of dimensionality out of the three. This is particularly prominent for the DP-MC when the true density is either MG or MST with n = 200 and p = 6, and for the DP-VB when the true density is MVC. It is also important to keep in mind that the NN-DM provides similar results compared to the DP-MC while being at least an order of magnitude faster, as illustrated in Section 4.6. The performance of the OPT is hit quite signiﬁcantly as the number of dimensions increases, along with the algorithm not converging for p = 6. The LLDE provides competitive results with the NNDM in lower dimensions. However, in higher dimensions, the LLDE often does not converge, indicating lack of stability of the algorithm. We reported the average of the replicates for which the algorithm did converge. The results suggest that the performance of the LLDE is also aﬀected quite drastically with increasing dimensions. When compared across all data generating cases considering the variation in densities, dimensions and sample sizes, the proposed method is seen to be more versatile than its competitors.

4.4 Accuracy of Uncertainty Quantiﬁcation

In this section, we assess frequentist coverage of 95% pseudo-posterior credible intervals for the NN-DM and compare with coverage based on the 95% posterior credible intervals obtained from DP-MC and DP-VB. Ghosal and van der Vaart (2017) recommend investigating the frequentist coverage of Bayesian credible intervals. We do not include frequentist coverage for Polya tree mixtures (PTMs) and the optional Polya tree (OPT) due to the lack of available code. We consider the cases p {1, 2} in our experiments with sample size n = 500. For each choice of density f0, we ﬁx nt = 200 test points Xt = {Xt1, . . . , Xtnt} generated from the density f0. With these ﬁxed test points, we generate n = 500 data points in our sample for Rcov = 200 times and check the coverage of posterior/pseudo-posterior credible intervals obtained from the three methods. We implement the DP-MC with base measure NIWp(0p, 0.01, p, Ip) and a Gamma(2, 4) prior on the concentration parameter as in West (1992). These choices of hyperparameters were seen to give better frequentist coverage results than using the default values used in Sections 4.2 and 4.3. Same choices of hyperparameters are maintained for DP-VB. For the NN-DM, we take k = 8 in the univariate case and k = 5 in the bivariate case, α = 0.001, and other hyperparameters chosen as before. We report the average coverage probability and average length of the (pseudo) credible intervals across all the points in the test data Xt in Tables 1 and 2 for the univariate and bivariate cases, respectively.

For univariate densities, both the DP-MC and DP-VB display severe under-coverage. In most of the cases, the DP-VB and NN-DM have similar width of (pseudo) credible intervals but the DP-VB displays dramatically lower coverage than the NN-DM. The under-coverage displayed by the DP-MC may be due to MCMC mixing issues. The NN-DM shows near nominal coverage in the smooth Gaussian (GS) and lognormal (LN) densities, while also attaining near nominal coverage in the skewed bimodal (SB), claw (CW), and sawtooth (ST) densities which are multi-modal. The shortcomings of DP-MC and DP-VB are especially noticeable when dealing with spiky densities such as the claw or sawtooth. For bivariate cases considered in Table 2 we see a similar trend; the NN-DM method provides uniformly

Chattopadhyay, Chakraborty and Dunson

Method CA CW DE GS IE

NN-DM 0.75 (0.05) 0.89 (0.21) 0.75 (0.06) 0.92 (0.08) 0.81 (0.11) DP-MC 0.48 (0.02) 0.06 (0.01) 0.35 (0.02) 0.37 (0.01) 0.39 (0.04) DP-VB 0.33 (0.05) 0.18 (0.07) 0.28 (0.07) 0.79 (0.05) 0.14 (0.04)

Method LN LO SB SP ST

NN-DM 0.92 (0.17) 0.81 (0.03) 0.88 (0.10) 0.72 (0.01) 0.91 (0.05) DP-MC 0.31 (0.05) 0.55 (0.03) 0.46 (0.03) 0.46 (0.01) 0.64 (0.03) DP-VB 0.19 (0.15) 0.10 (0.03) 0.40 (0.10) 0.20 (0.01) 0.07 (0.01)

Table 1: Comparison of the frequentist coverage of 95% (pseudo) posterior credible intervals of the nearest neighbor-Dirichlet mixture and the MCMC and variational implementations of the Dirichlet process mixture for univariate data. Average length of the intervals are also provided for each case within parentheses. Number of replications and sample size are Rcov = 200 and ncov = 500, respectively.

Method MG MST MVC MVG SN T

NN-DM 0.92 (0.04) 0.88 (0.03) 0.69 (0.03) 0.80 (0.31) 0.92 (0.06) 0.88 (0.03) DP-MC 0.53 (0.01) 0.56 (0.01) 0.47 (0.01) 0.41 (0.16) 0.39 (0.02) 0.55 (0.01) DP-VB 0.56 (0.03) 0.58 (0.03) 0.18 (0.02) 0.55 (0.26) 0.49 (0.05) 0.57 (0.02)

Table 2: Comparison of the frequentist coverage of 95% (pseudo) posterior credible intervals of the nearest neighbor-Dirichlet mixture and the MCMC and variational implementations of the Dirichlet process mixture for bivariate data. Average length of the intervals are also provided for each case within parentheses. Number of replications and sample size are Rcov = 200 and ncov = 500, respectively.

better uncertainty quantiﬁcation across all the densities considered. It is clear that in terms of frequentist uncertainty quantiﬁcation, the NN-DM displays vastly superior coverage to the DP-MC and the DP-VB without inﬂating the interval width.

4.5 Comparison for High Dimensional Data

In addition to the above experiments, we performed a simulation experiment for highdimensional data. Speciﬁcally, we set n = 1000, p = 50, and consider the same set of true densities in Section 4.3. We compared results from the proposed NN-DM method and the DP-VB. Due to severe computational time, we did not consider the DP-MC in this scenario. We also tried optional Polya trees (Wong and Ma, 2010) using the PTT package; however, the current implementation of the method breaks down in this high-dimensional setup. Due to numerical instability in estimating the L1 error in higher dimensions, we evaluate the methods in terms of their out-of-sample log-likelihood (OOSLL) instead (Gneiting and Raftery, 2007), on a test set of 500 data points. We report the average OOSLL over 30

Nearest Neighbor Dirichlet Mixtures

Method MG SN T MST MVC MVG

NN-DM 1.75 1.74 1.84 1.84 1.31 1.36 DP-VB 1.75 1.74 1.86 1.84 1.34 1.36

Table 3: Out-of-sample log-likelihood ( 104) of NN-DM and DP-VB on a test set of 500 points for 6 diﬀerent multivariate densities considered in Section 4.3, for n = 1000 and p = 50. Greater out-of-sample log-likelihood is better.

replications in Table 3. The results indicate that both methods perform very similarly in terms of out-of-sample ﬁt to the data, with the NN-DM outperforming the DP-VB when the true density is MVC. We also observed that for this experiment, the NN-DM methods with default choice of hyperparameters and with cross-validated choice of Ψ0 have almost identical performance. For the NN-DM, we set k = 12 after carrying out a sensitivity analysis on k by considering k = 5, 7, 10, 15, and 20. The best results for the NN-DM were obtained for k {7, 10, 12} with negligible diﬀerence in out-of-sample log-likelihoods between these three choices, with k = 12 performing the best.

4.6 Runtime Comparison

With n data points in p dimensions, the initial nearest neighbor allocation into n neighborhoods can be carried out in O(n log n) steps (Vaidya, 1986; Ma and Li, 2019). Once the neighborhoods are determined with kn points in each neighborhood, obtaining the neighborhood speciﬁc empirical means and covariance matrices has O(nknp+nknp2) = O(nknp2) complexity. Obtaining the pseudo-posterior mean (8) then requires inversion of n such p p matrices to evaluate the multivariate t-density, with a runtime of O(np3). Therefore, the total runtime to obtain the pseudo-posterior mean is of the order O(nknp2 + np3). When we are interested in uncertainty quantiﬁcation, we require Monte Carlo samples of the NNDM, which are independently drawn from its pseudo-posterior. This involves sampling the Dirichlet weights, the neighborhood speciﬁc unknown mean and covariance matrix parameters of the Gaussian kernel, and evaluating a Gaussian density for each neighborhood, as outlined in Algorithm 1. To obtain M Monte Carlo samples, the combined complexity of this step is thus O(Mn + Mnp3) = O(Mnp3). Overall the runtime complexity to obtain NN-DM samples is therefore O(Mnp3 + nknp2 + np3). For high dimensional scenarios, this runtime can be greatly improved by using a low rank matrix factorization of both the neighborhood speciﬁc empirical covariance matrices and the sampled covariance matrix parameters to make matrix inversion more eﬃcient (Golub and van Loan, 1996). We now provide a detailed simulation study of runtimes of the proposed method, with all the simulations carried out on an M1 Mac Book Pro with 16 GB of RAM. We ﬁrst focus on some runtime experiments comparing NN-DM and DP-MC. In our experiments, we focus on p = 1 and p = 4. The runtime for NN-DM consists of the time to estimate δ2 0 by cross-validation as in Section 2.3 and then drawing samples from its pseudoposterior. For both dimensions, the sample size is varied from n = 200 to n = 1500 in increments of 100. Data are generated from the standard Gaussian density (GS) for p = 1

Chattopadhyay, Chakraborty and Dunson

and from a mixture of skew t-distributions with the parameters as described for the case MST in Section 4.3 for p = 4. For p = 1, we evaluate the two methods at 500 test points, while for p = 4 we evaluate the methods at 200 test points. The hyperparameters are kept the same as in Sections 4.2 and 4.3. We took 1000 Monte Carlo samples for the NN-DM and 2500 MCMC samples for the DP-MC with a burn-in of 1500 samples. We provide a ﬁgure summarising the results in Figure 5. In the top panel of Figure 5, we plot the average of the logarithm (base 10) of the run times of each approach for 10 independent replications. Corresponding L1 errors of the two methods is included in the bottom panel of Figure 5.

In Figure 5, the NN-DM is at least an order of magnitude faster than DP-MC. The time saved becomes more pronounced in the multivariate case, where for sample size 1500 the NN-DM is 50 times faster. The gain in computing time does not come at the cost of accuracy as can be seen from the right panel; the proposed method maintains the same order of L1 error as the DP-MC in the univariate case and often outperforms the DP-MC in the multivariate case. We did not implement the Monte Carlo sampler for the proposed algorithm in parallel, but such a modiﬁcation would substantially improve runtime. Bypassing cross-validation and choosing default hyperparameters instead as outlined in Section 2.3, NN-DM took 3.3 seconds and 16.4 seconds when p = 1 and p = 4, respectively, with sample size n = 1500. In the same scenario, DP-MC took 99.4 seconds and 1618.1 seconds for p = 1 and p = 4, respectively. Thus the NN-DM with default hyperparameters is about 30 times faster when p = 1 and almost 100 times faster when p = 4.

We also compare the runtime of the proposed method with three recent implementations of the DPM, namely the packages bnpy (Hughes and Sudderth, 2014), DPMMSub Clusters (Dinari et al., 2019), and vdpgm (Kurihara et al., 2006) available for download at https:// kenichikurihara.com/variational-dirichlet-process-gaussian-mixture-model/. These three packages implement variational approximations of the DPM posterior with different modiﬁcations. We also include the DP-MC and OPT for comparison. All the runtime results are comparable only up to machine and coding language diﬀerences. Amongst the competitor package implementations, the NNDM and dirichletprocess packages are the only ones providing (pseudo) posterior samples of the density estimate at a test point. We consider the average runtime of R = 10 replicates to ﬁt a training data set of iid N(0, 1) entries with n = 1500 and p = 4. For the NN-DM, DP-MC, and OPT, we consider 1000 (pseudo) posterior samples. Table 4 provides the runtimes for the diﬀerent packages considered. Overall, the fastest implementation is observed for the PTT package. The next fastest implementations are the NNDM without cross-validation (CV), DPMMSub Clusters, and bnpy. The runtime for NNDM with CV closely follows the previous implementations, with both NNDM with and without CV providing (pseudo) posterior samples. The major improvement in runtime for NN-DM is mainly due to the fact that neighborhood allocations are ﬁxed here which is not the case for DP-MC.

4.7 Sensitivity to the Choice of k

In this subsection, we investigate the role of kn = k in ﬁnite samples for the proposed method. We consider n = 200 samples from the SP density in the univariate case and the MG density in the bivariate case. In each case, we ﬁx a test set of nt = 500 points, and evaluate the out-of-sample log-likelihood (OOSLL) of the test points for 20 diﬀerent integer

Nearest Neighbor Dirichlet Mixtures

400 800 1200 Sample size

Log runtime in seconds

DP MC NN DM

400 800 1200 Sample size

Log runtime in seconds

DP MC NN DM

400 800 1200 Sample size

DP MC NN DM

400 800 1200 Sample size

DP MC NN DM

Figure 5: Runtime comparison of DP-MC and NN-DM in univariate case and for 4dimensional data. Top panel shows runtimes in log10 scale whereas bottom panel shows corresponding L1 error. Sample size n is varied from 200 to 1500 in increments of 100.

Package (Language) Average Runtime (s) Provides Samples? bnpy (Python) 5.79 No DPMMSub Clusters (Julia) 4.33 No vdpgm (MATLAB) 58.38 No NNDM (RCpp and R, with CV) 18.96 Yes NNDM (Rcpp and R, without CV) 3.52 Yes dirichletprocess (R) 1068.48 Yes PTT (Rcpp and R) 0.59 No

Table 4: Table comparing the average runtimes of diﬀerent packages for n = 1500, p = 4. NN-DM runtimes are provided both with and without cross-validation (CV), using the package NNDM developed by the authors.

values of k ranging from 2 to 50. Finally, we report results averaged from 10 independent replicates of this setup. We note that for each considered value of k, the parameter δ2 0 was estimated using leave-one-out cross-validation. Figure 6 shows how the OOSLL averaged over replicates changes as a function of k for each density considered. The original OOSLL values of the test data points were scaled by the number of test points nt = 500 for better representability.

Chattopadhyay, Chakraborty and Dunson

0 10 20 30 40 50 Values of k

Average out of sample log likelihood

Choice of k for p = 1

0 10 20 30 40 50 Values of k

Average out of sample log likelihood

Choice of k for p = 2

Figure 6: Average out-of-sample log-likelihood of 500 test points for the NN-DM as a function of k for one-dimensional and two-dimensional data. Number of samples and number of replications are n = 200 and R = 10, respectively.

For the univariate SP density, the optimal value of k which maximizes the average OOSLL is bk = 9. This is close to the choice of k = 6 as taken in Section 4.2. For the bivariate MG density, we observe that the choice of k maximizing the OOSLL is bk = 12, which is also close to the choice of k = 10 as taken in Section 4.3. For both the univariate and the bivariate case, the out-of-sample log-likelihood of the test set shows little variation with changing k. This indicates that the estimates obtained from the proposed method are quite robust to the particular choice of k.

5. Application

We apply the proposed density estimator to binary classiﬁcation. Consider data D = {(Xi, Yi) : i = 1, . . . , n}, where Xi Rp are p-dimensional feature vectors and Yi {0, 1} are binary class labels. To predict the probability that y0 = 1 for a test point x0, we use Bayes rule:

pr(y0 = 1 | x0) = f1(x0) pr(y0 = 1) f0(x0) pr(y0 = 0) + f1(x0) pr(y0 = 1) , (16)

where fj(x0) is the feature density at x0 in class j and pr(y0 = j) is the marginal probability of class j, for j = 0, 1. Based on nt test data, we let bpr(y0 = 1) = (1/nt) Pnt i=1 Yi, with bpr(y0 = 0) = 1 bpr(y0 = 1). We use either the NN-DM pseudo-posterior mean ˆfn( ), the DP-MC posterior mean ˆf DP( ), or the DP-VB posterior mean ˆf VB( ) for estimating the within class densities, and compare their classiﬁcation performances in terms of sensitivity, speciﬁcity, and probabilistic calibration. We omit the KDE as to the best of our knowledge, no routine R implementation is available for data having more than 6 dimensions. The high time resolution universe survey data (Keith et al., 2010) contain information on sampled pulsar stars. Pulsar stars are a type of neutron stars and their radio emissions are detectable from the Earth. These stars have gained considerable interest from the scientiﬁc community due to their several applications (Lorimer and Kramer, 2012). The data are publicly available from the University of California at Irvine machine learning repository.

Nearest Neighbor Dirichlet Mixtures

500 1000 1500 Sample size

Sensitivity

DP MC DP VB NN DM

500 1000 1500 Sample size

Specificity

DP MC DP VB NN DM

Figure 7: Sensitivity and speciﬁcity of the NN-DM, DP-MC, and DP-VB for the high time resolution universe survey data.

Stars are classiﬁed into pulsar and non-pulsar groups according to 8 attributes (Lyon, 2016). There are a total of 17898 instances of stars, among which 1639 are classiﬁed as pulsar stars. We create a test data set of 200 stars, among which 23 are pulsar stars. The training size is then varied from 300 to 1800 in increments of 300, each time adding 300 training points by randomly sampling from the entire data leaving out the initial test set. In Figure 7, we plot the sensitivity and speciﬁcity of the three methods in consideration. All the methods exhibit similar sensitivity across various training sizes; the DP-MC has marginally better speciﬁcity for training sizes 1200 and 1500, while the NN-DM has better speciﬁcity for training sizes 300 and 600. Both the NN-DM and the DP-MC exhibit higher speciﬁcity and sensitivity than the DP-VB across all training sample sizes considered. We also compare the methods using the Brier score, a proper scoring rule (Gneiting and Raftery, 2007) for probabilistic classiﬁcation. Suppose for nt test points and the ith Monte Carlo sample, pi denotes the sampled nt 1 probability vector for a generic method. We compute the normalized Brier score for the ith sample as (1/nt) ||pi Yt||2 2, where Yt is the vector of class labels in the test set. Then with T samples of pi, i = 1, . . . , T, we compute the mean Brier score for the three methods considered. The mean Brier score for each training size is shown in the right panel of Figure 8, which naturally shows a declining trend with increasing training size. There is little to choose between the three classiﬁers in terms of mean Brier score; the proposed method fairs equally well in terms of calibration of estimated test set probabilities with the MCMC implementation of the Dirichlet process. In the left panel of Figure 8, the receiver operating characteristic curve of the methods is shown for 1800 training samples. The area under the curve (AUC) for the NN-DM, the DP-MC and the DP-VB are 0.96, 0.95 and 0.96, respectively. For 1800 training samples, the computation time for the proposed method is about 13 minutes while for the DP-MC it is approximately 5 hours. Hence, the proposed method is much faster, even without exploiting parallel computation. We also ﬁtted the proposed method using the training set of all 17698 points; DP-MC was too slow in this case. The sensitivity and speciﬁcity of the proposed method

Chattopadhyay, Chakraborty and Dunson

0.00 0.25 0.50 0.75 1.00 1 Specificity

Sensitivity

DP MC DP VB NN DM

NN DM AUC = 0.96, DP MC AUC = 0.95, DP VB AUC = 0.96

500 1000 1500 Sample size

Mean Brier score

DP MC DP VB NN DM

Figure 8: Left plot shows the receiver operating characteristic curve of the NN-DM, DPMC, and DP-VB with 1800 training samples. Area under the curve is abbreviated as AUC. Right plot shows normalized Brier scores for the methods with varying training sample size.

50 100 150 Sample size

Out of sample log likelihood

DP MC DP VB NN DM

Pulsar stars

400 800 1200 1600 Sample size

Out of sample log likelihood

DP MC DP VB NN DM

Non pulsar stars

Figure 9: Left and right plots show the out-of-sample log-likelihoods of NN-DM, DP-MC, and DP-VB for the two diﬀerent star types.

increased to 0.99 and 0.91, respectively. We additionally evaluated the methods in terms of the out-of-sample log-likelihood. The results are displayed in Figure 9. While the methods perform comparably in terms of their classiﬁcation performance, NN-DM achieves a better ﬁt overall, especially for the signiﬁcantly less prevalent pulsar star type.

6. Discussion

The proposed nearest neighbor-Dirichlet mixture provides a useful alternative to Bayesian density estimation based on Dirichlet mixtures with much faster computational speed and stability in avoiding MCMC. MCMC can have very poor performance in mixture models and other multimodal cases, due to diﬃculty in mixing, and hence can lead to posterior

Nearest Neighbor Dirichlet Mixtures

inferences that are unreliable. There is a recent literature attempting to scale up MCMCbased analyses in model-based clustering contexts including for Dirichlet process mixtures; refer, for example to Song et al. (2020) and Ni et al. (2020). However, these approaches are complex to implement and are primarily focused on the problem of clustering, while we are instead focused on ﬂexible modeling of unknown densities. The main conceptual disadvantage of the proposed approach is the lack of a coherent Bayesian posterior updating rule. However, we have shown that nonetheless the resulting pseudo-posterior can have appealing behavior in terms of frequentist asymptotic properties, ﬁnite sample performance, and accuracy in uncertainty quantiﬁcation. In addition, it is important to keep in mind that Bayesian kernel mixtures have key disadvantages that are diﬃcult to remove within a fully coherent Bayesian modeling framework. These include a strong sensitivity to the choice of kernel and prior on the weights on these kernels; refer, for example to Miller and Dunson (2019). There are several important next steps. The ﬁrst is to develop fast and robust algorithms for using the nearest neighbor-Dirichlet mixture not just for density estimation but also as a component of more complex hierarchical models. For example, one may want to model the residual density in regression nonparametrically or treat a random eﬀects distribution as unknown. In such settings, one can potentially update other parameters within a Bayesian model using Markov chain Monte Carlo, while using algorithms related to those proposed in this article to update the nonparametric part conditionally on these other parameters.

Acknowledgments

R package NNDM available at https://github.com/shounakchattopadhyay/NN-DM was used for the numerical experiments. This research was partially supported by grants R01ES027498 and R01ES028804 of the United States National Institutes of Health and grants N00014-161-2147 and N00014-21-1-2510 of the United States Oﬃce of Naval Research. The authors would also like to thank the three anonymous reviewers, whose feedback led to a substantial improvement of the paper.

Chattopadhyay, Chakraborty and Dunson

Appendix A. Prerequisites

We ﬁrst introduce some notation with accompanying technical details which will be used hereafter. We denote the Frobenius norm and determinant of A Rp p by ||A||F = {tr(ATA)}1/2 and |A|, respectively. For v Rp, one has ||vv T||F = ||v||2 2 where ||a||2 = (a Ta)1/2 is the Euclidean norm of a. For two symmetric matrices A, B Rp p, we say that A B if A B is positive semi-deﬁnite, that is x T(A B)x 0 for all x Rp, x = 0p where 0p = (0, . . . , 0)T. For a real symmetric matrix A , let the eigenvalues of A be e1(A ), . . . , ep(A ), arranged such that e1(A ) . . . ep(A ). If A B, then it follows by the min-max theorem (Teschl, 2009) that for each j = 1, . . . , p, we have ej(A) ej(B). In particular, we have |A| |B| and ||A||F ||B||F .

Now consider a true data generating density X1, . . . , Xn iid f0 satisfying Assumptions 1-3 as in Section 3.1. Let X (n) = (X1, . . . , Xn) and suppose f0 induces the measure Pf0 on the Borel σ-ﬁeld on Rp, denoted by B(Rp). We form the k-nearest neighborhood of Xi using the Euclidean norm for i = 1, . . . , n. We also let k depend on n and express this dependence as kn when required. However, we routinely drop this dependence for notational simplicity. For Xi, let Qi be its kth nearest neighbor in X (n) (for k = 1, Qi = Xi) and let Ri be the distance between Xi and Qi, given by Ri = ||Xi Qi||2. Deﬁne the ball

Bi = {y [0, 1]p : 0 < ||y Xi||2 < Ri}

and the probability

G(Xi, Ri) = Z

Bi f0(u) du

of the ball Bi under Pf0. Let Y (i) 1 = Xi and Y (i) 2 , . . . , Y (i) k 1 denote the rest of the interior points in Bi. Let the mean Xi and covariance matrix Si of the ith neighborhood be

j=1 Y (i) j + Qi

j=1 (Y (i) j Xi)(Y (i) j Xi)T + (Qi Xi)(Qi Xi)T

We observe that (Y (i) 2 , . . . , Y (i) k 1, Qi) is identically distributed for i = 1, . . . , n since X1, . . . , Xn are independent and identically distributed. Thus we only consider the case i = 1 from here on. For sake of brevity, denote Y (1) u by Yu for u = 2, . . . , k 1 and Q1 by Q. Conditional on X1 = x1 [0, 1]p and R1 = r1 > 0, following Mack and Rosenblatt (1979), the conditional joint density of Y2, . . . , Yk 1 and Q is

p(y2, . . . , yk 1, q | x1, r1) =

f0(yj) G(x1, r1)I (yj B1)

f0(q) G (x1, r1)I (||q x1|| = r1) ,

Nearest Neighbor Dirichlet Mixtures

where G (x1, r1) = G(x1, r1)/ r1 and I(A) denotes the indicator function of the event A B(Rp). Thus conditional on X1 and R1, the random variables Y2, . . . , Yk 1 are independent and identically distributed, and independent of Q. Let the function ρ(x1, r1) = rκ1 1 where κ1 is a non-negative integer. This function can be identiﬁed with φ( ) in equation (11) of Mack and Rosenblatt (1979). In the results that follow, we will require the expectation of ρ(x1, r1) under Pf0 for diﬀerent choices of κ1. To that end, we shall repeatedly make use of the equation (12) from Mack and Rosenblatt (1979) adapted to our setting:

EPf0{Rκ1 1 | X1 = x1} =

(n 1)! (k 2)!(n k)!

( t Cpf0(x1)

κ1/p + o(tκ1/p)

tk 2(1 t)n kdt. (17)

Finally, we let e E and f var denote the expectation and variance, respectively, of the NNDM estimator f(x) under the pseudo-posterior density eΠ, described in (7). Conditioning notation under eΠ is as usual; for example, the conditional expectation

e E{f(x) | π1, . . . , πn} =

i=1 πi e E{φp(x; ηi, Σi)},

where the expectation e E{φp(x; ηi, Σi)} is with respect to the pseudo-posterior density of (ηi, Σi) as described in Section 2.2.

Appendix B. Proof of Theorem 4

Suppose X1, . . . , Xn are independent and identically distributed random variables generated from the density f0 supported on [0, 1]p satisfying Assumptions 1-3. For i = 1, . . . , n, recall the deﬁnitions of µi and Λi from (8):

νn Xi, Λi = νn + 1 νn(γn p + 1)Ψi.

We want to show that ˆfn(x) = (1/n) Pn i=1 tγn p+1(x; µi, Λi) f0(x) in Pf0-probability as n for any x [0, 1]p, where ˆfn(x) is as described in (8). We ﬁrst prove two propositions involving successive mean value theorem type approximations to ˆfn(x), which will imply the ﬁnal result. We now state the two propositions, with accompanying proofs, before stating the ﬁnal theorem.

Proposition 9 Fix x [0, 1]p. Let f A(x) = (1/n) Pn i=1 tγn p+1(x; Xi, Λi). Also, let k = o(ni1) with i1 = 2/(p2 + p + 2) and ν0 = o(n 1/pk(1/p)+1). Then, we have EPf0( | ˆfn(x) f A(x)| ) 0 as n .

Proof Since the (Λi)n i=1 are identically distributed and (µi)n i=1 are identically distributed, we have EPf0( | ˆfn(x) f A(x)| ) EPf0{ |tγn p+1(x; µ1, Λ1) tγn p+1(x; X1, Λ1)| }. The multivariate mean value theorem now implies that

EPf0( | ˆfn(x) f A(x)| ) EPf0

n |Λ1| 1/2 || tγn p+1(ξ; 0p, Ip)||2 ||Λ 1/2 1 (X1 µ1)||2 o , (18)

Chattopadhyay, Chakraborty and Dunson

where tγn p+1(ξ; 0p, Ip) = [ tγn p+1(x; 0p, Ip)/ x]ξ for some ξ in the convex hull of Λ 1/2 1 (x X1) and Λ 1/2 1 (x µ1). Using standard results and the min-max theorem, we have

||Λ 1/2 1 (X1 µ1)||2 ||Λ 1/2 1 ||F ||X1 µ1||2.

If we let Hn = H = {νn(γn p + 1)} 1(νn + 1)Ψ0 = h2Ip where h2 = h2 n = {νn(γn p + 1)} 1{(νn+1)(γ0 p+1)} δ2 0 following the choice of Ψ0 from Section 2.3, then it is clear that Λ1 H. Therefore, we have ||Λ 1/2 1 (X1 µ1)||2 ||H 1/2||F ||X1 µ1||2. Straightforward calculations show that ||H 1/2 ||F = h 1p1/2 and ||X1 µ1||2 R1 + {ν 1 n (1 + ||µ0||2 )ν0} where R1 = ||X1 X1[k]||2. Using Theorem 2.4 from Biau and Devroye (2015) for p 2 and (17) for p = 1, one gets

EPf0(R2 1) d2 p

for an appropriate constant dp > 0. Thus, we have EPf0(R1) {EPf0(R2 1)}1/2 dp(k/n)1/p

for suﬃciently large n. This implies that

E(||X1 µ1||2) dp

We also have |Λ1| 1/2 |H| 1/2 = h p. Finally, simple calculations yield that

|| tγn p+1(ξ; 0p, Ip)||2 L1,n,p

where L1,n,p > 0 satisﬁes L1,n,p (2π) p/2e 1/2 as n . Plugging all these back in (18), we obtain a ﬁnite constant L2,n,p > 0 such that

EPf0( | ˆfn(x) f A(x)| ) L2,n,p(n i1k)(p2+p+2)/(2p) + o{(n i1k)(p2+p+2)/(2p)}, (21)

which goes to 0 as n , completing the proof.

We now provide the second mean value theorem type approximation which approximates the random bandwidth matrix Λi in f A(x) by H = Hn for each i = 1, . . . , n.

Proposition 10 Fix x [0, 1]p. Let f K(x) = (1/n) Pn i=1 tγn p+1(x; Xi, H). Also, let k = o(ni2) with i2 = 4/(p + 2)2 and ν0 = o{n 2/pk(2/p)+1}. Then, we have EPf0( |f A(x) f K(x)| ) 0 as n .

Proof Using the identically distributed properties of (Λi)n i=1 and (Xi)n i=1, we obtain EPf0( |f A(x) f K(x)| ) EPf0( |tγn p+1(x; X1, Λ1) tγn p+1(x; X1, H)| ). Using the multivariate mean value theorem, we obtain that

EPf0 ( |tγn p+1(x; X1, Λ1) tγn p+1(x; X1, H)| ) EPf0( ||M1||F ||Λ1 H||F ), (22)

where M1 = [ {tγn p+1(x; X1, Σ)}/ Σ]Σ0 for some Σ0, with Σ0 in the convex hull of Λ1 and H. Since Λ1 H, we immediately have Σ0 H as well. Using the deﬁnitions of Λ1 and H, we have

||Λ1 H||F (νn + 1) νn(γn p + 1)

j N1 (Xj X1)(Xj X1)T

Nearest Neighbor Dirichlet Mixtures

Since || P j N1(Xj X1)(Xj X1)T||F P j N1 ||(Xj X1)(Xj X1)T||F = P j N1 ||Xj X1||2 2 P j N1 R2 1 = k R2 1, we get for suﬃciently large n the following:

EPf0( ||Λ1 H||F ) EPf0(R2 1) + o k

using (19) and ν0 = o n 2/pk(2/p)+1 . Taking partial derivatives of log{tγn p+1(x; X1, Σ)} with respect to Σ evaluated at Σ0 and taking Frobenius norm of both sides, we obtain

||t 1 γn p+1(x; X1, Σ0) M1||F h 2(γn + 1)

for suﬃciently large n. We now observe that

tγn p+1(x; X1, Σ0) cp,γn p+1|Σ0| 1/2 cp,γn p+1|H| 1/2 = h pcp,γn p+1,

where cp,β = (πβ) p/2{Γ(β/2)} 1Γ{(β + p)/2} for p 1, β > 0. Note that cp,β (2π) p/2

as β for any p 1. This immediately implies that ||M1||F h (p+2)cp,γn p+1(γn +1) for suﬃciently large n. Plugging all these back in equation (22), we obtain for suﬃciently large n, a ﬁnite L3,n,p > 0 such that

EPf0( |f A(x) f K(x)| ) L3,n,p(n i2k)(p+2)2/(2p) + o{(n i2k)(p+2)2/(2p)}, (25)

which goes to 0 as n , proving the proposition.

We now prove Theorem 4.

Proof [Theorem 4] EPf0( | ˆfn(x) f K(x)| ) EPf0( | ˆfn(x) f A(x)| )+EPf0( |f A(x) f K(x)| ) by the triangle inequality. Using Propositions 9 and 10, we obtain that EPf0( | ˆfn(x) f K(x)| ) 0 as n . From Section F of the Appendix, we obtain f K(x) f0(x) in Pf0-probability. This immediately implies that given the conditions on k, ν0, and for any x [0, 1]p, we have ˆfn(x) f0(x) in Pf0-probability.

Appendix C. Proof of Theorem 5

Proof Fix x [0, 1]p. For i = 1, . . . , n, let zi = φp(x ; ηi, Σi) and suppose z(n) = (z1, . . . , zn)T. Then, we have f(x) = Pn i=1 πizi = z(n)Tπ(n) where π(n) = (π1, . . . , πn)T. We begin with the identity

f var{f(x)} = f var[ e E{f(x) | z(n)}] + e E[ f var{f(x) | z(n)}]. (26)

Chattopadhyay, Chakraborty and Dunson

We start with the ﬁrst term on the right hand side of (26). Observe that z1, . . . , zn are independent under eΠ and e E(πi) = 1/n for i = 1, . . . , n. Thus, we have

f var[ e E{f(x) | z(n)}] = f var

i=1 f var(zi)

i=1 e E(z2 i )

i=1 Rn|Bi| 1/2tγn p+2(x; µi, Bi),

since for i = 1, . . . , n, we have

e E(z2 i ) = Rn|Bi| 1/2tγn p+2(x; µi, Bi), (27)

Rn = Γ{(γn p + 2)/2}

Γ{(γn p + 1)/2}

νn + 2 4πνn(γn p + 2)

p/2 , Bi = DnΛi,

and Dn = {2(γn p+2)(νn+1)} 1(γn p+1)(νn+2). To obtain (27), we integrate over the pseudo-posterior distribution of (ηi, Σi)n i=1, namely NIW(µi, νn, γn, Ψi). For i = 1, . . . , n, since |Λi| |Hn|, we have |Bi| Dp n|Hn|. Letting bfvar(x) = (1/n) Pn i=1 tγn p+2(x; µi, Bi), we have

f var[ e E{f(x) | z(n)}] Rn D p/2 n bfvar(x) n|Hn|1/2 . (28)

We now analyze the second term on the right hand side of (26). Recall that π(n) is independent of z(n) under eΠ. Let Σπ denote the pseudo-posterior covariance matrix of π(n). Standard results yield Σπ = Vn{(1 Cn)In+Cn1n1T n}, where Vn = (n 1)/[n2{n(α+1)+1}], and Cn = 1/(n 1). Then, we have

e E[ f var{f(x) | z(n)}] = e E[z(n)T Σπ z(n)]. (29)

Using the expression for Σπ along with (29), we obtain,

e E[ f var{f(x) | z(n)}] = 1 n(α + 1) + 1 e E

i=1 (zi z)2 )

where z = (1/n) Pn i=1 zi. We now have

e E[ f var{f(x) | z(n)}] = 1 n{n(α + 1) + 1}

i=1 e E(z2 i ) n e E( z2)

1 n{n(α + 1) + 1}

i=1 e E(z2 i )

= 1 n{n(α + 1) + 1}

i=1 Rn|Bi| 1/2tγn p+2(x; µi, Bi),

Nearest Neighbor Dirichlet Mixtures

using (27). Using |Bi| Dp n|H| for i = 1, . . . , n as before, we have

e E[ f var{f(x) | z(n)}] Rn D p/2 n bfvar(x) {n(α + 1) + 1}|Hn|1/2 . (31)

Combining (28) and (31) and putting the results back in (26), we have the desired result. If we let n , we immediately obtain that f var{f(x)} 0 in Pf0-probability.

Appendix D. Proof of Theorem 7

Proof We have iid data X (n) = (X1, . . . , Xn) such that X1, . . . , Xn iid f0, with f0 satisfying Assumptions 1-3 for p = 1. Given the NN-DM estimator f(x) = Pn i=1 πiφ(x; ηi, σ2 i ), we deﬁne the simpliﬁed NN-DM density estimator to be

i=1 φ(x; ηi, σ2 i ),

The simpliﬁed estimator g(x) can be interpreted as a version of f(x) with the Dirichlet weights being replaced by their pseudo-posterior mean. That is, g(x) = e E{f(x) | (η1, σ2 1), . . . , (ηn, σ2 n)}. The pseudo-posterior distribution of g(x) is induced through the pseudo-posterior distributions of {(ηi, σ2 i )}n i=1. The pseudo-posterior mean is of the form

where λi = {(νn + 1)/νn}1/2δi. Let hn = (νnγn) 1/2(νn + 1)1/2(γ0δ2 0)1/2. Then

(nhn)1/2 EPf0| ˆfn(x) f K(x)| 0, (32)

for kn = o(n2/7) and kn as n from Section B of the Appendix, where

f K(x) = 1 nhn

We want to investigate the asymptotic distribution of f(x) as n . For that, we ﬁrst investigate the asymptotic distribution of the simpliﬁed NN-DM estimator g(x), and then show that f(x) and g(x) are asymptotically close in Pf0-probability. To derive the asymptotic distribution of g(x), we begin with the asymptotic distribution of f K(x), which can be expressed as f K(x) = n 1 Pn i=1 uin, where uin = h 1 n tγn{(x Xi)/hn}. Using Lyapunov s central limit theorem and denoting convergence in distribution under f0 by d0, we have f K(x) EPf0{f K(x)}

[var Pf0{f K(x)}]1/2 d0 N(0, 1)

if (Pn i=1 ρin)1/r

(Pn i=1 τ 2 in)1/2 0, as n , (33)

Chattopadhyay, Chakraborty and Dunson

for some r > 2, where ρin = E|uin E(uin)|r and τ 2 in = E{uin E(uin)}2 for i = 1, . . . , n. By standard calculations, we have

τ 2 in = f0(x)

Z t2 γn(u)du + o 1

Z t3 γn(u)du + o 1

It is straightforward to see that R tr γn(u)du/ R tγn(u)du = O(1) for any r 1. So, Lyapunov s condition is satisﬁed as the ratio in this case satisﬁes O{(nhn) 1/6} and nhn . Additionally, |τ 2 in {f0(x)/hn} R φ2(u) du| 0. So by a combination of Lyapunov s central limit theorem and Slutsky s theorem, we have

(nhn)1/2 h f K(x) EPf0{f K(x)} i d0 N 0, f0(x)

since R φ2(u) du = (2π1/2) 1. From the calculations in Section F of the Appendix, we can expand the Taylor series to two more terms to obtain

f K(x) f0(x) h2 nf 0 (x) 2

since |f(4) 0 (x)| C0 for all x [0, 1]. Thus,

f0(x) + h2 nf(2) 0 (x)

d0 N 0, f0(x)

provided n 2/9kn as n , implying (nhn)1/2h4 n 0. We now argue that (nhn)1/2|g(x) ˆfn(x)| 0 in Pf0-probability. For this, we ﬁrst look at

h nhn{g(x) ˆfn(x)}2i = nhn EPf0

h e E n (g(x) ˆfn(x))2oi

= nhn EPf0 [ f var{g(x)}] ,

since e E{g(x)} = ˆfn(x). The pseudo-posterior variance of g(x) is given by

f var{g(x)} = 1

i=1 f var{Zi(x)},

where Zi(x) = φ(x; ηi, σ2 i ) for i = 1, . . . , n. It is straightforward to show that

f var{Zi(x)} | n| 1

eλ2 i t2γn+1

Nearest Neighbor Dirichlet Mixtures

as n , where eλ2 i = λ2 i /2 for i = 1, . . . , n and

n = u2 γn u2γn+1 1 (2π)1/2 ,

with ud = Γ{(d + 1)/2}/{(dπ)1/2Γ(d/2)} being the normalizing constant of the Student s t-density with degrees of freedom d > 0. Using Stirling s approximation, n 0 as n . This immediately implies nhn f var{g(x)} | n| vg(x),

1 eλi t2γn+1

Using the techniques of Section B of the Appendix, it can be shown that EPf0{vg(x)} f0(x) as n . Therefore, we have

h nhn{g(x) ˆfn(x)}2i = nhn EPf0

h e E n (g(x) ˆfn(x))2oi

= nhn EPf0 [ f var {g(x)}]

EPf0{| n|vg(x)}

as n . A simple application of Chebychev s inequality implies (nhn)1/2|g(x) ˆfn(x)| 0 in Pf0-probability as n . Combining this with (32) and (35) and using Slutsky s theorem, we obtain the desired result for g(x). We now demonstrate that f(x) and g(x) are asymptotically close to derive the same result for f(x). We start out with

h (nhn)1/2{f(x) g(x)} i = nhn EPf0 [ f var {f(x) g(x)}]

We now focus on the term inside EPf0 in (37) and further decompose it as

Zi(x) | π(n) )#

Zi(x) | π(n) )#

where π(n) = (π1, . . . , πn)T. In (38), let Ξ1n and Ξ2n be as follows:

Zi(x) | π(n) )#

Chattopadhyay, Chakraborty and Dunson

Ξ2n = f var

Zi(x) | π(n) )#

Thus, we can write (37) as

h (nhn)1/2{f(x) g(x)} i = nhn EPf0(Ξ1n) + nhn EPf0(Ξ2n). (39)

It is straightforward to see that Ξ1n = Pn i=1 f var(πi) f var{Zi(x)}. As n , we use the fact that πi Beta(αn + 1, (n 1)(αn + 1)) under eΠ and (36), to get

nhn EPf0(Ξ1n) n n n(αn + 1) + 1 eΞ1n, (40)

for some eΞ1n satisfying eΞ1n f0(x) and n 0 as n . Therefore, nhn EPf0(Ξ1n) 0 as n . For the second part, let di(x) = (1/λi) tγn{(x µi)/λi} so that the pseudoposterior mean ˆfn(x) = (1/n) Pn i=1 di(x). We ﬁrst observe that

Ξ2n = f var

i=1 πidi(x)

= 1 n(αn + 1) + 1

i=1 {di(x) ˆfn(x)}2 #

1 n(αn + 1) + 1

i=1 d2 i (x)

It now follows from some algebra that

i=1 d2 i (x) d0(n)γn

1 λi t2γn+1

where d0(n) (2π) 1/2 as n . Therefore, we have

nhn EPf0(Ξ2n) O 1 2π1/2 n n(αn + 1) + 1 eΞ2n

for some eΞ2n satisfying eΞ2n f0(x) as n . By the conditions of the theorem, we have nhn EPf0(Ξ2n) 0 as n . This, along with (40) substituted in (39) provides (nhn)1/2 |f(x) g(x)| 0 in Pf0-probability. This implies the desired result for f(x) using Slutsky s theorem. As a result, we can interpret pseudo-credible intervals to be frequentist conﬁdence intervals, on average, asymptotically.

Nearest Neighbor Dirichlet Mixtures

Appendix E. Proof of Theorem 8

E.1 A Property of the kn-Nearest Neighbor Distance

Suppose X1, . . . , Xn iid f0 with f0 a density on Rp satisfying Assumptions 1-3. We denote the induced probability measure Pf0 by P0 in this section for the sake of convenience. We deﬁne the smoothed k-nearest neighborhood of Xi as Bi = {y Rp : ||Xi y||2 Ri}, where Ri = ||Xi Xi[kn]||2 is the Euclidean distance between Xi and the kn-nearest neighbor of Xi for i = 1, . . . , n. By symmetry, R1, . . . , Rn are identically distributed. Suppose rn = (kn/n)1/p and deﬁne the quasi-neighborhood Bi(r) = {y Rp : ||Xi y||2 r}, where the random variables Ri have been replaced by r 0. Let

{y : ||y x||2 r}

The positive density condition on f0 obtained from Assumptions 1 and 3 (Evans et al., 2002; Evans, 2008) ensures the existence of A > 1 and ρ > 0 such that for all 0 r ρ and for all x [0, 1]p, rp

A ωx(r) Arp. (42)

We ﬁrst state a Lemma proving some important properties of R1. We next use this Lemma to prove Theorem 8. Recall that two non-negative sequences (an) and (bn) are said to be asymptotically equivalent if |an/bn| c0 for some c0 > 0, denoted by an bn.

Lemma 11 Deﬁne i0 = {2/(p2+p+2)} {4/(p+2)2} as in Theorem 4. Assume kn ni0 ϵ

for some ϵ (0, i0). Suppose δ > 0 satisﬁes

δ < (1 i0 + ϵ) 1 1.

1/p , cn = 1 (Ae)1/p r1+δ n , and n0 =

2 δ(1 i0 + ϵ) + 1

) 1 i0 ϵ + 1.

Then, the following results hold:

(i) pn = P0(R1 cn) = O

n=n0+1 pn < . That is, {pn} n=1 is summable.

(iii) P limsup n {R1 cn} = 0.

(iv) ncp n as n .

Chattopadhyay, Chakraborty and Dunson

(i) Note that cn ρ for suﬃciently large n. From Lemma 4.1 of Evans et al. (2002) we have

P0(R1 cn | Xi = x) =

0 (kn 1) n 1 kn 1

ykn 2(1 y)n kndy

for any x [0, 1]p, using (42). This immediately implies

P0(R1 cn) = Z

[0,1]p P0(Ri cn | Xi = x) f0(x) dx O

(ii) For n > n0, we have pn = O{n (1+Θn)} for a sequence Θn , Θn > 0. This ensures that P n=n0+1 pn < .

(iii) Since P n=1 pn < , a direct application of the ﬁrst Borel-Cantelli lemma proves the statement.

(iv) We have, using the condition on δ,

Ae k1+δ n nδ

We now use the above Lemma to prove Theorem 8. The key idea is to leverage the fact that Ri > cn for all i = 1, . . . , n with probability 1 for all but ﬁnite n.

E.2 Number of Eﬀective Member Points in Each Neighborhood

We now prove Theorem 8. Proof Using (iii) from Lemma 11, for i = 1, . . . , n, we have an integer e Ni such that for all n e Ni, P0(Ri > cn) = 1. However, since R1, . . . , Rn are identically distributed, e N1 = . . . = e Nn = e N, say. Thus, for all i = 1, . . . , n, we have P0(Ri > cn) = 1 for all n e N. This immediately implies that P0 [Tn i=1{Ri > cn}] = 1 P0 [Sn i=1{Ri cn}] 1 Pn i=1 P0[Ri cn] = 1 for all n e N, which shows P0 [Tn i=1{Ri > cn}] = 1. Therefore,

Nearest Neighbor Dirichlet Mixtures

n Qn = n P0

i=3 {X2 / Bi},

i=1 {Ri > cn}

i=3 {X2 / Bi(cn)}

i=3 {||X2 Xi|| > cn}

= n Z θn 2 n (x) f0(x) dx,

where θn(x) = 1 ωx(cn), since X1, . . . , Xn iid f0. Using (42), we have θn(x) 1 (cp n/A) for all x [0, 1]p. Given the conditions on k, it follows that as n ,

log n + n log 1 cp n A

for all x [0, 1]p, where ξ = 1 (1 + ϵ i0)(1 + δ) > 0. Therefore, we have

n Qn = n Z θn 2 n (x)f0(x) dx

n 2 Z f0(x) dx

= O exp log n + n log 1 cp n A

as n . This proves the result.

Appendix F. Proof of Consistency of Kernel Estimator

Deﬁne the standard multivariate t-density with d > 0 degrees of freedom to be gd(x) = td(x; 0p, Ip). Since H = Hn = h2 n Ip as deﬁned in Section 3.1 is diagonal, it immediately follows that tγn p+1(x; µ, H) = h p n gγn p+1{h 1 n (x µ)}. The following lemma proves the consistency of any such generic kernel density estimator with t kernel depending on n, say

f K(x) = 1 nwp

i=1 gγn p+1

where the bandwidth w = wn satisﬁes wn 0 and nwp n as n , with independent and identically distributed data X1, . . . , Xn f0 satisfying Assumptions 1-3.

Lemma 12 Suppose wn is a sequence satisfying wn 0 and nwp n as n . Let f K(x) = (nwp n) 1 Pn i=1 gγn p+1{w 1 n (x Xi)}. Then f K(x) f0(x) in Pf0-probability for each x [0, 1]p.

Chattopadhyay, Chakraborty and Dunson

Proof It is enough to show that EPf0{f K(x)} f0(x) and var Pf0{f K(x)} 0 as n . Let us start ﬁrst with EPf0{f K(x)}. We have

EPf0{f K(x)} = EPf0

wp n gγn p+1

[0,1]p 1 wp n gγn p+1

ip gγn p+1(u) f0(x + wnu) du,

ip gγn p+1(u) {f0(x) + wnu T f0(ξ)} du

ip gγn p+1(u) du + wn

ip gγn p+1(u)u T f0(ξ) du,

= f0(x){1 on(1)} + wn On(1),

using the mean value theorem and Polya s theorem (P olya, 1920) along with Assumption 2 to bound f0( ). As n , this implies that EPf0{f K(x)} f0(x) since wn 0 as n . The variance may be dealt with in a similar manner. Following the same steps as before we get

var Pf0{f K(x)} = 1

wp n gγn p+1

w2p n g2 γn p+1

[0,1]p g2 γn p+1

ip g2 γn p+1(u){f0(x) + wnu T f0(ξ)} du,

which shows that the variance goes to 0 as n , since nwp n as n .

For the nearest neighbor-Dirichlet mixture, recall f K(x) = (1/n) Pn i=1 tγn p+1(x; Xi, Hn) from Section 3.1 of the main document, where Hn = h2 n Ip and h2 n = {νn(γn p+1)} 1{(νn+ 1)(γ0 p+1)}δ2 0. Here, the bandwidth hn satisﬁes hn 0 and nhp n as n . Lemma 12 then shows that f K(x) converges to f0(x) in Pf0-probability as n .

Appendix G. Cross-validation

G.1 Algorithm for Leave-one-out Cross-validation

Consider independent and identically distributed data X1, . . . , Xn Rp f with f having the nearest neighbor-Dirichlet mixture formulation. The prior of the neighborhood parameters (ηi, Σi) following Sections 2.2 and 2.3 is (ηi, Σi) NIWp(µ0, ν0, γ0, Ψ0) where Ψ0 = (γ δ2 0) Ip with γ = γ0 p + 1. We use the pseudo-posterior mean in (8) to compute

Nearest Neighbor Dirichlet Mixtures

leave-one-out log-likelihoods L(δ2 0) for diﬀerent choices of the hyperparameter δ2 0, choosing δ2 0,CV = arg supδ2 0L(δ2 0) to maximize this criteria. The details of the computation of L(δ2 0) for a ﬁxed δ2 0 are provided in Algorithm 2.

Step 1: Consider data X (n) = (X1, . . . , Xn) where Xi Rp, p 1.

Fix the number of neighbors k and other hyperparameters µ0, ν0, γ0.

Step 2: For i {1, . . . , n}, consider the data set leaving out the ith data point, given by X i = (X1, . . . , Xi 1, Xi+1, . . . , Xn).

Compute the pseudo-posterior mean density estimate at Xi, namely ˆf i(Xi), using X i and (8); let Li(δ2 0) = ˆf i(Xi).

Step 3: Compute the leave-one-out log-likelihood given by

L(δ2 0) = 1

i=1 log{Li(δ2 0)}.

Step 4: For δ2 0 > 0, obtain δ2 0,CV = arg sup δ2 0 L(δ2 0).

Algorithm 2: Leave-one-out cross-validation for choosing the hyperparameter δ2 0 in nearest neighbor-Dirichlet mixture method.

G.2 Fast Implementation of Cross-validation

In Algorithm 2, the nearest neighborhood speciﬁcation for each X i is diﬀerent for i = 1, . . . , n. However, we bypass this computation by initially forming a neighborhood of size (k + 1) for each data point using the entire data and storing the respective neighborhood means and covariance matrices. Suppose for Xi, the indices of the (k + 1)-nearest neighbors are given by e Ni = {j {1, . . . , n} : ||Xi Xj||2 ||Xi Xi[k+1]||2}, arranged in increasing order according to their distance from Xi with Xi[1] = Xi. Deﬁne the neighborhood mean mi = {1/(k + 1)} P

j e Ni Xj and the neighborhood covariance matrix Si = (k + 1) 1{P

j e Ni(Xj mi)(Xj mi)T}. Then, to form a k-nearest neighborhood

for the new data X i, a single pass over the initial neighborhoods e Ni is suﬃcient to update the new neighborhood means and covariance matrices. Below, we describe the update for the neighborhood means m( i) j and covariance matrices S( i) j for j = 1, . . . , n and j = i, considering the data X i. For j = 1, . . . , n and j = i, we have,

( (1/k){(k + 1)mj Xj[k+1]} if i / e Nj, (1/k){(k + 1)mj Xi} if i e Nj.

( Sj {(k + 1)/k}(mj Xj[k+1])(mj Xj[k+1])T if i / e Nj, Sj {(k + 1)/k}(mj Xi)(mj Xi)T if i e Nj. (43)

Chattopadhyay, Chakraborty and Dunson

Appendix H. Algorithm with Gaussian Kernels for Univariate Data

For p = 1, we have a univariate Gaussian density φ(x; ηi, σ2 i ) in neighborhood i and normalinverse gamma priors (ηi, σ2 i ) NIG(µ0, ν0, γ0/2, γ0δ2 0/2) independently for i = 1, . . . , n, with µ0 R and ν0, γ0, δ2 0 > 0. That is,

ηi | σ2 i N µ0, σ2 i ν0

, σ2 i IG γ0

2 , γ0δ2 0 2

Monte Carlo samples from the pseudo-posterior of the unknown density f at any point x can be generated following the steps of Algorithm 3.

Step 1: Compute the k-nearest neighborhood Ni for Xi with Xi[1] = Xi, using the distance d( , ).

Step 2: Update the parameters for neighborhood Ni to (µi, νn, γn/2, γnδ2 i /2), where νn = ν0 + k, γn = γ0 + k,

µi = ν0µ0 + k Xi

νn , Xi = 1

and γnδ2 i = γ0δ2 0 + P j Ni Xj Xi 2 + kν0ν 1 n µ0 Xi 2 .

Step 3: To compute the t-th Monte Carlo sample of f(x), sample π(t) Dirichlet(α + 1, . . . , α + 1) and (η(t) i , σ(t)2 i ) NIG µi, νn, γn/2, γnδ2 i /2

independently for i = 1, . . . , n, and set

i=1 π(t) i φ(x; η(t) i , σ(t)2 i ).

Algorithm 3: Nearest neighbor-Dirichlet mixture algorithm to obtain Monte Carlo samples from the pseudo-posterior of f(x) with Gaussian kernel and normal-inverse gamma prior.

Appendix I. Inverse Wishart Parametrization

The parametrization of the inverse Wishart density deﬁned on the set of all p p matrices with real entries used in this article is given as follows. Suppose γ > p 1 and Ψ is a p p positive deﬁnite matrix. If Σ IWp(γ, Ψ), then Σ has the following density function:

|Σ| (γ+p+1)/2 etr 1

2ΨΣ 1 if Σ is positive deﬁnite,

0 otherwise,

Nearest Neighbor Dirichlet Mixtures

where Γp( ) is the multivariate gamma function given by

Γp(a) = πp(p 1)/4 p Y

j=1 Γ a + 1 j

for a (p 1)/2 and the function etr (A) = exp {tr(A)} for a square matrix A. When p = 1, the IWp(γ, Ψ) density is the same as the IG(γ/2, γδ2/2) density, where δ2 = Ψ/γ. The IWp(γ, Ψ) distribution has mean Ψ/(ν p 1) for ν > p + 1 and mode Ψ/(ν + p + 1).

Appendix J. L1 Error Tables in Sections 4.2 and 4.3

Sample size Estimator CA CW DE GS IE LN LO SB SP ST

200 NN-DM 0.20 0.31 0.19 0.12 0.36 0.20 0.13 0.16 0.30 0.31 NN-DM (default) 0.21 0.37 0.17 0.12 0.34 0.20 0.14 0.17 0.31 0.32 DP-MC 0.17 0.37 0.14 0.10 0.36 0.22 0.13 0.23 0.27 0.55 KDE - 0.37 0.16 0.12 - 0.18 0.11 0.18 - 0.52 KNN 5.99 0.58 0.59 0.28 3.46 0.54 0.48 0.39 6.02 0.46 DP-VB 0.20 0.35 0.15 0.08 0.53 0.25 0.11 0.11 0.44 0.57 RD - 0.35 0.13 0.12 - 0.16 0.11 0.16 - 0.53 PTM 0.29 0.27 0.18 0.13 0.38 0.22 0.13 0.20 0.40 0.39 LLDE 0.19 0.36 0.14 0.10 - 0.15 0.10 0.18 - 0.55 OPT 0.32 0.36 0.28 0.17 0.55 0.21 0.02 0.18 0.75 0.52 A-KDE 0.22 0.35 0.15 0.14 0.46 0.16 0.11 0.17 0.38 0.53

500 NN-DM 0.16 0.17 0.13 0.08 0.30 0.16 0.10 0.10 0.24 0.20 NN-DM (default) 0.16 0.36 0.12 0.09 0.30 0.17 0.10 0.12 0.25 0.22 DP-MC 0.11 0.35 0.10 0.08 0.27 0.18 0.09 0.13 0.22 0.53 KDE - 0.32 0.11 0.08 - 0.15 0.08 0.11 - 0.51 KNN 3.62 0.47 0.48 0.27 3.39 0.40 0.30 0.31 5.64 0.35 DP-VB 0.14 0.33 0.11 0.05 0.48 0.19 0.08 0.08 0.45 0.55 RD - 0.32 0.10 0.09 - 0.13 0.09 0.10 - 0.50 PTM 0.24 0.19 0.14 0.10 0.32 0.19 0.11 0.14 0.32 0.30 LLDE 0.17 0.35 0.11 0.08 - 0.15 0.08 0.14 - 0.53 OPT 0.27 0.31 0.16 0.12 0.51 0.16 0.01 0.14 0.72 0.46 A-KDE 0.16 0.32 0.10 0.10 0.40 0.13 0.10 0.10 0.27 0.52

Table 5: Comparison of the methods in terms of L1 error in the univariate case. Number of test points and replications considered are nt = 500 and R = 20, respectively.

Chattopadhyay, Chakraborty and Dunson

Density MG MST MVC MVG SN T

Sample size Dimension 2 3 4 6 2 3 4 6 2 3 4 6 2 3 4 6 2 3 4 6 2 3 4 6

200 NN-DM 0.29 0.39 0.46 0.63 0.27 0.37 0.43 0.59 0.33 0.48 0.52 0.62 0.33 0.50 0.61 0.74 0.23 0.32 0.40 0.56 0.26 0.33 0.38 0.52 NN-DM (default) 0.29 0.40 0.47 0.65 0.28 0.38 0.45 0.61 0.37 0.49 0.61 0.80 0.38 0.50 0.62 0.75 0.24 0.34 0.42 0.58 0.26 0.33 0.40 0.53 DP-MC 0.20 0.36 0.62 0.67 0.22 0.42 0.55 0.68 0.31 0.46 0.51 0.60 0.34 0.52 0.61 0.73 0.16 0.18 0.25 0.32 0.21 0.26 0.34 0.44 KDE 0.28 0.52 0.79 1.29 0.27 0.55 0.72 1.21 - - - - 0.43 0.77 0.90 1.09 0.22 0.50 0.78 1.26 0.26 0.53 0.74 1.21 KNN 1.93 3.82 4.80 18.83 7.28 8.22 9.25 11.05 4.92 7.31 15.24 20.5 3.30 3.65 4.75 5.54 2.21 5.16 8.37 10.04 2.97 6.37 10.22 17.54 DP-VB 0.29 0.38 0.41 0.50 0.24 0.36 0.44 0.59 0.48 0.85 1.28 1.69 0.45 0.58 0.71 0.86 0.17 0.23 0.31 0.52 0.19 0.29 0.34 0.46 LLDE 0.32 0.52 0.93 1.82 0.28 0.67 1.02 11.38 0.53 - - - 0.32 0.45 0.75 0.97 0.16 0.22 0.29 0.65 0.17 0.26 0.35 2.05 OPT 0.58 0.79 1.04 - 0.57 1.08 1.31 - 0.59 0.82 1.01 - 0.43 0.69 0.86 - 0.48 0.68 0.88 - 0.57 0.78 1.10 -

1000 NN-DM 0.18 0.26 0.34 0.46 0.19 0.27 0.32 0.44 0.28 0.33 0.46 0.50 0.29 0.44 0.49 0.62 0.15 0.22 0.29 0.42 0.17 0.24 0.30 0.38 NN-DM (default) 0.22 0.30 0.37 0.48 0.21 0.29 0.33 0.44 0.31 0.41 0.52 0.66 0.36 0.48 0.55 0.68 0.22 0.29 0.36 0.45 0.22 0.28 0.33 0.39 DP-MC 0.08 0.39 0.57 0.58 0.11 0.18 0.21 0.47 0.26 0.39 0.48 0.54 0.21 0.38 0.51 0.64 0.06 0.07 0.10 0.15 0.09 0.15 0.17 0.28 KDE 0.16 0.32 0.52 0.96 0.16 0.35 0.53 1.04 - - - - 0.33 0.66 0.73 0.93 0.13 0.32 0.53 1.05 0.14 0.32 0.52 0.90 KNN 0.92 2.62 4.01 15.28 5.96 6.48 7.04 9.63 4.68 6.29 13.7 17.04 2.01 2.59 3.88 4.09 1.89 4.39 6.84 8.15 2.37 5.30 9.66 13.28 DP-VB 0.25 0.29 0.33 0.36 0.15 0.24 0.25 0.45 0.42 0.74 0.82 0.91 0.38 0.54 0.61 0.77 0.10 0.12 0.15 0.20 0.10 0.15 0.18 0.31 LLDE 0.31 0.37 0.52 1.02 0.22 0.40 0.46 1.18 0.47 - - - 0.29 0.39 0.51 0.94 0.07 0.10 0.14 0.27 0.10 0.15 0.23 0.38 OPT 0.39 0.72 1.00 - 0.43 0.84 1.10 - 0.51 0.72 0.89 - 0.30 0.60 0.85 - 0.31 0.50 0.76 - 0.33 0.53 0.94 -

Table 6: Comparison of the methods in terms of L1 error in the multivariate case. Number of test points and replications considered are nt = 500 and R = 20, respectively.

Nearest Neighbor Dirichlet Mixtures

Ian S Abramson. On bandwidth variation in kernel estimates-a square root law. The Annals of Statistics, 10(4):1217 1223, 1982.

Adelchi Azzalini. The skew-normal distribution and related multivariate families. Scandinavian Journal of Statistics, 32(2):159 188, 2005.

G erard Biau and Luc Devroye. Lectures on the Nearest Neighbor Method. Springer, 2015.

David M Blei and Michael I Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121 143, 2006.

Adrian W Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353 360, 1984.

Leo Breiman, William Meisel, and Edward Purcell. Variable kernel estimates of multivariate densities. Technometrics, 19(2):135 144, 1977.

Luc Devroye and Laszlo Gyorﬁ. Nonparametric Density Estimation: the L1 view. Wiley Series in Probability and Statistics, 1985.

Or Dinari, Angel Yu, Oren Freifeld, and John Fisher. Distributed MCMC inference in Dirichlet process mixture models using Julia. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 518 525, 2019. doi: 10.1109/CCGRID.2019.00066.

Tarn Duong. ks: Kernel Smoothing, 2020. URL https://CRAN.R-project.org/package =ks. R package version 1.11.7.

Dafydd Evans. A law of large numbers for nearest neighbour statistics. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 464(2100):3175 3192, 2008.

Dafydd Evans, Antonia J Jones, and Wolfgang M Schmidt. Asymptotic moments of near neighbour distance distributions. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 458(2028):2839 2849, 2002.

Subhashis Ghosal and Aad van der Vaart. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics, 35(2):697 723, 2007.

Subhashis Ghosal and Aad van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44. Cambridge University Press, 2017.

Subhashis Ghosal, Jayanta K Ghosh, and RV Ramamoorthi. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 27(1):143 158, 1999.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

Chattopadhyay, Chakraborty and Dunson

Gene H Golub and Charles F van Loan. Matrix Computations. John Hopkins University Press, 3rd edition, 1996.

P Richard Hahn, Ryan Martin, and Stephen G Walker. On recursive Bayesian predictive distributions. Journal of the American Statistical Association, 113(523):1085 1093, 2018.

Peter Hall. On Kullback-Leibler loss and density estimation. The Annals of Statistics, 15 (4):1491 1519, 1987.

Nils Lid Hjort and M Chris Jones. Locally parametric nonparametric density estimation. The Annals of Statistics, pages 1619 1647, 1996.

Michael C Hughes and Erik B Sudderth. Bnpy: Reliable and scalable variational inference for Bayesian nonparametric models. In Proceedings of the NIPS Probabilistic Programimming Workshop, Montreal, QC, Canada, pages 8 13, 2014.

Gordon J. Ross and Dean Markwick. dirichletprocess: Build Dirichlet Process Objects for Bayesian Modelling, 2019. URL https://CRAN.R-project.org/package=dirichletpr ocess. R package version 0.3.1.

Alejandro Jara, Timothy Hanson, Fernando Quintana, Peter M uller, and Gary Rosner. DPpackage: Bayesian semiand nonparametric modeling in R. Journal of Statistical Software, 40(5):1 30, 2011. URL http://www.jstatsoft.org/v40/i05/.

MJ Keith, A Jameson, W Van Straten, M Bailes, S Johnston, M Kramer, A Possenti, SD Bates, NDR Bhat, M Burgay, et al. The High Time Resolution Universe Pulsar Survey I, System conﬁguration and initial discoveries. Monthly Notices of the Royal Astronomical Society, 409(2):619 627, 2010.

Kenichi Kurihara, Max Welling, and Nikos Vlassis. Accelerated variational Dirichlet process mixtures. In B. Sch olkopf, J. Platt, and T. Hoﬀman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings. neurips.cc/paper/2006/file/2bd235c31c97855b7ef2dc8b414779af-Paper.pdf.

Michael Lavine. Some aspects of Polya tree distributions for statistical modelling. Annals of Statistics, 20(3):1222 1235, 1992.

Michael Lavine. More aspects of Polya tree distributions for statistical modelling. The Annals of Statistics, 22(3):1161 1176, 1994.

Clive Loader. Local regression and likelihood. Springer Science & Business Media, 2006.

Clive R Loader. Local likelihood density estimation. The Annals of Statistics, 24(4):1602 1618, 1996.

Don O Loftsgaarden and Charles P Quesenberry. A nonparametric estimate of a multivariate density function. The Annals of Mathematical Statistics, 36(3):1049 1051, 1965.

Duncan Ross Lorimer and Michael Kramer. Handbook of Pulsar Astronomy, 2012.

Nearest Neighbor Dirichlet Mixtures

Robert James Lyon. Why are pulsars hard to ﬁnd? Ph D thesis, The University of Manchester (United Kingdom), 2016.

Hengzhao Ma and Jianzhong Li. A true O(n log n) algorithm for the all-k-nearest-neighbors problem. In International Conference on Combinatorial Optimization and Applications, pages 362 374. Springer, 2019.

YP Mack and Murray Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of Multivariate Analysis, 9(1):1 15, 1979.

Thoralf Mildenberger and Henrike Weinert. The benchden package: Benchmark densities for nonparametric density estimation. Journal of Statistical Software, 46(14):1 14, 2012. URL http://www.jstatsoft.org/v46/i14/.

Jeﬀrey W Miller and David B. Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113 1125, 2019.

Michael A Newton. On a nonparametric recursive estimator of the mixing distribution. Sankhy a: The Indian Journal of Statistics, Series A, 64(2):306 322, 2002.

Michael A Newton and Yunlei Zhang. A recursive algorithm for nonparametric analysis with missing data. Biometrika, 86(1):15 26, 1999.

Yang Ni, Yuan Ji, and Peter M uller. Consensus Monte Carlo for random subsets using shared anchors. Journal of Computational and Graphical Statistics, 29(4):1 12, 2020.

Georg P olya. Uber den zentralen grenzwertsatz der wahrscheinlichkeitsrechnung und das momentenproblem. Mathematische Zeitschrift, 8(3-4):171 181, 1920.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL https://www.R-project.org/.

Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overﬁtted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):689 710, 2011.

David W Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 2015.

Simon J Sheather and Michael C Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):683 690, 1991.

Hanyu Song, Yingjian Wang, and David B. Dunson. Distributed Bayesian clustering using ﬁnite mixture of mixtures. ar Xiv preprint, page ar Xiv:2003.13936, 2020.

Peter Xue-Kun Song. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2):305 320, 2000.

George R Terrell and David W Scott. Variable kernel density estimation. The Annals of Statistics, pages 1236 1265, 1992.

Chattopadhyay, Chakraborty and Dunson

Gerald Teschl. Mathematical methods in quantum mechanics. Graduate Studies in Mathematics, 99:106, 2009.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer New York, NY, 1st edition, 2009.

Pravin M. Vaidya. An optimal algorithm for the all-nearest-neighbors problem. In 27th Annual Symposium on Foundations of Computer Science, pages 117 122, 1986.

Matt P Wand and M Chris Jones. Multivariate plug-in bandwidth selection. Computational Statistics, 9(2):97 116, 1994.

Lianming Wang and David B. Dunson. Fast Bayesian inference in Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 20(1):196 216, 2011.

Mike West. Hyperparameter estimation in Dirichlet process mixture models. Duke University ISDS Discussion Paper# 92-A03, 1992.

Wing H Wong and Li Ma. Optional Polya tree and Bayesian inference. The Annals of Statistics, 38(3):1433 1459, 2010.

Xiaole Zhang, David J Nott, Christopher Yau, and Ajay Jasra. A sequential algorithm for fast ﬁtting of Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 23(4):1143 1162, 2014.