# parallelizing_mcmc_with_random_partition_trees__47e5a40a.pdf

Parallelizing MCMC with Random Partition Trees

Xiangyu Wang Dept. of Statistical Science Duke University xw56@stat.duke.edu

Fangjian Guo Dept. of Computer Science Duke University guo@cs.duke.edu

Katherine A. Heller Dept. of Statistical Science Duke University kheller@stat.duke.edu

David B. Dunson Dept. of Statistical Science Duke University dunson@stat.duke.edu

The modern scale of data has brought new challenges to Bayesian inference. In particular, conventional MCMC algorithms are computationally very expensive for large data sets. A promising approach to solve this problem is embarrassingly parallel MCMC (EP-MCMC), which ﬁrst partitions the data into multiple subsets and runs independent sampling algorithms on each subset. The subset posterior draws are then aggregated via some combining rules to obtain the ﬁnal approximation. Existing EP-MCMC algorithms are limited by approximation accuracy and difﬁculty in resampling. In this article, we propose a new EP-MCMC algorithm PART that solves these problems. The new algorithm applies random partition trees to combine the subset posterior draws, which is distribution-free, easy to resample from and can adapt to multiple scales. We provide theoretical justiﬁcation and extensive experiments illustrating empirical performance.

1 Introduction

Bayesian methods are popular for their success in analyzing complex data sets. However, for large data sets, Markov Chain Monte Carlo (MCMC) algorithms, widely used in Bayesian inference, can suffer from huge computational expense. With large data, there is increasing time per iteration, increasing time to convergence, and difﬁculties with processing the full data on a single machine due to memory limits. To ameliorate these concerns, various methods such as stochastic gradient Monte Carlo [1] and sub-sampling based Monte Carlo [2] have been proposed. Among directions that have been explored, embarrassingly parallel MCMC (EP-MCMC) seems most promising. EP-MCMC algorithms typically divide the data into multiple subsets and run independent MCMC chains simultaneously on each subset. The posterior draws are then aggregated according to some rules to produce the ﬁnal approximation. This approach is clearly more efﬁcient as now each chain involves a much smaller data set and the sampling is communication-free. The key to a successful EP-MCMC algorithm lies in the speed and accuracy of the combining rule.

Existing EP-MCMC algorithms can be roughly divided into three categories. The ﬁrst relies on asymptotic normality of posterior distributions. [3] propose a Consensus Monte Carlo algorithm, which produces ﬁnal approximation by a weighted averaging over all subset draws. This approach is effective when the posterior distributions are close to Gaussian, but could suffer from huge bias when skewness and multi-modes are present. The second category relies on calculating an appropriate variant of a mean or median of the subset posterior measures [4, 5]. These approaches rely on asymptotics (size of data increasing to inﬁnity) to justify accuracy, and lack guarantees in ﬁnite samples. The third category relies on the product density equation (PDE) in (1). Assuming X is the

observed data and θ is the parameter of interest, when the observations are iid conditioned on θ, for any partition of X = X(1) X(2) X(m), the following identity holds,

p(θ|X) π(θ)p(X|θ) p(θ|X(1))p(θ|X(2)) p(θ|X(m)), (1)

if the prior on the full data and subsets satisfy π(θ) = Qm i=1 πi(θ). [6] proposes using kernel density estimation on each subset posterior and then combining via (1). They use an independent Metropolis sampler to resample from the combined density. [7] apply the Weierstrass transform directly to (1) and developed two sampling algorithms based on the transformed density. These methods guarantee the approximation density converges to the true posterior density as the number of posterior draws increase. However, as both are kernel-based, the two methods are limited by two major drawbacks. The ﬁrst is the inefﬁciency of resampling. Kernel density estimators are essentially mixture distributions. Assuming we have collected 10,000 posterior samples on each machine, then multiplying just two densities already yields a mixture distribution containing 108 components, each of which is associated with a different weight. The resampling requires the independent Metropolis sampler to search over an exponential number of mixture components and it is likely to get stuck at one good component, resulting in high rejection rates and slow mixing. The second is the sensitivity to bandwidth choice, with one bandwidth applied to the whole space.

In this article, we propose a novel EP-MCMC algorithm termed parallel aggregation random trees (PART), which solves the above two problems. The algorithm inhibits the explosion of mixture components so that the aggregated density is easy to resample. In addition, the density estimator is able to adapt to multiple scales and thus achieve better approximation accuracy. In Section 2, we motivate the new methodology and present the algorithm. In Section 3, we present error bounds and prove consistency of PART in the number of posterior draws. Experimental results are presented in Section 4. Proofs and part of the numerical results are provided in the supplementary materials.

Recall the PDE identity (1) in the introduction. When data set X is partitioned into m subsets X = X(1) X(m), the posterior distribution of the ith subset can be written as

f (i)(θ) π(θ)1/mp(X(i)|θ), (2) where π(θ) is the prior assigned to the full data set. Assuming observations are iid given θ, the relationship between the full data posterior and subset posteriors is captured by

p(θ|X) π(θ)

i=1 p(X(i)|θ)

i=1 f (i)(θ). (3)

Due to the ﬂaws of applying kernel-based density estimation to (3) mentioned above, we propose to use random partition trees or multi-scale histograms. Let FK be the collection of all Rppartitions formed by K disjoint rectangular blocks, where a rectangular block takes the form of

Ak def = (lk,1, rk,1] (lk,2, rk,2] (lk,p, rk,p] Rp for some lk,q < rk,q. A K-block histogram is then deﬁned as

ˆf (i)(θ) =

n(i) k N|Ak|1(θ Ak), (4)

where {Ak : k = 1, 2, , K} FK are the blocks and N, n(i) k are the total number of posterior samples on the ith subset and of those inside the block Ak respectively (assuming the same N across subsets). We use | | to denote the area of a block. Assuming each subset posterior is approximated by a K-block histogram, if the partition {Ak} is restricted to be the same across all subsets, then the aggregated density after applying (3) is still a K-block histogram (illustrated in the supplement),

ˆp(θ|X) = 1

i=1 ˆf (i)(θ) = 1

n(i) k |Ak|

k=1 wkgk(θ), (5)

where Z = PK k=1 Qm i=1 n(i) k /|Ak|m 1 is the normalizing constant, wk s are the updated weights, and gk(θ) = unif(θ; Ak) is the block-wise distribution. Common histogram blocks across subsets control the number of mixture components, leading to simple aggregation and resampling procedures. Our PART algorithm consists of space partitioning followed by density aggregation, with aggregation simply multiplying densities across subsets for each block and then normalizing.

2.1 Space Partitioning

To ﬁnd good partitions, our algorithm recursively bisects (not necessarily evenly) a previous block along a randomly selected dimension, subject to certain rules. Such partitioning is multi-scale and related to wavelets [8]. Assume we are currently splitting the block A along the dimension q and denote the posterior samples in A by {θ(i) j }j A for the ith subset. The cut point on dimension q is

determined by a partition rule φ({θ(1) j,q }, {θ(2) j,q }, , {θ(m) j,q }). The resulting two blocks are subject to further bisecting under the same procedure until one of the following stopping criteria is met (i) nk/N < δρ or (ii) the area of the block |Ak| becomes smaller than δ|A|. The algorithm returns a tree with K leafs, each corresponding to a block Ak. Details are provided in Algorithm 1.

Algorithm 1 Partition tree algorithm

1: procedure BUILDTREE({θ(1) j }, {θ(2) j }, , {θ(m) j }, φ( ), δρ, δa, N, L, R) 2: D {1, 2, , p} 3: while D not empty do 4: Draw q uniformly at random from D. Randomly choose the dimension to cut

5: θ q φ({θ(1) j,q }, {θ(2) j,q }, , {θ(m) j,q }), T .n(i) Cardinality of {θ(i) j } for all i

6: if θ q Lq > δa, Rq θ q > δa and min(P

j 1(θ(i) j,q θ q), P

j 1(θ(i) j,q > θ q)) > Nδρ for all i then 7: L L, L q θ q, R R, R q θ q Update left and right boundaries

8: T .L BUILDTREE({θ(1) j : θ(1) j,q θ q}, , {θ(m) j : θ(m) j,q θ q}, , N, L, R )

9: T .R BUILDTREE({θ(1) j : θ(1) j,q > θ q}, , {θ(m) j : θ(m) j,q > θ q}, , N, L , R) 10: return T 11: else 12: D D \ {q} Try cutting at another dimension 13: end if 14: end while 15: T .L NULL, T .R NULL, return T Leaf node 16: end procedure

In Algorithm 1, δ|A| becomes the minimum edge length of a block δa (possibly different across dimensions). Quantities L, R Rp are the left and right boundaries of the samples respectively, which take the sample minimum/maximum when the support is unbounded. We consider two choices for the partition rule φ( ) maximum (empirical) likelihood partition (ML) and median/KD-tree partition (KD).

Maximum Likelihood Partition (ML) ML-partition searches for partitions by greedily maximizing the empirical log likelihood at each iteration. For m = 1 we have

θ = φML({θj,q, j = 1, , n}) = arg max n1+n2=n,A1 A2=A

where n1 and n2 are counts of posterior samples in A1 and A2, respectively. The solution to (6) falls inside the set {θj}. Thus, a simple linear search after sorting samples sufﬁces (by book-keeping the ordering, sorting the whole block once is enough for the entire procedure). For m > 1, we have

φq,ML( ) = arg max

θ m i=1{θ(i) j }

n(i) 1 n(i)|A1|

n(i) 1 n(i) 2 n(i)|A2|

n(i) 2 , (7)

similarly solved by a linear search. This is dominated by sorting and takes O(n log n) time.

Median/KD-Tree Partition (KD) Median/KD-tree partition cuts at the empirical median of posterior samples. When there are multiple subsets, the median is taken over pooled samples to force {Ak} to be the same across subsets. Searching for median takes O(n) time [9], which is faster than ML-partition especially when the number of posterior draws is large. The same partitioning strategy is adopted by KD-trees [10].

2.2 Density Aggregation

Given a common partition, Algorithm 2 aggregates all subsets in one stage. However, assuming a single good partition for all subsets is overly restrictive when m is large. Hence, we also consider pairwise aggregation [6, 7], which recursively groups subsets into pairs, combines each pair with Algorithm 2, and repeats until one ﬁnal set is obtained. Run time of PART is dominated by space partitioning (BUILDTREE), with normalization and resampling very fast.

Algorithm 2 Density aggregation algorithm (drawing N samples from the aggregated posterior)

1: procedure ONESTAGEAGGREGATE({θ(1) j }, {θ(2) j }, , {θ(m) j }, φ( ), δρ, δa, N, N , L, R)

2: T BUILDTREE({θ(1) j }, {θ(2) j }, , {θ(m) j }, φ( ), δρ, δa, N, L, R), Z 0

3: ({Ak}, {n(i) k }) TRAVERSELEAF(T ) 4: for k = 1, 2, , K do 5: wk Qm i=1 n(i) k /|Ak|m 1, Z Z + wk Multiply inside each block 6: end for 7: wk wk/Z for all k Normalize 8: for t = 1, 2, , N do 9: Draw k with weights {wk} and then draw θt gk(θ) 10: end for 11: return {θ1, θ2, , θN } 12: end procedure

2.3 Variance Reduction and Smoothing

Random Tree Ensemble Inspired by random forests [11, 12], the full posterior is estimated by averaging T independent trees output by Algorithm 1. Smoothing and averaging can reduce variance and yield better approximation accuracy. The trees can be built in parallel and resampling in Algorithm 2 only additionally requires picking a tree uniformly at random. Local Gaussian Smoothing As another approach to increase smoothness, the blockwise uniform distribution in (5) can be replaced by a Gaussian distribution gk = N(θ; µk, Σk), with mean and covariance estimated locally by samples within the block. A multiplied Gaussian approximation is used: Σk = (Pm i=1 ˆΣ(i) 1 k ) 1, µk = Σk(Pm i=1 ˆΣ(i) 1 k ˆµ(i) k ), where ˆΣ(i) k and ˆµ(i) k are estimated with the ith subset. We apply both random tree ensembles and local Gaussian smoothing in all applications of PART in this article unless explicitly stated otherwise.

In this section, we provide consistency theory (in the number of posterior samples) for histograms and the aggregated density. We do not consider the variance reduction and smoothing modiﬁcations in these developments for simplicity in exposition, but extensions are possible. Section 3.1 provides error bounds on ML and KD-tree partitioning-based histogram density estimators constructed from N independent samples from a single joint posterior; modiﬁed bounds can be obtained for MCMC samples incorporating the mixing rate, but will not be considered here. Section 3.2 then provides corresponding error bounds for our PART-aggregrated density estimators in the one-stage and pairwise cases. Detailed proofs are provided in the supplementary materials.

Let f(θ) be a p-dimensional posterior density function. Assume f is supported on a measurable set Ω Rp. Since one can always transform Ωto a bounded region by scaling, we simply assume Ω= [0, 1]p as in [8, 13] without loss of generality. We also assume that f C1(Ω).

3.1 Space partitioning

Maximum likelihood partition (ML) For a given K, ML partition solves the following problem:

ˆf ML = arg max 1

k=1 nk log nk N|Ak|

, s.t. nk/N c0ρ, |Ak| ρ/D, (8)

for some c0 and ρ, where D = f < . We have the following result.

Theorem 1. Choose ρ = 1/K1+1/(2p). For any δ > 0, if the sample size satisﬁes that N > 2(1 c0) 2K1+1/(2p) log(2K/δ), then with probability at least 1 δ, the optimal solution to (8) satisﬁes that

DKL(f ˆf ML) (C1 + 2 log K)K 1

2p + C2 max log D, 2 log K s

K N log 3e N

where C1 = log D + 4p LD with L = f and C2 = 48 p + 1.

When multiple densities f (1)(θ), , f (m)(θ) are presented, our goal of imposing the same partition on all functions requires solving a different problem,

( ˆf (i) ML)m i=1 = arg max

k=1 n(i) k log n(i) k Ni|Ak|

, s.t. n(i) k /Ni c0ρ, |Ak| ρ/D, (9)

where Ni is the number of posterior samples for function f (i). A similar result as Theorem 1 for (9) is provided in the supplementary materials.

Median partition/KD-tree (KD) The KD-tree ˆf KD cuts at the empirical median for different dimensions. We have the following result.

Theorem 2. For any ε > 0, deﬁne rε = log2

1 + 1 2+3L/ε

. For any δ > 0, if N >

32e2(log K)2K log(2K/δ), then with probability at least 1 δ, we have

ˆf KD f KD 1 ε + p LK rε/p + 4e log K

If f(θ) is further lower bounded by some constant b0 > 0, we can then obtain an upper bound on

the KL-divergence. Deﬁne rb0 = log2

1 + 1 2+3L/b0

and we have

DKL(f ˆf KD) p LD

b0 K rb0/p + 8e log K

When there are multiple functions and the median partition is performed on pooled data, the partition might not happen at the empirical median on each subset. However, as long as the partition quantiles are upper and lower bounded by α and 1 α for some α [1/2, 1), we can establish results similar to Theorem 2. The result is provided in the supplementary materials.

3.2 Posterior aggregation

The previous section provides estimation error bounds on individual posterior densities, through which we can bound the distance between the true posterior conditional on the full data set and the aggregated density via (3). Assume we have m density functions {f (i), i = 1, 2, , m} and intend to approximate their aggregated density f I = Q

i I f (i)/ R Q

i I f (i), where I = {1, 2, , m}. Notice that for any I I, f I = p(θ| S

i I X(i)). Let D = max I I f I , i.e., D is an upper bound on all posterior densities formed by a subset of X. Also deﬁne ZI = R Q i I f (i). These quantities depend only on the model and the observed data (not posterior samples). We denote ˆf ML and ˆf KD by ˆf as the following results apply similarly to both methods.

The one-stage aggregation (Algorithm 2) ﬁrst obtains an approximation for each f (i) (via either ML-partition or KD-partition) and then computes ˆf I = Q

i I ˆf (i)/ R Q

i I ˆf (i).

Theorem 3 (One-stage aggregation). Denote the average total variation distance between f (i) and ˆf (i) by ε. Assume the conditions in Theorem 1 and 2 and for ML-partition

N 32c 1 0 p

2(p + 1)K 3 2 + 1

and for KD-partition N > 128e2K(log K)2 log(K/δ). Then with high probability the total variation distance between f I and ˆf I is bounded by f I ˆf I 1 2 ZI m(2D)m 1ε, where ZI is a constant that does not depend on the posterior samples.

The approximation error of Algorithm 2 increases dramatically with the number of subsets. To ameliorate this, we introduce the pairwise aggregation strategy in Section 2, for which we have the following result. Theorem 4 (Pairwise aggregation). Denote the average total variation distance between f (i) and ˆf (i) by ε. Assume the conditions in Theorem 3. Then with high probability the total variation distance between f I and ˆf I is bounded by f I ˆf I 1 (4C0D)log2 m+1ε, where C0 = max I I I ZI ZI \I

ZI is a constant that does not depend on posterior samples.

4 Experiments

In this section, we evaluate the empirical performance of PART1 and compare the two algorithms PART-KD and PART-ML to the following posterior aggregation algorithms.

1. Simple averaging (average): each aggregated sample is an arithmetic average of M samples coming from M subsets. 2. Weighted averaging (weighted): also called Consensus Monte Carlo algorithm [3], where each aggregated sample is a weighted average of M samples. The weights are optimally chosen for a Gaussian posterior. 3. Weierstrass rejection sampler (Weierstrass): subset posterior samples are passed through a rejection sampler based on the Weierstrass transform to produce the aggregated samples [7]. We use its R package2 for experiments. 4. Parametric density product (parametric): aggregated samples are drawn from a multivariate Gaussian, which is a product of Laplacian approximations to subset posteriors [6]. 5. Nonparametric density product (nonparametric): aggregated posterior is approximated by a product of kernel density estimates of subset posteriors [6]. Samples are drawn with an independent Metropolis sampler. 6. Semiparametric density product (semiparametric): similar to the nonparametric, but with subset posteriors estimated semiparametrically [6, 14].

All experiments except the two toy examples use adaptive MCMC [15, 16] 3 for posterior sampling. For PART-KD/ML, one-stage aggregation (Algorithm 2) is used only for the toy examples (results from pairwise aggregation are provided in the supplement). For other experiments, pairwise aggregation is used, which draws 50,000 samples for intermediate stages and halves δρ after each stage to reﬁne the resolution (The value of δρ listed below is for the ﬁnal stage). The random ensemble of PART consists of 40 trees.

4.1 Two Toy Examples

The two toy examples highlight the performance of our methods in terms of (i) recovering multiple modes and (ii) correctly locating posterior mass when subset posteriors are heterogeneous. The PART-KD/PART-ML results are obtained from Algorithm 2 without local Gaussian smoothing.

1MATLAB implementation available from https://github.com/richardkwo/ random-tree-parallel-MCMC 2https://github.com/wwrechard/weierstrass 3http://helios.fmi.fi/ lainema/mcmc/

Bimodal Example Figure 1 shows an example consisting of m = 10 subsets. Each subset consists of 10,000 samples drawn from a mixture of two univariate normals 0.27N(µi,1, σ2 i,1) + 0.73N(µi,2, σ2 i,2), with the means and standard deviations slightly different across subsets, given by µi,1 = 5 + ϵi,1, µi,2 = 5 + ϵi,2 and σi,1 = 1 + |δi,1|, σi,2 = 4 + |δi,2|, where ϵi,l N(0, 0.5), δi,l N(0, 0.1) independently for m = 1, , 10 and l = 1, 2. The resulting true combined posterior (red solid) consists of two modes with different scales. In Figure 1, the left panel shows the subset posteriors (dashed) and the true posterior; the right panel compares the results with various methods to the truth. A few are omitted in the graph: average and weighted average overlap with parametric, and Weierstrass overlaps with PART-KD/PART-ML.

x -10 -5 0 5 10 15

True density Subset densities

x -10 -5 0 5 10 15 0

True density PART-KD PART-ML Parametric Nonparametric Semiparametric

Figure 1: Bimodal posterior combined from 10 subsets. Left: the true posterior and subset posteriors (dashed). Right: aggregated posterior output by various methods compared to the truth. Results are based on 10,000 aggregated samples.

Rare Bernoulli Example We consider N = 10, 000 Bernoulli trials xi iid Ber(θ) split into m = 15 subsets. The parameter θ is chosen to be 2m/N so that on average each subset only contains 2 successes. By random partitioning, the subset posteriors are rather heterogeneous as plotted in dashed lines in the left panel of Figure 2. The prior is set as π(θ) = Beta(θ; 2, 2). The right panel of Figure 2 compares the results of various methods. PART-KD, PART-ML and Weierstrass capture the true posterior shape, while parametric, average and weighted average are all biased. The nonparametric and semiparametric methods produce ﬂat densities near zero (not visible in Figure 2 due to the scale).

3 0 0.005 0.01 0.015 0.02

True posterior Subset posteriors PART-KD PART-ML

Figure 2: The posterior for the probability θ of a rare event. Left: the full posterior (solid) and m = 15 subset posteriors (dashed). Right: aggregated posterior output by various methods. All results are based on 20,000 aggregated samples.

4.2 Bayesian Logistic Regression

Synthetic dataset The dataset {(xi, yi)}N i=1 consists of N = 50, 000 observations in p = 50 dimensions. All features xi Rp 1 are drawn from Np 1(0, Σ) with p = 50 and Σk,l = 0.9|k l|. The model intercept is set to 3 and the other coefﬁcient θ j s are drawn randomly from N(0, 52). Conditional on xi, yi {0, 1} follows p(yi = 1) = 1/(1 + exp( θ T [1, xi])). The dataset is randomly split into m = 40 subsets. For both full chain and subset chains, we run adaptive MCMC for 200,000 iterations after 100,000 burn-in. Thinning by 4 results in T = 50, 000 samples.

The samples from the full chain (denoted as {θj}T j=1) are treated as the ground truth. To compare the accuracy of different methods, we resample T points {ˆθj} from each aggregated posterior and then

compare them using the following metrics: (1) RMSE of posterior mean 1

j θj) 2 (2) approximate KL divergence DKL(p(θ) ˆp(θ)) and DKL(ˆp(θ) p(θ)), where ˆp and p are both approximated by multivariate Gaussians (3) the posterior concentration ratio, deﬁned as r = q P

j ˆθj θ 2 2/P

j θj θ 2 2, which measures how posterior spreads out around the true value (with r = 1 being ideal). The result is provided in Table 1. Figure 4 shows the DKL(p ˆp) versus the length of subset chains supplied to the aggregation algorithm. The results of PART are obtained with δρ = 0.001, δa = 0.0001 and 40 trees. Figure 3 showcases the aggregated posterior for two parameters in terms of joint and marginal distributions.

Method RMSE DKL(p ˆp) DKL(ˆp p) r

PART (KD) 0.587 3.95 102 6.45 102 3.94 PART (ML) 1.399 8.05 101 5.47 102 9.17 average 29.93 2.53 103 5.41 104 184.62 weighted 38.28 2.60 104 2.53 105 236.15 Weierstrass 6.47 7.20 102 2.62 103 39.96 parametric 10.07 2.46 103 6.12 103 62.13 nonparametric 25.59 3.40 104 3.95 104 157.86 semiparametric 25.45 2.06 104 3.90 104 156.97

Table 1: Accuracy of posterior aggregation on logistic regression.

PART-KD PART-ML

Figure 3: Posterior of θ1 and θ17.

Real datasets We also run experiments on two real datasets: (1) the Covertype dataset4 [17] consists of 581,012 observations in 54 dimensions, and the task is to predict the type of forest cover with cartographic measurements; (2) the Mini Boo NE dataset5 [18, 19] consists of 130,065 observations in 50 dimensions, whose task is to distinguish electron neutrinos from muon neutrinos with experimental data. For both datasets, we reserve 1/5 of the data as the test set. The training set is randomly split into m = 50 and m = 25 subsets respectively for covertype and Mini Boo NE. Figure 5 shows the prediction accuracy versus total runtime (parallel subset MCMC + aggregation time) for different methods. For each MCMC chain, the ﬁrst 20% iterations are discarded before aggregation as burn-in. The aggregated chain is required to be of the same length as the subset chains. As a reference, we also plot the result for the full chain and lasso [20] run on the full training set.

PART-KD PART-ML

Figure 4: Approximate KL divergence between the full chain and the combined posterior versus the length of subset chains.

total time (sec) 0 200 400 600 800

prediction accuracy

PART-KD PART-ML

Parametric Nonparametric

Weierstrass Average

Weighted Full chain Lasso

total time (sec) 0 50 100 150 200 250 300 0.2

Mini Boo NE

Figure 5: Prediction accuracy versus total runtime (running chain + aggregation) on Covertype and Mini Boo NE datasets (semiparametric is not compared due to its long running time). Plots against the length of chain are provided in the supplement.

5 Conclusion

In this article, we propose a new embarrassingly-parallel MCMC algorithm PART that can efﬁciently draw posterior samples for large data sets. PART is simple to implement, efﬁcient in subset combining and has theoretical guarantees. Compared to existing EP-MCMC algorithms, PART has substantially improved performance. Possible future directions include (1) exploring other multi-scale density estimators which share similar properties as partition trees but with a better approximation accuracy (2) developing a tuning procedure for choosing good δρ and δa, which are essential to the performance of PART.

4http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html 5https://archive.ics.uci.edu/ml/machine-learning-databases/00199

[1] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011.

[2] Dougal Maclaurin and Ryan P Adams. Fireﬂy Monte Carlo: Exact MCMC with subsets of data. Proceedings of the conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2014.

[3] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E Mc Culloch. Bayes and big data: The consensus Monte Carlo algorithm. In EFa BBayes 250 conference, volume 16, 2013.

[4] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and robust bayesian inference via the median posterior. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014.

[5] Sanvesh Srivastava, Volkan Cevher, Quoc Tran-Dinh, and David B Dunson. WASP: Scalable Bayes via barycenters of subset posteriors. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 38, 2015.

[6] Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the Thirtieth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-14), pages 623 632, Corvallis, Oregon, 2014. AUAI Press.

[7] Xiangyu Wang and David B Dunson. Parallel MCMC via Weierstrass sampler. ar Xiv preprint ar Xiv:1312.4605, 2013.

[8] Linxi Liu and Wing Hung Wong. Multivariate density estimation based on adaptive partitioning: Convergence rate, variable selection and spatial adaptation. ar Xiv preprint ar Xiv:1401.2597, 2014.

[9] Manuel Blum, Robert W Floyd, Vaughan Pratt, Ronald L Rivest, and Robert E Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448 461, 1973.

[10] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509 517, 1975.

[11] Leo Breiman. Random forests. Machine Learning, 45(1):5 32, 2001.

[12] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996.

[13] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. The Annals of Statistics, pages 580 615, 1994.

[14] Nils Lid Hjort and Ingrid K Glad. Nonparametric density estimation with a parametric start. The Annals of Statistics, pages 882 904, 1995.

[15] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman. DRAM: efﬁcient adaptive MCMC. Statistics and Computing, 16(4):339 354, 2006.

[16] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive Metropolis algorithm. Bernoulli, pages 223 242, 2001.

[17] Jock A Blackard and Denis J Dean. Comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables. In Proc. Second Southern Forestry GIS Conf, pages 189 199, 1998.

[18] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon Mc Gregor. Boosted decision trees as an alternative to artiﬁcial neural networks for particle identiﬁcation. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543(2):577 584, 2005.

[19] M. Lichman. UCI machine learning repository, 2013.

[20] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996.