# nonparametric_bayesian_aggregation_for_massive_data__e23db48b.pdf Journal of Machine Learning Research 20 (2019) 1-81 Submitted 10/17; Revised 4/19; Published 9/19 Nonparametric Bayesian Aggregation for Massive Data Zuofeng Shang zuofengshang@gmail.com Department of Mathematical Sciences New Jersey Institute of Technology Newark, NJ 07102, USA Botao Hao hao22@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA Guang Cheng chengg@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA Editor: Francis Bach We develop a set of scalable Bayesian inference procedures for a general class of nonparametric regression models. Specifically, nonparametric Bayesian inferences are separately performed on each subset randomly split from a massive dataset, and then the obtained local results are aggregated into global counterparts. This aggregation step is explicit without involving any additional computation cost. By a careful partition, we show that our aggregated inference results obtain an oracle rule in the sense that they are equivalent to those obtained directly from the entire data (which are computationally prohibitive). For example, an aggregated credible ball achieves desirable credibility level and also frequentist coverage while possessing the same radius as the oracle ball. Keywords: Credible region, divide-and-conquer, Gaussian process prior, linear functional, nonparametric Bayesian inference 1. Introduction With rapid development in modern technology, massive data sets are becoming more and more common. An important feature of massive data is their large volume which hinders applications of traditional statistical methods. For example, due to huge data amount and limited CPU memory, it is often impossible to process the entire data in a single machine. In the parallel computing environment, a common practice is to distribute massive data to multiple processors, and then aggregate local results in an efficient way. A series of frequentist methods such as Kleiner et al. (2011); Mc Donald et al. (2010); Zhang et al. (2015a); Zhao et al. (2016) have been proposed in this Divide-and-Conquer (D&C) framework. In Bayesian community, there are quite a few computational or methodological works developed for massive data such as scalable algorithms for Bayesian variable selection Boom et al. (2015); Wang et al. (2014) and scalable posterior sampling in parametric models Wang and Dunson (2013); Wang et al. (2015). Theoretical guarantees of D&C methods have been 2019 Zuofeng Shang, Botao Hao, Guang Cheng. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-641.html. Shang, Hao, Cheng recently obtained in robust estimation Minsker et al. (2017), posterior interval estimation Srivastava et al. (2018), credible sets of signal in Gaussian white noise Szab o and van Zanten (2019); Szabo and van Zanten (2018). Rather, the present paper puts focus on uncertainty quantification of the model parameter in general nonparametric regression, primarily in theoretical aspects. For instance, how to aggregate individual posterior means into a global one that maintains frequentist optimality? How to aggregate individual credible balls into a global one with a minimal possible radius? And how many divisions and what kind of priors should be chosen to guarantee Bayesian and frequentist validity of the aggregated ball? We attempt to address these questions in a univariate nonparametric regression setup. Specifically, we develop a set of aggregation procedures in Bayesian nonparametric regression. As a first step, nonparametric Bayesian regression is separately fitted based on each subsample randomly split from a massive dataset. A variety of finite sample valid credible balls (credible intervals) for regression functions (their linear functionals Rivoirard et al. (2012), e.g., local values) are then constructed from each individual posterior distribution based on MCMC. In the second step, we aggregate these credible balls (credible intervals) into global counterparts analytically without involving any additional computation. For example, the center of an aggregated ball is obtained by weighted averaging Fourier coefficients of all individual (approximate) posterior modes, while the radius is given through an explicit formula on individual radii. A notable advantage of this distributed strategy is its dramatically faster computational speed, and this computational advantage becomes more obvious as data size grows. Our aggregation procedures are proven to obtain an oracle rule in the sense that they are equivalent to those obtained directly from the entire data, i.e., called as oracle results which are computationally prohibitive in practice. For example, our aggregated posterior means are proven to achieve optimal estimation rate, and our aggregated credible ball achieves desirable credibility level and also frequentist coverage while possessing asymptotically the same radius as the oracle ball. These oracle results hold when the assigned Gaussian process priors in each subset are properly chosen and the number of subsets does not grow too fast. A fundamental theory underlying Bayesian aggregation is a uniform version of nonparametric Gaussian approximation theorem, also called as Bernstein-von Mises theorem. Developed based on our recent work Shang and Cheng (2017), this theory states that a sequence of individual posterior distributions converge to Gaussian processes uniformly over the number of subsets. The rest of this paper is organized as follows. Section 3 describes our Bayesian nonparametric model with a Gaussian process prior, based on which our main results are developed in Section 4. Specifically, a uniform nonparametric Gaussian approximation theorem is established in Section 4.1, and all the Bayesian aggregation procedures together with their theoretical guarantee are provided in Sections 4.2 4.6. Section 5 provides a simulation study to justify our methods. Section 6 applies the proposed procedures to a real dataset of large size. Main proofs are provided in Appendix. Other results and additional plots are given in a supplementary document Shang and Cheng. Nonparametric Bayesian Aggregation for Massive Data 2. Nonparametric Bayesian Aggregation: An Illustration In this section, we provide a concrete example to demonstrate the intuition of our nonparametric Bayesian aggregation procedure. Our example is based on the special uniform design and periodic Sobolev space which makes our aggregation procedure explicit and easy to understand. Section 2.1 describes our nonparametric Bayesian model, and Section 2.2 demonstrates our algorithm and its numeric performance. General aggregation procedures will be proposed in Sections 3 and 4 with asymptotic properties investigated as well. 2.1. Nonparametric Bayesian model Suppose that we observe the data Zi = (Yi,Xi), i = 1,...,N, generated from the following Gaussian regression model with uniform design Yi f,Xi N(f(Xi),1), X1,...,XN iid Unif[0,1]. (1) Randomly split {1,2,...,N} into s subsets I1,I2,...,Is with I1 = = Is = n (so N = ns). Denote Dj = {Zi i Ij} the j-th subsample for j = 1,...,s and D = s j=1Dj the entire sample. Suppose that f belongs to an m-order periodic Sobolev space Sm 0 [0,1] where Sm 0 [0,1] is the collection of all functions on [0,1] of the form 2 k=1 fk cos(2πkx) + 2 k=1 gk sin(2πkx) (2) with real coefficients fk,gk satisfying k=1 (f2 k + g2 k)(2πk)2m < . (3) Here, m > 1/2 is a constant describing the smoothness of the functions. Wahba (1990) Wahba (1990) introduced a Gaussian process (GP) prior on f which has an interesting smoothing spline interpretation. Specifically, she assumed that the coefficients fk,gk in (2) are independent and normally distributed as follows: fk,gk N (0,[(2πk)2m+β + nλ(2πk)2m] 1), k = 1,2,..., (4) where β > 1 and λ 0 are predecided constants. In particular, β represents the relative smoothness of the prior to the parameter space and λ represents the amount of rescaling. Rescaling priors are also considered by Szab o and van Zanten (2019); Szabo and van Zanten (2018) for constructing credible sets of signals in Gaussian white noise. It can be examined that if f satisfies (2) and (4), then f is a Gaussian process with mean zero and isotropic covariance function K0(x,x ) = 2 k=1 cos(2πk(x x )) (2πk)2m+β + nλ(2πk)2m , x,x [0,1]. (5) Wahba Wahba (1990) showed that the above GP prior (4) generates a posterior distribution corresponding to a penalized likelihood function (with λ the penalty parameter). This Shang, Hao, Cheng provides a Bayesian interpretation for smoothing splines. Below we provide some details to justify this argument. Let Πλ denote the probability distribution of f under (4). To derive the posterior distribution, we need to find the prior density of f. Unlike the parametric settings where the prior densities are Radon-Nikodym (RN) derivatives w.r.t. Lebesgue measure, in the current infinite-dimensional setting it is impossible to do so since there is no Lebesgue measure on Sm 0 [0,1] (see Hunt et al. (1992)). Instead, we need to characterize the prior density of f as an RN derivative w.r.t. other kinds of measures such as Gaussian measure. Following Wahba Wahba (1990), Πλ and Π Π0 (corresponding to λ = 0) are equivalent probability measures, and the RN derivative of Πλ w.r.t. Π is dΠ (f) = k=1 (1 + nλ(2πk) β) 1 exp( nλ k=1 (f2 k + g2 k)(2πk)2m) = k=1 (1 + nλ(2πk) β) 1 exp( nλ 0 f(m)(x)2dx) = k=1 (1 + nλ(2πk) β) 1 exp( nλ 2 J(f)), (6) where J(f) = 1 0 f(m)(x)2dx. Note that k=1 (1 + nλ(2πk) β) 1 converges thanks to β > 1 so that (6) is a valid expression. (6) provides an expression for the prior density of f, which induces the following posterior distribution for f given subsample j: d P(f Dj) P(Dj f)dΠλ(f) 2 i Ij (Yi f(Xi))2 nλ 2 J(f) dΠ(f), j = 1,...,s. (7) Recall that Ij indexes the j-th subsample. The right hand-side of (7) corresponds to penalized likelihood function ℓj(f) = 1 2n i Ij(Yi f(Xi))2 λ 2J(f) which has been well studied in smoothing spline literature (Wahba (1990)). Theoretically, we recommend to choose λ N 2m 2m+β which will be proven to yield optimal Bayesian inference; see Sections 3 and 4. The duality between the posterior and smoothing spline, i.e., (7), enables us to easily choose λ for practical use, e.g., GCV considered by Wahba (1990). 2.2. Nonparamtric Bayesian Aggregation First of all, we calculate fj,n = E{f Dj}, j = 1,...,s, the posterior means based on individual posterior distributions (7). Then we construct a (1 α)-th credible ball centering at fj,n with radius rj,n(α). That is, rj,n(α) > 0 such that P(f Sm 0 [0,1] f fj,n L2 rj,n(α) Dj) = 1 α, where L2 is the usual L2-norm, i.e., f L2 = 1 0 f(x)2dx. In practice, fj,n and rj,n(α) can be both estimated by the posterior samples. For instance, generate M independent samples fj1,...,fj M from (7); estimate fj,n by their average and estimate rj,n(α) by the (1 α)-th percentile of fjl fj,n L2 for 1 l M. We postpone the computational details of the sampling procedure to Section 8.5. We next present a concrete aggregation scheme (procedures (1) (3) below) to construct a credible ball based on these individual results { fj,n,rj,n(α)}s j=1. Specifically, an aggregated Nonparametric Bayesian Aggregation for Massive Data credible ball for f, denoted RN(α), is constructed with its center/radius obtained through weighted averaging the individual centers/radii. Unlike simple averaging commonly used in frequentist setting (see Zhang et al. (2015a)), our procedures for posterior mean aggregation and radius aggregation are weighted averaging with weights ws,N,λ,k defined in (10). These weights are used to calibrate the prior effect such that the aggregation procedure can have satisfactory asymptotic property. The details of our procedure are demonstrated as follows: 1. Posterior mean aggregation. For j-th subsample and k 1, find 0 fj,n(x)cos(2πkx)dx, gj,n,k = 0 fj,n(x)sin(2πkx)dx, (8) where fj,n is the posterior mean based on subsample j. Then we aggregate these quantities through the following formulas: f N,λ,k = s j=1 fj,n,k/s, g N,λ,k = s j=1 gj,n,k/s. (9) In the end, we let f N,λ(x) = k=1 ws,N,λ,k { f N,λ,k 2cos(2πkx) + g N,λ,k 2sin(2πkx)}, (10) where ws,N,λ,k = s(2πk)2m+β+N(1+λ(2πk)2m) (2πk)2m+β+N(1+λ(2πk)2m) for k 1. 2. Posterior radius aggregation. Aggregate the radii rj,n(α) through the following formula: Á Á Á ÀAN,s 1 s s j=1 rj,n(α)2 + BN,s, (11) C2/D2s 4m+2β 1 BN,s = (2C1 2D1 C2/D2s 1 2(2m+β) )N 2m+β 1 0 (1 + (2πx)2m + (2πx)2m+β) kdx, k = 1,2, 0 (1 + (2πx)2m) kdx, k = 1,2. 3. Aggregated credible ball: RN(α) = {f Sm 0 [0,1] f f N,λ L2 r N(α)}. (13) Algorithms based on weighted averaging have been proposed in numerous computational aspects. For instance, Huang and Gelman (2005); Neiswanger et al. (2013); Scott et al. (2016) proposed computational procedures for efficiently aggregating local MCMC samples in which Shang, Hao, Cheng the aggregation steps involve proper weight averaging. Such algorithms are particularly useful to produce MCMC samples from the oracle posterior which can be used for various inferential purposes, e.g., estimation and testing. The present paper focuses on inferences, e.g., construction of credible balls, in a special class of nonparametric regression models, and has more extensive theoretical guarantees. In practice, one can approximate the integral (8) through discretization; see Section 8.5. Theorem 3 will show that RN(α) given in (13) asymptotically covers 1 α mass of the posterior based on the full data set and includes the true function with probability approaching one. More theoretical study on RN(α) such as its center and radius can be found in Sections 4.2 and 4.3. Note that these sections present an aggregation procedure in a more general context, which covers (13) as a special case. A toy simulation study was carried out to examine the proposed procedures (1) (3). Specifically, we examine the computing time and coverage probability (CP) of RN(α) for various choices of s. The CP is defined as the relative frequency of the sets that cover the truth. We choose m = β = 2 in our GP prior (4). Results are summarized in Figure 1. Plot (a) displays the true function f0 under which data were generated. Plot (b) displays how the CP varies as γ = log(s)/log(N). Plot (c) displays that the computing time decreases when γ increases. There seems to be a transition for CP vs. γ, i.e., CP is uniformly close to one when 0 γ < 0.3 and approaches zero when γ > 0.4. In conclusion, RN(α) possesses both satisfactory frequentist coverage and computational efficiency when γ 0.2. Other choices of γ either lower CP or slow down the computing. Thus, under a proper choice of s, our aggregation procedure can maintain good statistical properties and reduce computing burden at the same time. Careful readers may have noticed that the CP approaches one rather than the credibility level (1 α). This issue can be addressed by a modified aggregated set proposed in Section 4.4. More comprehensive simulation results are provided in Section 5 to examine various aggregation procedures such as the pointwise credible intervals. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 10 20 30 40 50 60 Figure 1. Examination of our aggregation procedures (1) (3). Results are based on N = 1200 observations generated from (1) and a GP prior (4) with m = β = 2 and λ = N 2/3. (a) True regression function f0(x) = 2.4β30,17(x) + 1.6β3,11(x), where βa,b is the probability density function for Beta(a, b). (b) Coverage probability (CP) of RN(0.95) vs. γ. (c) Computing time (in seconds) of RN(0.95) vs. γ. Nonparametric Bayesian Aggregation for Massive Data 3. A Nonparametric Bayesian Framework Based on General Design and Space In this section, we introduce a more general Bayesian nonparametric framework based on general design and function space under which the aggregation results will be obtained. Suppose that the data {Yi,Xi}N i=1 follow a nonparametric regression model: Yi f,Xi ind. N(f(Xi),σ2), X1,...,XN iid π(x), (14) where π( ) is a probability density on I = (0,1), and f belongs to an m-order Sobolev space Sm(I): Sm(I) = {f L2(I) f(0),f(1),...,f(m 1)are abs. cont. and f(m) L2(I)}. (15) In particular, Sm 0 [0,1] is a proper subset of Sm(I). Throughout, we let m > 1/2 such that Sm(I) is a reproducing kernel Hilbert space (RKHS). For technical convenience, assume σ2 = 1 and 0 < infx I π(x) supx I π(x) < . When σ2 is unknown, our approach can still be applied with σ2 replaced by its consistent estimate. For any f,g Sm(I), define V (f,g) = E{f(X)g(X)} and J(f,g) = 1 0 f(m)(x)g(m)(x)dx. Following Shang et al. (2013), there exists a sequence of eigenfunctions ϕ1,ϕ2,... Sm(I) and a sequence of eigenvalues 0 = ρ1 = ρ2 = = ρm < ρm+1 ρm+2 such that ρν ν2m and V (ϕν,ϕµ) = δνµ, J(ϕν,ϕµ) = ρνδνµ, ν,µ 1, (16) where δνµ is the Kronecker s delta. We next place a prior distribution Πλ on f, where Πλ is a probability measure on Sm(I) and λ 0 is a hyperparameter. Similar to Section 2, we will characterize Πλ through its Radon-Nikodym (RN) derivative w.r.t. Π, with Π a pre-given probability measure Π on Sm(I). Specifically, assume that the RN derivative of Πλ w.r.t. Π satisfies dΠ (f) exp( nλ 2 J(f)), (17) where J(f) is defined in (3). Interestingly, it is possible to explicitly construct Πλ and Π such that (17) holds. To see this, let Gλ( ) = ν=m+1 wνϕν( ), (18) where wν s are independent of the observations satisfying wν N(0,1/(ρ1+β/(2m) ν +nλρν)),ν > m. Let G( ) = Gλ=0( ). Suppose Πλ and Π are probability measures induced by Gλ and G, i.e., Πλ(S) = P(Gλ S) and Π(S) = P(G S) for any measurable S Sm(I). It follows by H ajek s lemma (see Shang and Cheng (2017)) that (17) holds. In (18), λ 0 and β > 1 are both hyper-parameters characterizing the smoothness of the prior. It is easy to check that the sample path of Gλ belongs to Sm(I) for any β > 1 almost surely. As demonstrated in a simulation study, the GCV-selected λ is sufficient to provide satisfactory results. Shang, Hao, Cheng 4. Main Results In this section, we present a series of main results that are built upon a uniform Gaussian approximation theorem (Section 4.1). Three classes of aggregation procedures are then proposed: aggregated credible balls in both strong and weak topology, and aggregated credible intervals for linear functionals. These results can be classified into two types: finite sample construction (Sections 4.3, 4.4 and 4.5) and asymptotic construction (Section 4.6). The former construction is often time-consuming since its radius (interval length) is obtained through s posterior sampling, while the latter employs a large-sample limit of the radius given by an explicit formula. The computational gain will be illustrated by the simulations in Section 5. Similar to Section 2, let I1,I2,...,Is be a random partition of {1,2,...,N} such that s j=1Ij = {1,2,...,N} with Ij = n for j = 1,...,s and N = ns. 4.1. A Uniform Gaussian Approximation Theorem A fundamental theory underlying Bayesian aggregation is developed in this section. It is a uniform version of Gaussian approximation theorem that characterizes the limit shapes of a sequence of individual posterior distributions. This uniform validity holds if the number of posterior distributions does not grow too fast. Also, Bayesian aggregation procedures possess frequentist validity if λ is chosen properly. Similar to (7), we note that each sub-posterior distribution can be written as d P(f Dj) exp(nℓjn(f))dΠ(f), where ℓjn(f) = n 1 i Ij(Yi f(Xi))2 (λ/2)J(f). Define fj,n = arg max f Sm(I)ℓjn(f), j = 1,...,s. (19) Suppose that fj,n admits the following Fourier expansion: fj,n( ) = ν=1 f(j) ν ϕν( ), 1 j s. (20) Define h = λ1/(2m) with h = N 1 2m+β . We remark that h is an optimal choice for our aggregation procedure as will be shown later. Theorem 1 (Uniform Gaussian Approximation) Suppose that f0 admits a Fourier expansion f0( ) = ν=1 f0 ν ϕν( ) which further satisfies Condition (S) ν=1 f0 ν 2ρ 1+ β 1 If the following holds 3 2 1.866,1 < β < 2m + 1 2m 1,s = o(N β 1 2m+β ) and h h , (21) then we have as N , sup S S max 1 j s P(S Dj) P0j(S) = OPf0 ( s N 4m2+2mβ 10m+1 4m(2m+β) (log N) 5 2 ), (22) Nonparametric Bayesian Aggregation for Massive Data where S is the Borel σ-algebra on Sm(I) with respect to Π, and P0j s are GPs defined by P0j(S) = S exp( n 2 f fj,n 2)dΠ(f) Sm(I) exp( n 2 f fj,n 2)dΠ(f) , S S. (23) Proof of Theorem 1 is rooted in Shang and Cheng (2017) who essentially considered s = 1. Substantial efforts have been made here to quantify a range of partition size s such that local posteriors can be uniformly approximated by GPs. The explicit structure of the GPs provides a guideline for our aggregation procedures which will be introduced in subsequent sections. It should be emphasized that our aggregation of GPs is weighted-averaging which is different from product-based ones such as Cao and Fleet (2014). Condition (S) amounts to requiring known regularity of the truth f0 Sm+ β 1 2 (I). This can be seen from the inequality ν=1 f0 ν 2ν2m+β 1 < since ρν ν2m. This condition essentially means that f0 has derivatives up to order m + β 1 2 (when this order is integervalued). Combined with (21) this means that the regularity of f0 belongs to (m,2m+ 1 4m 1), i.e., the truth function is jointly confined by both functional space and the prior. The -norm used in (23) is defined as follows. For any g, g Sm(I), define g, g = V (g, g) + λJ(g, g) (24) and its squared norm g 2 = g,g . Clearly, , is a valid inner product on Sm(I). Remark 1 We remark that (21) can be replaced by a more general rate condition: nh2m+1 1, an = O( rn), bn 1, r2 nbn r2 n, n r2 nbn = o(1), where rn = (nh) 1/2 + hm, rn = (nh/log 2s) 1/2 + hm+ β 1 2 ,an = n 1/2h 6m 1 4m rn log N,bn = n 1/2h 6m 1 4m (log N)3/2. Here, we provide a technical explanation for the terms rn, rn,an,bn. Specifically, rn can be viewed as the rate of convergence of local ordinary penalized MLE (19), rn can be viewed as the posterior contraction rate of the local Bayesian mode, an,bn are error bounds of the higher-order remainders in the Taylor expansions of the individual penalized likelihood functions. Uniform Gaussian approximation for general h (not necessarily h h ) can be established under such condition. Theorem 3.5 in Shang and Cheng (2017) shows that P0j (conditional on Dj) is induced by a Gaussian process, denoted as W j, in the sense that P0j(S) = P(W j S Dj) for any S S. Define τ 2 ν = ρ 1+ β 2m ν , ν 1. (25) Then we have W j( ) = ν=1 (an,ν f(j) ν + bn,ντνvν)ϕν( ), j = 1,2,...,s, where an,ν = n(1 + λρν)(τ 2 ν + n(1 + λρν)) 1, bn,ν = (τ 2 ν + n(1 + λρν)) 1/2 and vν N(0,τ 2 ν ). For convenience, define the mean functions of W j as fj,n( ) = ν=1 an,ν f(j) ν ϕν( ), j = 1,...,s, (26) Shang, Hao, Cheng such that we can re-express W j as W j = fj,n + Wn, j = 1,...,s, where Wn( ) = ν=1 bn,ντνvνϕν( ) is a zero-mean GP. Note that the posterior mode fj,n is very close to fj,n since fj,n fj,n = o Pf0(1) uniformly for 1 j s; see the proof of Theorem 3. The above characterization of W j is useful for the subsequent Bayesian aggregation procedures. 4.2. Aggregated posterior means In this section, we propose a method to aggregate the posterior means fj,n = E{f Dj}, for j = 1,...,s. The aggregated mean function, denoted as f N,λ( ), can be viewed as a nonparametric Bayesian estimate of f, and will be used to construct aggregated credible balls/intervals to be introduced later. Our aggregation procedure is f N,λ( ) = ν=1 s j=1 fj,n,ϕν ϕν( ). (27) Note that when the model is Gaussian and f Sm 0 (0,1), (27) becomes (10). Next we will show that the aggregation procedure (27) yields minimax optimality in the following theorem. Theorem 2 Under conditions of Theorem 1, the following result holds: max 1 j s fj,n fj,n = OPf0 ( rn s N 4m2+2mβ 10m+1 4m(2m+β) (log N) 5 2 ), (28) If, in addition, 3/2 < β < 2m + 1/(2m) 3/2 and s satisfies 4m2+2mβ 11m+1 8m(2m+β) (log N) 3 then it holds that f N,λ f0 2 = OPf0 (N 2m+β 1 2(2m+β) ), (30) where f 2 = V (f) denotes the V -norm. According to van der Vaart et al. (2008b), the rate in (30) is minimax optimal given Condition (S). 4.3. Aggregated credible region in strong topology In this section, we construct an aggregated credible region based on s individual credible regions (w.r.t. a weighted ℓ2-norm). Specifically, s radii are combined in an explicit manner. This aggregated region possesses nominal posterior mass asymptotically, and is further proven to cover the true function with probability tending to one. This nice frequentist Nonparametric Bayesian Aggregation for Massive Data property is achieved as long as s is not diverging fast and the assigned GP prior in each subset is chosen by setting h h , i.e., λ N 2m/(2m+β). The conservative frequentist coverage can be improved to the nominal level if we use a weaker norm in defining credible region; see Section 4.4. Based on each subset Dj, the individual credible ball is constructed as follows: Rj,n(α) = {f Sm(I) f fj,n 2 rj,n(α)}. The credible ball centers around the posterior mean fj,n, while its radius rj,n(α) is directly sampled from MCMC such that P(Rj,n(α) Dj) = 1 α for any α (0,1). We will construct an aggregated region centering at f N,λ with radius explicitly constructed as follows: s j=1 r2 j,n(α) ζ1,n ζk,n = ν=1 ( n τ 2ν + n(1 + λρν)) k for k = 1,2. The final aggregated credible region is obtained as RN(α) = {f Sm(I) f f N,λ 2 r N(α)}. (32) Our theorem below confirms that RN(α) indeed possesses (asymptotic) posterior mass (1 α), and more importantly, proves that it covers the true function f0 with probability tending to one. Theorem 3 Suppose that f0 satisfies Condition (S), m > 1+ 3 2 , 3/2 < β < 2m+1/(2m) 3/2, β 1 2m+β ), (29) and h h . Then for any α (0,1), P(RN(α) D) = 1 α + o Pf0(1) and limn Pf0(f0 RN(α)) = 1. From the proof of Theorem 3, we point out that when s = 1, the posterior mass of the aggregated credible region is exactly 1 α, consistent with Shang and Cheng (2017). This remark also applies to other aggregated procedures to be presented later. Remark 2 When h h , the radius of the aggregated ball r N(α) N 2m+β 1 2(2m+β) according to the discussions in Section 4.6. This is the optimal rate at which a posterior ball contracts based on the entire sample; see van der Vaart et al. (2008b). 4.4. Aggregated credible region in weak topology In this section, we invoke a weaker norm (than that used in Section 4.3) to construct an aggregated credible region. Under this new norm (inspired by Castillo et al. (2013, 2014)), it is proven that the frequentist coverage exactly matches with the asymptotic credibility level. The requirement on s and h in this section remains the same as Section 4.3. We define a weaker norm than 2, denoted ω. For any f Sm(I) with f = ν fνϕν, define f 2 ω = ν=1 ωνf2 ν , where ων = (ν(log 2ν)) τ for some constant τ > 1. Since ων < 1 for Shang, Hao, Cheng all ν 1, we have f ω f 2. Under the new ω-norm, each individual (1 α) credible region is constructed as Rω j,n(α) = {f Sm(I) f fj,n ω rω,j,n(α)}, where rω,j,n(α) is directly obtained from posterior sampling such that P(Rω j,n(α) Dj) = 1 α. Under ω-norm, the aggregated credible region is constructed as: Rω N(α) = {f Sm(I) f f N,λ ω rω,N(α)}, (33) where the radius is given as s j=1 r2 ω,j,n(α). (34) Interestingly, Section 4.6 illustrates that the aggregated radius rω,N(α) contracts at root-N rate. Our theorem below shows that the frequentist covergage of Rω N(α) exactly matches with the asymptotic posterior mass, both of which achieve the nominal level (1 α). Theorem 4 Suppose that f0 satisfies Condition (S), m > 1 + 3/2, 2 β < (2m 1)2 β 1 2m+β ), s = o(N 4m2+2mβ 12m+1 8m(2m+β) (log N) 3 2 ), and h h . Then for any α (0,1), P(Rω N(α) D) = 1 α + o Pf0(1) and limn Pf0(f0 Rω N(α)) = 1 α. 4.5. Aggregated credible interval for linear functional In this section, we construct aggregated credible intervals for a class of linear functionals of f, denoted as F(f). Examples include the evaluation functional, i.e., F(f) = f(x), and integral functional, i.e., F(f) = 1 0 f(x)dx. Specifically, the interval is centered at F( f N,λ) with an length aggregated through s lengths obtained from posterior sampling. Posterior and frequentist coverage properties of this aggregated interval depends on the functional form F( ). Again, our theory holds when s is mildly diverging and h h . Let F Sm(I) R be a linear Π-measurable functional satisfying the following Condition (F): supν 1 F(ϕν) < , and there exist constants κ > 0 and r [0,1] such that for any f Sm(I), F(f) κh r/2 f . (35) It follows by Shang and Cheng (2017) that the evaluation functional satisfies Condition (F) with r = 1 and the integral functional satisfies Condition (F) with r = 0. Based on each Dj, we obtain from posterior samples the following (1 α) credible interval: CIF j,n(α) = {f Sm(I) F(f) F( fj,n) r F,j,n(α)}, where r F,j,n(α) is a radius such that P(CIF j,n(α) Dj) = 1 α. The aggregated credible interval is constructed as CIF N(α) = {f Sm(I) F(f) F( f N,λ) r F,N(α)} (36) Nonparametric Bayesian Aggregation for Massive Data r F,N(α) = θ1,N s j=1 r F,j,n(α)2 and θ2 k,n = ν=1 (τ 2ν + n(1 + λρν))k for k = 1,2. (37) The shrinking rate of r F,N(α) depends on the functional form F; see Section 4.6. Our theorem below investigates the asymptotic properties of CIN F (α) in terms of both posterior and frequentist coverage. Theorem 5 Suppose that f0 = ν=1 f0 ν ϕν satisfies Condition (S ): ν=1 f0 ν 2ν2m+β < , Ef0{ϵ4 X} M4 a.s. for some constant M4 > 0, Nkθ2 k,N h r for k = 1,2, m > 1 + 2 β < (2m 1)2 2m , s = o(N β 1 2m+β ), s = o(N 4m2+2mβ 12m+1 8m(2m+β) (log N) 3 2 ), (29) and h h . Then for any α (0,1), P(CIF N(α) D) = 1 α + o Pf0(1), and liminf N Pf0(f0 CIF N(α)) 1 α given that Condition (F) holds. Moreover, if 0 < ν=1 F(ϕν)2 < , then lim N Pf0(f0 CIF N(α)) = 1 α. Note that Condition (S ) is slightly stronger than Condition (S) required in Theorem 1. Indeed, this condition essentially means that f0 has derivatives up to order m + β 2 (when this order is integer-valued). Hence, Theorem 5 requires a more smooth true function f0. It was shown in Shang and Cheng (2017) that the integral functional Fx(f) = x 0 f(z)dz for any x [0,1] satisfies (35) with r = 0 and 0 < ν=1 Fx(ϕν)2 < . Therefore, the (1 α)-th credible interval of Fx(f) achieves exactly (1 α) frequentist coverage, while that for the evaluation functional is more conservative. These theoretical findings will be empirically verified in Section 5 . 4.6. Asymptotic aggregated inference In practice, the centers f N,λ, F( f N,λ) and the radii rj,n(α), rω,j,n(α), r F,j,n(α) in Sections 4.3 4.5 are directly obtained from posterior samples. Sometimes posterior sampling is time consuming and inefficient, particularly as s . This computational consideration motivates us to propose an asymptotic approach in which one replaces the above centers/radii by their large sample limits. Our new asymptotic inference procedures dramatically improve the computing speed, as displayed in simulations; see Section 5. Define f N,λ( ) = ν=1 s j=1 fj,n,ϕν ϕν( ). (38) Clearly, f N,λ is a counterpart of f N,λ (27) with fj,n therein replaced by fj,n. By a careful examination of the proofs of Theorems 3 5, it can be shown that the following limits hold: f N,λ f N,λ = o Pf0(N 1/2h 1/4), max 1 j s nr2 j,n(α) ζ1,n 2ζ2,n zα = o Pf0(1), max 1 j s nrω,j,n(α) cα = o Pf0(1), max 1 j s r F,j,n(α)/θ1,n zα/2 = o Pf0(1), (39) Shang, Hao, Cheng where zα = Φ 1(1 α) with Φ( ) being the c.d.f. of standard normal random variable, and cα > 0 satisfies P( ν=1 dνη2 ν cα) = 1 α with ην being independent standard normal random variables. It yields from (39) that the following approximation relationships hold uniformly for 1 j s: Á Á Àζ1,n + 2ζ2,nzα n , rω,j,n(α) cα n and r F,j,n(α) θ1,nzα/2, which further implies (by the aggregation formulae (31), (34) and (37)) r N(α) r N(α) = Á Á Àζ1,N + 2ζ2,Nzα N , rω,N(α) r ω,N(α) = cα r F,N(α) r F,N(α) = θ1,Nzα/2. Thus, we have the following asymptotic counterparts of RN(α), Rω N(α) and CIF N(α): R N(α) = {f Sm(I) f f N,λ 2 r N(α)}, (41) R ω N (α) = {f Sm(I) f f N,λ ω r ω,N(α)}, (42) CI F N (α) = {f Sm(I) F(f) F( f N,λ) r F,N(α)}. (43) Our theorem below shows that the posterior coverage and frequentist coverage of the above computationally efficient alternatives remain the same as those for RN(α), Rω N(α) and CIF N(α) under the same set of conditions. Theorem 6 Suppose that all assumptions in Theorems 3 5 hold. Then for any α (0,1), R N(α), R ω N (α) and CI F N (α) possess exactly the same posterior and frequentist properties as RN(α), Rω N(α) and CIF N(α), respectively. As a byproduct, (40) implies the contraction rate of each aggregated credible ball/interval in Sections 4.3 4.6. It is easy to see that rω,N(α) N 1/2. As for r F,N(α), it depends on the functional form F. For example, when F is an evaluation functional, it holds that θ2 1,N (Nh) 1, leading to N 2m+β 1 2(2m+β) when h h ; when F is an integral functional, we have r F,N(α) N 1/2 since θ2 1,N N 1. As for r N(α), it can be shown by a simple fact ζ1,N,ζ2,N h 1 that r N(α) (Nh) 1/2 N 2m+β 1 2(2m+β) when h h . This contraction rate turns out to be optimal based on the entire sample; see van der Vaart et al. (2008b). However, if we choose h in the scale of subsample size n, e.g., h n 1 2m+β , similar arguments show that r N(α) N 2m+β 1 2(2m+β) s 1 2(2m+β) . Hence, such a region contracts faster than the optimal rate, which results in unsatisfactory frequentist coverage. Table 1 summarizes six aggregated credible regions/intervals from Sections 4.3 4.5 in terms of their centers and radii. Nonparametric Bayesian Aggregation for Massive Data Table 1. Summary of Aggregated (1 α) Credible Regions/Intervals Type Name Notation Center Radius Finite-sample strong CR for f RN(α) f N,λ r N(α) weak CR for f Rω N(α) f N,λ rω,N(α) CI for F(f) CIF N(α) F( f N,λ) r F,N(α) Asymptotic strong CR for f R N(α) f N,λ r N(α) weak CR for f R ω N (α) f N,λ r ω,N(α) CI for F(f) CI F N (α) F( f N,λ) r F,N(α) 5. Simulation Study In this section, statistical properties of the proposed aggregated procedures are examined using a simulation study. We generated samples from the following model Yij = f0(Xij) + ϵij, i = 1,2,...,n,j = 1,2,...,s, (44) where Xij iid Unif[0,1], ϵij iid N(0,1), and ϵij are independent of Xij. The true regression function was chosen to be f0(x) = 2.4β30,17(x) + 1.6β3,11(x), where βa,b is the probability density function for Beta(a,b). Consider GP prior f n ν=1 wνϕν, where wν are defined in (18). The proposed Bayesian procedures were examined. Specifically, we computed the frequentist coverage proportions (CP) of the credible regions (32), (33), (41), (42), and credible intervals (36), (43). In particular, (32), (33) and (36) were constructed based on posterior samples, as described in Sections 4.2 4.5; whereas (41), (42) and (43) were constructed based on asymptotic theory developed in Section 4.6. To ease presentation, we call (32) and (33) as finite-sample credible regions (FCR), and call (41) and (42) as asymptotic credible regions (ACR). The calculation of CP was based on 500 independent experiments. Specifically, the CP is the proportion of the credible regions/intervals containing f0/F(f0) (for a linear functional F). Two types of F were considered: (1) the evaluation functional Fx(f) = f(x) for any x [0,1], and (2) the integral functional Fx(f) = x 0 f(z)dz for any x [0,1]. In both cases, we consider Fx with x being 15 evenly spaced points in [0.05,0.95]. To make the study more complete, a set of credibility levels were examined, i.e., 1 α = 0.1,0.3,0.5,0.7,0.9,0.95. In each experiment, N = 1200 independent samples were generated from the model (44). For ACR and FCR, we chose the number of divisions s = 1,2,3,4,5,6,8,10,12,15,20,24,30,40,60. Define γ = log s/log N. Note that s = 1 (equivalently, γ = 0) means no division. Figure 2 demonstrates the results for FCR and ACR based on strong topology, i.e., (32) and (41). The red dotted line indicates the (1 α) credibility level. It can be seen that the CP of both FCR and ACR is above the credibility levels when γ is small, while it suddenly drops to zero as γ is beyond some threshold, say 0.3. This observation supports our theory that s should not grow too fast, and that the credible regions based on strong Shang, Hao, Cheng 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 Figure 2. CP of ACR and FCR based on strong topology. Dotted red lines indicate credibility levels. topology tends to be more conservative. Figure 3 demonstrates the results for FCR and ACR based on weak topology, i.e., (33) and (42). We observe that the CP of both ACR and FCR approaches the desired credibility levels when γ 0.3, but quickly drops to zero when γ becomes large. This observation also supports our theory that the use of weak topology leads to a more satisfactory frequentist coverage. For credible intervals of linear functionals, we chose the number of divisions s = 1,6,15,60. Figures 4 and 5 display the results for evaluation functional and integral functional, respectively, based on posterior samples. It can be seen that when s = 60, the CP of the credible intervals for the evaluation functional drops to zero at most of the x points, indicating the failure in covering the true values of the function. However, when s = 1,6,15, the CP is Nonparametric Bayesian Aggregation for Massive Data 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.4 0.8 Figure 3. CP of ACR and FCR based on weak topology. Dotted red lines indicate credibility levels. Shang, Hao, Cheng above the credibility levels except for the points where the true function f0 has peaks; see (a) of Figure 1. The observation that the CP stays above (1 α) coincides with our theory that the credible interval of the evaluation functional is conservative. On the other hand, it can be seen that when s = 60, the CP of the credible intervals for the integral functional becomes far below the credibility levels at most x. However, when s = 1,6,15, the CP is close to the credibility levels at all x. This finding coincides with our theory that the the credible interval of the integral functional achieves exactly (1 α) frequentist coverage. The above results also support our claim that s cannot grow too fast for guaranteeing frequency validity. Credible intervals based on asymptotic theory, i.e., (43), were summarized in Figures 11 and 12 of the supplement document Shang and Cheng. Interpretations of these results are similar to those based on finite posterior samples. The supplement document Shang and Cheng also includes Figures 13 16 which demonstrate how the radii/lengths of the aggregated credible regions/intervals change along with γ, the size of the subsample. It can be observed that when γ 0.3, indicating that the full sample is divided into at most twelve subsamples, the radii of the aggregated regions/intervals are almost identical to the radii of the regions/intervals directly constructed from the full sample, i.e., γ = 0. This means that our aggregated procedures, based on a suitable amount of divisions, indeed mimic the oracle procedures. However, when γ increases to 0.6, the distinctions between the the aggregated and oracle procedures quickly become obvious. We also repeated the above study for N = 1800 and 2400. The plots corresponding to these studies are given in supplement document; see Section S.8.6 of Shang and Cheng. The interpretations of these additional results are similar as above. To the end of this section, computing efficiency is investigated. Figure 6 displays the results based on a single experiment for various choices of N. Specifically, we look at the value of the quantity ρ = 1 (T/T0) versus a collection of γ s for FCR and ACR, where T0 (T) is the computing time without using D&C (based on D&C). We observe that T is substantially smaller than T0, and this computation efficiency (as reflected by the value of ρ) becomes more obvious as γ grows for each fixed N. This can also be seen as N grows for each fixed γ. However, this reduction in computing time does not affect the performances of the aggregated credible regions when 0 γ 0.3, as demonstrated in Figures 2, 3, 13 16. 6. Real Data Analysis In this section, we apply our methods to Million Song Data (MSD) and Flight Delay Data (FDD). 6.1. Million Song Data As a real application, we apply our aggregation procedure to analyze MSD. The MSD is a perfect example of large dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Each observation is a song track released between the year 1922 and 2011. The response variable Yi is the year when the song was released and the covariate Xi is the timbre average of the song. The main purpose is to explore a relationship, denoted as f, between song features and years in a nonparametric regression model, i.e., year = f(timbre)+error. The above model is useful to Nonparametric Bayesian Aggregation for Massive Data 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 4. CP of Fx(f) = f(x) against x based on posterior samples of f. Dotted red lines indicate credibility levels. Shang, Hao, Cheng 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 5. CP of Fx(f) = x 0 f(z)dz against x based on posterior samples of f. Dotted red lines indicate credibility levels. Nonparametric Bayesian Aggregation for Massive Data 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 6. ρ versus γ based on FCR and ACR for single experiment. predict production year based on song timbre. Due to enormous sample size, processing the entire data is infeasible. In frequentist setting, a distributed kernel ridge regression method was proposed by Zhang et al. (2015a,b) for estimation purposes (without quantifying uncertainty). In the Bayesian setup, we applied our aggregation procedure to construct 95% credible sets for f based on a subset of N = 10,000 songs released from the year 1996 to 2010. We randomly split the observations to s = 5,10,20 subsets. We also compared our results with the baseline method in which all ten thousand observations were used. Credible sets are displayed as gray areas in Figure 7. We find that the shapes of all credible sets are overall the same when the timbre ranges from -4 to 4, e.g., all display a W-shape, although the results are a bit sensitive near the endpoints. Therefore, the overall pattern of the sets appears to be insensitive to the above selections of s. 6.2. Flight Delay Data We applied our aggregation procedure to one more real data set, the FDD. The data consists of flight arrival and departure information for all commercial flights within the United States, from October 1987 to April 2008. The main purpose is to find the key factors that have an Shang, Hao, Cheng 1990 1995 2000 2005 2010 1990 1995 2000 2005 2010 1990 1995 2000 2005 2010 1990 1995 2000 2005 2010 Figure 7. 95% Credible sets (grey areas) for f based on a subset of 10,000 samples in Million Song Data. The first plot refers to the baseline method where the whole samples were used. The rest three plots refer to the aggregation procedure which was applied to 5, 10, 20 random splits. impact on the flight delay. We considered the relationship (denoted f) between month and the length of the flight delay, i.e., length of flight delay = f(month)+error. Negative length of delay implies that the flight arrived earlier. We applied the same Bayesian aggregation procedure as described in MSD to a randomly selected subset of N = 10,000 flight information in the year 2007. We randomly split the observations to s = 10,100,500 subsamples, based on which the aggregated credible sets for f were constructed. We also compared the results with the baseline where all the ten thousand samples were used. Credible sets are displayed as gray areas in Figure 8. Again, the shapes of the four credible sets appear to be almost the same for all s. 6.3. Computation Efficiency We compare the overall execute computation time of both MSD and FDD on different numbers of splits, e.g. computational time per machine number of machines in Figure 9-10. It can be seen that the computing time dramatically decreases as the number of splits increases, which reflects the scalability of our proposed algorithm. Nonparametric Bayesian Aggregation for Massive Data 2 4 6 8 10 12 10 5 0 5 10 15 20 2 4 6 8 10 12 10 5 0 5 10 15 20 2 4 6 8 10 12 10 5 0 5 10 15 20 2 4 6 8 10 12 10 5 0 5 10 15 20 Figure 8. 95% Credible sets (grey areas) for f based on a subset of 10,000 samples in Flight Delay Data. The first plot refers to the baseline method where the whole samples were used. The rest three plots refer to the aggregation procedure which was applied to 10, 100, 500 random splits. 7. Conclusions This paper proposes algorithms for aggregating individual posterior results such as modes, balls, intervals, into their global counterparts. The algorithms are easy-to-implement which are particularly useful in big data scenarios. We also experimented the proposed algorithms through simulated and real data sets. A notable contribution of this article is to provide rigorously justified theoretical guarantees. The major tool for proving our theoretical results is a uniform Gaussian approximation theorem which shows that the individual posterior distributions converge uniformly to Gaussian processes provided that the number of subsets is not too large. Acknowledgments Shang s research is sponsored by NSF grants DMS 1821157 and DMS 1764280. Guang Cheng would like to acknowledge support by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research (ONR N00014-18-2759). In addition, Guang Cheng is a member of Institute for Advanced Study, Princeton and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality. Shang, Hao, Cheng Figure 9. Computational time of aggregation procedures for MSD. Figure 10. Computational time of aggregation procedures for FDD. Nonparametric Bayesian Aggregation for Massive Data 8. APPENDIX This appendix section contains the proofs of the main results. Section 8.1 contains proof of Theorem 1 and relevant preliminary results. Section 8.2 includes the proof of Theorem 2. Sections 8.3 and 4.4 includes the proofs of Theorems 3 and 4, i.e., coverage properties of the credible sets based on strong and weak topology respectively. All proofs crucially depend on an eigensystem designed for simultaneous diagonalization of the two bilinear functionals U,V induced from likelihood and prior, respectively. In fact, (ϕν,ρν) is a solution of the following ordinary differential system (whose existence and uniqueness is guaranteed by Birkhoff(1908)): ( 1)mϕ(2m) ν ( ) = ρνπ( )ϕν( ), ϕ(j) ν (0) = ϕ(j) ν (1) = 0, j = m,m + 1,...,2m 1, (1) Properties of this eigen-system are summarized in Proposition 1, whose proof can be found in (Shang et al., 2013, Proposition 2.2). Proposition 1 It holds that supν N ϕν < , and that the sequence ρν is nondecreasing with ρ1 = = ρm = 0, and ρν > 0 for µ > m. Moreover, ρν ν2m and V (ϕµ,ϕν) = δµν, J(ϕµ,ϕν) = ρµδµν, µ,ν N, (2) where δµν is the Kronecker s delta. In particular, any f Sm(I) admits a Fourier expansion f = ν V (f,ϕν)ϕν with convergence held in the -norm. 8.1. Proofs in Section 4.1 The proof of Theorem 1 requires the following technical result which derives a local contraction rate rn uniformly over s: rn = (nh/log 2s) 1/2 + hm+ β 1 2 . The proof can be found in (Shang and Cheng). Proposition 1 If f0 satisfies Condition (S) and the following Rate Condition (R) holds: nh2m+1 1, an = O( rn), bn 1, r2 nbn r2 n. Let a 0 be a fixed constant. Then for any ε (0,1), there exist positive constants M ,N s.t. for any n N , Pf0 (max 1 j s{E{ f f0 a I( f f0 M rn) Dj} M s2 exp( n r2 n/log(2s))) ε (3) We remark that Proposition 1 significantly generalizes the classical results in Ghosal et al. (2000); van der Vaart et al. (2008a). Proof [Proof of Theorem 1] Let M1,M2 be large positive constants. For any fixed constant a 0, consider three events: E n = {max 1 j s fj,n f0 M1 rn} E n = {max 1 j s E{ f f0 a I( f f0 M2 rn) Dj} M2s2 exp( n r2 n/log(2s))} E n = {max 1 j s E0j{ f f0 a I( f f0 M2 rn)} M2 exp( n r2 n)} Shang, Hao, Cheng where E0j means expectation taken under P0j. It follows from Shang and Cheng and Proposition 1 that we can choose M1 > M2 (both large enough) s.t. Pf0(E n E n) 1 ε1/2 where ε1 > 0 is an arbitrary constant. Meanwhile, by (Shang and Cheng) we have, on E n, for any 1 j s, E0j{ f f0 a I( f f0 M2 rn)} = f f0 M2 rn f f0 a exp( n 2 f fj,n 2)dΠ(f) Sm(I) exp( n 2 f fj,n 2)dΠ(f) f f0 M2 rn f f0 a exp( n 2 f fj,n 2)dΠ(f) f f0 rn exp( n 2 f fj,n 2)dΠ(f) exp( ((M2 M1)2/2 (M1 + 1)2/2 c3/4)n r2 n)C(a,Π), (4) where c3 > 0 is a universal constant and C(a,Π) = Sm(I) f f0 adΠ(f). We can choose M2 > C(a,Π) so that the quantity (4) is less than M2 exp( n r2 n). So E n implies E n , so that Pf0(E n ) Pf0(E n E n) 1 ε1/2. Define En = E n E n E n , then it can be seen that Pf0(En) 1 ε1. Let Tj be defined as 2n i Ij [( f)(Xi)2 EX{( f)(X)2}]. (5) Following Lemma 9, for any 1 j s, ℓjn(f) ℓjn( fj,n) + 1 2 f fj,n 2 = Tj(f). (6) It follows from the proof of Proposition 1 that on En, for any f Sm(I) satisfying f f0 M2 rn and 1 j s, Tj(f) D r2 nbn, (7) where D = D(M1,M2) is a positive constant depending only on M1,M2. Recall that our assumption says that ε2 n D r2 nbn = o(1). For 1 j s, define Jnj1 = Sm(I) exp(n(ℓjn(f) ℓjn( fj,n)))dΠ(f), Jnj2 = Sm(I) exp( n 2 f fj,n 2)dΠ(f), Jnj1 = f f0 M2 rn exp(n(ℓjn(f) ℓjn( fj,n)))dΠ(f), Jnj2 = f f0 M2 rn exp( n 2 f fj,n 2)dΠ(f). For simplicity, let ε3 = M2s2 exp( n r2 n/log(2s)). On En (with a = 0) and for any 1 j s, 0 Jnj1 Jnj1 Jnj1 M2s2 exp( n r2 n/log(2s)) = ε3, 0 Jnj2 Jnj2 Jnj2 exp( n r2 n) ε3. Nonparametric Bayesian Aggregation for Massive Data By some algebra, it can be shown that the above inequalities lead to (1 ε3) Jnj2 Jnj1 Jnj2 Jnj1 1 1 ε3 Jnj2 Jnj1 . (8) Meanwhile, on En and for any 1 j s, using (7) and the elementary inequality exp(x) 1 2 x for x log 2, we get that Jnj2 Jnj1 f f0 M2 rn exp( n 2 f fj,n 2) exp(n Tj(f)) 1 dΠ(f) leading to that 1 1 + 2ε2 Jnj2 Jnj1 1 1 2ε2 . (9) Combining (8) and (9), on En and for any 1 j s, 1 ε3 1+2ε2 Jnj2 Jnj1 1 (1 2ε2)(1 ε3). When n is large, ε3 ε2 and both quantities are small, the above inequalities lead to 1 + 2ε2 1 Jnj2 Jnj1 1 1 (1 2ε2)(1 ε3) 1 4ε2 (10) For simplicity, denote Rnj(f) = n Tj(f). For any S S, let S = S {f Sm(I) f f0 M2 rn}. Then on En, we get that max1 j s P(S Dj) P0j(S) max1 j s P(S Dj) P0j(S ) + 2ε3. Moreover, it follows from (10) that on En and for any 1 j s, P(S Dj) P0j(S ) = S exp(n(ℓjn(f) ℓjn( fj,n))) Jnj1 exp( n 2 f fj,n 2) 2 f fj,n 2) exp(Rnj(f)) Jnj1 1 Jnj2 dΠ(f) 2 f fj,n 2) exp(Rnj(f)) 1 2 f fj,n 2) exp(Rnj(f)) 1 Jnj1 1 Jnj2 dΠ(f) 2ε2 S exp( n 2 f fj,n 2)dΠ(f) Jnj1 1 Jnj2 S exp( n 2 f fj,n 2)dΠ(f) 2ε2 + exp(ε2) Jnj2 Jnj1 1 2ε2 + 4ε2 exp(ε2) 14ε2. Shang, Hao, Cheng Note that the right hand side is free of S. Then we get that on En, sup S S max1 j s P(S Dj) P0j(S) 14ε2 + 2ε3 16ε2. This implies that for sufficiently large n, Pf0 (sup S S max 1 j s P(S Dj) P0j(S) > 16ε2) Pf0(Ec n) + Pf0 (En,sup S S max 1 j s P(S Dj) P0j(S) > 16ε2) = Pf0(Ec n) ε1. The desirable result follows by the simple fact ε2 s N 4m2+2mβ 10m+1 4m(2m+β) (log N) 5 2 when h h . 8.2. Proofs in Section 4.2 Proof [Proof of Theorem 2] We first show (28). Let An = {f Sm(I) f f0 M rn} and Bj = {f Sm(I) d P(f Dj) d P0j(f)} for 1 j s. By Proposition 1, Theorem 1 and (4) with a = 1 therein, we can choose M > 0 sufficiently large such that max 1 j s E(f Dj) E0j(f) = max 1 j s (f f0)d P(f Dj) (f f0)d P0j(f) max 1 j s An (f f0)d P(f Dj) + max 1 j s An (f f0)d P0j(f) + max 1 j s Acn (f f0)(d P(f Dj) d P0j(f)) max 1 j s E{ f f0 I(f An) Dj} + max 1 j s E0j{ f f0 I(f An)} +M rn max 1 j s Acn d P(f Dj) d P0j(f) = OPf0 (s2 exp( n r2 n/log(2s)) + exp( n r2 n) + rn s N 4m2+2mβ 10m+1 4m(2m+β) (log N) 5 2 ) = OPf0 ( rn s N 4m2+2mβ 10m+1 4m(2m+β) (log N) 5 2 ) OPf0(LN), where the second last equality uses Theorem 1 and the fact that, uniformly for j, Acn d P(f Dj) d P0j(f) = P(Ac n Bj Dj) P0j(Ac n Bj) + P(Ac n Bc j Dj) P0j(Ac n Bc j) . Then (28) follows from the trivial fact that E0j{f} = E(W j Dj) = fj,n. Next we show (30). By direct examinations we can verify the following Rate Conditions (R): n r2 nbn = o(1),N r2 Nb N = o(1),Nh1/2a2 N = o(1),Nh1/2a2 n = o(1). Define Remj,n = fj,n f0 Sj,n(f0) for j = 1,2,...,s. It follows by Lemma 6 of Shang and Cheng that max1 j s Remj,n = OPf0(an). Nonparametric Bayesian Aggregation for Massive Data It is easy to see that a N,ν/an,ν s for all ν 1. Then it holds from (38) that f N,λ f N,λ 2 = ν 1 (a N,ν s j=1 ( fj,n fj,n),ϕν 2 (1 + λρν) s j=1 ( fj,n fj,b) 2 = OPf0 (s2L2 N) = o Pf0(N 1h 1/2). (11) The last equality owes to the condition s4 log(2s) = o(N 4m2+2mβ 11m+1 2m(2m+β) (log N) 5) and β > 3/2. By direct examinations, we have f N,λ f0 = ν=1 s j=1 V ( fj,n,ϕν) f0 ν ϕν s j=1 V (Remj,n + f0 + Sj,n(f0),ϕν) f0 ν ϕν = ν=1 a N,νV (1 s j=1 Remj,n,ϕν)ϕν + ν=1 (a N,ν 1)f0 ν ϕν + ν=1 a N,νV ( 1 N i=1 ϵi KXi,ϕν)ϕν ν=1 a N,νV (Pλf0,ϕν)ϕν. (12) Denote the four terms in the above equation by T1,T2,T3,T4. Since a N,ν 1, it is easy to see that T1 2 2 = ν=1 a2 N,ν V (1 s j=1 Remj,n,ϕν) 2 s j=1 Remj,n,ϕν) 2 = 1 s j=1 Remj,n 2 2 (max 1 j s Remj,n )2 = OPf0(a2 n). Using h N 1/(2m+β) and a direct algebra we get that T2 2 2 = ν=1 (a N,ν 1)2 f0 ν 2 ν=1 ( ν2m+β ν2m+β + N(1 + λν2m)) 2 f0 ν 2 = o(N 2m+β 1 2m+β ) = o(N 1h 1). Meanwhile, it follows by Proposition Shang and Cheng that T4 2 2 = ν=1 a2 N,ν f0 ν 2 ( λρν 1 + λρν ) 2 ν=1 f0 ν 2 ( λρν 1 + λρν ) 2 ν=1 f0 ν 2(hν)2m+β 1 (hν)2m β+1 (1 + (hν)2m)2 = o(N 2m+β 1 2m+β ) = o(N 1h 1). Define R(x,x ) = ν=1 a N,ν ϕν(x)ϕν(x ) 1+λρν for any x,x I. Also define Rx( ) = R(x, ). It is easy to see that Rx Sm(I) for any x I. Then it can be shown that T3 = 1 N N i=1 ϵi RXi, Shang, Hao, Cheng T3 2 2 = V (T3,T3) = 1 N i=1 ϵ2 i V (RXi,RXi) + 2 N2 i 0, we get by (11) that, with Pf0-probability approaching one, f N,λ f0 2 r N(α). Meanwhile, it follows by Shang and Cheng that for N,λ f0 SN,λ(f0) 2 = OPf0(a N) and 1 s s j=1 fj,n f0 1 s s j=1 Sj,n(f0) 2 = OPf0(an), where SN,λ(f0) = 1 N N i=1 ϵi KXi Pλf0. Note that SN,λ(f0) = 1 s s j=1 Sj,n(f0), which leads to for N,λ 1 s s j=1 fj,n 2 = OPf0(an + a N). Since a N,ν 1, we get Nonparametric Bayesian Aggregation for Massive Data N for N,λ f N,λ 2 = N ν=1 a2 N,νV for N,λ 1 s j=1 fj,n,ϕν 2 (1 + λρν) N ν=1 V for N,λ 1 s j=1 fj,n,ϕν 2 (1 + λρν) = N for N,λ 1 s j=1 fj,n 2 = OPf0(Na2 n + Na2 N) (21) = o Pf0(h 1/2), (by condition Nh1/2a2 n + Nh1/2a2 N = o(1)) Using (11) we get that N for N,λ f N,λ 2 2 = o Pf0(h 1/2). Since E{ WN, for N,λ f N,λ 2 2 D} = ν 1 b2 N,νV ( for N,λ f N,λ,ϕν)2 for N,λ f N,λ 2 2/N = o Pf0(N 2h 1/2), we have that N W or f N,λ 2 2 = N WN 2 2 + o Pf0(h 1/2). It follows by P (N WN 2 2 ζ1,N 2ζ2,N zα) 1 α, (14) and (20) that P(RN(α) D) = 1 α + o Pf0(1). This completes the proof. 8.4. Proofs in Section 4.4 Before proving Theorem 4, let us present a preliminary lemma. Lemma 3 As N , N WN 2 ω d ν=1 dνη2 ν, and n Wn 2 ω d ν=1 dνη2 ν, where ην are independent standard normal random variables. Proof [Proof of Theorem 4] By direct examinations, one can show that Rate Conditions (R ): n r2 nbn = o(1), N r2 Nb N = o(1), Na2 N = o(1) and Na2 n = o(1) are all satisfied. We first have the following fact: max 1 j s nrω,j,n(α) cα = o Pf0(1), (22) where cα > 0 satisfies P( ν=1 dνη2 ν cα) = 1 α with ην being independent standard normal random variables. It follows from (22) that Nrω,N(α)2 = cα + o Pf0(1). (23) By Theorem 2 and the condition s = o(N 4m2+2mβ 12m+1 8m(2m+β) (log N) 3 2 ) we have the following max1 j s n j 2 ω = max1 j s n j 2 2 = OPf0(n L2 N) = o Pf0(1). Also, for arbitrarily small ε (0,1), P( Wn, j ω 2 j 2 ω/(nε) Dj) ε. The proof of (22) is then similar to the proof of (15) and details are omitted. Let T1,T2,T3,T4 be defined in (12). It follows from the proof of Theorem 3 that T1 2 ω T1 2 2 = OPf0(a2 n), so N T1 2 ω = OPf0(Na2 n) = o Pf0(1) due to the condition Na2 n = o(1). It follows by condition h N 1/(2m+β), dominated convergence theorem and direct Shang, Hao, Cheng examinations, T2 2 ω = ν=1 dν(a N,ν 1)2 f0 ν 2 N 2 ν=1 dν ν2m+β+1 (1 + (hν)2m + (hν)2m+β)2 ν2m+β 1 f0 ν 2 (1 + (hν)2m + (hν)2m+β)2 ν2m+β 1 f0 ν 2 = o(N 1), T4 2 ω = ν=1 dνa2 N,ν ( λρν 1 + λρν ) 2 f0 ν 2 ν=1 dν (hν)2m β+1 (1 + (hν)2m + (hν)2m+β)2 f0 ν 2(hν)2m+β 1 (1 + (hν)2m + (hν)2m+β)2 f0 ν 2ν2m+β 1 = o(N 1). By direct examination it can be shown that T3 = 1 N N i=1 ϵi ν=1 ϕν(Xi)ϕν 1+λρν+N 1τ 2ν . It follows by Shang and Cheng (2017) that, as N , N T3 2 ω d ν=1 dνη2 ν. By the above analysis on T1 through T4, and N f N,λ f N,λ 2 ω = OPf0(Ns2L2 N) = o Pf0(1), we get that N f N,λ f0 2 ω d ν=1 dνη2 ν. It follows by (23) that lim N Pf0(f0 Rω N(α)) = 1 α. It follows by N for N,λ f N,λ 2 2 = OPf0(Na2 N + Na2 n) = o Pf0(1) (see (21)), P(N WN 2 2 cα) 1 α, (23) and (14) that P(Rω N(α) D) = 1 α + o Pf0(1). Proof is completed. 8.5. Computational Details In this subsection, we provide some computational details relating to Section 2.2. For convenience, we rewrite model (2.1) as following: Yji = f(Xji) + ϵji, j = 1,...,s, i = 1,...,n. Calculation of posterior means. In order to calculate the posterior mean fj,n, we have to generate samples of f from its posterior distribution P(f {Yji,Xji}n i=1). In practice, directly sampling the function f from P(f {Yji,Xji}n i=1) is impossible. Instead, we generate some samples from (f(Xj1),...,f(Xjn)) . As n is large, (f(Xj1),...,f(Xjn)) can represent the whole curve of f. Firstly, let us derive the posterior distribution for (f(Xj1),...,f(Xjn)) . For the j-th subsample, the likelihood function is written by Yj1,...,Yjn Xj1,...,Xjn N((f(Xj1),...,f(Xjn)) ,In). Since f follows a GP prior with mean zero and covariance function K0, where K0 is given in (5), the prior of (f(Xj1),...,f(Xjn)) is multivariate Gaussian: (f(Xj1),...,f(Xjn)) N(0,Kj), where Kj is the covariance matrix satisfying K0(Xj1,Xj1), K0(Xj1,Xjn) K0(Xjn,Xj1) K0(Xjn,Xjn) Nonparametric Bayesian Aggregation for Massive Data K0(x,x ) involves an infinite summation which is practically infeasible. Instead, the infinite sum is approximated by a finite one, i.e., K0(x,x ) 2 M k=1 cos(2πk(x x )) (2πk)2m+β + nλ(2πk)2m . In our numerical study, we found that M = 100 can already provide a good approximation. Due to the conjugacy, the posterior distribution of (f(Xj1),...,f(Xjn)) also follows a multivariate Gaussian distribution (f(Xj1),...,f(Xjn)) {Yji,Xji}n i=1 N(Kj(Kj+ 1 n In) 1(Yj1,...,Yjn) ,Kj(Kj+ 1 Next we generate M independent samples, denoted (f(l)(Xj1),...,f(l)(Xjn)) ,l = 1,...,M, from above multivariate Gaussian distribution. Therefore, the posterior mean can be approximated by ( fjn(Xj1),..., fjn(Xjn)) = ( 1 M l=1 f(l)(Xj1),..., 1 M l=1 f(l)(Xjn)) . Calculation of posterior radius. Once we have M independent samples {(f(l)(Xj1),...,f(l)(Xjn))}M l=1, we are able to approximate f(l) fj,n L2 by n i=1 (f(l)(Xji) fjn(Xji))2) 1 2 , for l = 1,...,M. Finally, the radius rj,n(α) is approximated by the upper α-th percentile of {L1,...,LM}. Calculation of the integral. We approximate (8) by n i=1 fj,n(Xji)cos(2πk Xji)dx, gj,n,k n i=1 fj,n(Xji)sin(2πk Xji)dx. In (12), Ck and Dk also involve two integrals. Since they are independent of samples, any numerical method for integral calculation is applicable. We also approximate f N,λ(x) in (10) by f N,λ(x) M k=1 ws,N,λ,k { f N,λ,k 2cos(2πkx) + g N,λ,k 2sin(2πkx)}. George D Birkhoff. Boundary value and expansion problems of ordinary linear differential equations. Transactions of the American Mathematical Society, 9(4):373 395, 1908. Willem van den Boom, Galen Reeves, and David B Dunson. Scalable approximations of marginal posteriors in variable selection. ar Xiv preprint ar Xiv:1506.06629, 2015. Robert H Cameron and William T Martin. Transformations of weiner integrals under translations. Annals of Mathematics, pages 386 396, 1944. Shang, Hao, Cheng Yanshuai Cao and David J Fleet. Generalized product of experts for automatic and principled fusion of gaussian process predictions. ar Xiv preprint ar Xiv:1410.7827, 2014. Isma el Castillo, Richard Nickl, et al. Nonparametric bernstein von mises theorems in gaussian white noise. The Annals of Statistics, 41(4):1999 2028, 2013. Isma el Castillo, Richard Nickl, et al. On the bernstein von mises phenomenon for nonparametric bayes procedures. The Annals of Statistics, 42(5):1941 1969, 2014. Yuan Shih Chow and Henry Teicher. Probability theory: independence, interchangeability, martingales. Springer Science & Business Media, 2012. Peter de Jong. A central limit theorem for generalized quadratic forms. Probability Theory and Related Fields, 75(2):261 277, 1987. Subhashis Ghosal, Jayanta K Ghosh, Aad W Van Der Vaart, et al. Convergence rates of posterior distributions. Annals of Statistics, 28(2):500 531, 2000. Jorgen Hoffmann-Jorgensen, Lawrence A Shepp, and Richard M Dudley. On the lower tail of gaussian seminorms. The Annals of Probability, pages 319 342, 1979. Zaijing Huang and Andrew Gelman. Sampling for bayesian computation with large datasets. Available at SSRN 1010107, 2005. Brian R Hunt, Tim Sauer, and James A Yorke. Prevalence: a translation-invariant almost every on infinite-dimensional spaces. Bulletin of the American mathematical society, 27 (2):217 238, 1992. Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. Bootstrapping big data. In Advances in neural information processing systems, workshop: Big learning: Algorithms, systems, and tools for learning at scale, 2011. Michael R Kosorok. Introduction to empirical processes and semiparametric inference. Springer, 2008. James Kuelbs, Wenbo V Li, and Werner Linde. The gaussian measure of shifted balls. Probability Theory and Related Fields, 98(2):143 162, 1994. Cheng Li, Sanvesh Srivastava, and David B Dunson. Simple, scalable and accurate posterior interval estimation. Biometrika, 104(3):665 680, 2017. Wenbo Li et al. A gaussian correlation inequality and its applications to small ball probabilities. Electronic Communications in Probability, 4:111 118, 1999. Ryan Mc Donald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 456 464. Association for Computational Linguistics, 2010. Nonparametric Bayesian Aggregation for Massive Data Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B Dunson. Robust and scalable bayes via a median of subset posterior measures. The Journal of Machine Learning Research, 18(1):4488 4527, 2017. Carl N Morris et al. Natural exponential families with quadratic variance functions: statistical theory. The Annals of Statistics, 11(2):515 529, 1983. Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel mcmc. ar Xiv preprint ar Xiv:1311.4780, 2013. Iosif Pinelis et al. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, 22(4):1679 1706, 1994. Vincent Rivoirard, Judith Rousseau, et al. Bernstein von mises theorem for linear functionals of the density. The Annals of Statistics, 40(3):1489 1523, 2012. Walter Rudin et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964. Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E Mc Culloch. Bayes and big data: The consensus monte carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78 88, 2016. Hao Botao Shang, Zuofeng and Guang Cheng. Supplementary document to nonparametric bayesian aggregation for massive data. . Zuofeng Shang and Guang Cheng. Gaussian approximation of general non-parametric posterior distributions. Information and Inference: A Journal of the IMA, 7(3):509 529, 2017. Zuofeng Shang, Guang Cheng, et al. Local and global asymptotic inference in smoothing spline models. The Annals of Statistics, 41(5):2608 2638, 2013. Sanvesh Srivastava, Cheng Li, and David B Dunson. Scalable bayes via barycenter in wasserstein space. The Journal of Machine Learning Research, 19(1):312 346, 2018. Botond Szabo and Harry van Zanten. Adaptive distributed methods under communication constraints. ar Xiv preprint ar Xiv:1804.00864, 2018. Botond Szab o and Harry van Zanten. An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research, 20(87):1 30, 2019. Sara Van De Geer and SA Van De Geer. Empirical Processes in M-estimation. Cambridge UP, 2006. Aad W van der Vaart, J Harry van Zanten, et al. Reproducing kernel hilbert spaces of gaussian priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, pages 200 222. Institute of Mathematical Statistics, 2008a. Shang, Hao, Cheng Aad W van der Vaart, J Harry van Zanten, et al. Rates of contraction of posterior distributions based on gaussian process priors. The Annals of Statistics, 36(3):1435 1463, 2008b. Grace Wahba. Spline models for observational data, volume 59. Siam, 1990. Xiangyu Wang and David B Dunson. Parallelizing mcmc via weierstrass sampler. ar Xiv preprint ar Xiv:1312.4605, 2013. Xiangyu Wang, Peichao Peng, and David B Dunson. Median selection subset aggregation for parallel inference. In Advances in neural information processing systems, pages 2195 2203, 2014. Xiangyu Wang, Fangjian Guo, Katherine A Heller, and David B Dunson. Parallelizing mcmc with random partition trees. In Advances in neural information processing systems, pages 451 459, 2015. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299 3340, 2015a. Yuchen Zhang, Martin Wainwright, and Michael Jordan. Distributed estimation of generalized matrix rank: Efficient algorithms and lower bounds. In International Conference on Machine Learning, pages 457 465, 2015b. Tianqi Zhao, Guang Cheng, and Han Liu. A partially linear framework for massive heterogeneous data. Annals of statistics, 44(4):1400, 2016. Nonparametric Bayesian Aggregation for Massive Data Supplementary document to Nonparametric Bayesian Aggregation for Massive Data This supplementary document is structured as follows. Section S.8.1 contains the proofs of Lemmas 2 and 3. Section S.8.2 contains the proofs of the main results in Section 4.5 and 4.6 that were not included in the main paper. Section S.8.3 proves Proposition 1, i.e., a uniform contraction rate result. Preliminary results relevant to the proof of Proposition 1 are provided in Section S.8.4. Section S.8.5 includes a result that characterizes the posterior tail moments of f f0 a for any a 0. Section S.8.6 includes additional simulation results supplementary to Section 5. S.8.1. Proofs of Lemmas 2 and 3 Proof [Proof of Lemma 2] We only show the first limit distribution since the proof of the second one is similar. Let ην = τνvν. Then ην is a sequence of iid standard normals. Note that Wn 2 2 = ν=1 η2 ν τ 2ν + n(1 + λρν). Let Un = (n Wn 2 2 ζ1,n)/ 2ζ2,n, then we have n(η2 ν 1) τ 2ν + n(1 + λρν). By straightforward calculations and Taylor s expansion of log(1 x), it can be shown that the logarithm of the moment generating function of Un equals log E{exp(t Un)} = t2/2 + O (t3ζ 3/2 2,n ζ3,n). (S.1) Without loss of generality, assume that N = na for some a 1. Then α1 = min{1/(2m + β),a/(2m + β)} = 1/(2m + β). It follows by (Shang and Cheng, 2017, Lemma S.1) that ζ2,n nα1 and ζ3,n nα1, so the remainder term in (S.1) is O(n α1/2) = o(1). So limn E{exp(t Un)} = exp(t2/2). Proof is completed. Proof [Proof of Lemma 3] The proof follows by moment generating function approach and direct calculations. S.8.2. Proofs in Sections 4.5 and 4.6 This section contains the proofs in Sections 4.5 and 4.6. Shang, Hao, Cheng Proofs in Section 4.5 Proof [Proof of Theorem 5] Recall in the proof of Theorem 4 we showed that Rate Conditions (R ) are satisfied. It is easy to see that F(Wn) d= N(0,θ2 1,n), and F(WN) d= N(0,θ2 1,N). (S.2) For 1 j s, define RF j,n(α) = {f Sm(I) F(f) F( fj,n) r F,j,n(α)}. It follows by The- orem 1 that max1 j s 1 α P0j(RF j,n(α)) = o Pf0(1). Since s = o(N 4m2+2mβ 12m+1 8m(2m+β) (log N) 3 2 ), it can be examined that NL2 N = o(1). Together with the condition h r Nθ2 1,N and the fact θk,N θk,n, one can verify that h r Nθ2 1,N Nθ2 1,n = o(L 2 N θ2 1,n). So we have by (35) and Theorem 2 that max 1 j s F( j) = OPf0(h r/2LN) = o Pf0(θ1,n). Combined with (S.2) we get that P0j(RF j,n(α)) = P( F(Wn) F( j) r F,j,n(α) Dj) = Φ(r F,j,n(α) + F( j) θ1,n ) + Φ(r F,j,n(α) F( j) = 2Φ(r F,j,n(α) θ1,n ) 1 + o Pf0(1), uniformly for 1 j s. The above argument leads to Φ(r F,j,n(α)/θ1,n) = 1 α/2 + o Pf0(1) uniformly for 1 j s, which further leads to the following max 1 j s r F,j,n(α)/θ1,n zα/2 = o Pf0(1). (S.3) Consider the decomposition (12) with T1,T2,T3,T4 being defined therein. It follows by (13) and rate condition Na2 n = o(1) that N T1 2 = OPf0(Na2 n) = o Pf0(1). Meanwhile, it follows by Condition (S ), N 1 h2m+β and λ = h2m and direct examinations that N T2 2 = N ν=1 (a N,ν 1)2 f0 ν 2(1 + λρν) N ν=1 ( ν2m+β ν2m+β + N(1 + λν2m)) 2 f0 ν 2(1 + λν2m) (hν)2m+β + (hν)4m+β (1 + (hν)2m + (hν)2m+β)2 f0 ν 2ν2m+β = o(1), N T4 2 = N ν=1 a2 N,ν ( λρν 1 + λρν ) 2 f0 ν 2(1 + λρν) 1 + (hν)2m f0 ν 2ν2m+β = o(1). Nonparametric Bayesian Aggregation for Massive Data By (11) and Ns2L2 N = o(1) we get f N,λ f N,λ = o Pf0(N 1/2). Therefore, f N,λ f0 T3 f N,λ f N,λ + T1 +T2 +T4 = o Pf0(N 1/2). If follows from (35) that F( f N,λ f0) F(T3) = o Pf0(h r/2N 1/2). Note that F(T3) = 1 N N i=1 ϵi F(RXi), where the kernel R is defined in the proof of Theorem 3. We will derive asymptotic distribution for F(T3). Let s2 N = V arf0( N i=1 ϵi F(RXi)). It is easy to show that s2 N = N3 ν=1 (τ 2ν + N(1 + λρν))2 = N3θ2 2,N. Clearly, by uniform boundedness of ϕν and F(ϕν), we get F(Rx) = ν=1 a N,ν ϕν(x)F(ϕν) 1 + λρν h 1, where the is free of x I, and Ef0{ϵ2F(RX)2} = N2 ν=1 (τ 2ν + N(1 + λρν))2 = N2θ2 2,N. (S.4) Then for any δ > 0, by condition Ef0{ϵ4 X} M4 a.s., N i=1 Ef0{ϵ2 i F(RXi)2I( ϵi F(RXi) δs N)} N s2 N (δs N) 2Ef0{ϵ4F(RX)4} N s2 N (δs N) 2h 2Ef0{ϵ2F(RX)2} δ 2N 1h 2+r = o(1), where the last o(1)-term follows by h h and 2 r < 2m + β. By Lindeberg s central limit theorem, as N , F(T3) N i=1 ϵi F(RXi) d N(0,1). (S.5) By condition N2θ2 2,N h r, we have F( f N,λ f0 T3) Nθ2,N = o Pf0 (h r/2N 1/2 Nθ2,N ) = o Pf0(1). It follows by (S.3) that r F,N(α) = θ1,N s j=1 r F,j,n(α)2/θ2 1,n = θ1,Nzα/2(1 + o Pf0(1)), (S.6) leading to that r F,N(α) Nθ2,N = θ1,N Nθ2,N zα/2(1 + o Pf0(1)). Shang, Hao, Cheng It can be shown that θ2 1,N Nθ2 2,N = ν=1 F(ϕν)2 1+λρν+N 1τ 2ν ν=1 F(ϕν)2 (1+λρν+N 1τ 2ν )2 1, together with (S.5) we get that Pf0( F(f0) F( f N,λ) r F,N(α)) = Pf0 F( f N,λ f0 T3) Nθ2,N + F(T3) Nθ2,N r F,N(α) Pf0 F( f N,λ f0 T3) Nθ2,N + F(T3) Nθ2,N zα/2(1 + o Pf0(1)) Notice that when 0 < ν=1 F(ϕν)2 < , θ2 1,N Nθ2 2,N 1, leading to that the probability in (S.7) approaches exactly 1 α. In the end, we show that P(RF N(α) D) = 1 α + o Pf0(1), where RF N(α) = {f Sm(I) F(f) F( f N,λ) r F,N(α)}. By rate condition N(a2 N + a2 n) = o(1), proof of (21) leading to for N,λ f N,λ = OPf0(a N + an), and (35) we have F( for N,λ f N,λ) θ1,N = OPf0 (h r/2(a N + an) θ1,N ) = o Pf0(1), where the last o(1)-term follows by condition Nθ2 1,N h r and Rate Condition (R ). From (S.6) we get that P0(RF N(α)) = P(W or RF N(α) D) = P( F(W or) F( f N,λ) r F,N(α) D) = P F( for N,λ f N,λ) θ1,N + F(WN) θ1,N r F,N(α) = 1 α + o Pf0(1). (S.8) So it follows from (14) that P(RF N(α) D) = 1 α + o Pf0(1). Proof is completed. Proofs in Section 4.6 Proof [Proof of Theorem 6] It follows from (20) that r N(α) r N(α) = o Pf0(N 1h 1/2), which together with (19) leads to that limn Pf0(f0 R N(α)) = 1. It follow from Lemma 2, (14) and the proof of Theorem 3 that P(R N(α) D) = 1 α + o Pf0(1). It follows from (23) that rω,N(α)2 r ω,N(α)2 = o Pf0(N 1). Then the desired results on R ω N (α) directly follow from the proof of Theorem 4. Nonparametric Bayesian Aggregation for Massive Data It follows by (S.6) that r F,N(α) = r F,N(α)(1 + o Pf0(1)). Then the desired results on CI F N (α) follow from (S.7) and (S.8). S.8.3. Proofs of Proposition 1 and relevant results The goal of this section is to prove Proposition 1 and relevant results. Before proofs, we exactly describe the Fr echet derivatives of the likelihood function that will be technically useful. Suppose that (Y,X) follows model (14) based on f. Let g,gk Sm(I) for k = 1,2. For j = 1,2,...,s, the Fr echet derivative of ℓjn can be identified as Dℓjn(g)g1 = 1 n i Ij (Yi g(Xi)) KXi,g1 Pλg,g1 = Sj,n(g),g1 . Define Sλ(g) = E{Sj,n(g)}. We also use DSλ and D2Sλ to represent the secondand third-order Fr echet derivatives of Sλ. Note that Sj,n( fj,n) = 0, and Sj,n(f) can be expressed as Sj,n(f) = 1 n i Ij (Yi f(Xi))KXi Pλf. (S.9) The Fr echet derivative of Sj,n is denoted DSj,n(g)g1g2. These derivatives can be explicitly written as D2ℓjn(g)g1g2 = DSj,n(g)g1g2 = 1 n i Ij g1(Xi)g2(Xi) Pλg1,g2 , The proof of Proposition 1 requires a series of preliminary lemmas. Define Hm(b) = {f Sm(I) J(f) b2}. We first state a basic lemma about a concentration phenomenon of smoothing spline estimates in the distributed setup. Lemma 4 If b,r,h,M are positives satisfying the following Rate Condition (H): 1. h1/2r 1, 2. c2 KM1/2rh 1/2B(h) 1/2, where B(h) = A(h,2) with A(h,ε) given in (S.19), then, for any 1 j s, the following two results hold: 1. supf Hm(b) Pf ( fj,n f δn) 2exp( Mnhr2), where δn = bhm +2c K(Cϵ +M)r with Cϵ = E{( ϵ + 1)2 exp( ϵ + 1)} an absolute constant; 2. supf Hm(b) Pf ( fj,n f Sj,n(f) > an) 2exp( Mnhr2), where an = c2 KM1/2h 1/2r B(h)δn. Here, Sj,n(f) is the Fr echet derivative of the likelihood function ℓjn(f); see (S.9) for its exact expression. Lemma 5 For any fixed constants M > 1 and b > 0, let r = (nh/log 2s) 1/2,δn = bhm + 2c K(Cϵ + M)r, (S.10) Shang, Hao, Cheng an = c2 KM1/2h 1/2r B(h)δn. (S.11) Then as n , Pf0 (max 1 j s fj,n f0 δn) 6s N M 0, and Pf0 (max 1 j s fj,n f0 Sj,n(f0) > an) 8s N M 0. Proof [Proof of Lemma 5] The result is a straightforward consequence of Lemma 4. Lemma 6 It holds that max 1 j s fj,n f0 Sj,n(f0) = OPf0(an). (S.12) Proof [Proof of Lemma 6] The proof follows by Lemma 5, and simple fact that B(h) h 2m 1 Lemma 7 Under Condition (S), we get max1 j s fj,n f0 = OPf0( rn). Proof [Proof of Lemma 7] Recall that Sj,n(f0) = 1 n i Ij (Yi f0(Xi))KXi Pλf0. It was shown by Shang et al. (2013) that Pλϕν = λϕν 1+λϕν ϕν. Since f0 satisfies Condition (S), Pλf0 2 = ν=1 f0 ν λρν 1 + λρν ϕν, ν=1 f0 ν λρν 1 + λρν ϕν = ν=1 f0 ν 2 λ2ρ2 ν 1 + λρν 2m ν=1 f0 ν 2ρ 1+ β 1 2m ν (λρν)1 β 1 2m 1 + λρν = O(h2m+β 1), where the last equation follows by λ = h2m, supx 0 x1 β 1 2m 1+x < , and Condition (S). On the other side, it follows by the proof of (S.22) that Pf0 max 1 j s i Ij (Yi f0(Xi))KXi L(M)n(nh/log 2s) 1/2 2sexp( Mnh(nh/log 2s) 1) = (2s)1 M 0, as M , where L(M) = c K(Cϵ + M). This implies that max 1 j s i Ij (Yi f0(Xi))KXi = OPf0(n(nh/log 2s) 1/2), Nonparametric Bayesian Aggregation for Massive Data max 1 j s Sj,n(f0) = OPf0((nh/log 2s) 1/2 + hm+ β 1 2 ) = OPf0( rn). Together with (S.12) of Lemma 6 and the rate condition an rn, we get that max1 j s fj,n f0 = OPf0( rn). Consider a function class G = {g Sm(I) g 1,J(g,g) c 2 K h 2m+1}. (S.13) Lemma 8 For any fixed constant M > 1, as n , Pf0 (max 1 j ssup g G Zj,n(g) B(h) M log N) 1, where Zj,n(g) = 1 n i Ij[ψj,n(Zi;g)KXi E{ψj,n(Zi;g)KXi}], ψj,n(Zi;g) = c 1 K h1/2g(Xi). Proof [Proof of Lemma 8] It is easy to see that ψj,n(Zi;g) satisfies the Lipschitz continuity condition (S.20). Then the result directly follows by Lemma 12 (see appendix). Lemma 9 For j = 1,...,s, 1. ℓjn(f) ℓjn( fj,n) = Ij,n(f), where Ij,n(f) = 1 0 1 0 s DSj,n( fj,n + ss (f fj,n))(f fj,n)(f fj,n)dsds for any f Sm(I); 2. Ij,n(f) = Tj(f) 1 2 f fj,n 2, where recall that (see 5) 2n i Ij [(f fj,n)(Xi)2 EX{(f fj,n)(X)2}]. (S.14) Proof [Proof of Lemma 9] Let f = f fj,n. Therefore, Ij,n(f) = 1 0 s i Ij ( f)(Xi)2dsds λJ( f, f)/2 2n i Ij [( f)(Xi)2 EX{( f)(X)2}] 1 By Taylor s expansion in terms of Fr echet derivatives, ℓjn(f) ℓjn( fj,n) = Sj,n( fj,n)(f fj,n) + Ij,n(f) = Ij,n(f). Shang, Hao, Cheng Lemma 10 There exists a universal constant c3 > 0 s.t. Π( f f0 rn) exp( c3 r 2 2m+β 1 n ), where recall that Π is the probability measure induced by G. Proof [Proof of Lemma 10] Note that λ r 4m 2m+β 1 n . Then it follows by Lemma 13 (with dn therein replaced by rn) and the proof of Theorem 7 that Π( f f0 rn) = P( G f0 rn) P(V (G f0) r2 n/2,λJ(G f0) r2 n/2) P(V (G f0) r2 n/2,J(G f0) r 2(β 1) 2m+β 1 n /2) = P( V ( G f0) r2 n/2, J( G f0) r 2(β 1) 2m+β 1 n /2) P( V ( G ω) (1/ 2 1/2)2 r2 n, J( G ω) (1/ 2(β 1) 2m+β 1 n ) exp( ω 2 β/2) P( V ( G) (1/ 2 1/2)2 r2 n, J( G) (1/ 2(β 1) 2m+β 1 n ) exp( ω 2 β/2)P( V ( G) (1/ 2 1/2)2 r2 n/2) P( J( G) (1/ 2(β 1) 2m+β 1 n /2) exp( c3 r 2 2m+β 1 n ), where c3 > 0 is a universal constant. Proof [Proof of Proposition 1] Fix any ε (0,1). Let M1 be a large constant so that (thanks to Lemma 7) the event E n = {max 1 j s fj,n f0 M1 rn} (S.15) has probability approaching one. Meanwhile, for a fixed constant M > 1, define E n = {max 1 j ssup g G Zj,n(g) B(h) M log N}. (S.16) By Lemma 8 we have that E n has Pf0-probability approaching one. Thus, it holds that, when n becomes large, Pf0(En) 1 ε/2, where En = E n E n. In the rest of the proof we simply assume that En holds. For some positive constant M0, it follows by Theorem 7 that max 1 j s E{ f f0 a I( f f0 M0rn) Dj} = OPf0(s2 exp( nr2 n)). Let C > M1 be a constant to be further determined later, then we have that max 1 j s E{ f f0 a I( f f0 2C rn) Dj} max 1 j s E{ f f0 a I( f f0 M0rn) Dj} + max 1 j s E{ f f0 a I(2C rn f f0 M0rn) Dj}. Nonparametric Bayesian Aggregation for Massive Data The first term is OPf0(s2 exp( nr2 n)). Thus, when n is sufficiently large, Pf0 (max 1 j s E{ f f0 a I( f f0 M0rn) Dj} M s2 exp( nr2 n)/2) ε/2 for a large constant M > 0. Next we only need to handle the second term. Let f = f fj,n. It follows by Lemma 9 that Ij,n(f) = Tj(f) 1 2 f 2, and ℓjn(f) ℓjn( fj,n) = Ij,n(f). Therefore, E{ f f0 a I(f An) Dj} = An f f0 a exp(n(ℓjn(f) ℓjn( fj,n)))dΠ(f) Sm(I) exp(n(ℓjn(f) ℓjn( fj,n)))dΠ(f) = An f f0 a exp(n Ij,n(f))dΠ(f) Sm(I) exp(n Ij,n(f))dΠ(f) , where An = {f Sm(I) 2C rn f f0 M0rn}. Let Jj1 = Sm(I) exp(n Ij,n(f))dΠ(f), Jj2 = An f f0 a exp(n Ij,n(f))dΠ(f). Then on En and for f f0 rn, we have f fj,n f f0 + fj,n f0 (M1 + 1) rn. Let dn = c K(M1 + 1)h 1/2 rn. It follows by similar arguments as above (S.23) that d 1 n f G. Note that on En and for f f0 rn, for all 1 j s, Tj(f) = 1 2n i Ij [( f)(Xi)2 EX{( f)(X)2}] = 1 2n i Ij [( f)(Xi)KXi EX{( f)(X)KX}], f 1 2n f i Ij [( f)(Xi)KXi EX{( f)(X)KX}] = c Kh 1/2dn f 2 n Zj,n(d 1 n f) c Kh 1/2dn f D(c K,M,M1) n 1/2h 6m 1 log N D(c K,M,M1) r2 nbn, (S.17) where D(c K,M,M1) is constant depending only on c K,M1,M. It follows that on En and for all 1 j s, Jj1 f f0 rn exp(n Ij,n(f))dΠ(f) = f f0 rn exp(n Tj(f) n 2 f fj,n 2)dΠ(f) exp( [D(c K,M,M1)bn + (M1 + 1)2/2]n r2 n)Π( f f0 rn). Shang, Hao, Cheng Since Π( f f0 rn) exp( c3 r 2 2m+β 1 n ) (Lemma 10), together with rn (nh) 1/2 + 2 2n 2m+β 1 2(2m+β) , we get that n r 2+ 2 2m+β 1 n n(4n 2m+β 1 2m+β )1+ 1 2m+β 1 = 4. Therefore, r 2 2m+β 1 n n r2 n/4, leading to Π( f f0 rn) exp( c3 4 n r2 n). (S.18) This implies by rate conditions bn 1 that, on En and for any 1 j s, Jj1 exp( [D(c K,M,M1)bn + (M1 + 1)2/2 + c3/4]n r2 n) exp( [D(c K,M,M1) + (M1 + 1)2/2 + c3/4]n r2 n). Next we handle Jj2. The idea is similar to how we handle Jj1 but with technical difference. Let f = f fj,n. Note that r2 n r2 n log(2s), and hence, on En, for any f An, i.e., f f0 M0rn, we get that f = fj,n f fj,n f0 + f f0 M1 rn + M0rn (M0 + M1)rn log(2s). Let d n = c K(M0 + M1)h 1/2rn log(2s). Then d 1 n f G. Using previous similar arguments handling Tj(f), we have that on En, for any f An and 1 j s, 2 n c Kh 1/2d n B(h) 1 2c2 K(M0 + M1)2M1/2n 1/2h 1r2 n B(h)(log N)3/2 D(c K,M,M0,M1) n 1/2r2 nh 6m 1 4m (log N)3/2 = D(c K,M,M0,M1) r2 nbn D(c K,M,M0,M1) r2 n, where D(c K,M,M0,M1) is constant only depending on c K,M,M0,M1 and the last inequality follows by rate condition r2 nbn r2 n. It is easy to see that on En and for any f An and 1 j s, fj,n f f f0 fj,n f0 (2C M1) rn, leading to that Jj2 exp( ((2C M1)2 2 D(c K,M,M0,M1))n r2 n)C(a,Π), where C(a,Π) = Sm(I) f f0 adΠ(f) is the ath prior moment of f f0 which is finite. Choose C > M1 to be large such that 2 1 + D(c K,M,M1) + D(c K,M,M0,M1) + (M1 + 1)2/2 + c3/4. Therefore, on En, max 1 j s E{ f f0 a I(f An) Dj} max1 j s Jj2 min1 j s Jj1 exp( n r2 n)C(a,Π). So we get that Pf0 (max 1 j s E{ f f0 a I(f An) Dj} exp( n r2 n)C(a,Π)) Pf0(Ec n) ε/2. Nonparametric Bayesian Aggregation for Massive Data By r2 n r2 n log(2s), the above leads to that Pf0 (max 1 j s E{ f f0 a I( f f0 2C rn) Dj} (M + C(a,Π))s2 exp( n r2 n/log(2s))) ε. Proof is completed. S.8.4. Proofs of other results in Section S.8.3 Let N(ε,G, ) be the ε-packing number in terms of supremum norm, where recall that the space G is defined in (S.13). The following result can be found in Van De Geer and Van De Geer (2006). Lemma 11 There exists a universal constant c0 > 0 s.t. for any ε > 0, log N(ε,G, ) c0( 2c 1 K )1/mh 2m 1 For r 0, define Ψ(r) = r 0 log(1 + exp(x 1/m))dx. For arbitrary ε > 0, define A(h,ε) = 32 2c 1 K cm 0 h (2m 1)/2Ψ( 1 2 c Kc m 0 h(2m 1)/2ε) log (1 + exp(2c0(( 2) 1c Kh(2m 1)/2ε) 1/m)), (S.19) where τ = log 1.5 0.6368. We have the following useful lemma. Lemma 12 For any 1 j s and f Sm(I), suppose that ψj,n,f(z;g) is a measurable function defined upon z = (y,x) Y I and g G satisfying ψj,n,f(z;0) = 0 and the following Lipschitz continuity condition: for any i Ij and g1,g2 G, ψj,n,f(Zi;g1) ψj,n,f(Zi;g2) c 1 K h1/2 g1 g2 . (S.20) Then for any constant t 0 and n 1, sup f Sm(I) Pf (sup g G Zj,n,f(g) f > t) 2exp( t2 where B(h) = A(h,2) and Zj,n,f(g) = 1 n i Ij [ψj,n,f(Zi;g)KXi Ef{ψj,n,f(Zi;g)KXi}]. Proof [Proof of Lemma 12] For any f Sm(I) and n 1, and any g1,g2 G, we get that (ψj,n,f(Zi;g1) ψj,n,f(Zi;g2))KXi c 1 K h1/2 g1 g2 c Kh 1/2 = g1 g2 . Shang, Hao, Cheng By Theorem 3.5 of Pinelis et al. (1994), for any t > 0, Pf ( Zj,n,f(g1) Zj,n,f(g2) t) 2exp( t2 8 g1 g2 2 ). Then by Lemma 8.1 in Kosorok (2008), we have Zj,n,f(g1) Zj,n,f(g2) ψ2 where ψ2 denotes the Orlicz norm associated with ψ2(s) = exp(s2) 1. Recall τ = log 1.5 0.6368. Define φ(x) = ψ2(τx). Then it can be shown by elementary calculus that φ(1) 1/2, and for any x,y 1, φ(x)φ(y) φ(xy). By a careful examination of the proof of Lemma 8.2, it can be shown that for any random variables ξ1,...,ξl, max 1 i l ξi ψ2 2 τ ψ 1 2 (l)max 1 i l ξi ψ2. (S.21) Next we use a chaining argument. Let T0 T1 T2 T = G be a sequence of finite nested sets satisfying the following properties: for any Tq and any s,t Tq, s t ε2 q; each Tq is maximal in the sense that if one adds any point in Tq, then the inequality will fail; the cardinality of Tq is upper bounded by log Tq log N(ε2 q,G, ) c0( 2c 1 K )1/mh (2m 1)/(2m)(ε2 q) 1/m, where c0 > 0 is absolute constant; each element tq+1 Tq+1 is uniquely linked to an element tq Tq which satisfies tq tq+1 ε2 q. For arbitrary sk+1,tk+1 Tk+1 with sk+1 tk+1 ε, choose two chains (both being of length k + 2) tq and sq with tq,sq Tq for 0 q k + 1. The ending points s0 and t0 satisfy s0 t0 k q=0 [ sq sq+1 + tq tq+1 ] + sk+1 tk+1 2 k q=0 ε2 q + ε 5ε, Nonparametric Bayesian Aggregation for Massive Data and hence, Zj,n,f(s0) Zj,n,f(t0) f ψ2 5 24ε. It follows by the proof of Theorem 8.4 of Kosorok (2008) and (S.21) that max sk+1,tk+1 Tk+1 Zj,n,f(sk+1) Zj,n,f(tk+1) (Zj,n,f(s0) Zj,n,f(t0)) ψ2 XXXXXXXXXXXXXXX max u Tq+1,v Tq u, v link each other Zj,n,f(u) Zj,n,f(v) XXXXXXXXXXXXXXXψ2 k q=0 ψ 1 2 (N(2 q 1ε,G, )) max u Tq+1,v Tq u, v link each other Zj,n,f(u) Zj,n,f(v) ψ2 log (1 + N(ε2 q 1,G, ))ε2 q log (1 + exp(c0c 1/m K h (2m 1)/(2m)(ε2 q) 1/m))ε2 q log (1 + exp(c0c 1/m K h (2m 1)/(2m)x 1/m))dx 6 τ c 1 K cm 0 h (2m 1)/2Ψ(1 2c Kc m 0 h(2m 1)/2ε). On the other hand, XXXXXXXXXXXXXXX max u,v T0 u v 5ε Zj,n,f(u) Zj,n,f(v) f XXXXXXXXXXXXXXXψ2 2 τ ψ2( T0 2) max u,v T0 u v 5ε Zj,n,f(u) Zj,n,f(v) f ψ2 2 τ ψ 1 2 (N(ε,G, )2)(5 Therefore, XXXXXXXXXXXXXXX max s,t Tk+1 s t ε Zj,n,f(s) Zj,n,f(t) XXXXXXXXXXXXXXXψ2 6 τ c 1 K cm 0 h (2m 1)/2Ψ(1 2c Kc m 0 h(2m 1)/2ε) τ ψ 1 2 (N(ε,G, )2)(5 6 τ c 1 K cm 0 h (2m 1)/2Ψ(1 2c Kc m 0 h(2m 1)/2ε) log (1 + exp(2c0(c Kh(2m 1)/2ε) 1/m)) Now for any g1,g2 G with g1 g2 ε/2. Let k 2, hence, 21 k 1 g1 g2 /ε. Since Tk is maximal , there exist sk,tk Tk s.t. max{ g1 sk , g2 tk } ε2 k. It is Shang, Hao, Cheng easy to see that sk tk ε. So Zj,n,f(g1) Zj,n,f(g2) Zj,n,f(g1) Zj,n,f(sk) + Zj,n,f(g2) Zj,n,f(tk) + Zj,n,f(sk) Zj,n,f(tk) 4 nε2 k + max u,v Tk u v ε Zj,n,f(u) Zj,n,f(v) . Therefore, letting k we get that XXXXXXXXXXXXXXXX sup g1,g2 G g1 g2 ε/2 Zj,n,f(g1) Zj,n,f(g2) XXXXXXXXXXXXXXXXψ2 XXXXXXXXXXXXXXX max u,v Tk u v ε Zj,n,f(u) Zj,n,f(v) XXXXXXXXXXXXXXXψ2 4 nε2 k/ log 2 + A(h,ε) A(h,ε). Taking ε = 2 in the above inequality, we get that XXXXXXXXXXXXXXXX sup g1,g2 G g1 g2 1 Zj,n,f(g1) Zj,n,f(g2) XXXXXXXXXXXXXXXXψ2 A(h,2) = B(h). By Lemma 8.1 in Kosorok (2008), we have Pf (sup g G Zj,n,f(g) t) 2exp( t2 Note that the right hand side in the above does not depend on f. This completes the proof. Proof [Proof of Lemma 4] Let f Hm(b) be the parameter based on which the data are drawn. It is easy to see that DSλ(f)g = E{g(X)KX} Pλg, g Sm(I). Therefore, for any g, g Sm(I), DSλ(f)g, g = g, g , implying DSλ(f) = id. The proof of (1) is finished in two parts. Part I: For any f Sm(I), define an operator mapping Sm(I) to Sm(I): T1f(g) = g + Sλ(f + g), g Sm(I). First observe that, under Pf with f Hm(b), Sλ(f) = Pλf = sup g =1 Pλf,g Let r1n = bhm. Let B(r1n) = {g Sm(I) g r1n} be the r1n-ball. For any g B(r1n), using DSλ(f) = id, it is easy to see that T1f(g) = Sλ(f) bhm = r1n. Therefore, T1f Nonparametric Bayesian Aggregation for Massive Data maps B(r1n) to itself. For any g1,g2 B(r1n), by Taylor s expansion we have T1f(g1) T1f(g2) = g1 g2 + Sλ(f + g1) Sλ(f + g2) 0 DSλ(f + g2 + sg)gds = 0. This shows that T1f is a contraction mapping which maps B(r1n) into B(r1n). By contraction mapping theorem (see Rudin et al. (1964)), T1f has a unique fixed point g B(r1n) satisfying T1f(g ) = g . Let fλ = f + g . Then Sλ(fλ) = 0 and fλ f r1n. Part II: For any f Hm(b), under (14) with f being the truth, let fλ be the function obtained in Part I s.t. fλ f r1n. Define an operator T2f(g) = g + Sj,n(fλ + g), g Sm(I). Rewrite T2f as T2f(g) = [DSj,n(fλ)g DSλ(fλ)g] + Sj,n(fλ). Denote the above two terms by I1f,I2f, respectively. For any i Ij, let Ri = (Yi fλ(Xi))KXi Ef{(Y fλ(X))KX}. Obviously, Ef{(Y fλ(X))KX} = sup g =1 Ef{(Y fλ(X))KX},g = sup g =1 Ef{(Y fλ(X))g(X)} f fλ r1n. Therefore, Ri c Kh 1/2 Yi fλ(Xi) + r1n which leads to that E {exp( Ri c Kh 1/2 )} E (exp( ϵi + 1)) Cϵ, where Cϵ = E{( ϵ + 1)2 exp( ϵ + 1)}. Let δ = hr/c K. By condition rh1/2 1, we have E{exp(δ Ri ) 1 δ Ri } E{(δ Ri )2 exp(δ Ri )} c2 KCϵδ2h 1. It follows by Theorem 3.2 of Pinelis et al. (1994) that, for L(M) = c K(Cϵ + M), Pf i Ij Ri f L(M)nr 2exp( L(M)δnr + c2 KCϵnh 1δ2) = 2exp( Mnhr2). (S.22) We note that the right hand side in (S.22) does not depend on f. Moreover, it is easy to see that Sj,n(fλ) = Sj,n(fλ) Sλ(fλ) = 1 n i Ij Ri. Let En,1 = { Sj,n(fλ) L(M)r}, then supf Hm(C) Pf(Ec n,1) 2exp( Mnhr2). Define ψj,n(Xi;g) = c 1 K h1/2g(Xi), i Ij, and Zj,n(g) = 1 n i Ij[ψj,n(Xi;g)KXi Ef{ψj,n(Xi;g)KXi}]. By Lemma 12, supf Hm(b) Pf(Ec n,2) 2exp( Mnhr2), where En,2 = {supg G Zj,n(g) Mnhr2B(h)}. Shang, Hao, Cheng For any g Sm(I)/{0}, let g = g/d n, where d n = c Kh 1/2 g . It follows that g c Kh 1/2 g = c Kh 1/2 g /d n = 1, and J( g, g) = d 2 n J(g,g) = h 2m λJ(g,g) c2 Kh 1 g 2 h 2m g 2 c2 Kh 1 g 2 c 2 K h 2m+1. Therefore, g G. Consequently, on En,2, for any g Sm(I)/{0}, we get Zj,n( g) Mnhr2B(h), which leads to that DSj,n(fλ)g DSλ(fλ)g = 1 n i Ij [g(Xi)KXi E{g(Xi)KXi}] f c2 KM1/2rh 1/2B(h) g g /2, (S.23) where the last inequality follows by condition c2 KM1/2rh 1/2B(h) 1/2. Note that the above inequality also holds for g = 0. Let r2n = 2L(M)r. Therefore, it follows by (S.23) that, for any f Hm(b), on En = En,1 En,2 and for any g B(r2n), T2f(g) g /2 + r2n/2 r2n. Meanwhile, for any g1,g2 B(r2n), replacing g by g1 g2 in (S.23), we get that T2f(g1) T2f(g2) g1 g2 /2. Therefore, for any f Hm(b), on En, T2f is a contraction mapping from B(r2n) to itself. By contraction mapping theorem, there exists uniquely an element g B(r2n) s.t. T2f(g ) = g . Let fj,n = fλ + g . Clearly, Sj,n( fj,n) = 0, and hence, fj,n is the maximizer of ℓjn; see (19). So we get that, on En, fj,n f f fλ f + fj,n fλ r1n + r2n = bhm + 2L(M)r. The desired conclusion follows by the trivial fact: supf Hm(b) Pf(Ec n) 4exp( Mnhr2). Proof of (1) is completed. Next we show (2). For any f Hm(b), let fj,n be the penalized MLE of f obtained by (19). Let gn = fj,n f, δn = bhm + 2L(M)r, d n = c Kh 1/2δn. On En, we have gn f δn. Let g = gn/d n. Clearly, g G. Then we get that Sj,n(f + gn) Sj,n(f) (Sλ(f + gn) Sλ(f)) = 1 n i Ij [gn(Xi)KXi EX{gn(X)KX}] nh Zj,n( g) c2 KM1/2h 1/2r B(h)δn = an. (S.24) Since Sj,n(f + gn) = 0 and DSλ(f) = id, from (S.24) we have on En, an Sj,n(f) + DSλ(f)gn + 0 s D2Sλ(f + ss gn)gngndsds = Sj,n(f) gn which implies that fj,n f Sn,λ(f) an. Since supf Hm(b C) Pf(Ec n) 4exp( Mnhr2), proof of (2) is completed. Nonparametric Bayesian Aggregation for Massive Data S.8.5. An initial contraction rate Theorem 7 below states that the s posterior measures uniformly contract at rate rn = (nh) 1/2 + hm, where recall that h = λ1/(2m). This is an initial rate result that holds irrespective the diverging rate of s. Theorem 7 (An Initial Contraction Rate) Suppose f0 = ν=1 f0 ν ϕν satisfies Condition (S). Let a 0 be a fixed constant. If rn = o(h3/2), h1/2 log N = o(1), nh2m+1 1, then there exists a universal constant M > 0 s.t. max 1 j s E{ f f0 a I( f f0 Mrn) Dj} = OPf0(s2 exp( nr2 n)) as n , no matter s is fixed or diverges at any rate. Before proving Theorem 7, we present a preliminary lemma. Let { ϕν ν 1} be a bounded orthonormal basis of L2(I) under usual L2 inner product. For any b [0,β], define Hb = { ν=1 fν ϕν ν=1 f2 ν ρ1+b/(2m) ν < }. Then Hb can be viewed as a version of Sobolev space with regularity m + b/2. Define G = ν=1 vν ϕν, a centered GP, and f0 = ν=1 f0 ν ϕν. Define V (f,g) = f,g L2 = 1 0 f(x)g(x)dx, the usual L2 inner product, J(f) = ν=1 V (f, ϕν) 2ρν, a functional on H0. For simplicity, denote V (f) = V (f,f). Clearly, f0 Hβ. Since G is a Gaussian process with covariance function r(s,t) = E{ G(s) G(t)} = m ν=1 σ2 ν ϕν(s) ϕν(t) + ν>m ρ (1+ β 2m) ν ϕν(s) ϕν(t), it follows by van der Vaart et al. (2008a) that Hβ is the RKHS of G. For any Hb with 0 b β, define inner product ν=1 fν ϕν, ν=1 gν ϕν b = m ν=1 σ 2 ν fνgν + ν>m fνgνρ 1+ b Let b be the norm corresponding to the above inner product. The following lemma is used in the proof of Theorem 7. Its proof can be found in Shang and Cheng (2017). Lemma 13 Let dn be any positive sequence. If Condition (S) holds, then there exists ω Hβ such that 1. V (ω f0) 1 2. J(ω f0) 1 2(β 1) 2m+β 1 n , 3. ω 2 β = O(d 2 2m+β 1 n ). Shang, Hao, Cheng To ease reading, we sketch the proof of Theorem 7. We first show the following result: for any ε > 0, as n , max 1 j s f f0 ε f f0 ad P(f Dj) = OPf0(s2 exp( nr2 n)) (S.25) To show (S.25), we can rewrite the posterior density of f by p(f Dj) = i Ij(pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f) Sm(I) i Ij(pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f), 1 j s, where recall that pf(z) is the probability density of Z = (Y,X) under f. For 1 j s, define Ij1 = Sm(I) i Ij (pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f), (S.26) Ij2 = An f f0 a i Ij (pf/pf0)(Zi)exp( nλ 2 J(f))dΠ(f), (S.27) I j2 = A n f f0 a i Ij (pf/pf0)(Zi)exp( nλ 2 J(f))dΠ(f), (S.28) where An = {f Sm(I) f f0 2δn} and A n = {f Sm(I) f f0 2Mrn}, with the quantities δn,M specified later. Using Le Cam s uniformly consistent test Ghosal et al. (2000), we will show that max1 j s Ij2/Ij1 is of an exponential order (in the sense of Pf0). Then (S.25) holds by taking a = 0 in Ij2. The proof of Theorem 7 will be completed by decomposing I j2/Ij1 into three terms based on an auxiliary event {f Sm(I) f f0 ε} with each term of an exponential order. Proof [Proof of Theorem 7] Note that there exists a universal constant c > 0 such that Ψ(x) c x1 1/(2m) for any 0 < x < 1. Therefore, there exists a universal constant c > 0 s.t. B(h) c h (2m 1)/(4m). Define Bn = {f Sm(I) V (f f0) r2 n,J(f f0) r 2(β 1) 2m+β 1 n }. Then Ij1 Bn i Ij (pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f) = Bn exp( i Ij Ri(f,f0))exp( nλJ(f)/2)dΠ(f), where Ri(f,f0) = log (pf(Zi)/pf0(Zi)) = Yi(f(Xi) f0(Xi)) f(Xi)2/2 + f0(Xi)2/2 for any i Ij. Define dΠ (f) = dΠ(f)/Π(Bn), a reduced probability measure on Bn. By Jensen s Nonparametric Bayesian Aggregation for Massive Data inequality, log Bn exp( i Ij Ri(f,f0))exp( nλJ(f)/2)dΠ (f) i Ij Ri(f,f0) nλJ(f)/2 dΠ (f) = Bn i Ij [Ri(f,f0) Ef0{Ri(f,f0)}]dΠ (f) +n Bn Ef0{Ri(f,f0)}dΠ (f) Bn nλJ(f) = Jj1 + Jj2 + Jj3. For any f Bn, f f0 2 = V (f f0) + λJ(f f0) r2 n + λr 2(β 1) 2m+β 1 n . By (Shang and Cheng, 2017, Lemma A.9) and the condition h 3/2rn = o(1), we can choose n to be sufficiently large so that f f0 ch 1/2 f f0 c h 1r2n + h2m 1 1. It follows by Taylor s expansion and Ef0{Yi f0(Xi) Xi} = 0, that for any f Bn, Ef0{Ri(f,f0)} = Ef0{(f(X) f0(X))2}/2 r2 n/2. Therefore, Jj2 nr2 n/2 for any 1 j s. Since r2 n = o(1), we can choose n to be large so that Ef0{Ri(f,f0)} 1. Meanwhile, for any f Bn, for some s [0,1], we have Ri(f,f0) = Yi(f(Xi) f0(Xi)) f(Xi)2/2 + f0(Xi)2/2 = Yi f0(Xi) 1 2(f f0)(Xi) (f f0)(Xi) Yi f0(Xi) + 1/2 = ϵi + 1/2. We have used f f0 1 in the above inequalities. For any 1 i N, define Ai = { ϵi 2log N}. It is easy to check that Pf0( N i=1Ac i) 0, as N . Define ξi = Bn Ri(f,f0)dΠ (f) IAi, we get that ξi 2log N + 1/2, a.s. It can also be shown by r2 n 1/n 1/N that, as n,N , Ef0{ Bn Ri(f,f0)dΠ (f) IAc i } Ef0{( ϵi + 1/2) IAc i } Cϵ(1/N + 1/N2) r2 n, where Cϵ is an absolute constant. Let δ = 1/( nrn). Note that by the condition h1/2 log N = o(1) we have δ log N = (log N)/( nrn) h1/2 log N = o(1), we can let n be large so that δ(2log N + 1) 1. Let di = ξi Ef0{ξi} for i Ij, then it is easy to see that di ξi + Ef0{ξi} 2log N + 1, a.s. Shang, Hao, Cheng Let ei = Ef0{exp(δ di ) 1 δ di }. It can be shown using inequality exp(x) 1 x x2 exp(x) for x 0 and Cauchy-Schwartz inequality that ei Ef0{δ2d2 i exp(δ di )} eδ2Ef0{d2 i } eδ2Ef0{ξ2 i } eδ2 Bn Ef0{Ri(f,f0)2}dΠ (f) eδ2 Bn Ef0{( ϵi + 1/2)2(f f0)(Xi)2}dΠ (f) e Cϵδ2r2 n, where the last step follows from V (f f0) r2 n for any f Bn. Therefore, it follows by (Pinelis et al., 1994, Theorem 3.2) that Pf0 max 1 j s i Ij [ξi Ef0{ξi}] 4 nrn log N s Pf0 i Ij [ξi Ef0{ξi}] 4 nrn log N 2sexp( 4 nrn(log N)δ + e Cϵδ2nr2 n) 2s/N2 0, as N . (S.29) Since nrn log N, we can let n be large so that 4 nrn log N nr2 n. Since on N i=1Ai, Jj1 = i Ij [ξi Ef0{ξi}] n Ef0{ Bn Ri(f,f0)dΠ (f) IAc i }, we get from (S.29) that with Pf0-probability approaching one, for any 1 j s, Jj1 4 nrn log N nr2 n 2nr2 n. Meanwhile, for any f Bn, J(f) (1 + J(f0)1/2)2. Therefore, Jj3 (1+J(f0)1/2)2 2 nλ. So, with probability approaching one, for any 1 j s, Ij1 exp( 5nr2 n/2 (1 + J(f0)1/2)2 2 nλ)Π(Bn). To proceed, we need a lower bound for Π(Bn). It follows by Lemma 13 by replacing dn therein by rn, by Gaussian correlation inequality (see Theorem 1.1 of Li et al. (1999)), by Cameron-Martin theorem (see Cameron and Martin (1944) or (Kuelbs et al., 1994, eqn Nonparametric Bayesian Aggregation for Massive Data (4.18))) and (Hoffmann-Jorgensen et al., 1979, Example 4.5) that Π(Bn) = P(V (G f0) r2 n,J(G f0) r 2(β 1) 2m+β 1 n ) = P( V ( G f0) r2 n, J( G f0) r 2(β 1) 2m+β 1 n ) P( V ( G ω) r2 n/4, J( G ω) r 2(β 1) 2m+β 1 n /4) 2 ω 2 β)P( V ( G) r2 n/4, J( G) r 2(β 1) 2m+β 1 n /4) 2 ω 2 β)P( V ( G) r2 n/8)P( J( G) r 2(β 1) 2m+β 1 n /8) exp( c1r 2/(2m+β 1) n ), (S.30) where c1 > 0 is a universal constant. Since β > 1 and r2 n = (nh) 1 + λ n 2m/(2m+1), we get r2 n λ and nr n1 2m(2m+β) (2m+1)(2m+β 1) > 1, so nr2 n > r 2 2m+β 1 n . Consequently, with Pf0-probability approaching one min 1 j s Ij1 exp( c2nr2 n), (S.31) where c2 = 5/2 + (1 + J(f0)1/2)2/2 + c1. Let b = 2 c2 + 1 and C b2/4. Next we examine Ij2 defined in (S.27) with An = {f Sm(I) f f0 2δn}, for δn = bhm + 2c K(Cϵ + C)r, r = rnh 1/2. By the condition h 3/2rn = o(1) and B(h) h (2m 1)/(4m) it can be easily checked that the Rate Condition (H): is satisfied (when n becomes large) with M therein replaced by C. For 1 j s, define test φj,n = I( fj,n f0 δn). It follows by part (1) of Theorem 4 that for any 1 j s, Ef0{φj,n} = Pf0( fj,n f0 δn) 2exp( Cnr2 n), sup f Hm(b) f f0 2δn Ef{1 φj,n} = sup f Hm(b) f f0 2δn Pf( fj,n f0 < δn) sup f Hm(b) f f0 2δn Pf( fj,n f δn) 2exp( Cnr2 n). An immediate consequence is Ef0{max1 j s φj,n} 2sexp( Cnr2 n), which implies max1 j s φj,n = OPf0(sexp( Cnr2 n)). Shang, Hao, Cheng Note that for any f An/Hm(b), J(f) > b2. Since nh2m+1 1 leads to r2 n = (nh) 1 + λ 2λ, it then holds that, for any 1 j s, Ef0{Ij2(1 φj,n)} = An f f0 a Ef{1 φj,n}exp( nλJ(f)/2)dΠ(f) = An/Hm(b) f f0 a Ef{1 φj,n}exp( nλJ(f)/2)dΠ(f) + An Hm(b) f f0 a Ef{1 φj,n}exp( nλJ(f)/2)dΠ(f) (exp( b2nλ/2) + 2exp( Cnr2 n))C(a,Π) 3exp( b2nr2 n/4)C(a,Π), where the last inequality follows by C b2/4 and λ r2 n/2. So Ef0{max 1 j s Ij2(1 φj,n)} s j=1 Ef0{Ij2(1 φj,n)} 3sexp( b2nr2 n/4)C(a,Π), which implies max1 j s Ij2(1 φj,n) = OPf0(sexp( b2nr2 n/4)). On the other hand, as n , Ef0{max 1 j s Ij2} s Sm(I) f f0 2dΠ(f) which implies that max1 j s Ij2 = o Pf0(s). Therefore, max 1 j s Ij2 Ij1 φj,n max1 j s Ij2 max1 j s φj,n min1 j s Ij1 = OPf0(s2 exp( nr2 n)). (S.32) By the above arguments and (S.31), we have max 1 j s An f f0 ad P(f Dj) = max 1 j s Ij2 Ij1 max 1 j s Ij2 Ij1 φj,n + max 1 j s Ij2(1 φj,n) Ij1 = OPf0(s2 exp( nr2 n)) + OPf0(sexp( b2nr2 n/4)exp(c2nr2 n)) = OPf0(s2 exp( nr2 n)). By condition rnh 3/2 = o(1) and the trivial fact δn rnh 1/2, we have that h 1/2δn = o(1). Therefore, eventually f f0 ε f f0 ad P(f Dj) An f f0 ad P(f Dj) for all 1 j s, which implies that (S.25) holds. Now we will prove the theorem. Let I j2 be defined as in (S.28) with A n = {f Sm(I) f f0 2Mrn} for a fixed number satisfying M > max{2,J(f0)1/2+ 2(c2 + 1),1+ f0 } (M will be further described). Let A n1 = {f Sm(I) V (f f0) M2r2 n,λJ(f f0) M2r2 n} and A n2 = {f Sm(I) λJ(f f0) M2r2 n}. For any f A n2, it can be shown that λ(J(f)1/2 + J(f0)1/2) (λJ(f))1/2 + J(f0)1/2rn, Nonparametric Bayesian Aggregation for Massive Data which leads to λJ(f) (M J(f0)1/2)2r2 n. So we have Ef0{max 1 j s A n2 f f0 a i Ij (pf/pf0)(Zi)exp( nλ 2 J(f))dΠ(f)} s j=1 Ef0{ A n2 f f0 a i Ij (pf/pf0)(Zi)exp( nλ 2 J(f))dΠ(f)} = s A n2 f f0 a exp( nλ 2 J(f))dΠ(f)} sexp( (M J(f0)1/2)2nr2 n/2)C(a,Π), which leads to that max 1 j s A n2 f f0 a i Ij (pf/pf0)(Zi)exp( nλ 2 J(f))dΠ(f) = OPf0(sexp( (M J(f0)1/2)2nr2 n/2)). (S.33) It follows from (S.31) and (S.33) that max 1 j s 1 Ij1 A n2 f f0 a i Ij (pf/pf0)(Zi)exp( nλ 2 J(f))dΠ(f) = OPf0 (sexp( (M J(f0)1/2)2nr2 n/2 + c2nr2 n)) = OPf0(sexp( nr2 n)), (S.34) where the last inequality follows by (M J(f0)1/2)2 > 2(c2 + 1). To continue, we need to build uniformly consistent test. Let d2 H(Pf,Pg) = 1 d Pg)2 be the squared Hellinger distance between the two probability measures Pf(z) and Pg(z). Recall that their corresponding probability density functions are pf and pg, respectively. Nextwe present a lemma showing the local equivalence of V and d2 H. Lemma 14 Let ε (0,1) satisfy ε2 + 32εexp(1/2)Cϵ 2, where Cϵ = E{exp( ϵ )}. Then for any f,g Sm(I) satisfying f g ε, V (f g)/16 d2 H(Pf,Pg) 3V (f g)/16. Let ε satisfy the conditions in Lemma 14. Define Fn = {f Sm(I) f f0 ε/2,J(f) (M + J(f0)1/2)2r2 nλ 1}. Let Pn = {Pf f Fn} and D(δ,Pn,d H) be the δ-packing number in terms of d H. Since r2 n λ which leads to (M +J(f0)1/2)rnh m > M +J(f0)1/2 > ε+ f0 , it can be easily checked that Fn (M + J(f0)1/2)rnh m T , where T = {f Sm(I) f 1,J(f) 1}. For any f,g Fn with f g ε, it follows by Lemma 14 that D(δ,Pn,d H) D(4δ/ 3,Fn,d V ), where d V is the distance induced by V , i.e., d V (f,g) = V 1/2(f g). And hence, it follows by (Kosorok, 2008, Theorem 9.21) that log D(δ,Pn,d H) log D(4δ/ 3,(M + J(f0)1/2)rnh m T ,d V ) c V ( δ (M + J(f0)1/2)rnh m ) Shang, Hao, Cheng where c V is a universal constant only depending on the regularity level m. This implies that for any δ > 2rn, log D(δ/2,Pn,d H) log D(rn,Pn,d H) c V (M + J(f0)1/2)1/mh 1 c V (M + J(f0)1/2)1/mnr2 n, where the last inequality follows by the fact r2 n (nh) 1. Thus, the right side of the above inequality is constant in δ. By (Ghosal et al., 2000, Theorem 7.1), with δ = Mrn/4, there exists test φj,n and a universal constant k0 > 0 satisfying Ef0{ φj,n} = Pf0 φj,n exp(c V (M + J(f0)1/2)1/mnr2 n)exp( k0nδ2) 1 exp( k0nδ2) = exp(c V (M + J(f0)1/2)nr2 n k0M2nr2 n/16) 1 exp( k0M2nr2n/16) , and, combined with Lemma 14, sup f Fn d V (f,f0) 4δ Ef{1 φj,n} = sup f Fn d V (f,f0) 4δ sup f Fn d H(Pf,Pf0) δ exp( k0nδ2) = exp( k0M2nr2 n/16). This implies that Ef0{max 1 j s f Fn d V (f,f0) 4δ f f0 a i Ij (pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f)(1 φj,n)} s j=1 f Fn d V (f,f0) 4δ f f0 a Ef0{ i Ij (pf/pf0)(Zi)(1 φj,n)}dΠ(f) = s j=1 f Fn d V (f,f0) 4δ f f0 a Ef{1 φj,n}dΠ(f) sexp( k0M2nr2 n/16)C(a,Π). max 1 j s f Fn d V (f,f0) 4δ f f0 a i Ij (pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f)(1 φj,n) = OPf0 (sexp( k0M2nr2 n/16)). Nonparametric Bayesian Aggregation for Massive Data Meanwhile, it follows by (S.31) and (S.35) that max 1 j s A n1, f f0 ε/2 f f0 ad P(f Dj)(1 φj,n) max 1 j s Fn,d V (f,f0) 4δ f f0 ad P(f Dj)(1 φj,n) max 1 j s f Fn d V (f,f0) 4δ f f0 a i Ij(pf/pf0)(Zi)exp( nλJ(f)/2)dΠ(f)(1 φj,n) min 1 j s Ij1 = OPf0 (sexp( k0M2nr2 n/16 + c2nr2 n)) = OPf0 (sexp( nr2 n)). Choose the constant M to be even bigger so that c V (M + J(f0)1/2) + 1 + c2 < k0M2/16. Similar to (S.32) we get max 1 j s A n1, f f0 ε/2 f f0 ad P(f Dj) φj,n = OPf0(s2 exp( nr2 n)). max 1 j s A n1, f f0 ε/2 f f0 ad P(f Dj) = OPf0(s2 exp( nr2 n)). (S.36) Together with (S.25), (S.32) and (S.36), we get max 1 j s A n f f0 ad P(f Dj) max 1 j s A n1 f f0 ad P(f Dj) + max 1 j s A n2 f f0 ad P(f Dj) max 1 j s A n1, f f0 ε/2 f f0 ad P(f Dj) + max 1 j s f f0 >ε/2 f f0 ad P(f Dj) + max 1 j s A n2 f f0 ad P(f Dj) = OPf0(s2 exp( nr2 n)). This completes the proof. Proof [Proof of Lemma 14] For any f,g Sm(I) with f g ε, define Z(f,g) = 1 2[Y (f(X) g(X)) f(X)2/2 + g(X)2/2], where recall and Z = (Y,X). It is easy to see by direct calculations that d2 H(Pf,Pg) = 1 Eg{exp( Z(f,g))}. By Taylor s expansion, for some random t [0,1], 1 Eg{exp( Z(f,g))} = Eg{ Z(f,g)} 1 2Eg{ Z(f,g)2} 1 6Eg{exp(t Z(f,g)) Z(f,g)3}. We will analyze the terms on the right side of the equation. Shang, Hao, Cheng Define ξ = Y A(g(X)). By Morris et al. (1983) we get Eg{ξ X} = 0 and Eg{ξ2 X} = 1. By Taylor s expansion, Z(f,g) = 1 2[ξ(f(X) g(X)) 1 2(f(X) g(X))2. Then we get that Eg{ Z(f,g)} = 1 4V (f g) and Eg{ Z(f,g)2} = Eg{(1 2ξ(f(X) g(X)) 1 4(f(X) g(X))2)} = 1 4Eg{ξ2(f(X) g(X))2} 1 4Eg{ξ(f(X) g(X))3} + 1 16Eg{(f(X) g(X))4} = 1 4V (f g) + 1 16Eg{(f(X) g(X))4}. Since f g ε < 1 and Z(f,g) 1 2( ξ + 1/2) f(X) g(X) , we get Eg{exp(t Z(f,g)) Z(f,g)3} Eg{exp( Z(f,g) ) Z(f,g) 3} Eg{exp(ε ξ /2 + ε/4)( ξ /2 + 1/4)3 f(X) g(X) 3} = 6Eg {exp(ε ξ /2 + ε/4) 1 3! ( ξ /2 + 1/4)3 f(X) g(X) 3} 6Eg{exp(ε ξ /2 + ε/4)exp( ξ /2 + 1/4) f(X) g(X) 3} 6exp(ε/4 + /4)Eg{exp( ξ ) f(X) g(X) 3} 6εexp(1/2)CϵV (f g). It also holds that Eg{(f(X) g(X))4} ε2V (f g). Therefore, for any f,g Sm(I) with f g ε, d2 H(Pf,Pg) V (f g)/8 32Eg{(f(X) g(X))4} + 1 6Eg{exp(t Z(f,g)) Z(f,g)3} (εCϵ exp(1/2) + ε2/32)V (f g) < V (f g)/16, which implies V (f g)/16 d2 H(Pf,Pg) 3V (f g)/16. This proves Lemma 14. Nonparametric Bayesian Aggregation for Massive Data S.8.6. Additional Plots in Section 5 Radius of the credible sets/intervals 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 11. CP of Fx(f) = f(x) against x based on asymptotic theory. Results on larger N Simulation results about credible regions/intervals in Section 5 are based on N = 1200. This section repeated the same study for N = 1800,2400. Results are summarized in following plots. Shang, Hao, Cheng 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 12. CP of Fx(f) = x 0 f(z)dz against x based on asymptotic theory. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.08 0.10 0.12 0.14 0.16 0.18 0.95 0.90 0.70 0.50 0.30 0.10 Figure 13. Radius of credible region (32) against γ. Legend indicates the credibility levels 1 α. Nonparametric Bayesian Aggregation for Massive Data 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.05 0.10 0.15 0.95 0.90 0.70 0.50 0.30 0.10 Figure 14. Radius of credible region (33) against γ. Legend indicates the credibility levels 1 α. Shang, Hao, Cheng 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.10 0.20 0.30 0.95 0.90 0.70 0.50 0.30 0.10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.10 0.20 0.30 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.10 0.20 0.30 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.10 0.20 0.30 Figure 15. Radius of credible interval (36) for pointwise functional Fx(f) = f(x) against γ. Legend indicates the credibility levels 1 α. Four values of x are considered. Nonparametric Bayesian Aggregation for Massive Data 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.02 0.04 0.06 0.08 0.95 0.90 0.70 0.50 0.30 0.10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.02 0.04 0.06 0.08 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.02 0.04 0.06 0.08 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.00 0.02 0.04 0.06 0.08 Figure 16. Radius of credible interval (36) for integral functional Fx(f) = x 0 f(z)dz against γ. Legend indicates the credibility levels 1 α. Four values of x are considered. Shang, Hao, Cheng 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 Figure 17. N = 1800: CP of ACR and FCR based on strong topology. Nonparametric Bayesian Aggregation for Massive Data 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 Figure 18. N = 1800: CP of ACR and FCR based on weak topology. Shang, Hao, Cheng 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 19. N = 1800: CP of Fx(f) = f(x) against x based on posterior samples of f. Nonparametric Bayesian Aggregation for Massive Data 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 20. N = 1800: CP of Fx(f) = f(x) against x based on asymptotic theory. Shang, Hao, Cheng 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 21. N = 1800: CP of Fx(f) = x 0 f(z)dz against x based on posterior samples of f. Nonparametric Bayesian Aggregation for Massive Data 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 22. N = 1800: CP of Fx(f) = x 0 f(z)dz against x based on asymptotic theory. Shang, Hao, Cheng 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 Figure 23. N = 2400: CP of ACR and FCR based on strong topology. Nonparametric Bayesian Aggregation for Massive Data 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.4 0.8 Figure 24. N = 2400: CP of ACR and FCR based on weak topology. Shang, Hao, Cheng 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 25. N = 2400: CP of Fx(f) = f(x) against x based on posterior samples of f. Nonparametric Bayesian Aggregation for Massive Data 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 26. N = 2400: CP of Fx(f) = f(x) against x based on asymptotic theory. Shang, Hao, Cheng 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 27. N = 2400: CP of Fx(f) = x 0 f(z)dz against x based on posterior samples of f. Nonparametric Bayesian Aggregation for Massive Data 0.2 0.4 0.6 0.8 0.2 0.2 0.6 1.0 s=1 s=6 s=15 s=60 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.4 0.8 Figure 28. N = 2400: CP of Fx(f) = x 0 f(z)dz against x based on asymptotic theory.