# an_asymptotic_analysis_of_distributed_nonparametric_methods__87399382.pdf

Journal of Machine Learning Research 20 (2019) 1-30 Submitted 11/17; Revised 10/18; Published 6/19

An asymptotic analysis of distributed nonparametric methods

Botond Szab o b.t.szabo@math.leidenuniv.nl Mathematical Institute Leiden University 2333 CA Leiden The Netherlands

Harry van Zanten hvzanten@uva.nl Korteweg-de Vries Institute for Mathematics University of Amsterdam Science Park 105-107 1098 XG Amsterdam The Netherlands

Editor: Manfred Opper

We investigate and compare the fundamental performance of several distributed learning methods that have been proposed recently. We do this in the context of a distributed version of the classical signal-in-Gaussian-white-noise model, which serves as a benchmark model for studying performance in this setting. The results show how the design and tuning of a distributed method can have great impact on convergence rates and validity of uncertainty quantiﬁcation. Moreover, we highlight the diﬃculty of designing nonparametric distributed procedures that automatically adapt to smoothness.

Keywords: distributed learning, nonparametric models, high-dimensional models, Gaussian processes, convergence rates

1. Introduction

Both in statistics and machine learning there has been substantial interest in the design and study of distributed statistical or learning methods in recent years. One driving reason is the fact that in certain applications datasets have become so large that it is often unfeasible, or computationally undesirable, to carry out the analysis on a single machine. In a distributed method the data are divided over a cluster consisting of several machines and/or cores. The machines in the cluster then process their data locally, after which the local results are somehow aggregated on a central machine to ﬁnally produce the overall outcome of the statistical analysis. Distributed methods are not only used for computational reasons, but are for instance also of interest in situations where privacy is important and it is undesirable that all data are handled at a single location. Moreover, there are applications in which data are by construction gathered at multiple locations and ﬁrst processed locally, before being combined at a central location.

c 2019 Botond Szab o and Harry van Zanten.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-666.html.

Szab o and Van Zanten

Over the last years a variety of distributed methods have been proposed. Recent examples include Consensus Monte Carlo (Scott et al. (2016)), WASP (Srivastava et al. (2015)), distributed GP s (Deisenroth and Ng (2015)), and methods proposed in Shang and Cheng (2015), Jordan et al. (2016), Lee et al. (2015), Volgushev et al. (2017), to mention but a few. Most papers on distributed methods do extensive experiments on simulated, benchmark and real data to numerically assess and compare the performance of the various methods. Some papers also derive a number of theoretical properties. Theoretical results on the performance of distributed methods are not yet widely available however and there is certainly no common theoretical framework in place that allows a clear theoretical comparison of methods and the development of an understanding of fundamental performance guarantees and limitations. Clearly we can not consider the complete list of all existing methods in this paper. We limit ourselves to a number of representative methods that are Bayesian in nature, allowing for a meaningful comparison. Since a better theoretical understanding of distributed methods can help to pinpoint fundamental diﬃculties and opportunities, we develop a framework in this paper which allows us to study and compare the performance of various methods. We are in particular interested in high-dimensional, or nonparametric problems. It is by now well known that the performance of learning or statistical methods in such settings depends crucially on wether or not a method succeeds in realizing the correct bias-variance trade-oﬀ, or, in diﬀerent terminology, succeeds in balancing underand overﬁtting. For classical, non-distributed settings we have a rather well-developed understanding of how methods should be tuned to achieve a proper bias-variance trade-oﬀ. For distributed methods however, such theory is currently not yet available. To be able to develop relevant theory we study an idealized model, which is a distributed version of the canonical signal-in-white-noise model that serves as an important benchmark model in mathematical statistics (see for instance Tsybakov (2009); Johnstone (2017); Gin e and Nickl (2016)). The model is on the one hand rich enough to be interesting, in the sense that it is really distributed in nature and the unknown object that needs to be learned is truly inﬁnite-dimensional. On the other hand it is tractable enough to allow detailed mathematical analysis. In the non-distributed case the signal-in-white-noise model is well known to be very closely related to other nonparametric models, such as nonparametric regression and density estimation. (This can be made very precise in the context of Le Cam s theory of limits of experiments (e.g. Le Cam (2012), Brown and Low (1996), Nussbaum (1996)).) Similarly, the distributed signal-in-white-noise model that we consider in this paper provides a uniﬁed framework to compare methods that were originally introduced in diﬀerent settings. Therefore, although our theoretical results are all obtained in the context of this relatively simple benchmark model and we consider only a number of representative distributed methods, we believe that the conclusions that we draw are relevant more generally. We introduce the model in Section 2. It is not diﬃcult to see that if the number of machines m is relatively large with respect to the total sample size, or signal-to-noise-ratio n, then doing things completely naively in the distributed case leads to a sub-optimal bias-variance trade-oﬀ(see also the simulation example in Section 2). In particular, just computing the usual estimators on every local machine and then averaging them on the central machine typically leads to a global estimator with a bias that is too large. To achieve good performance, the trade-oﬀhas to be

An asymptotic analysis of distributed nonparametric methods

adjusted somehow. This can in principle be done in various ways. For instance by locally choosing the wrong settings for tuning parameters on purpose, or, in a Bayesian setting, by adjusting the likelihood (e.g. raising it to some power) or by adjusting the prior. In Section 3 we study to what degree various methods that have been proposed in the literature succeed in ultimately achieving the right trade-oﬀ. We will see that some are more successful than others in this respect. An important observation that we make is that the methods that are shown to work well in Section 3 all use information on aspects of the true signal that are in principle unknown, such as its degree of regularity. A key question is whether in distributed settings it is fundamentally possible to set tuning parameters correctly in a purely data-driven way, without using such information. In the non-distributed setting it is well known that such adaptive methods indeed exist (e.g. Tsybakov (2009) or Gin e and Nickl (2016)). In the distributed case that we study here however, this is much less clear. In Section 4 we show that using a distributed version of a standard adaptation method that is known to work in the non-distributed case, such as maximum marginal likelihood empirical Bayes, can lead to sub-optimal results in the distributed setting. We will argue that this seems to be a fundamental issue and that we expect that correct automatic setting of tuning parameters in distributed methods is fundamentally more challenging than in the classical, non-distributed case. We believe this is an important issue and want to highlight it as an important and interesting topic for future research. To further study the fundamental potential and limitations of distributed methods, one should also take into account that there are typically computational and/or communication cost restrictions in such settings. In fact, without such restrictions the matter of obtaining good strategies is essentially non-existent, since we could simply communicate all data to a central machine and then apply an existing optimal non-distributed method. The main goal of this paper, however, is to study and compare the theoretical performance of a number of existing distributed methods. The methods we consider are all divide-andconquer algorithms in which computational or communication limitations are not explicitly imposed, however such restrictions do motivate the methods and all methods implicitly or explicitly consider situations in which simply aggregating and handling all data in one machine is not an option. Formally taking into account communication restrictions in the theoretical analysis is an important next step, but turns out to be rather delicate. We recently obtained some ﬁrst results in the follow-up paper Szabo and van Zanten (2019), see also Zhu and Laﬀerty (2018) for related work. The remainder of the paper is organized as follows. In the next section we introduce the distributed version of the signal-in-white-noise model and provide a simple simulation example to show that in a distributed setting, naively combining inferences from local machines into a global estimator may produce misleading results. In Section 3 we study the performance of a number of Bayesian procedures for signal reconstruction in the distributed signal-in-white-noise model introduced in Section 2. We include a number of methods that have recently been proposed in the literature. We show that some succeed in obtaining the appropriate bias-variance trade-oﬀ, but others do not. Moreover, the ones that do produce good results are all non-adaptive, in the sense that they use knowledge of the smoothness of the unkown signal to set their tuning parameters. In the ﬁnal Section 4 we consider the more realistic setting in which this smoothness is unknown. We study a distributed method

Szab o and Van Zanten

that has been proposed for data-driven tuning of the hyperparameters and show that there exist diﬃcult signals , which this method can not recover in the distributed model at an optimal rate. We argue that this appears to be a fundamental issue, and that designing procedures that automatically adapt to smoothness is fundamentally more challenging in the distributed framework. Mathematical proofs are collected in appendix Sections A and B.

2. Distributed signal-in-white-noise model

Consider the problem of estimating a signal in Gaussian white noise. This is the continuous regression-type problem in which we observe a signal X = (Xt : t [0, 1]) satisfying a stochastic diﬀerential equation

0 f(s) ds + σ n Wt,

where W is a Brownian motion, modelling the white noise , and f is an unknown signal, modelled by a square integrable function, that needs to be recovered from the data. The natural number n is the signal-to-noise ratio. Its size aﬀects the diﬃculty of the problem and n can be seen as playing the role of sample size in this problem. By expanding the data in a ﬁxed orthonormal basis of L2[0, 1] (for instance the classical Fourier basis), it is seen that the statistical problem of recovering the signal f from the data X is equivalent to the problem of recovering the sequence of (Fourier) coeﬃcients θ ℓ2

from noisy observations Y1, Y2, . . . , satisfying

n Zi, i = 1, 2, . . . , (2.1)

where the Zi are independent standard normal variables. This is the usual setting in which there is a single observer that observes every coeﬃcient θi with additive Gaussian noise with variance σ2/n. See for instance Tsybakov (2009); Johnstone (2017); Gin e and Nickl (2016) for more details on this classical model. In the distributed version of the model we divide the precision budget n over m diﬀerent observers, so that each one observes the signal in Gaussian noise with variance σ2m/n, independent of the others. In other words, observer j has data Y j 1 , Y j 2 , . . . satisfying

Y j i = θi +

n Zj i , i = 1, 2, . . . , (2.2)

where the Zj i are independent, standard Gaussian random variables. We call the m independent sub-problems in which the signal-to-noise ratio is σ2m/n the local problems. Note that the classical, non-distributed signal-in-white-noise model is obtained again from the distributed model by aggregating all the local data. Indeed, if for j = 1, 2, . . . we deﬁne Yi = m 1 Pm j=1 Y j i , then (2.1) holds, with Zi = m 1/2 Pm j=1 Zj i independent standard normal variables. This model has been studied extensively in the literature, serving as a canonical model for understanding the performance of high-dimensional or nonparametric

An asymptotic analysis of distributed nonparametric methods

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

True signal

Figure 1: True signal.

statistical procedures. It is well known for instance that if the true signal θ belongs to an ellipsoid or a hyper rectangle of the form

{θ ℓ2 : X i2βθ2 i M2} or {θ ℓ2 : sup i (i1+2βθ2 i ) M2}

for some β, M > 0, then the optimal rate of convergence of estimators (relative to the ℓ2norm) is of the order n β/(1+2β). Moreover, there exist so-called adaptive estimators, which achieve this rate without using knowledge about the parameters β or M that describe the complexity, or regularity of the true signal. See, for instance, Tsybakov (2009) or Gin e and Nickl (2016). Our central question is whether or not the same results can be obtained in the distributed setting in which each of the m diﬀerent observers ﬁrst separately make inference about the signal, and then the local estimates are aggregated into one joint estimator. For technical convenience we will quantify regularity using hyper rectangles in the remainder of the paper, but the analoguous results can be obtained for ellipsoids. The speciﬁc examples of distributed procedures that we consider in this paper are about distributed Bayesian methods. These methods have in common that each local observer ﬁrst chooses a prior distribution and computes the corresponding local posterior distribution using the local data (or an appropriate modiﬁcation). In the next step the m local posteriors are somehow aggregated into a global posterior-type distribution, which is then used to produce an estimate of the signal and/or a quantiﬁcation of the associated uncertainty. In general there is no guarantee that this aggregated posterior resembles the posterior distribution that would be obtained in the non-distributed setting, using all the data at once. In particular, it is not clear beforehand how a distributed Bayes method should be constructed in order to have good theoretical properties, like optimal convergence rates, reliable uncertainty quantiﬁcation or adaptation properties. In this paper we investigate various distributed methods that have been proposed from this point of view. To see that interesting things can happen it is exemplifying to compare the results of a distributed and a non-distributed (Bayesian) analysis of simulated data. Concretely, we consider a true signal θ consisting of the Fourier coeﬃcients of the function shown in Figure 1. For this signal we simulate data according to (2.2), with σ = 1, m = 40 and n = 120 40 = 4800. Then for every local observer a Bayesian procedure is carried out with a Gaussian prior on θ, postulating that the coordinates θi are independent and N(0, i 1 2α)-distributed. The hyperparameter α, which describes the regularity of the prior,

Szab o and Van Zanten

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

Distributed adaptive method (Method VII)

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

Non distributed adaptive method

Figure 2: Signal reconstruction using the distributed method (left) and the non-distributed method (right).

is determined using a distributed version of maximum marginal likelihood, as described in Section 4. This analysis leads to m = 40 local posterior distributions. These are then combined to produce an overall posterior distribution for the signal. The precise procedure is described in Section 4. The resulting estimator for the signal, together with pointwise 95% credible intervals, is shown in the left plot in Figure 2. The corresponding non-distributed result is obtained by ﬁrst aggregating all local data as in (2.1) and then carrying out the same Bayesian procedure on these complete data. The resulting non-distributed reconstruction of the signal is shown on the right in Figure 2. The non-distributed version of this method was studied theoretically for instance in Knapik et al. (2016) and Szab o et al. (2015), where it was shown that the method is adaptive and rate-optimal. The simulation suggests however that an apparently reasonable distributed analogue of the method does not necessarily inherit these favourable properties. The procedure seems to be underﬁtting and the credible intervals appear to be too narrow. We will argue that this is in some sense a fundamental issue and in the next sections we will study various proposed distributed methods to investigate to what degree they succeed in avoiding or solving these problems.

3. Results for non-adaptive procedures

In this section we study the performance of a number of Bayesian procedures for signal reconstruction in the distributed signal-in-white-noise model introduced in Section 2. All methods involve putting a prior distribution on the unknown signal θ ℓ2 in each local problem and then combining the resulting local posteriors into one global posterior-type distribution. To be able to compare the various methods we consider the same Gaussian process (GP) prior in every case, namely the prior

i=1 N(0, i 1 2α), (3.1)

which postulates that the coeﬃcients θi of the signal θ are independent and N(0, i 1 2α)- distributed. The hyper parameter α > 0 essentially controls the regularity of the prior, since

An asymptotic analysis of distributed nonparametric methods

it basically controls how fast the Fourier coeﬃcients decrease. (Some of the methods we consider use exactly this prior, others modify it in a certain way with the aim of achieving better performance.) The global posterior-type distribution depends on all the data Y = (Y j i : j = 1, . . . , m; i = 1, 2, . . .) and is denoted by Π( | Y). It is generally some type of average of the local posteriors, but its precise construction diﬀers between proposed methods. We will see that this can have a signiﬁcant eﬀect on the performance. We take an asymptotic perspective and investigate in every case the rate at which the global posterior contracts around the true signal as n relative to the ℓ2-norm, which is as usual deﬁned by θ 2 2 = P θ2 i . For a sequence of positive numbers εn 0 we say that the global posterior contracts at the rate εn around the true signal θ0 if for all sequences Mn , Eθ0Π(θ ℓ2 : θ θ0 2 > Mnεn | Y) 0

as n . This means that asymptotically, all posterior mass is concentrated in balls around the true signal θ0 with ℓ2-radius of the order εn. Additionally, we study how well the posterior quantiﬁes the remaining uncertainty. Speciﬁcally, we consider the coverage probabilities of credible balls around the global posterior mean. These credible sets are constructed by ﬁrst computing the mean ˆθ of the global posterior Π( | Y). Then for a level γ (0, 1), the posterior is used to determine the radius rγ such that the ball around ˆθ with radius rγ receives 1 γ posterior mass, i.e.

Π(θ : θ ˆθ 2 rγ|Y) = 1 γ.

For L > 0, the credible set ˆC(L) is subsequently deﬁned by

ˆC(L) = {θ : θ ˆθn 2 Lrγ}. (3.2)

(The extra constant L gives some added ﬂexibility, for L = 1 we obtain an exact 1 γ credible set.) We are interested in the coverage probabilities Pθ0(θ0 ˆC(L)). If this tends to 0 as n , the credible sets are asymptotically not frequentist conﬁdence sets, hence give a misleading quantiﬁcation of the uncertainty. Ideally, the coverage probabilities stay bounded away from 0 as n . In the non-distributed case m = 1 it is well known that both the rate at which the posterior contracts around the truth and the behaviour of the coverage probabilities of credible sets depend crucially on how the hyper parameter α is tuned. The correct biasvariance trade-oﬀis achieved if α is in accordance with the regularity of the unknown signal. To make this precise, we will consider signals belonging to hyper rectangles of the form

Hβ(M) = n θ ℓ2 : sup i (i1+2βθ2 i ) M2o (3.3)

for some β, M > 0. It is shown for instance in Knapik et al. (2011) for the non-distributed case that if θ0 Hβ(M) and we set α = β, then the posterior contracts around θ0 at the optimal rate n β/(1+2β). Moreover, for L large enough it then holds that Pθ0(θ0 ˆC(L)) γ. Hence, in the non-distributed case it is optimal to choose the hyper parameter α in such a way that the regularity α of the prior matches the regularity β of the true signal. In the remainder of this section we investigate distributed methods from this point of view. We will see that the diﬀerent proposed methods lead to diﬀerent behaviours in terms

Szab o and Van Zanten

of contraction rates and coverage. We stress that the results in this section are non-adaptive, in the sense that we allow the tuning parameter α and other aspects of the constructions to use knowledge of the regularity β of the true signal. This is of course not realistic. It is however important to ﬁrst understand for every method whether ideally, if the value of β is given to us by an oracle, it is possible to tune the method optimally. Whether this is also possible adaptively, without knowing β, is then the next natural question, which we address in Section 4.

3.1. Naive averaging of local posterior draws

Recall that we have m local observers that each have a dataset Yj = (Y j 1 , Y j 2 , . . .) of noisy coeﬃcients satisfying (2.2). The aim is to recover the true sequence of coeﬃcients θ. As a starting point, and to have a baseline case to compare the other methods to, we analyse the naive distributed approach in which in every local problem we simply use the prior Π( | α) deﬁned by (3.1), with α = β equal to the regularity of the true sequence θ, in the sense of (3.3). Every local observer then computes its corresponding local posterior, Πj( | Yj). By Bayes formula this is given by

dΠj(θ | Yj) p(Yj | θ) dΠ(θ | β),

where the likelihood for the jth local problem is given by

p(Yj | θ) Y e 1

2 n(Y j i θi)2

σ2m . (3.4)

Finally these local posteriors are combined into a global, average posterior ΠI( | Y) by postulating that a draw from this global posterior is generated by ﬁrst drawing once from each local posterior and then averaging these m independent draws. (Formally, this means that the global posterior ΠI( | Y) is the convolution of the rescaled local posteriors Π1(m | Y1), . . . , Πm(m | Ym)). This distributed method is conceptually very simple, but it turns out that neither from the point of view of contraction rates, nor from the point of view of uncertainty quantiﬁcation it performs very well. The reason is basically that although the choice α = β of the tuning parameter of the prior correctly matches squared bias, variance and posterior spread in the local problems, the averaging procedure results in a global posterior for which the spread and the variance of the mean are too small relative to the squared bias. The following theorem asserts that for every smoothness level β > 0 there exist β-regular truths for which the contraction rate of the posterior deteriorates substantially and for which the uncertainty quantiﬁcation by the credible sets (3.2) constructed from the global posterior ΠI( | Y) is useless, no matter how far they are blown up by a constant L > 0.

Theorem 1 (naive averaging) For every β, M > 0 there exists a θ0 Hβ(M) such that for small enough c > 0,

Eθ0ΠI(θ : θ θ0 2 cm

β 1+2β n β 1+2β |β, Y) 0

as m and n/m . Furthermore, for all L > 0 it holds that

Pθ0 θ0 ˆC(L) 0.

An asymptotic analysis of distributed nonparametric methods

Proof The proof of the theorem is given in Section A.1.

In the literature several less naive distributed strategies have been proposed. These methods either change the local likelihoods in a certain way, and/or the priors that are locally used, and/or the way that the local posteriors are aggregated. In the next few sections we investigate whether such strategies can improve the bad asymptotic performance of the naive averaging method.

3.2. Adjusted local likelihoods and averaging

One perspective on the bad performance of the naive method is to say that since the sample size n/m in the local problems is too small, the inﬂuence of the data on the local posterior is too small, resulting in a variance (and spread) that is too small relative to the squared bias in the global posterior. A possible way to remedy this that has been proposed in several papers is to raise the local likelihoods to the power m, in order to mimic the situation that we have sample size n in the local problems. This generalized Bayesian approach for the local problems has for instance been considered in the distributed context by Srivastava et al. (2015). They combine it with a diﬀerent aggregation method however, which we consider in Section 3.4. In this section we still consider the simple averaging scheme, in order to isolate the eﬀect of adjusting the local likelihoods. So in method II all local observers use the prior Π( | α) again, with α = β equal to the regularity of the truth. They now each compute a generalized local posterior Πj( | Yj), deﬁned by

d Πj(θ | Yj) p(Yj | θ) m dΠ(θ | β).

As before the global posterior ΠII( | Y) is deﬁned by postulating that a draw from this global posterior is generated by ﬁrst drawing once from each local generalized posterior and then averaging these m independent draws. The following theorem states that this method indeed improves the naive approach of Section 3.1. The global posterior now contracts at the optimal rate for every β-regular truth. Unfortunately, the bad behaviour of the credible sets has not been remedied. For this approach the uncertainty quantiﬁcation is in fact misleading for all β-regular truths.

Theorem 2 (adjusted likelihoods + averaging) For all β, M > 0 and all sequences Mn ,

sup θ0 Hβ(M) Eθ0ΠII(θ : θ θ0 2 Mnn β 1+2β |Y) 0

as n, m . However, for all θ0 Hβ(M) and all L > 0 it holds that

Pθ0 θ0 ˆC(L) 0.

Proof The proof is given in Section A.2.

Szab o and Van Zanten

3.3. Adjusted priors and averaging

Adjusting the likelihood as in the preceding section resulted in a correct trade-oﬀbetween the bias and the variance of the global posterior mean, yielding an optimal posterior contraction rate. The spread of the posterior remained too small in comparison however, resulting in credible sets with zero asymptotic coverage. Instead of raising the local posteriors to the power m, as considered in the preceding section, we could alternatively raise the prior density to the power 1/m. This has for instance been proposed in the context of the Consensus Monte Carlo approach by Scott et al. (2016), in combination with simple averaging of the local posteriors. In this section we investigate the performance of this method in terms of posterior contraction and uncertainty quantiﬁcation in our distributed signal-in-white-noise model. The prior Π( | α) that we use in the local problems is again a product of centered Gaussians with variance i 1 2α. Raising the corresponding densities to the power 1/m has the eﬀect of multiplying the ith prior variance by m. Hence, in our case raising the prior density to the power 1/m is the same as multiplicative rescaling, postulating that θ is a-priori distributed according to Π( | α, m), where

Π( |α, τ) =

i=1 N(0, τi 1 2α) (3.5)

for α, τ > 0. Rescaled GPs have also been considered by Shang and Cheng (2015), who have used them in the distributed setting to construct global credible sets from local ones. Using rescaling we can actually obtain good results if the prior regularity α is not exactly equal to the true regularity β. By using a scaling diﬀerent from τ = m we can somehow compensate for the mismatch between α and β, at least in the range β 1 + 2α. In the non-distributed setting this is a well-known phenomenon, see for instance van der Vaart and van Zanten (2007); Knapik et al. (2011); Szab o et al. (2013). The distributed procedure that we consider in this section then takes the following form. Every local observer uses the rescaled prior Π( |α, τ) deﬁned by (3.5), with α > 0 and

where β is the regularity of the truth. Next the (normal, unadjusted) corresponding posteriors are computed and they are averaged into a global posterior ΠIII( | Y) as in the preceding sections. (Note that if in the local problems the prior regularity α = β is used, then τ = m, so the method corresponds to raising the prior density to the power 1/m.) The following theorem gives the posterior contraction and coverage results for this method.

Theorem 3 (adjusted priors + averaging) Suppose β, M > 0 and β 1 + 2α. Then for all sequences Mn ,

sup θ0 Hβ(M) Eθ0ΠIII(θ : θ θ0 2 > Mnn β 1+2β |Y) 0

An asymptotic analysis of distributed nonparametric methods

as n . Moreover, for all γ (0, 1) it holds that

sup θ0 Hβ(M) Pθ0 θ0 ˆC(L) γ

for large enough L > 0.

Proof See Section A.3.

So adjusting the prior in this way actually works better than adjusting the likelihood. Not only do we get optimal contraction rates, but the credible sets that this method produces have asymptotic frequentist coverage too. The proof shows that the credible sets have optimal radius of the order n β/(1+2β) as well.

3.4. Adjusted local likelihoods and Wasserstein barycenters

In Section 3.2 we saw that raising the local likelihoods to the power m and then averaging the corresponding generalized posteriors yields optimal contraction rates, but can produce badly performing credible sets. In this section we study the approach considered by Minsker et al. (2014); Srivastava et al. (2015) in the context of their WASP method, which consists in aggregating the local posteriors not by simple averaging, but by computing their Wasserstein barycenter. The generalized local posteriors Πj( | Yj), as deﬁned in Section 3.2, are (Gaussian) measures on ℓ2. The 2-Wasserstein distance W2(µ, ν) between two probability measures µ and ν on ℓ2 is deﬁned by

W 2 2 (µ, ν) = inf γ

Z Z x y 2 2 γ(dx, dy),

where the inﬁmum is over all measures γ on ℓ2 ℓ2 with marginals µ and ν. The corresponding 2-Wasserstein barycenter of m probability measures µ1, . . . , µm on ℓ2 is then deﬁned by

µ = argmin µ 1 m

j=1 W 2 2 (µ, µj),

where the minimum is over all probability measures on ℓ2 with ﬁnite second moments. There exist eﬀective algorithms to compute Wasserstein barycenters in many cases, see for instance Cuturi and Doucet (2014) and the references therein. Having this notion at our disposal the distributed method we consider in this section proceeds as follows. In every local problem the prior Π( | α) is used, with α = β equal to the regularity of the truth. Next, the corresponding generalized posteriors Πj( | Yj) are computed locally, which involves raising the likelihood to the power m as described in Section 3.2. Finally, the global posterior ΠIV ( | Y) is constructed as the 2-Wasserstein barycenter of the local measures Π1( | Y1), . . . , Πm( | Ym). The following theorem asserts that this method results in optimal posterior contraction rates and correct quantiﬁcation of uncertainty.

Szab o and Van Zanten

Theorem 4 (adjusted likelihoods + barycenters) For all β, M > 0 and all sequences Mn ,

sup θ0 Hβ(M) Eθ0ΠIV (θ : θ θ0 2 > Mnn β 1+2β |Y) 0

as n . Moreover, for all γ (0, 1) it holds that

sup θ0 Hβ(M) Pθ0 θ0 ˆC(L) γ

for large enough L > 0.

Proof See Section A.4.

3.5. Product of Gaussian process experts

The proofs of the theorems presented so far show that since in our context the global posterior is always a Gaussian measure, the behaviour of the procedure can be understood by analyzing three central quantities: the bias of the posterior mean, the variance of the posterior mean, and the spread of the posterior. Depending on how these quantities are related we have found diﬀerent behaviours: sub-optimal posterior contraction and bad coverage of credible sets (Section 3.1), optimal posterior contraction but bad coverage of credible sets (Section 3.2), and optimal posterior contraction and also good coverage of credible sets (Sections 3.3 and 3.4). In principle it is now straightforward to analyze diﬀerent methods as well, provided the three central quantities can be controlled. As an illustration we consider in this section the single-layer version of the product-of-Gaussian-process-expert (Po E) model, introduced in Ng and Deisenroth (2014) and a generalization proposed in Cao and Fleet (2014). An interesting fact is that we will encounter a combination of behaviours that we have not seen yet: sub-optimal contraction rates, but good coverage of credible sets. These methods were introduced to deal with the distributed non-parametric regression model, but for the sake of comparison we analyze them in the context of our distributed signal-in-white-noise model, which can be thought of as an idealized version of the regression model. The idea of the basic version of the Gaussian Po E model is to employ a Gaussian prior in every local machine, compute the corresponding posterior densities and approximate the global posterior density by multiplying and normalizing these. In our inﬁnite-dimensional setting this does not make sense strictly speaking, since we can not express priors and posteriors on ℓ2 in terms of densities with respect to some generic dominating measure. We could remedy this by considering a truncated version of our distributed model, where we assume we only observe the ﬁrst n noisy coeﬃcients Y j i in every machine, say, and focus on making inference about the ﬁrst n true coeﬃcients θi. This would make the setting ﬁnitedimensional, allowing us to write prior and posterior densities with respect to the Lebesgue measure. Alternatively, we can stay in the inﬁnite-dimensional setting of the paper and just reason formally and still arrive at a well-deﬁned global Po E posterior . This is the approach we follow here.

An asymptotic analysis of distributed nonparametric methods

Indeed, say that as before we use the prior Π( | α) given by (3.1) in every local machine, with α = β equal to the regularity of the true signal. This prior has formal density proportional to

2 θ2 i i 1 2β .

By completing the square we see that the product of this expression with the local likelihood given by (3.4) is, still formally, proportional to

2 θ2 i i 1 2β e 1

2 n(Y j i θi)2

2 θ2 i (i1+2β+ n mσ2 )+θi n Y j i mσ2 .

Taking the product over j we then obtain the formal density of the Po E posterior, which is proportional to

2 θ2 i (mi1+2β+ n

σ2 )+θi n Pm j=1 Y j i mσ2 .

Now this last expression is, up to a constant, the density of a product of Gaussians with means ˆθi and variances t2 i given by

ˆθi = nm 1 P Y j i n + σ2mi1+2β , t2 i = σ2

n + σ2mi1+2β .

The latter is in fact a well-deﬁned Gaussian measure on ℓ2, so we can now simply deﬁne the global Po E posterior ΠV ( | Y) as the latter measure. We see that the expressions for the global mean and spread are in fact the same as what we found in Section A.1 for the naive averaging method. As a consequence, the negative result of Theorem 1 holds for the basic version of the Gaussian Po E model as well.

Theorem 5 (product of Gaussian experts) For every β, M > 0 there exists a θ0 Hβ(M) such that for small enough c > 0,

Eθ0ΠV (θ : θ θ0 2 cm

β 1+2β n β 1+2β |Y) 0

as m and n/m . Furthermore, for all L > 0 it holds that

Pθ0 θ0 ˆC(L) 0.

One can generalize the Po E model by raising the local posterior densities to some power before multiplying and normalizing them, as proposed in Cao and Fleet (2014). In the subsequent analysis we consider the choice 1/m for the power, as suggested in Deisenroth and Ng (2015). Adapting the preceding analysis for the ordinary Po E model we see that for this generalized Po E model the global posterior ΠV I( | Y) is in our setting again a product of Gaussians, but now with means and variances given by

ˆθi = nm 1 P Y j i n + σ2mi1+2β , t2 i = σ2m n + σ2mi1+2β .

So the global posterior mean is unaltered compared to the basic Po E model, but the global posterior spread has been blown up by a factor m. As a result, there still exists the same

Szab o and Van Zanten

Method Description Optimal rate Coverage I naive averaging no no II adjusted likelihoods, averaging yes no III adjusted priors, averaging yes yes IV adjusted likelihoods, barycenter yes yes V product of experts no no VI generalized product of experts no yes

Table 1: Performance of the various non-adaptive methods.

class of truths as in Section A.1 for which the squared bias and the variance of the posterior mean will be incorrectly balanced, resulting in a sub-optimal rate of posterior contraction. However, the larger posterior spread ensures that we do have asymptotic coverage of credible sets. It should be noted however that these sets have a diameter that is sub-optimal, i.e. they are too conservative.

Theorem 6 (generalized product of Gaussian experts) For every β, M > 0 there exists a θ0 Hβ(M) such that for small enough c > 0,

Eθ0ΠV I(θ : θ θ0 2 cm

β 1+2β n β 1+2β |Y) 0

as m and n/m . However, for all γ (0, 1) it holds that

sup θ0 Hβ(M) Pθ0 θ0 ˆC(L) γ

for large enough L > 0.

Proof The proof of the theorem can be found in Section A.5.

3.6. Summary of results for non-adaptive methods

We have seen that the various methods for aggregation of the local posteriors can give quite diﬀerent results. The methods we considered produce diﬀerent global posterior measures. Depending on the relation between the bias and variance of the global posterior mean and the spread of this global posterior, the posterior contraction rate and coverage probabilities of credible sets can have diﬀerent behaviours. We summarize our ﬁndings in Table 1. This is certainly not meant to be an exhaustive list of methods, but rather an illustration of how the design of distributed procedures can aﬀect their fundamental performance. Simulations further illustrate the theoretical results. We have considered a true signal θ consisting of the Fourier coeﬃcients of the function shown in the left panel of Figure 3. This is a signal which has regularity β = 1 in the sense of (3.3). For this signal we simulated data according to (2.2), with σ = 1, n = 4800 and m = 40, i.e. we considered a distributed setting with m = 40 machines. For the sake of comparison, the right panel of Figure 3 shows the

An asymptotic analysis of distributed nonparametric methods

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

True signal

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

Non distributed method

Figure 3: Left: true signal. Right: posterior mean (blue solid curve) and 95% pointwise credible bands (dashed blue curves) for the non-distributed method.

signal reconstruction and uncertainty quantiﬁcation for the non-distributed method which ﬁrst aggregates all data in a single machine and then computes the posterior corresponding to the prior Π( | α) deﬁned by (3.1), with α = β. This is a method which is known to have an optimal convergence rate and correct quantiﬁcation of uncertainty. This classical, non-distributed result should be compared to Figure 4, which visualizes the posteriors generated by each of the distributed methods I VI. In accordance with our theoretical results, we see that the results of methods III and IV are comparable with the non-distributed method. Methods I, V and VI have worse signal reconstruction. The posterior mean of Method II is comparable to that of the optimal methods, but the uncertainty is underestimated. An important observation to make is that the methods that achieve the same optimal performance as non-distributed methods, all use information about the regularity β of the unkown signal, mostly through the setting of tuning parameters in the priors. In that sense, they are non-adaptive. They serve as useful results that indicate what is possible in principle if we have certain oracle knowledge about the truth we are trying to learn. To understand what realistic procedures can achieve this has to be combined with insight into what can be learned about this oracle knowledge from the data. In the next section we address this issue in the context of our distributed signal-in-white-noise model.

4. Results for adaptive procedures

In the non-distributed case it is well known that there exist adaptive methods that achieve the same optimal performance as non-adaptive procedures, without using knowledge of the regularity β of the unkown signal. These methods somehow succeed in correctly trading oﬀbias, variance (and spread in Bayesian methods) in a purely data-driven manner. For several such result in the context of the signal-in-white-noise model, see, for instance, Gin e and Nickl (2016) and the references therein. For distributed methods the issue of adaptation appears to be a lot more subtle. In this paper we only have a ﬁrst, negative result on adaptive properties of distributed methods. So now we do not assume that we know the true regularity β of the unknown signal. As before we employ the prior Π( | α) in the local machines. To tune the regularity parameter α

Szab o and Van Zanten

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

0.4 0.2 0.0 0.2 0.4

Figure 4: Global posterior mean (solid red curve) and 95% pointwise credible bands (dashed red curves) for each of the methods I VI.

An asymptotic analysis of distributed nonparametric methods

of the prior we consider a distributed version of maximum marginal likelihood estimator, as proposed by Deisenroth and Ng (2015). The usual, non-distributed version of that method would use the maximizer of the map

α 7 log Z m Y

j=1 p(Yj | θ) Π(dθ | α)

as tuning parameter. Maximizing this function however requires having all data available in a central machine. In the distributed setting, Deisenroth and Ng (2015) argue that this map is well approximated by the map

j=1 log Z p(Yj | θ) Π(dθ | α) .

Now every term in the sum just depends on one of the local machines and this function can be maximized on the central machine by repeatedly asking the local machines for function evaluations and gradients of the local log-marginal likelihoods

log Z p(Yj | θ) Π(dθ | α).

The resulting estimator is denoted by ˆα, i.e.

ˆα = argmax α [0,log n]

j=1 log Z p(Yj | θ) Π(dθ | α) .

(We maximize over a compact interval to ensure that the maximizer exists.) It turns out that in the distributed setting, the local machines are in general not able to learn enough about the true signal regularity β. The following lemma asserts that there exist diﬃcult signals for which the estimator ˆα overestimates the regularity.

Lemma 7 For β, M > 0, consider a signal θ0 ℓ2 such that

( M2i 1 2β if i (n/(σ2 m))1/(1+2β), 0 else. (4.1)

Then θ0 Hβ(M) and if M is small enough, then

Pθ0(ˆα β + 1/2) 1 (4.2)

if n/m and m .

Proof The proof is given in Section B.1.

In view of Lemma 7 it is perhaps not surprising that if the approximated maximum marginal likelihood estimator ˆα is used to tune the local prior that is used in every machine, sub-optimal performance is obtained for certain truths. Intuitively this is because due to

Szab o and Van Zanten

the smaller signal-to-noise ratio, or sample size in the local machines, certain truths may appear more regular than they really are. It turns out that using the estimator ˆα in combination with any of the methods considered in the preceding section indeed leads to sub-optimal rates and bad coverage probabilities for certain truths. As an illustration we present a rigorous statement for the method of Section 3.4, but similar results can be derived for the others methods as well. So suppose that in every local problem the prior Π( | α) is used, the corresponding generalized posterior Πj( | Yj) is computed locally (which involves raising the local likelihood to the power m), and then the tuning parameter α is substituted by the estimator ˆα deﬁned above. In the central machine, the global posterior ΠV II( | Y) is constructed as the 2-Wasserstein barycenter of the local posterior measures Π1( | ˆα, Y1), . . . , Πm( | ˆα, Ym).

Theorem 8 For β, M > 0 and θ0 as in Lemma 7 we have, for some c > 0,

Eθ0ΠV II(θ : θ θ0 2 c(n/ m) β 1+2β |Y) 0

as m and n/m . Furthermore, for all L > 0 it holds that

Pθ0 θ0 ˆC(L) 0.

Proof See Section B.2.

A simulation illustrating the theoretical result of the theorem is given in Figure 2. The left panel visualizes the posterior generated by method VII, in the same distributed setting, and using the same simulated data as considered in Section 3.6. So when combined with a data-driven tuning method like the distributed version of maximum marginal likelihood considered here, even the distributed methods that perform well in the non-adaptive setting loose their favourable properties. None of the methods yields a procedure that automatically adapts to regularity and achieves the optimal nondistributed rate. This does not imply of course that such an adaptive method does not exist. We expect however that the matter is delicate and that fundamental limitations exist. The issue appears to be similar to that of the existence of adaptive conﬁdence sets. To achieve adaptation in our distributed setting the local machines must be able to learn the global regularity of the signal from the limited local data that they have available. Analogous to the adaptive conﬁdence problem we expect that this is in general only possible under additional assumptions on the true signal, like the self-similarity or polished tail conditions proposed for instance in Gin e and Nickl (2010), Bull (2012), Szab o et al. (2015), Nickl and Szab o (2016), Belitser et al. (2017). Making these admittedly somewhat loose claims mathematically precise takes considerably more eﬀort. Recent work shows that in a distributed setting with communication restrictions, some degree of adaptation to smoothness is in principle possible, but requires diﬀerent kinds of algorithms (see Szabo and van Zanten (2019)). Many open questions remain at the moment however. It is for instance unclear how the possibility of adaptation, or purely data-driven tuning, is related the degree of communication or the amount of central computation allowed. It would in particular be interesting to better understand the theoretical performance of distributed methods which allow multiple rounds of communication (e.g. Shamir et al. (2014), Heinze et al. (2016), Wang et al. (2017a,b), Lu et al. (2016)).

An asymptotic analysis of distributed nonparametric methods

Appendix A. Proofs for Section 3

A.1. Proof of Theorem 1

By completing the square we see that under the local posterior Πj( | Yj) the coeﬃcients θi are independent and Gaussian, with mean ˆθj i and variance s2 i given by

ˆθj i = n n + σ2mi1+2β Y j i , s2 i = σ2m n + σ2mi1+2β .

Hence the global posterior ΠI( | Y) is Gaussian as well, and under that measure the coeﬃcients θi are independent and have mean ˆθi and variance t2 i given by

j=1 ˆθj i , t2 i = s2 i m.

For the global posterior mean we have, for every θ0 ℓ2,

Eθ0 ˆθi θ0,i = σ2mi1+2β

n + σ2mi1+2β θ0,i, Varθ0 ˆθi = σ2n (n + σ2mi1+2β)2 ,

Eθ0 ˆθ θ0 2 2 = X σ4m2i2+4β

(n + σ2mi1+2β)2 θ2 0,i + X σ2n (n + σ2mi1+2β)2 .

By Lemma 9 the second, variance term is of the order

m 1/(1+2β))n 2β/(1+2β),

as n/m . For θ2 0,i = Mi 1 2β, by the same lemma, the ﬁrst, squared bias term is proportional to (n/m) 2β/(1+2β). For the global spread, again in view of Lemma 9, we have

X t2 i = X σ2

n + σ2mi1+2β m 1/(1+2β))n 2β/(1+2β). (A.1)

By the triangle inequality we have

ΠI( θ θ0 2 cm

β 1+2β n β 1+2β |Y)

ΠI(θ : Eθ0 ˆθ θ0 2 cm

β 1+2β n β 1+2β ˆθ Eθ0 ˆθ 2 θ ˆθ 2|Y).

It follows from the bounds on the variance and squared bias of the posterior mean that for θ0 as chosen above, the quantity

Eθ0 ˆθ θ0 2 cm

β 1+2β n β 1+2β ˆθ Eθ0 ˆθ 2

appearing in the posterior probability is with Pθ0-probability tending to one bounded from

below by cm

β 1+2β n β 1+2β for c > 0 small enough. Then by the upper bound for the posterior spread and Chebyshev s inequality we obtain the ﬁrst statement of the theorem.

Szab o and Van Zanten

For the coverage we note that the radius rγ of the credible set is a multiple of m 1/(2+4β)n β/(1+2β), which follows from the Gaussianity of the posterior and (A.1). Then by similar computations as above we get that for the same truth θ0,

Pθ0(θ0 ˆC(L)) = Pθ0( ˆθ θ0 2 Lrγ)

Pθ0 ˆθ Eθ0 ˆθ 2 Eθ0 ˆθ θ0 2 Lrγ

Pθ0 ˆθ Eθ0 ˆθ 2 cm

β 1+2β n β 1+2β

2β 1+2β Eθ0 ˆθ Eθ0 ˆθ 2 2 m 1 0.

This completes the proof of the theorem.

A.2. Proof of Theorem 2

Raising the local likelihood (3.4) to the power m makes it proportional to

2 n(Y j i θi)2

which is the likelihood for the case m = 1. It follows that under the generalized local posterior Πj( | Yj) the coeﬃcients θi are independent and Gaussian, with mean ˆθj i and variance s2 i given by

ˆθj i = n n + σ2i1+2β Y j i , s2 i = σ2

n + σ2i1+2β .

Hence the global posterior ΠII( | Y) is again Gaussian, and under this global measure the coeﬃcients θi are independent and have mean ˆθi and variance t2 i given by

j=1 ˆθj i , t2 i = s2 i m.

For the global posterior mean we have in this case, for every θ0 ℓ2,

Eθ0 ˆθi θ0,i = σ2i1+2β

n + σ2i1+2β θ0,i, Varθ0 ˆθi = σ2n (n + σ2i1+2β)2 ,

Eθ0 ˆθ θ0 2 2 = X σ4i2+4β

(n + σ2i1+2β)2 θ2 0,i + X σ2n (n + σ2i1+2β)2 .

For all θ0 Hβ(M), in view of Lemma 9, the squared bias term is bounded by

M2 X σ4i1+2β

(n + σ2i1+2β)2 M2n 2β/(1+2β)

for large n, and the variance term behaves like a constant times n 2β/(1+2β) as well. The global spread P t2 i is of the order m 1n 2β/(1+2β) for large n, again following from Lemma 9.

An asymptotic analysis of distributed nonparametric methods

For Mn and θ0 Hβ(M) we now have, by the triangle inequality,

ΠII(θ : θ θ0 2 Mnn β/(1+2β) | Y)

ΠII(θ : θ ˆθ 2 Mnn β/(1+2β) ˆθ Eθ0 ˆθ 2 θ0 Eθ0 ˆθ 2 | Y).

By the bounds on the bias and the variance of the posterior mean derived above the quantity on the right of the inequality in the last posterior probability is bounded from below by (Mn/2)n β/(1+2β) with Pθ0-probability tending to one as n, m , uniformly in θ0 Hβ(M). By Chebychev s inequality, and the bound on the posterior spread, we conclude that the ﬁrst statement of the theorem holds. For the second statement we ﬁrst note that by Chebychev s inequality and by the upper bound on the posterior spread the radius rγ of the credible set is for large n bounded by Cm 1/2n β/(1+2β) for some C > 0. Hence, since the posterior mean is Gaussian and the Gaussian measure of a ball of a ﬁxed size is maximal if the ball is centered at the mean (a consequence of Anderson s inequality, e.g. Lifshits (1995), Section 11), we have

Pθ0 θ0 ˆC(L) Pθ0 ˆθ θ0 2 CLm 1/2n β/(1+2β)

Pθ0 ˆθ Eθ0 ˆθ 2 CLm 1/2n β/(1+2β) .

By Chebychev s inequality,

Pθ0 ˆθ Eθ0 ˆθ 2 2 X σ2 i a q

for all a > 0, where σ2 i = Varθ0 ˆθi. Above we saw that P σ2 i n 2β/(1+2β). Similarly, it is easily seen that P σ4 i n( 1 4β)/(1+2β). Hence by taking a = n(1/4)/(1+2β), for instance, we see that for c > 0 small enough,

Pθ0 ˆθ Eθ0 ˆθ 2 cn β/(1+2β) 0

as n . But then also

Pθ0 ˆθ Eθ0 ˆθ 2 CLm 1/2n β/(1+2β) 0

A.3. Proof of Theorem 3

In this case the jth local posterior is a product of Gaussians with means and variances given by

ˆθj i = n n + σ2mτ 1i1+2α Y j i , s2 i = σ2m n + σ2mτ 1i1+2α .

As before the global posterior is Gaussian as well, and under that measure the coeﬃcients θi are independent and have mean ˆθi and variance t2 i given by

j=1 ˆθj i , t2 i = s2 i m.

Szab o and Van Zanten

For the global posterior mean we have, for every θ0 ℓ2,

Eθ0 ˆθi θ0,i = σ2mτ 1i1+2α

n + σ2mτ 1i1+2α θ0,i, Varθ0 ˆθi = σ2n (n + σ2mτ 1i1+2α)2 ,

Eθ0 ˆθ θ0 2 2 = X σ4m2τ 2i2+4α

(n + σ2mτ 1i1+2α)2 θ2 0,i + X σ2n (n + σ2mτ 1i1+2α)2 .

Then in view of Lemma 9, we see that for β < 1 + 2α and uniformly for θ0 Hβ(M), the squared bias term is bounded by a constant times

M2(τn/m) 2β/(1+2α).

Similarly, the variance term and the posterior spread P t2 i both behave like a constant times

(τ/m)1/(1+2α)n 2α/(1+2α)

as n . The choice τ = mn2(α β)/(1+2β) balances these quantities, so that all three are of the order n 2β/(1+2β). By exactly the same reasoning as in Section A.2, the fact that the squared bias bound and the variance and spread are of the same order implies the ﬁrst statement of the theorem. For the coverage statement we ﬁrst note that the squared credible set radius r2 γ is the 1 γ quantile of the distribution of P t2 i Z2 i , with t2 i as above and Zi independent standard normals. This distribution has mean P t2 i n 2β/(1+2β) and variance

nn 2β/(1+2β),

following from Lemma 9. As the standard deviation is of smaller order than the mean, it follows from Chebychev s inequality that rγ cn β/(1+2β) for some c > 0. For the coverage probability we then have

Pθ0 θ0 ˆC(L) Pθ0 ˆθ θ0 2 c Ln β/(1+2β) n2β/(1+2β)

c2L2 Eθ0 ˆθ θ0 2 2.

By the bounds on the bias and variance of the posterior mean the right-hand side is smaller than γ for L large enough, uniformly for θ0 Hβ(L).

A.4. Proof of Theorem 4

As we saw in Section A.2, the jth local generalized posterior is a product of Gaussians with means ˆθj i and variances s2 i given by

ˆθj i = n n + σ2i1+2β Y j i , s2 i = σ2

n + σ2i1+2β .

In other words, the jth local measure is a Gaussian measure on ℓ2 with mean ˆθj = (ˆθj i )i and (diagonal) covariance operator R : ℓ2 ℓ2 given by (Rx)i = s2 i xi, which is the same for

An asymptotic analysis of distributed nonparametric methods

every local machine. The Wasserstein barycenter of a ﬁnite collection of Gaussian measures is a Gaussian measure again (e.g. Agueh and Carlier (2011)). By Theorem 3.5 of Gelbrich (1990) the squared 2-Wasserstein distance between the jth local measure and a Gaussian measure on ℓ2 with mean µ and covariance operator K is given by

ˆθj µ 2 2 + tr(R) + tr(K) 2tr p

It follows that the barycenter ΠIV ( | Y) of the local generalized posteriors is the Gaussian measure on ℓ2 with mean ˆθ equal to the average of the local means ˆθj and covariance operator equal to R. In other words, the global posterior is a product of Gaussians with means and variances given by

j=1 ˆθj i , t2 i = s2 i .

So the global posterior mean is the same as in Section A.2 and the posterior spread P t2 i is a factor m larger. It then follows from the considerations in Section A.2 that the squared bias of the global posterior mean is bounded by a constant times M2n 2β/(1+2β), uniformly for θ0 Hβ(M). Moreover, the variance term P s2 i and the posterior spread P t2 i behave like a multiple of n 2β/(1+2β) as well. As was explained in Section A.3, this leads to the statement of the theorem.

A.5. Proof of Theorem 6

The proof of the ﬁrst statement is the same as in Section A.1, since the mean of the global posterior is the same as for the naive averaging method. For the second statement, we observe that for θ0 Hβ(M), the squared bias term for the posterior mean satisﬁes

X σ4m2i2+4β

(n + σ2mi1+2β)2 θ2 0,i M2 X σ4m2i1+2β

(n + σ2mi1+2β)2 M2(n/m) 2β/(1+2β).

for n/m . As was shown in Section A.1 the variance of the posterior mean behaves as m 1/(1+2β)n 2β/(1+2β). Since the spread P t2 i of the posterior is a factor m larger than in Section A.1, it is of the same order (n/m) 2β/(1+2β) as the squared bias term. Since squared bias and spread are of the same order, the variance is of smaller order, and

1+2β X t2 i

is of lower order than P t2 i , the coverage statement can be proved as in Section A.3.

A.6. Technical Lemma

Lemma 9 For s, t > 0 with st > 1 and r < st 1 consider the function f(x) = xr(xs+1) t

and set fν = P k=1 ν 1f(k/ν). Then as ν ,

(i) If r > 1, then fν R 0 f(x)dx.

Szab o and Van Zanten

(ii) If r = 1, then fν log ν.

(iii) If r < 1, then fν νr 1 P k=1 k r.

Proof Assertions (ii) and (iii), along with (i) for 1 < r 0 are proved in Lemma A.1 of Szab o et al. (2013), hence it remains to verify assertion (i) for 0 < r < st 1. Note that

k=1 ν 1f(k/ν) < fν <

k=1 ν 1f(k/ν) +

k=Nν+1 ν 1f(k/ν). (A.2)

Since the function f(x) is continuous on [0, N] it is Riemann integrable (see for instance Theorem 6.8 of Rudin (1976)), hence the right Riemann sum converges to the integral, i.e. PNν k=1 ν 1f(k/ν) R N 0 f(x)dx as ν (for simplicity assume that ν N). Furthermore, for every ν > 0,

k=Nν+1 ν 1f(k/ν) νst r 1 X

k=Nν+1 kr st νst r 1 Z

Nν xr stdx 1 st r 1Nr+1 st

and R N+1 f(x)dx R N+1 xr stdx N st+r+1/(st r 1). Therefore by choosing N suﬃciently large, both sides of (A.2) gets arbitrarily close to R 0 f(x)dx as ν , concluding the proof of the statement.

Appendix B. Proofs for Section 4

B.1. Proof of Lemma 7

The estimator ˆα is the maximizer of the random map α 7 P j ℓj(α), where

ℓj(α) = log Z p(Yj | θ) Π(dθ | α).

The asymptotic behaviour of the local log-marginal likelihood ℓj has been studied in Knapik et al. (2016). Denote the derivative of ℓj with respect to α by ℓj and let k = n/(σ2m) be the local sample size . Moreover, for l > 0, deﬁne

α = inf{α > 0 : hk(α) > l} p

hk(α) = 1 + 2α k1/(1+2α) log k

k2i1+2αθ2 0,i log i (k + i1+2α)2 .

Note that the expectation Eθ0 ℓj(α) does not depend on j. It is proved in Section 5.3 of Knapik et al. (2016) that if l is smaller than some universal threshold, then for every j

lim inf k inf α α

1 + 2α k1/(1+2α) log k Eθ0 ℓj(α) = δ > 0,

An asymptotic analysis of distributed nonparametric methods

Eθ0 sup α α

1 + 2α k1/(1+2α) log k| ℓj(α) Eθ0 ℓj(α)| e C log k

for constants δ, C > 0. But then we also have

lim inf k inf α α

1 + 2α k1/(1+2α) log k Eθ0 X

j ℓj(α) > mδ

and Eθ0 sup α α

1 + 2α k1/(1+2α) log k

j ℓj(α) Eθ0 X

j ℓj(α) me C log k.

By Markov s inequality, it follows that with probability at least 1 C1 exp( C2 log k) the map α 7 P j ℓj(α) is strictly increasing on the interval [0, α]. Hence, on that event we have ˆα α. It remains to show that α β + 1/2. To that end it suﬃces to prove that hk(α) l for all α β + 1/2. To see this, suppose ﬁrst that α < β. Deﬁne Nβ = (n/(σ2 m))1/(1+2β)

and Mα = k1/(1+2α). By deﬁnition of θ0 we then have

k2i2α 2β log i

(i1+2α + k)2

i=Nβ i2α 2β log i + M2k2

i=Mα i 2 2α 2β log i

M2 MαN2α 2β β log Mα Mα log Mα + M2 k2M 1 2α 2β α log Mα Mα log Mα M2

for n, m large enough. Hence, if M is small enough, then hk l for α < β. For β α β + 1/2 we have

i=Nβ i 2 2α 2β log i

M2k2N 1 2α 2β β log Nβ Mα log Mα

= M2(n/σ2) 2α 1+2α 2α 1+2β m 2+ 1 1+2α+ 1+2α+2β

2(1+2β) log Nβ

log Mα m 2α 1+2α log m

for n/m large enough. Together, this shows that if both n/m and m are large enough, then indeed hk(α) l for all α β + 1/2.

B.2. Proof of Theorem 8

In view of the proof of Theorem 4 the jth local generalized posterior is a product of Gaussians with means ˆθj i and variances s2 i given by

ˆθj i = n n + σ2i1+2ˆα Y j i , s2 i = σ2

n + σ2i1+2ˆα .

Szab o and Van Zanten

Using again that the Wasserstein barycenter of a ﬁnite collection of Gaussian measures is a Gaussian measure in combination with the explicit expression for the 2-Wasserstein distance between Gaussians (see Section A.4) we see that the global posterior is a product of Gaussians with means ˆθi and variances t2 i given by

j=1 ˆθj i , t2 i = s2 i .

The posterior mean can be written as ˆθ = ˆθ(ˆα), where ˆθ(α) is the estimator with a ﬁxed choice α for the hyperparameter, i.e.

n n + σ2i1+2α Y j i .

For ﬁxed α we also deﬁne the corresponding expectation E(α) = Eθ0 ˆθ(α). Then by the triangle inequality,

ˆθ θ0 2 E(ˆα) θ0 2 E(ˆα) ˆθ(ˆα) 2.

We have the explicit expressions

E(α) θ0 2 2 = X

σ4i2+4αθ2 0,i (n + σ2i1+2α)2

E(α) ˆθ(α) 2 2 = X

σ2n (n + σ2i1+2α)2

j=1 Zj i 2 .

Since the ﬁrst expression is increasing in α and the second one is decreasing, we see that on the event A = {ˆα β + 1/2} it holds that

σ4θ2 0,ii4+4β

(n + σ2i2+2β)2

σ2n (n + σ2i2+2β)2

j=1 Zj i 2 .

By deﬁnition of θ0, the square of the ﬁrst term on the right is bounded from below by

i (n/(σ2 m)) 1/(1+2β)

(n + σ2i2+2β)2 .

In view of Lemma 9 we see that it is of the order M2(n/ m) 2β/(1+2β). The square of the second term can be written as

σ2n (n + σ2i2+2β)2 U2 i ,

An asymptotic analysis of distributed nonparametric methods

with the Ui independent and standard normal under Pθ0. Again in view of Lemma 9 it is easily seen that the mean and variance of this sum behave as n (1+2β)/(2+2β) and n (3+4β)/(2+2β), respectively. Hence the standard deviation is of smaller order than the mean for large n, so that by Chebychev s inequality the square of the second term is of stochastic order n (1+2β)/(2+2β). Since this is of smaller order than (n/ m) 2β/(1+2β), we conclude that for the global posterior mean we have, for some constant c > 0,

Pθ0( ˆθ θ0 2 c(n/ m) β/(1+2β)) 1

as n/m and m . The spread P t2 i of the global posterior is on the event A bounded by X

1 n + i2+2β ,

which is of the order n (1+2β)/(2+2β) (n/ m) 2β/(1+2β) as well, see Lemma 9. The conclusions of the theorem now follow.

Acknowledgments

The research leading to these results has received funding from the Netherlands Science foundation NWO and from the European Research Council under ERC Grant Agreement 320637.

Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011.

Eduard Belitser et al. On coverage and local radial rates of credible sets. The Annals of Statistics, 45(3):1124 1151, 2017.

Lawrence D. Brown and Mark G. Low. Asymptotic equivalence of nonparametric regression and white noise. Ann. Statist., 24(6):2384 2398, 1996.

Adam D. Bull. Honest adaptive conﬁdence bands and self-similar functions. Electron. J. Statist., 6:1490 1516, 2012.

Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions. Ar Xiv e-prints, October 2014.

Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 685 693, Bejing, China, 22 24 Jun 2014. PMLR.

Marc Deisenroth and Jun Wei Ng. Distributed Gaussian processes. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1481 1490, Lille, France, 07 09 Jul 2015. PMLR.

Szab o and Van Zanten

Matthias Gelbrich. On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten, 147(1):185 203, 1990.

Evarist Gin e and Richard Nickl. Conﬁdence bands in density estimation. Ann. Statist., 38 (2):1122 1170, 04 2010. doi: 10.1214/09-AOS738.

Evarist Gin e and Richard Nickl. Mathematical foundations of inﬁnite-dimensional statistical models. Cambridge University Press. 2016.

Christina Heinze, Brian Mc Williams, and Nicolai Meinshausen. Dual-loco: Distributing statistical estimation using random projections. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 875 883, Cadiz, Spain, 09 11 May 2016. PMLR.

I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. Book draft, 2017.

M. I. Jordan, J. D. Lee, and Y. Yang. Communication-Eﬃcient Distributed Statistical Inference. Ar Xiv e-prints, May 2016.

B. T. Knapik, A. W. van der Vaart, and J. H. van Zanten. Bayesian inverse problems with Gaussian priors. Ann. Statist., 39(5):2626 2657, 10 2011.

B. T. Knapik, B. T. Szab o, A. W. Vaart, and J. H. Zanten. Bayes procedures for adaptive inference in inverse problems for the white noise model. Probability Theory and Related Fields, 164(3):771 813, 2016.

Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer, 2012.

J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor. Communication-eﬃcient sparse regression: a one-shot approach. Ar Xiv e-prints, March 2015.

Mikhail Anatolevich Lifshits. Gaussian random functions. Springer, 1995.

J. Lu, G. Cheng, and H. Liu. Nonparametric Heterogeneity Testing For Massive Data. Ar Xiv e-prints, January 2016.

S. Minsker, S. Srivastava, L. Lin, and D. B. Dunson. Robust and scalable Bayes via a median of subset posterior measures. Ar Xiv e-prints, March 2014.

J. W. Ng and M. P. Deisenroth. Hierarchical Mixture-of-Experts Model for Large-Scale Gaussian Process Regression. Ar Xiv e-prints, December 2014.

Richard Nickl and Botond Szab o. A sharp adaptive conﬁdence ball for self-similar functions. Stochastic Processes and their Applications, 126(12):3913 3934, 2016.

Michael Nussbaum. Asymptotic equivalence of density estimation and Gaussian white noise. Ann. Statist., 24(6):2399 2430, 1996.

W. Rudin. Principles of Mathematical Analysis. International series in pure and applied mathematics. Mc Graw-Hill, 1976.

An asymptotic analysis of distributed nonparametric methods

Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George, and Robert E. Mc Culloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11:78 88, 2016.

Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-eﬃcient distributed optimization using an approximate newton-type method. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 1000 II 1008. JMLR.org, 2014.

Z. Shang and G. Cheng. A Bayesian splitotic theory for nonparametric models. Ar Xiv e-prints, August 2015.

Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: Scalable Bayes via barycenters of subset posteriors. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 912 920, San Diego, California, USA, 09 12 May 2015. PMLR.

B. Szabo and H. van Zanten. Adaptive distributed methods under communication constraints. Ar Xiv e-prints, 2019.

B. T. Szab o, A. W. van der Vaart, and J. H. van Zanten. Empirical Bayes scaling of Gaussian priors in the white noise model. Electron. J. Statist., 7:991 1018, 2013.

Botond Szab o, A. W. van der Vaart, and J. H. van Zanten. Frequentist coverage of adaptive nonparametric Bayesian credible sets. Ann. Statist., 43(4):1391 1428, 08 2015.

Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer, New York, 2009.

Aad van der Vaart and J. H. van Zanten. Bayesian inference with rescaled Gaussian process priors. Electron. J. Statist., 1:433 448, 2007.

S. Volgushev, S.-K. Chao, and G. Cheng. Distributed inference for quantile regression processes. Ar Xiv e-prints, January 2017.

Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Eﬃcient distributed learning with sparsity. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3636 3645, International Convention Centre, Sydney, Australia, 06 11 Aug 2017a. PMLR.

Jialei Wang, Jason Lee, Mehrdad Mahdavi, Mladen Kolar, and Nati Srebro. Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1150 1158, Fort Lauderdale, FL, USA, 20 22 Apr 2017b. PMLR.

Szab o and Van Zanten

Y. Zhu and J. Laﬀerty. Distributed Nonparametric Regression under Communication Constraints. Ar Xiv e-prints, March 2018.