# frequentist_guarantees_of_distributed_nonbayesian_inference__937efaa8.pdf

Journal of Machine Learning Research 26 (2025) 1-65 Submitted 11/23; Revised 9/24; Published 7/25

Frequentist Guarantees of Distributed (Non)-Bayesian Inference

Bohan Wu bw2766@columbia.edu Department of Statistics Columbia University New York, NY 10027, USA

C esar A. Uribe cauribe@rice.edu Department of Electrical and Computer Engineering Rice University Houston, TX 77005, USA

Editor: Daniel Roy

We establish frequentist properties, i.e., posterior consistency, asymptotic normality, and posterior contraction rates, for the distributed (non-)Bayesian inference problem for a set of agents connected over a network. These results are motivated by the need to analyze large, decentralized datasets, where distributed (non)-Bayesian inference has become a critical research area across multiple ﬁelds, including statistics, machine learning, and economics. Our results show that, under appropriate assumptions on the communication graph, distributed (non)-Bayesian inference retains parametric eﬃciency while enhancing robustness in uncertainty quantiﬁcation. We also explore the trade-oﬀbetween statistical eﬃciency and communication eﬃciency by examining how the design and size of the communication graph impact the posterior contraction rate. Furthermore, we extend our analysis to time-varying graphs and apply our results to exponential family models, distributed logistic regression, and decentralized detection models.

Keywords: Distributed inference, Bayesian theory, Bernstein von Mises, communication eﬃciency, estimation over networks

1. Introduction

Modern datasets are frequently generated and stored by distributed systems, including social media, sensor networks, blockchain, and cloud-based databases. However, transmission costs make analyzing these datasets on a centralized machine prohibitively expensive and, in some cases, infeasible. To address this challenge, researchers have turned to distributed algorithms that enable decentralized data-driven decision-making under communication constraints (Borkar and Varaiya, 1982; Tsitsiklis and Athans, 1984; Gubner, 1993). In such systems, a set of agents operates within a communication network structure, where each agent can only share information locally with its neighbors. The agents sequentially analyze the data, with each agent performing inference independently and sharing the results

c 2025 Bohan Wu, C esar A. Uribe.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1504.html.

Wu and Uribe

through edges deﬁned by the network structure, which may vary over time (Nedi c et al., 2017; Uribe et al., 2022b). Decentralized or distributed Bayesian inference originates in statistics (De Groot, 1974; Gilardoni and Clayton, 1993). However, it wasn t until the massive advances in computing power in the past decade that the ideas of distributed inference started regaining interest in the statistics community (Uribe et al., 2022a,b). There is a growing line of works on Distributed Bayesian inference, which aims to develop scalable and eﬃcient algorithms for posterior computation on large datasets (Jordan et al., 2018). One of the main challenges in this area is to design data-parallel procedures that can handle massive datasets by breaking them into smaller blocks that can be processed independently on individual machines. Much of the current literature focuses on one-shot or embarrassingly parallel approaches, which involve only one round of communication between local machines and a central node at the end of the computational pipeline. These approaches compute estimators or posterior samples in parallel on local machines, then communicate the estimates to a central node to form a global estimator or approximation to the posterior (e.g., by computing a Wasserstein barycenter of the posteriors computed on local machines). From the Markov chain Monte Carlo (MCMC) perspective, there have been several developments in parallel MCMC methods for distributed Bayesian inference (Neiswanger et al., 2013; Wang and Dunson, 2013; Minsker et al., 2014; Wang et al., 2015; Rabinovich et al., 2015; Scott et al., 2016; Li et al., 2017; Minsker et al., 2017). These methods draw samples from the subset posterior in parallel agents and combine the samples to obtain an approximation to the posterior measure for the complete data. From the variational Bayes perspective, algorithms such as stochastic variational inference (Hoﬀman et al., 2013) have been proposed for distributed Bayesian inference. These algorithms distribute the data across machines, implement the local variational updates in parallel through stochastic gradient descent (SGD), and update the global variational parameters as a weighted average of local optima. The variational interpretation of the Bayes rule (Walker, 2006) allows the representation between the variational optimization problem and posterior to go both ways. In parallel to the success in the statistics community, Distributed Bayesian inference has also been studied in microeconomics under the name non-Bayesian social learning. Notable works in this area have focused on its axiomatic foundations (Epstein et al., 2010), conditions for achieving consensus (Acemoglu et al., 2011), various learning rules (Golub and Sadler, 2017), and the eﬀects of information aggregation (Molavi et al., 2018). This cross-disciplinary interest underscores the broad relevance and applicability of Distributed Bayesian inference techniques. Here, the agents represent individuals seeking to learn about an underlying state of the world θ. Unlike traditional Bayesian approaches, where each agent makes inferences based on the full data, the non-Bayesian learning (as economists call it) model captures how individuals make inferences in the presence of other decision-makers, often with limited access to information and interaction with social networks. The Bayesian distributed learning framework oﬀers a promising solution that retains the desirable properties of Bayesian learning, such as ease of uncertainty quantiﬁcation and ﬂexibility, while incorporating a form of information aggregation that aligns with realistic behavioral assumptions. Indeed, the distributed Bayes rule has been shown to reﬂect reasonable assumptions about individual behavior in society (Jadbabaie et al., 2012). The distributed framework is

Bv M for Distributed Bayes

analytically tractable under certain distributional assumptions and computationally feasible in general. For a more comprehensive literature review, see (Molavi et al., 2018). Although economic theory addresses social learning in various strategic environments, it is almost entirely behavioral, with little data involved. Recent work in the signal processing community has explored the statistical properties of social learning in greater depth, much of which focuses on ﬁnite parameter spaces. For example, Braca et al. (2010) studies the relative eﬃciency of binary testing based on a distributed test statistic under local alternative hypotheses and establishes an asymptotic normality result for the test statistic. In the non-Bayesian social learning setting, Shahrampour et al. (2015) assumes a ﬁnite parameter space and provides non-asymptotic bounds on the KL divergence between distributed and centralized beliefs. Lalitha et al. (2014) establishes the exponential convergence of beliefs to the truth from a large-deviation perspective. Bordignon et al. (2021) studies the asymptotic normality of agents beliefs in the regime of vanishing step sizes. In contrast, Inan et al. (2022) investigates the consistency and convergence rate of agents beliefs toward the truth when the communication graph is random. The technique for establishing concentration bounds in the discrete ﬁnite parameter space relies heavily on bounding the KL divergence between two candidate distributions. Work beyond the ﬁnite, discrete parameter space has been sparse. One example is Uribe et al. (2022a,b), which establishes consistency and concentration bounds for pj t(θ) when Θ is discrete or compact, respectively. Our work studies the asymptotic properties of non-Bayesian social learning in the regime of continuous parameter spaces (as a subset of Rp), the sample size (or time) tending to . Although this regime is underexplored in the social learning literature, it is arguably more natural from the perspective of classical asymptotic theory (Van der Vaart, 2000). Establishing such a theory ﬁlls the gap in the existing social learning literature and builds a connection with the distributed Bayesian inference literature in statistics. Distributed Bayesian procedures have attracted substantial interest across disciplines such as electrical engineering, statistics, and economics. Yet, the broad adoption of these methods in the statistical community has been hindered by a lack of rigorous analysis of their statistical properties. Moreover, understanding these properties is key to deepening our knowledge of the consensus behavior of agents within varying communication patterns, a topic of interest to the electrical engineering and economics communities. In this paper, we investigate the distributed Bayesian procedures that arise from applying the stochastic mirror descent (SMD) algorithm to statistical estimation problems (Uribe et al., 2022a,b). Our work ﬁlls a crucial gap by establishing the Frequentist properties of such distributed Bayesian procedures, including posterior consistency, asymptotic normality, and posterior contraction rates. We also explore the tradeoﬀbetween statistical eﬃciency and communication cost by investigating the relationship between the posterior contraction rate and the structure of the communication graph. Furthermore, we illustrate practical applications of the Bernstein von Mises results to emphasize their utility in uncertainty quantiﬁcation. Ultimately, we hope to stimulate further interest in distributed Bayesian methods within the statistical community and to establish a solid theoretical foundation for distributed Bayesian inference in ﬁelds such as economics and electrical engineering. The rest of the paper is outlined as follows: Section 2 introduces the distributed Bayesian inference problem from an optimization-centric viewpoint and rigorously deﬁnes the dis-

Wu and Uribe

Table 1: Table of Notation

Functions: DKL Kullback Leibler divergence, L loss function, ., . L2 inner product, as in f, g = R

R f(x)g(x)dx.

Probability Distributions: P, p data generating measure, data generating density, Pf(X) expectation of f(X) when X P, same as EPf(X), Π, π prior measure, prior density. P, p posterior measure, posterior density, P j t , pj t posterior measure, density for agent j at iterant (time) t,

Others: G, A, Aij Graph, adjacency matrix and its (i, j)th entry, λj(A) jth largest eigenvalue of A, 1 the vector of all ones.

tributed Bayesian posterior. In Section 3, we outline suﬃcient conditions for the consistency and asymptotic normality of the distributed Bayesian posterior. Section 4 establishes the consistency of the distributed Bayesian posterior (Theorem 3). Section 5 establishes the Bernstein von Mises theorems under both correct (Theorem 10) and incorrect model speciﬁcations(Theorem 14). Section 6 provides both the abstract and concrete upper bounds on the posterior contraction rate (Theorem 15 and Theorem 18), with an emphasis on model misspeciﬁcation. Our analysis is extended to time-varying graphs in Section 7, where we establish posterior contraction rates under various communication frequency regimes (Theorem 20). In Section 8, we demonstrate the practical use of our ﬁndings by establishing Bernstein von Mises results for three statistical models, including exponential family (Proposition 24), logistic regression models(Proposition 25), and the distributed detection problem (Proposition 27)- a canonical problem in electrical engineering. The paper concludes in Section 9 with a discussion on future research directions following our ﬁndings.

2. Preliminaries

Suppose we observe a sequence of i.i.d. random variables X1, X2, all taking values in a probability space (X, B, P0) where the true distribution P0 is unknown. Moreover, let a family of probability distributions be given in the form of {Pθ : θ Θ} where (Θ, A) is a measurable space. Each Pθ is a probability measure deﬁned on (X, B) and the mapping θ Pθ(B) is measurable for every B B. We refer to {Pθ : θ Θ} as the statistical model. The parameter space Θ is typically taken as a subset of the Euclidean or Hilbert space to avoid any measurability issue with {Pθ : θ Θ} (Ghosal and Van der Vaart, 2017).

Bv M for Distributed Bayes

The centralized statistical estimation problem is to ﬁnd a subset Θ0 Θ such that for θ0 Θ0, Pθ0 is the closest to P0 with respect to a metric. Geometrically, the goal is to ﬁnd a point in the subset of the probability measures {Pθ : θ Θ} closest to P0 under a given topology. The topology is often deﬁned by divergence on the space of probabilities, such as the Kullback-Leibler (KL) divergence, R enyi divergence, etc. The deﬁnition of KL, R enyi divergence, and other divergence functions is reviewed in Section A.1 of the Appendix.

Arguably, the most natural estimation problem to consider is based on KL divergence:

θ0 arg min θ Θ DKL(P0 Pθ). (2.1)

Equation (2.1) is equivalent to maximum likelihood estimation (MLE) at the population level. For example, if the true distribution P0 is standard normal N(0, 1) and the parametric family Pθ is the normal location family N(θ, 1); θ [ 1, 1], the optimal value is θ0 = 0.

There are two main challenges in solving (2.1). First, since P0 is unknown, it is typically replaced with samples drawn from the distribution. The resulting approximation error inﬂuences the sample complexity of MLE, which has been extensively studied in the statistics literature (Van der Vaart, 2000). The second challenge is computational. When the set Θ is discrete, non-smooth, or disconnected, default approaches such as ﬁrst-order optimization methods cannot be directly applied to solving Equation (2.1), even at the population level.

Instead of minimizing over Θ, a common approach is to lift the problem by minimizing over the probability distributions on Θ.

Let Θ be the space of probability density functions deﬁned on Θ. Then

min θ Θ DKL(P0 Pθ) = min p Θ

Θ p(θ)DKL(P0 Pθ)dθ, (2.2)

where Θ is the (hypothetical) space of all probability distributions over Θ, and R

Θ p(θ)dθ = 1, with p(θ) 0. If θ0 solves Equation (2.1), p = δθ0 solves Equation (2.2), thus the two problems are equivalent. The advantage, however, is that Equation (2.2) is a continuous optimization problem regardless of the topology in which Θ is deﬁned; that is, Θ could be a ﬁnite discrete set.

Equation (2.2) introduces an equivalent formulation of the statistical estimation as a minimization problem over the space of probability measures on Θ. Given the linearity of the expectation functional, the Riesz representation theorem implies the existence of an inner product ., . that characterizes the expectation over Θ. We then reformulate the KL-minimization problem in terms of linear stochastic optimization up to a constant:

min θ Θ DKL(P0 Pθ) c= min p Θ EP0 p, log pθ(x) . (2.3)

Wu and Uribe

min θ Θ DKL(P0 Pθ) = min p Θ

Θ p(θ)DKL(P0 Pθ)dθ

X p0(x) log p0(x)

Θ p(θ) log p0(x)

pθ(x)dθdx, by Fubini s theorem

Θ p(θ)[log p0(x) log pθ(x)]dθdx

c= min p Θ Z

Θ p(θ) log pθ(x)dθdx.

In the reformulated problem, Θ is the probability simplex over Θ and p, pθ(x) represents the inner product between the simplex vector p and the negative log-likelihood pθ(x), taken with respect to the expectation under P0. The resulting problem can be eﬃciently solved using computational methods from the stochastic optimization literature. One approach is the stochastic mirror descent (SMD) algorithm (Uribe et al., 2022a,b, Algorithm 6), which iterates through a KL-regularized optimization problem:

pt+1 = arg min p Θ { p, log Pθ(xt+1) + αDKL(p pt)} . (2.4)

At time t, the agent s goal is to maximize the expected log-likelihood while staying close to the belief pt. The trade-oﬀbetween these two objectives is governed by the learning rate α. Equation (2.4) is also the variational representation of generalized Bayesian inference.

pt+1 = arg min p Θ DKL(p p), p (Pθ(xt+1)) 1 α pt(θ). (2.5)

A generalized posterior is a probability distribution p(θ | X1, . . . , Xt) π(θ) Lt(θ), where π is a prior and Lt is a surrogate likelihood (Miller, 2021). The stochastic mirror descent update in Equation (2.4) produces a generalized Bayesian posterior:

pt+1(θ) (Pθ(xt+1)) 1 α pt(θ). (2.6)

The result follows from a standard variational Bayes argument (Knoblauch et al., 2022). The result above suggests a fascinating interplay between stochastic optimization and generalized Bayesian inference. The parameter α serves as both the step size in the SMD update and a temperature parameter in the generalized Bayes rule. When α is set to 1, we recover the standard Bayes theorem. In the sequel, we explore the connection between generalized Bayes and SMD to formulate a distributed Bayesian posterior based on the distributed SMD algorithm.

2.1 Distributed Bayesian Inference

In this section, we introduce the distributed Bayesian inference problem. Consider a set of m independent agents, represented by the index j = 1, ..., m. Each agent independently

Bv M for Distributed Bayes

observes a sequence of random variables at discrete time steps t = 0, 1, 2, .... The random variable observed by agent j at time t is denoted as Xj t . The random process observed by agent j is denoted as Xj. The collection of random variables all agents observe at time t is denoted as Xt. In general, each Xj may be endowed with a diﬀerent probability space with a common parameter space. For simplicity, we assume that the set of all random variables Xj t are i.i.d. draws from the probability space (X, B, P) with density p. Given a common set of parameters Θ, the private parametric statistical model that agent j can access is deﬁned as Pj Θ = {Pj θ : θ Θ}. Each model Pj Θ has the same support for all j [m]. At time t, the agents interact through an undirected communication graph Gt = (V, Et), where V = [m] denotes the set of agents. An edge (j, i) Et implies that agent j can communicate with agent i at time t. The weighted adjacency matrix associated with Gt is denoted by At. We assume that At is a doubly stochastic matrix obtained by normalizing the matrix A t, where [A t]ij = 1 if there exists a communication link between agent i and agent j and [A t]ij = 0 otherwise. For undirected graphs, one can always construct a doubly stochastic matrix by normalizing the adjacency matrix, so At is guaranteed to exist. In our framework, agents share information by communicating their beliefs, represented by pj t(θ). This notation represents the posterior distribution of θ as perceived by agent j at time t. Similar to the centralized case, we formulate distributed statistical inference as a distributed optimization problem:

j=1 DKL(P0 Pj θ). (2.7)

For each agent j [m], let πj represent their initial belief or prior. Therefore, we have pj 0 = πj. Similar to the statistical estimation for a single agent, the distributed statistical inference problem admits a reformulation via the distributed stochastic mirror descent algorithm (Uribe et al., 2022a,b). The belief of the agent j at time t+1 is obtained through the following mirror descent update.

pj t+1 = arg min p Θ

log pj θ(xj t+1), p +

i=1 [At]ij DKL(p pi t)

which is solved via the update

pj t+1(θ) pj θ(xj t+1)

i=1 (pi t(θ))[At]ij. (2.9)

The derivation from (2.8) to (2.9) is identical to that of the case of a single agent (2.4) (2.6). Equation (2.9) is aptly referred to as the distributed Bayes rule, as it generalizes the Bayesian rule to a distributed setting. This distributed Bayes rule induces a distributed Bayesian posterior pj t+1, which might be viewed as the belief of each agent after updating its prior belief based on the new data and information received from its neighbors.

Wu and Uribe

Despite its name, the distributed Bayesian posterior is not a standard posterior distribution. The classic Bayes rule is a special case of the distributed Bayes rule where the communication graph has no edges, i.e., At = Im. Then, each agent updates independently,

pj t+1(θ) pθ(xj t+1)pj t(θ).

At the other extreme, where the communication graph Gt is fully connected, each agent can communicate with every other agent. Then, the distributed Bayes rule eﬀectively acts as a weighted Bayesian update rule, with equal weights assigned to all agents. The resulting posterior is a product of the likelihood of the incoming data and a tempered posterior with power 1/m

pj t+1(θ) pθ(xj t+1)

k=1 (pθ(xi k))1/m m Y

i=1 πi(θ)1/m.

These examples show that a distributed Bayesian posterior can adapt to the underlying communication structure, balancing individualistic updates (when there are no communications) and collective updates (when the communication happens according to the network). Our strategy is to analyze pj t+1(θ) through the lens of the generalized posterior. In the next section, we provide the measure-theoretic deﬁnition of the distributed Bayesian posterior as a generalized Bayesian posterior.

2.2 Distributed Bayesian Posterior

This section provides a rigorous deﬁnition of the distributed Bayesian posterior we study throughout this paper. Deﬁne the prior measure Π as follows:

B Qm j=1 πj(θ)1/mdθ R

Θ Qm j=1 πj(θ)1/mdθ.

For each t and j, denote zj t = R

Θ pj θ(xj t) Qt 1 k=1 Qm i=1 pi θ(xi k)[ Qt 1 τ=k Aτ]ijΠ(dθ) and assume zj t < . The distributed Bayesian posterior is deﬁned as the probability measure P j t such that for B B,

P j t (B) = 1

B pj θ(xj t)

i=1 pi θ(xi k)[ Qt 1 τ=k Aτ]ijΠ(dθ). (2.10)

The measure-theoretic deﬁnition of distributed Bayesian posterior is equivalent to the recursive formulation (2.9). The deﬁnition shows that the distributed Bayesian posterior is the product of the prior distribution and a surrogate likelihood. The asymptotic properties of posteriors with surrogate likelihood have been studied in Miller (2021), which establishes suﬃcient assumptions for the generalized posterior to achieve consistency, asymptotic normality (Bernstein von Mises), and correct frequentist coverage. However, the one key diﬀerence between the distributed Bayesian posterior and the setting in (Miller, 2021) is that the latter depends only on a sample of size t. In contrast,

Bv M for Distributed Bayes

the surrogate likelihood used in distributed Bayesian posterior depends on the sample size mt, the statistical model Pj Θ, and the communication graph structure G. Therefore, the theoretical analysis of distributed Bayesian posterior requires careful adjustments of the standard results for generalized posteriors for new constraints and challenges. We deﬁne the following surrogate loss functions fj t , ft, f on Θ.

fj t (θ) = 1

t log pj θ(xj t) 1

ij log pi θ(Xi k), j [m], (2.11)

i=1 log pi θ(Xi k), (2.12)

i=1 P0 log pi θ. (2.13)

The function fj t plays the role of a generalized likelihood, ft corresponds to the empirical mean of log pj θ(X) and a special case of fj t when the adjacency matrix has weights of all 1/m. Informally, fj t and ft are asymptotically indistinguishable under mild assumptions, but ft has much nicer statistical behaviors, so one should expect the distributed Bayesian posterior to behave like a generalized posterior with likelihood given by ft.

3. Assumptions

Throughout the paper, we impose the following assumptions:

(i) Θ is an open subset of Rp with standard Euclidean metric d .

(ii) There exists a unique parameter θ0 Θ that minimizes Problem (2.7), that is,

θ0 = arg min θ Θ

j=1 DKL(P0 Pj θ). (3.1)

The remaining assumptions fall into four categories: communication graph structure, regularity of private statistical models, prior mass assumptions, and consistent testing assumptions. The latter three are standard in Bayesian asymptotics with a well-established history (see Appendix Section A.2). The assumptions on the communication graph are relaxed in Section 7 to allow for temporal dependence. We assume that the communication graph Gt, t 1 is static and undirected, i.e., Gt = G = (V, E).

Assumption 1 (Graph Assumptions) The communication graph Gt is static and undirected, i.e., Gt = G = (V, E), t 1. Moreover, G and the adjacency matrix A satisfy

(a) A is symmetric and row stochastic.

(b) A has positive diagonal entries, i.e., aii > 0 for all i [m].

(c) G is connected, i.e., a directed path exists from any agent i to any agent j.

Wu and Uribe

The assumption of connectedness is standard in the literature on network communication to ensure information ﬂow between agents (Shahrampour et al., 2015; Nedi c et al., 2017). The three assumptions ensure that the Markov chain with transition matrix A is irreducible and aperiodic. We proceed to establish a result for the convergence of At to 1

m11T and an upper bound on the convergence rate of At 1

Lemma 1 [Acemoglu et al. (2011); Nedi c et al. (2017)] Let Assumption 1 hold. The matrix A satisﬁes

j=1 |[At k ij ] 1

m| 16m2 log m

where ν is the smallest positive entry in A.

Lemma 1 states that given ﬁxed A, the sum Pt k=1 Pm j=1 |[At k ij ] 1 m| is uniformly bounded in t, and scales in the order of m2 log m. The argument for Lemma 1 relies crucially on the geometric convergence rate of an irreducible, aperiodic Markov chain. See Section C.4 in the Appendix for the proof. The next group of assumptions concerns the smoothness of the log-likelihood of the private statistical models.

Assumption 2 (Regularity Assumptions) The regularity assumptions on the statistical model Pj θ = {Pj θ, θ Θ} involve, for every j [m] and almost surely [P0],

(a) For every θ Θ, P0| log pj θ| < , i.e., the expectation of | log pj θ(X)| is ﬁnite under X P0.

(b) The mapping θ 7 log pj θ is convex for every x in a neighborhood of θ0, and the entrywise inequality f(θ0 ϵ) < 0 < f(θ0 + ϵ) holds.

(c) The statistical model Pj θ is diﬀerentiable in quadratic mean at θ0 with nonsingular Fisher information matrix V j θ0.

(d) there exists a measurable function sj with P0[sj(X)]2 < such that, for every θ1, θ2 in a neighborhood of θ0,

| log pj θ1 log pj θ2| sj(x) θ1 θ2

(e) the mapping θ 7 log pj θ is twice continuously diﬀerentiable for every x in a neighborhood of θ0, and the Fisher information matrix V j θ0 exists and is nonsingular.

(f) the mapping θ 7 log pj θ is three-times continuously diﬀerentiable for every x in a neighborhood of θ0, the Fisher information matrix V j θ0 exists and is nonsingular, and the third-order partial derivatives uniformly bounded by an integrable function in a neighborhood of θ0.

Bv M for Distributed Bayes

The regularity conditions described in Assumption 2 are common prerequisites for the asymptotic analysis of posterior distributions (Van der Vaart, 2000). The intuition is that the diﬀerentiability or convexity of the log-likelihood ensures the consistency of the maximum likelihood estimators, and the quadratic means diﬀerentiability enables a valid second-order Taylor expansion around the truth. This allows us to describe the asymptotic properties of the posterior using properties of the MLEs. The next group of assumptions is typically referred to as the prior mass or prior thickness assumption.

Assumption 3 (Prior Assumptions) For every j [m], the prior distribution Πj satisﬁes

(a) Πj(Uϵ) > 0 for all ϵ > 0, where Uϵ = {θ Θ : 1

m Pm j=1 DKL(P0 Pj θ) < ϵ}.

(b) the density πj is continuous and positive at θ0.

The ﬁrst assumption states that prior puts a suﬃcient amount of mass in a KL neighborhood of the target distribution P0. An extra continuity assumption of the density π at θ0 is sometimes assumed to connect the ﬁrst assumption to a statement about the neighborhood of θ0 (Van der Vaart, 2000). Uniform, consistent testing assumptions enable θ0 to be distinguished with a sequence of test functions. This assumption ensures that asymptotically negligible mass is placed outside a neighborhood of θ0.

Assumption 4 (Uniform Consistent Testing) Let Θ Rp. For each j [m],

(a) For all ϵ > 0, there exists δ > 0 such that limt P0(inf θ θ0 >δ |fj t (θ) fj t (θ0)| ϵ) = 1.

(b) There exists a sequence of test functions φt such that Pj θ0(1 φt(X(mt))) 0 and sup θ θ0 ϵ Pj θ(1 φt(X(mt))) 0 for every ϵ.

Assumption 4(b) is attributed to Theorem 10.1 in Van der Vaart (2000), and Assumption 4(a) is given ﬁrst in Ghosh and Ramamoorthi (2003). The two assumptions in 4 play an identical role in the theory but are slightly diﬀerent. We mainly use Assumption 4(a).

4. Posterior Consistency

This section establishes the posterior consistency for the distributed Bayesian posterior. Theorem 3, to be presented next, provides a general concentration result for the distributed Bayesian posterior P j t over the measurable space (Θ, A). The proof of Theorem 3 takes its cues from Schwartz s theorem (Schwartz, 1965). This fundamental theorem implies two supporting ﬁndings. Corollary 4 shows that Theorem 3 holds under model misspeciﬁcation and alternative deﬁnitions of the neighborhood around the true parameter. Moreover, Lemma 5 lays down more user-friendly suﬃcient assumptions for the regularity assumption on the log-likelihood (Assumption 2(a)) in Theorem 3. First, we introduce a supporting lemma to demonstrate that the surrogate loss functions fj t and ft are asymptotically equivalent, and both converge to the population loss function

Wu and Uribe

f. This lemma suggests that under the conditions of a connected communication graph and a ﬁrst-moment condition satisﬁed by the statistical model, the distributed Bayesian posterior P j t becomes asymptotically equivalent to a posterior derived from ft.

Lemma 2 Let Assumptions 1 and 2(a) hold. Then fj t and ft converges to f on Θ in [P0] probability.

The proof can be found in Section C.1 of the Appendix. The argument is based on a distributed version of the law of large numbers and geometric convergence of the adjacency matrices described in Lemma 1. We are ready to state the main result of this section.

Theorem 3 (Consistency) Let θ0 Θ be deﬁned in (3.1), ϵ > 0, and denote Uϵ = {θ Θ : 1

m Pm j=1 DKL(P0 Pj θ) < ϵ}. Moreover, let Assumptions 1, 2(a), 3(a) hold. Then, the distributed Bayesian posterior P j t deﬁned in (2.10) has the following property:

lim t P j t (Uϵ) = 1,

in [P0] probability for each ϵ > 0 and j [m].

See Section C.1 for the proof. The proof uses the standard technique based on Schwartz s theorem (Miller, 2021). The primary insight is that, given the deﬁned regularity assumptions on private statistical models and the presence of a connected communication graph, the distributed Bayesian posterior concentrates at the same point as the 1 m geometric average of the individual Bayesian posteriors. Theorem 3 does not assume correct model speciﬁcation. When the model is misspeciﬁed, the distributed Bayesian posterior concentrates around the unique minimizer θ0 of Problem (2.7), which is assumed to exist. Under additional assumptions, the neighborhoods on the space of probability measures used in stating Theorem 3 can be substituted with neighborhoods on Θ.

Corollary 4 Let (Θ, d) be a metric space, θ0 Θ, ϵ > 0, and denote Nϵ = {θ Θ : d(θ, θ0) < ϵ}. Moreover, let Assumptions 1, 2(a), 2(b), 3(c) hold. Then the distributed Bayesian posterior P j t deﬁned in (2.10) has the following property

lim t P j t (Nϵ) = 1,

in [P0] probability for each ϵ > 0 and j [m].

The result follows directly from Theorem 3 of Miller (2021). Hence, the proof is omitted. In practice, Assumption 2(a) is challenging to verify due to the lack of information on the moments of the unknown distribution. The expectation may not exist under a heavy-tailed distribution like the Cauchy distribution. In the result below, we provide an information-theoretic equivalent of Assumption 2(a) that is easier to check.

Lemma 5 Assumption 2(a) is equivalent to either of the following assumptions:

(a). P0 is absolutely continuous with respect to Pj θ for all θ Θ,

Bv M for Distributed Bayes

(b). DKL(P0 Pj θ) < for all θ Θ.

When P0 has a density p0, the absolute continuity assumption is equivalent to the assumption that if pj θ(x) > 0, then p0(x) > 0 for every x (Cover, 1999). The second assumption is the minimum assumption for the problem of distributed statistical estimation to be tractable. Both assumptions are easy to check, as illustrated by the following example.

Example 1 Assume that Pj θ = {N(θ, σ2 j ), θ Θ} for known σ2 j > 0 and P0 = N(θ0, σ2 0) for unknown mean θ0 and variance σ2 0. The Gaussian distribution with positive variance has support on the whole real line. Then P0 is absolutely continuous with respect to Pj θ for all j. For the second assumption, the KL divergence between Pj θ and P0 has an explicit formula:

DKL(Pj θ P0) = 1

σ2 0 + σ2 j σ2 0 1 log σ2 j σ2 0

For ﬁnite σ2 j , σ2 0, θ0, the KL divergence is ﬁnite for every θ R.

5. Asymptotic Normality

The Bernstein von Mises (Bv M) theorem states that under certain conditions on the prior, the posterior distribution approximates a Gaussian distribution centered at a consistent estimator, such as the maximum likelihood estimator, as data increases. This theorem is pivotal in Bayesian statistics for at least two reasons. First, it provides a quantitative description of how the posterior contracts to the truth. Second, this theorem justiﬁes Bayesian credible sets as valid frequentist conﬁdence sets, i.e., sets of posterior probability 1 α contain the true parameter at the conﬁdence level 1 α. In this section, we establish Bernstein von Mises theorems for the distributed Bayesian posterior deﬁned in Equation (2.10). Our results address the asymptotic normality of the distributed Bayesian posterior under both correct and incorrect model speciﬁcations. We consider scenarios where all agents accurately specify the model, and some agents gather observations from a true distribution that does not belong to their statistical models. Theorem 10 provides suﬃcient assumptions for the distributed posterior to converge to a normal distribution centered around a sequence of M-estimators ˆθj t. Theorem 14 generates an analogous result under classical assumptions, with the normal approximation centered around θ0 (deﬁned in Equation (3.1)). The supporting lemmas are provided preceding the main results. Given the abstract nature of the assumptions, Theorem 10 is supplemented by Corollary 11 and Corollary 12, which outline more user-friendly assumptions that guarantee the same Bernstein von Mises (Bv M) results. The Bernstein von Mises argument relies critically on the following sequence of Mestimators: ˆθj t = arg min θ Θ fj t (θ). (5.1)

The existence and consistency of ˆθj t is not always guaranteed; one set of suﬃcient assumptions is provided as follows.

Wu and Uribe

Lemma 6 Let Assumptions 1, 2(a), 2(f), 3(b) hold. Then the probability that the equation fj t (ˆθj t) = 0 has at least one solution converges to 1 as t , and there exists a sequence

of solutions ˆθj t such that ˆθj t P0 θ0.

This is a direct consequence of Theorem 5.42 of Van der Vaart (2000); thus, the proof is omitted. Lemma 6 states that the sequence of M estimators exists under third-order smoothness assumptions on the log-likelihood. From now on, we assume the existence of M estimators ˆθj t that satisfy (5.1) for every t, j. The high-order smoothness assumptions on the private log-likelihoods (Assumption (f)) is a strong assumption for guaranteeing consistency. It can be replaced with a more relaxed and amenable convexity assumption (Assumption 2(b)).

Lemma 7 Let Assumptions 1, 2(a), 2(b) hold. Then ˆθj t P0 θ0.

The next two lemmas are useful in the proof of Theorem 10.

Lemma 8 Let Assumptions 1, 2(c), 4(a) hold and ˆθj t P0 θ0. Then for every ϵ > 0, there exists δ > 0 such that

lim t P0( inf θ ˆθj t >δ |fj t (θ) fj t (ˆθj t)| ϵ) = 1

The result states that if a sequence of estimators ˆθj t is consistent at θ0, then the existence of uniformly consistent tests at θ0 (Assumption 4(a)) implies an analogous result at ˆθj t.

Lemma 9 Let ˆθt Rp such that ˆθt P0 θ0 for some θ0 Rp, πt be a density with respect to Lebesgue measure on Rp. Suppose qt is the density of at(θ ˆθt) where θ πt and a 1 t = o(1). If R |qt(x) q(x)|dx 0 in [P0] probability for some probability density q, then for every ϵ > 0, limt Πt(Nϵ) = 1,

where Nϵ is the neighborhood of θ0 deﬁned in Corollary 4.

The proofs of Lemma 6, Lemma 7, Lemma 8, and Lemma 9 can be found in Section C.2 of the Appendix. Lemma 6 and 7 rely on the usual consistency argument of M - estimators. Lemma 8 and Lemma 9 use measure-theoretic arguments. Our ﬁrst main result of the section provides general suﬃcient assumptions under which a distributed Bayesian posterior exhibits asymptotic normality and an asymptotically correct Laplace approximation, along with the posterior concentration at θ0.

Theorem 10 (Bv M) Let Θ be an open subset of Rp. Assume that there exists θ0 Θ such that P0 = P j θ0 for every j [m]. Moreover, let Assumptions 1, 2(a), 2(c), 2(d), 3(b),

4(a) hold. If the sequence ˆθj t deﬁned in (5.1) satisﬁes that ˆθj t P0 θ0, then

fj t (θ) = fj t (ˆθj t) 1

2(θ ˆθj t)T ˆV j t (θ ˆθj t) + rj t(θ ˆθj t), (5.2)

where ˆV j t is a sequence of matrices that converges in probability to the average Fisher information matrix Vθ0 = 1

m Pm i=1 V i θ0, and |rj t(h)| = O(|h|3) for large enough t.

Bv M for Distributed Bayes

Let Equation (5.2) and Assumption 3(b), 4(a) hold. As t , we have Z

Bϵ(θ0) pj t(θ)dθ P0 1 ϵ > 0. (5.3)

that is, the distributed Bayesian posterior pj t is weakly consistent around θ0. Let qj t be the density of

t(θ ˆθj t) when θ P j t . Then, Z

qj t (x) N(0, V 1 θ0 ) dx P0 0. (5.4)

i.e., the total variational distance between qj t and N(0, V 1 θ0 ) vanishes in [P0] probability.

See Section C.2 for the proof. The assumptions of Theorem 10 involve structural and statistical assumptions that are in line with the Bernstein von Mises literature. These include a connected communication graph (Assumption 1), bounded entropy condition of the private statistical models (Assumption 2(a)), a distributed version of the diﬀerentiable in quadratic means (DQM) assumption (Assumption 2(c)) and the Lipschitz gradient regularity assumption on the log-likelihood around θ0 (Assumption 2(d)). The M-estimators, ˆθj t, are assumed to be consistent, with suﬃcient conditions outlined in Lemmas 6 and 7. These regularity assumptions are fairly mild and applicable to, for example, most exponential family models. They serve as the foundation for Equation (5.2), which mirrors the local asymptotic normality assumption in classical Bv M theory (Cam et al., 2000). Additionally, we specify a prior mass assumption (Assumption 3(b)) and an assumption related to uniform, consistent testing (Assumption 4(a)). The prior mass assumption (Assumption 3(b)) is slightly diﬀerent from the one used in Theorem 3, but they are equivalent when f is continuous at θ0. These, together with Equation (5.2), are the main suﬃcient assumptions for the Bv M results in Equations (5.3) and (5.4). The result establishes a sampling complexity of

t, indicating that all agents achieve parametric eﬃciency when performing distributed inference in a fully connected network comparable to a single-node network. While parametric eﬃciency is not surprising from Bv M, our result also shows that communication among agents enhances the robustness of uncertainty quantiﬁcation. In Theorem 10, the precision matrix of the limiting Gaussian is the average of the Fisher information Vθ0 across agents. The intuition here is that each agent s uncertainty about the ground truth eventually aligns with the network s average level of uncertainty, regardless of the agent s initial uncertainty or the speciﬁc statistical model they use. This suggests that communication can act as a protective mechanism in adversarial settings, where the posterior of a few agents might suﬀer from abnormally high uncertainty due to data contamination. In the social learning context, Theorem 10 extends existing asymptotic results to a large-time, continuous Θ setting. Previous results, such as Uribe et al. (2022a,b), established consistency and concentration bounds for pj t(θ) when Θ is discrete or compact. In comparison, Theorem 10 assumes Θ to be an open subset of Rd. Beyond recovering posterior consistency, Theorem 10 provides a richer asymptotic characterization of the distributed posterior in the sense that it guarantees 1) the frequentist coverage of posterior credible sets and 2) (rate-)eﬃcient and robust inference under the distributed posterior.

Wu and Uribe

The diﬀerentiability in quadratic means (DQM) assumption (Assumption 2(c)) in Theorem 10 may be diﬃcult to verify in practical settings. We replace the abstract DQM assumptions with a second-order smoothness assumption on the private log-likelihoods (Assumption 2(e)).

Corollary 11 Theorem 10 holds if Assumption 2(c) is replaced with Assumption 2(e).

Both diﬀerentiability in quadratic means (DQM) assumption (Assumption 2(c)) and the Lipschitz gradient assumption (Assumption 2(d)) can be replaced with a third-order smoothness assumption (Assumption 2(f)) which is often called the classical condition for asymptotic normality of M - estimators (Van der Vaart, 2000).

Corollary 12 Theorem 10 holds if Assumption 2(c) and 2(d) are replaced with Assumption 2(f).

The proofs of Corollary 11 and Corollary 12 are based on bounding the second and thirdorder terms in the Taylor expansion of fj t (θ), respectively. We can substitute more model regularity assumptions by bounding higher-order terms in the Taylor expansion. The classical Bernstein von Mises theorem relies on a stochastic version of the Local Asymptotic Normality(LAN) assumption (Van der Vaart, 2000). For social learning, we introduce a distributed version of the stochastic LAN condition. A distributed family of statistical models {Pj θ}j [m], G satisﬁes stochastic LAN at θ Θ if, for every j [m], and

with respect to a non-singular scaling factor ϵj t 0, there exists a random vector j t,θ and

a non-singular matrix Vθ such that j t,θ is bounded in probability and for every compact subset K Rp, as t ,

tfj t (θ + ϵj th) + tfj t (θ) h T Vθ j t,θ + 1

2h T Vθh P0 0. (5.5)

The random vector j t,θ is often called the local suﬃcient statistics, and Vθ in our case corresponds to the average of Fisher information. It s worth noting that the Vθ matrix is required to be the same across all agents. The scaling factor ϵj t is chosen to ensure j t,θ is bounded in probability, and Vθ converges to a nonsingular matrix. A default choice is ϵj t = 1

t. However, an agent-dependent choice of ϵj t is available if any private statistical model has a diﬀerent rate of contraction. A suﬃcient assumption for the distributed LAN is a distributed version of the diﬀerentiable in quadratic means (DQM) condition.

Lemma 13 Let Assumptions 1, 2(a), 2(c) hold. The distributed family of statistical models, denoted as {Pj Θ}j [m], G , is stochastically locally asymptotically normal (LAN) at θ0.

We now extend the Bv M result to the case where the true data-generating distribution P0 = Pj θ0 for some j [m]; in other words, we allow at least one agent to have a misspeciﬁed model.

Bv M for Distributed Bayes

Theorem 14 (Misspeciﬁed Bv M) Let θ0 (Θ, d) be the true parameter deﬁned in Equation (3.1). Moreover, let Assumptions 3(b), 4(b) hold, and assume that the distributed stochastic LAN (5.5) holds at θ0. Let a sequence of constants ϵj t satisfy that for any Mt , P j t θ : d(θ, θ0) Mtϵj t P0 0. (5.6)

If qj t is the density of (θ θ0)/ϵj t when θ P j t , then Z

Θ |qj t (x) N(0, V 1 θ0 )|dx P0 0,

for Vθ0 deﬁned in Equation (5.5).

Since the covariance matrix Vθ0 in the misspeciﬁed Bernstein von Mises theorem fails to match the average of sandwich covariance matrices, the posterior credible sets derived from P j t do not have valid frequentist coverage. While these sets may be properly centered at ˆθj t, their width may be inaccurate, and they don t typically correspond to conﬁdence sets with level 1 α. The sequence ϵj t that satisﬁes Equation (5.6) is called a posterior contraction rate of the distributed Bayesian posterior P j t . This quantity determines the convergence rate of P j t to the unknown parameter θ0. Results to control the contraction rates are provided in the next section.

6. Contraction Rate

Contraction rates quantify the speed at which a posterior distribution approaches the true parameter of the data-generating distribution. Controlling the contraction rates not only reﬁnes our understanding of posterior consistency but also helps control the sampling complexity in the misspeciﬁed Bernstein von Mises Theorem (Theorem 14). Unlike the previous sections, which focus on asymptotic results, we provide non-asymptotic bounds and scaling laws of the posterior contraction rates in this section. Our results involve the sample size t, dimension p, and the number of agents m. For two positive sequences xt and yt, we use xt yt to denote the existence of a constant c, independent of n, such that xt cyt. Furthermore, we write xt yt when xt yt and yt xt. Let (Θ, d) be a metric space. A sequence of constants ϵj t is a posterior contraction rate for P j t at the parameter θ0 if, for every Mt ,

P j t θ : d(θ, θ0) Mtϵj t P0 0, (6.1)

The posterior contraction rate is not a unique quantity. Any rate slower than a contraction rate is also a contraction rate. Although the fastest rate is desirable, it may be hard to ﬁnd. The natural goal is to establish a close upper bound for the optimal rate. We refer to this upper bound as the contraction rate. We deﬁne a probability measure Pθ on the product space X m. For a measurable set A X m, we have

A Qm j=1[pj θ(xj)] 1 m dx1 . . . dxm R

Θ Qm j=1[pj θ(xj)] 1 m dx1 . . . dxm .

Wu and Uribe

The density is given by pθ(x) = Qm j=1[pj θ0(xj)] 1 m for x X m. Given the prior measure Π, we deﬁne the distributed ideal posterior Pt as the posterior measure corresponding to Pθ and Π. For a measurable set B Θ, we have

B Qt k=1 pθ(xk)Π(dθ) R

Θ Qt k=1 pθ(xk)Π(dθ) =

B Qt k=1 Qm j=1[pj θ(xj k)] 1 m Π(dθ) R

Θ Qt k=1 Qm j=1[pj θ(xj k)] 1 m Π(dθ) , (6.2)

Let Dρ(p q) denote the ρ R enyi divergence, as deﬁned in Section (A.1) of the Appendix. We establish a contraction rate of the distributed Bayesian posterior P j t given by:

mt P0DKL(P j t Pt) + 1 m2t

i=1 DKL(P0 Pi θ0). (6.3)

Here ϵm,t is the contraction rate of the distributed ideal posterior Pt. The second term quantiﬁes the approximation error when the distributed Bayesian posterior is approximated by the corresponding distributed ideal posterior under the true distribution P0. The last term captures the minimum average discrepancy between the distributed models and the true data-generating distribution. The main theorem of the section is stated under the prior mass and testing framework. The assumptions are sub-exponential reﬁnements of the assumptions for Bernstein von Mises theorems: (a) The prior is required to put a minimal amount of mass in a neighborhood of the true parameter. (b) Restricted to a subset of the parameter space, there exists a test function that can distinguish the truth from the complement of its neighborhood; (c) The prior is essentially supported on the subset described in (b).

Theorem 15 (Contraction Rate) Suppose ϵm,t is a sequence such that mtϵ2 m,t 1. Let C0, C1, C2, C3 > 0 be constants such that C0 > C2 + C3 + 2. Let the following assumptions hold:

1. For any ϵ > ϵm,t, there exists a set Θt(ϵ) and a testing function φt such that:

Pθ0φt(X(mt)) + sup θ Θt(ϵ) d(θ,θ0) C1ϵ2

Pθ(1 φt(X(mt))) exp C0tϵ2 . (C1)

2. For any ϵ > ϵm,t, the set Θt(ϵ) above satisﬁes:

Π (Θt(ϵ)c) exp C0tϵ2 . (C2)

3. For some constant ρ > 1:

j=1 Dρ(Pj θ0 Pj θ) C3ϵ2 m,t

exp C2tϵ2 m,t . (C3)

Then, for the distributed Bayesian posterior P j t deﬁned in (2.10), we have:

P0P j t d(θ, θ0) C

ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0)

Bv M for Distributed Bayes

for some constant C depending on C0, C1, where the quantity γ2 j,m,t is deﬁned as:

γ2 j,m,t = 1

mt P0DKL(P j t Pt).

The proof of Theorem 15 can be found in Section C.2 of the Appendix. It is a consequence of support lemmas that make use of the Gibbs variational representation and the subexponential decay of sub-exponential decay of d(θ, θ0) under Pt. Assumption (C1) and (C2) is a reﬁnement of assumptions 4 for the uniform consistent testing and states that there is a sequence of tests such that the sum of Type I and Type II errors decrease exponentially with sample size, where the alternative hypothesis is taken in a large enough set under the prior. Assumption (C3) reﬁnes the prior mass assumptions 3 by stating that the prior mass decreases exponentially away from a ρ R enyi neighborhood of the true distribution. This assumption is slightly stronger than the equivalent assumption stated with the KL neighborhood because Dρ(P Q) > DKL(P Q) for ρ > 1. The contraction rate is the sum of three terms. The ﬁrst term ϵ2 m,t is the contraction rate of the distributed ideal posterior. The second term γ2 j,m,t characterizes the distance between the distributed Bayesian posterior P j t and the ideal posterior Pt. A larger or less connected communication graph means more deviation between the two distributions, which slows the contraction rate. The last term 1 m2t Pm j=1 DKL(P0 Pj θ0) penalizes the rate by the average discrepancy between the truth and its distributed approximation. We can show by Markov inequality that the upper bound on P0P j t d(θ, θ0) is indeed the contraction rate for the distributed Bayesian posterior P j t . This allows us to obtain a point estimate ˆθ that converges to the KL minimizer θ0 at the same rate for convex loss functions.

Corollary 16 Under the assumptions of Theorem 15, for any diverging sequence Mt , we have

d(θ, θ0) > Mt

ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0)

Furthermore, if d(θ, θ0) is convex in θ, the distributed posterior mean ˆθ = R

Θ θd P j t (θ) satisﬁes

P0d(ˆθ, θ0) C

ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0)

where C is the same constant in (6.4).

The contraction rate deﬁned in Theorem 15 is somewhat abstract because the terms such as ϵ2 m,t and γ2 j,m,t do not directly inform the design of the underlying communication network. For practical purposes, it s preferable to characterize the rate in terms of design parameters such as m, t, and ν to guide the design of a statistically eﬃcient network structure. To this end, the following section oﬀers more concrete upper bounds on the terms ϵ2 m,t and γ2 j,m,t. The upper bounds for ϵ2 m,t can be directly borrowed from the existing theory on posterior contraction rates under model misspeciﬁcation (Kleijn and van der Vaart, 2012). For ﬁnitedimensional models denoted by {Pθ, θ Θ}, the optimal contraction rate is given by t 1/2.

Wu and Uribe

However, this result does not follow from Theorem 15 because it requires a more restrictive metric entropy assumption involving the 2nd-order KL divergence. See, for example, Theorem 2.2 of Kleijn and van der Vaart (2012). The general theory for posterior contraction rates typically combines a prior mass assumption, often in the form of C3, with either a model entropy assumption or a consistent testing assumption in the form of C1 and C2. For a review of the theory of posterior contraction rates, see Chapter 8 of Ghosal and Van der Vaart (2017) and the references therein. As touched upon in Section 2.2, one should expect the distributed Bayesian posterior to be well approximated by the distributed ideal posterior as the sample size increases. The next result provides a uniform bound on the approximation error γ2 j,m,t.

Lemma 17 Let Assumptions 1 and 2(a) hold. For P j t deﬁned in (2.10) and Pt deﬁned in (6.2), we have

γ2 j,m,t 16m log m

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ)

The upper bounds on γ2 j,m,t consist of two terms: the ﬁrst term depends on the structure of the graph, and the second term depends on the graph and the worst-case model misspeciﬁcation error, denoted by maxi [m] infθ Θ DKL(P0 Pi θ). The intuition is that there is a tradeoﬀbetween the size of the communication network and statistical eﬃciency, and the communication cost could be much higher if one or more agents in the network use misspeciﬁed models. If all models are correctly speciﬁed, the contraction rate degrades with m at a rate of m log m. If the term maxi [m] infθ Θ DKL(P0 Pi θ) scales at a rate of m2, then the contraction rate decreases at a much higher rate of m3 log m. Suppose we ignore the constants and focus on the scaling law. In that case, the contraction rate of the distributed Bayesian posterior is a function of the sample size, the number of agents, the smallest positive adjacency weight (spectral gap), the worst-case model misspeciﬁcation error, and the average model misspeciﬁcation error. We formalize this in the following result.

Theorem 18 (Practical Contraction Rate) Let Assumptions 1, 2(a), and the assumptions of Theorem (15) hold. Let the distributed ideal posterior Pt satisfy a contraction rate of ϵ2 m,t t 1. For the distributed Bayesian posterior P j t deﬁned in (2.10), we have:

P0P j t d(θ, θ0) 1

t + m log m

1 + max i [m] inf θ Θ DKL(P0 Pi θ) + 1 m2t

j=1 DKL(P0 Pj θ0).

The result follows directly from Theorem 15 and 17. Thus, the proof is omitted. This theorem outlines how design parameters determine the contraction rate of distributed Bayesian posteriors. The second term of the formula is arguably the most interesting. As the number of agents m increases, the contraction rate diminishes proportionally to m log m and inversely to ν, the smallest positive adjacency weight. This shows that both the scale of the agent network and the minimal communication bandwidth (spectral gap) critically inﬂuence the posterior s contraction rate. Additionally, the error scales linearly in the worst-case and average model misspeciﬁcation errors, which shows that deviations from the ideal model assumptions penalize the contraction rate.

Bv M for Distributed Bayes

7. Extension to Time-Varying Graphs

This section extends our theory to a speciﬁc time-varying connectivity scenario: an alternation between a fully connected graph and a fully isolated agents graph. While this is a simplistic setting, it is theoretically rich enough to reveal a phase transition behavior of inference under time-varying communication. Let Gt = (V, Et), t 1 be a sequence of time-varying graphs, where the node set V is ﬁxed, but the edge set Et changes over time.

Assumption 5 Let Assumption 1 be satisﬁed for graph G with the adjacency matrix A. We assume that Gt and At are independent, random graphs and matrices such that Gt = G and At = A with probability λ [0, 1], respectively, and At = Im otherwise, where Im is the identity matrix.

Assumption 5 proposes a setting useful for various reasons. Theoretically, this setting leads to a rigorous study of the tradeoﬀbetween statistical eﬃciency and communication cost. It allows us to explore what constitutes an optimal level of communication, given a targeted level of statistical eﬃciency. On the practical side, this setting can be conceptualized as a network experiencing complete failure with probability λ. In such scenarios, the primary objective is to quantify how this level of intermittent connectivity impacts the overall learning quality within the system. We provide an analogous result to Lemma 1 in the Assumption 5 setting.

Proposition 19 Let Assumption 5 hold and m 2. Then with probability 1, the sequence of adjacency matrices (At) satisﬁes the following scaling law: If λ 2

16m2 log m + 8m2 log λ

If 0 < λ < 2

If λ = 0, then

In these inequalities, ν denotes the smallest positive entry of A.

Proposition 19 provides the scaling laws for three communication regimes. Consensus can be reached as long as the communication occurs with a non-zero probability. However, suppose the goal is to achieve optimal scaling behavior relative to the size of the communication graph (m). Then communication needs to happen with a minimum probability of

Wu and Uribe

2 m, with a higher frequency being more desirable. On the other hand, if there is a need to adhere to the low-frequency regime (0 < λ < 2

m) due to constraints such as high communication costs, the optimal strategy becomes one of minimal communication. Notably, within this regime, a decrease in communication frequency does not aﬀect the rate of attaining consensus among the agents. Recall that the contraction rate of P j t is decomposed into three pieces.

mt P0DKL(P j t Pt) + 1 m2t

i=1 DKL(P0 Pi θ0).

The structure of the communication network aﬀects the contraction rate through the second term. The following Corollary illustrates this.

Corollary 20 Let Assumptions 5 and 2(a) hold. For P j t deﬁned in (2.10) and Pt deﬁned in (6.2), the following holds with probability 1: If λ 2

1 mt P0DKL(P j t Pt) 16m log m + 8m log λ

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

If 0 < λ < 2

1 mt P0DKL(P j t Pt) 4m2

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

In Corollary 20, the term with probability 1 is relative to the measure deﬁned over the sequence A1, A2, . . ., where each At is an independent random matrix drawn from two deterministic matrices with probability λ. Importantly, this is the same probability measure that underlies Proposition 19, and we continue to use this measure in the sequel. Varying the communication frequency λ aﬀects the statistical eﬃciency of performing distributed Bayesian inference over the network. In the following result, we demonstrate the impact of λ on the contraction rates.

Corollary 21 Let Assumptions 5, 2(a), and the assumptions of Theorem 15 hold.Let ϵ2 m,t t 1. Then for the distributed Bayesian posterior P j t deﬁned in (2.10), the following holds with probability 1: If λ 2

P0P j t d(θ, θ0) 1

t +m log m + m log λ

1 + max i [m] inf θ Θ DKL(P0 Pi θ) + 1

j=1 DKL(P0 Pj θ0).

If 0 < λ < 2

P0P j t d(θ, θ0) 1

1 + max i [m] inf θ Θ DKL(P0 Pi θ) + 1 m2t

j=1 DKL(P0 Pj θ0).

Bv M for Distributed Bayes

The distributed ideal posterior Pt does not depend on the communication graph; thus, Theorem 15 holds, and the assumption that ϵ2 m,t t 1 is still justiﬁed. Since the only term in the contraction rate that depends on the communication graph is γ2 j,m,t, the argument for Corollary 21 directly builds on Theorem 15 and Corollary 20. See Section C.3 in the Appendix for the proof. Theorem 21 is, to our knowledge, the ﬁrst result on the impact of time-varying communication networks on the eﬃciency of distributed statistical inference. The result somewhat deﬁes intuition. One might naturally expect a communication-statistical tradeoﬀ, where increasing the communication frequency leads to faster posterior contraction rates. While higher communication frequencies (λ 2 m) do indeed result in faster contraction rates, scaled by a factor of log λ

λ , the situation is diﬀerent when the frequency is below 2 m. In this regime, the communication eﬃciency λ no longer has any eﬀect, and the contraction rate depends solely on the number of agents m with a rate of m2 rather than m log m. This suggests the possibility of a phase transition phenomenon at λ = 2 m, worth exploring for future research. In practical terms, if the goal is to balance communication eﬃciency and statistical performance in a large network, a communication frequency of λ = 2

m appears to be the optimal choice.

8. Illustrative Examples

8.1 Exponential Family Distributions

Let the distributed statistical models {Pj Θ}j [m], G be well - speciﬁed, and the Assumption 1 be satisﬁed for G. Let η : Θ Rp and T : X Rp be some suﬃcient statistics and let ψj : Θ R, h : X R be normalizing functions. We assume that Pj θ is a member of the canonical exponential family

pj θ(x) = h(x) exp θ, T j(x) ψj(θ) . (8.1)

The exponential family includes commonly used Gaussian, exponential, gamma, chisquare, Beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, inverse Wishart, and geometric distributions. In this section, we only consider the canonical exponential family and interchangeably refer to the exponential and canonical exponential families. Exponential family distributions are helpful for Bayesian inference because of their conjugate properties. For example, the posterior is a member of the exponential family if the prior and likelihood are members of the exponential family. This property is preserved in the distributed setting. Let us consider a scenario where the beliefs of all agents at time t belong to the natural exponential family. In this setting, the belief of agent i concerning the parameters θ can be described as follows:

pi t(θ) exp θ, χi t Bi(θ) ,

whereχi t represents the suﬃcient statistic for agenti at timet, and Ai(θ) is the log-partition function.

Wu and Uribe

We update the one-step-ahead posterior pj t+1 with the distributed Bayes rule,

pj t+1(θ) log pθ(Xj t+1)

i=1 pi t(θ)

exp( θ, T(Xj t+1) ψj(θ)) exp( θ,

i=1 Aijχi t

θ, T j(Xj t+1) +

i=1 Aijχi t ψj(θ)

The distribution pj t+1 is a member of the exponential family with suﬃcient statistic T j(Xj t+1)+ Pm i=1 Aijχi t and the log-partition function ψj(θ) + Pm i=1 Bi(θ). This provides an easy-toimplement algorithm to learn the distributed Bayesian posterior. Let the prior be a member of the natural exponential family.

π(θ) = g(u) exp θ, u ψ0(θ) . (8.2)

Leveraging the conjugate property of the exponential family, we derive a closed-form expression for the density of the distributed Bayesian posterior P j t .

Lemma 22 Let assumption (1) hold. Assume that the likelihood pi θ has an exponential family form given by Equation (8.1). Assume that the prior Π has an exponential family form given by Equation (8.2). Then the distributed Bayesian posterior P j t deﬁned in (2.10) is given by the following formula:

pj t(θ) = h(X(mt)) exp( θ, χj t + u Bj t (θ) ψ0(θ)), (8.3)

i=1 [At k ji ]T i(Xi k), Bj t (θ) =

i=1 [At k ji ]ψi(θ), (8.4)

are the suﬃcient statistics and the log-partition function, respectively.

An exponential family is full-rank if no linear combination of the suﬃcient statistic is constant. For example, we say the distribution pi θ in Equation (8.1) is full-rank if no linear combination of the D dimensional suﬃcient statistic T i(X) = [T i 1(X), , T i D(X)] leads to a constant. This is a mild assumption on the form of the exponential family.

Lemma 23 Assume that int(Θ) = . Assume that the likelihood Pj θ given by Equation (8.1) and prior Π given by Equation (8.2) belong to exponential families with full rank. Then P j t belongs to an exponential family of full rank. Moreover, the gradient of the log-partition function Bj t is invertible on the interior of Θ.

Lemma 23 establishes the conditions under which the distributed Bayesian posterior belongs to the full-rank exponential family. Critically, the invertibility of Bj t is instrumental for constructing a sequence of consistent M-estimators for θ0. These estimators serve as the

Bv M for Distributed Bayes

centering sequence for the Laplace approximation, which is key in proving our Bernstein von Mises result. Let ˆθj t be the M-estimators corresponding to fj t , as deﬁned in Equation (2.11). ˆθj t = arg max θ Rp θ, χj t Bj t (θ).

The closed-form expression for ˆθj t can be obtained as follows: ˆθj t = ( θBj t ) 1(χj t), (8.5)

where χj t, Bj t are deﬁned in Equation (8.4). We now state the asymptotic property of the distributed Bayesian posteriors for exponential family distributions.

Proposition 24 Let Θ be an open subset of Rp. Assume that there exists θ0 Θ such that P0 = Pj θ0 for every j [m]. Moreover, let Assumptions 1, 2(a) hold. Let the sequence of estimators ˆθj t be deﬁned in Equation (8.5). Let qj t be the density of

t(θ ˆθj t) when θ P j t Then, Z

Θ |qj t (x) N(0, V 1 θ0 )|dx P0 0,

where Vθ0 is the average of the covariance of T i evaluated at θ0, i.e., Vθ0 = 1

m Pm i=1 Cov(T i).

The Bv M result suggests a sample-based Laplace approximation to the distributed Bayesian posterior that belongs to the canonical exponential family. One example of such an approximation is the normal distribution with mean given by the moment estimator ˆθj t and covari-

ance given by as 1 m Pm i=1 ˆV (T i) 1 , where ˆV (T i) is the bootstrapped sample covariance

of T i. By Proposition 24 and Slusky s theorem, we obtain that the total variational dis-

tance between P j t and the normal distribution N ˆθj t, 1 m Pm i=1 ˆV (T i) 1 asymptotically

converges to zero. Laplace approximation enables direct calculation of an asymptotically valid credible region. Let χ2 α,p be the critical value from a χ2 distribution with p degrees of freedom. If

θ N ˆθj t, V 1 θ0

, a credible region for θ at level 1 α is given by the set of θ that satisﬁes

(ˆθj t θ)T 1 m

i=1 ˆV (T i)

(ˆθj t θ)T χ2 α,p

where ˆV (T i) can be replaced by any consistent estimator for Cov(T i). By Proposition 24 and Slusky s theorem, the credible region has asymptotic coverage of 1 α under P j t .

8.2 Distributed Logistic Regression

We consider i.i.d. observations Dj k = (Xj k, Y j k ) for k [t] and j [m] where covariates Xj k X for X Rp and responses Y j k {0, 1}. Let θ0 Rp be the true and unknown parameter. A logistic regression model generates the data.

Y j k Ber(νj k), log

νj k 1 νj k

= θT 0 Xj k. (8.6)

Wu and Uribe

For a logistic regression model with coeﬃcient θ, the conditional distribution pθ(yj k | xj k) after marginalizing out νj k is expressed as:

pθ(yj k | xj k) = exp θ, xj kyj k σ( θ, xj k ) , (8.7)

where σ(η) = log(1 + eη). The function σ(η) is strictly convex, since σ (η) = eη (1+eη)2 > 0. Conditioned on the covariates, the model (8.7) aligns with the canonical form of the exponential family. This allows us to apply the results from Section 8.1. For each agent j, we let the statistical model Pj Θ be a logistic regression model (8.7) with prior π(θ) supported on Rp. By aggregating the log-likelihood, we obtain the closed-form expression for the distributed Bayesian posterior:

pj t(θ) = exp θ, T j t Bj t (θ) log π(θ) , (8.8)

i=1 [At k ji ]Xi k Y i k, Bj t (θ) =

i=1 [At k ji ]σ( θ, Xi k ).

In settings where sampling from the distribution (8.8) is expensive, a Laplace approximation serves as a useful surrogate for posterior inference. Next, we present the Bv M result that provides this approximate distribution. Let ˆθj t be the solution to the following maximization problem:

ˆθj t = arg max θ Rp θ, T j t Bj t (θ). (8.9)

The objective function in (8.9) is strictly concave, which guarantees the uniqueness of ˆθj t. Moreover, we can use gradient-based optimization methods to compute ˆθj t.

Proposition 25 Let Θ = Rp. Let Dj k = (Xj k, Y j k ), k [t], j [m] be a sequence of i.i.d. paired random variables generated according to the model (8.6). Let Assumptions 1, 2(a) hold. Under these conditions, the estimator sequence ˆθj t deﬁned in Equation (8.9) converges to θ0 in [P0] probability. Deﬁne qj t as the density of

t(θ ˆθj t), where θ P j t . If the following assumptions are satisﬁed:

i) Pθ0X1 1X1T 1 exists and is ﬁnite and nonsingular.

ii) Pθ0|X1 k,a X1 k,b X1 k,c| < holds for all a, b, c [p],

then we have Z

qj t (x) N(0, ˆV 1 θ0 ) dx P0 0. (8.10)

The matrix ˆVθ is computed as

i=1 Xi T W i(θ)Xi, (8.11)

Bv M for Distributed Bayes

where W i(θ) is the diagonal matrix deﬁned by

W i(θ) = diag

e Pp j=0 θjxi 1j 1 + e Pp j=0 θjxi 1j 2 , . . . , e Pp j=0 θjxi tj 1 + e Pp j=0 θjxi tj 2

The proposition relies on two model-speciﬁc assumptions. Assumption i) focuses on the regularity of the private Fisher information. It is a standard prerequisite in Generalized Linear Models (GLMs) to ensure that the model parameter θ is identiﬁable (Example 16.8, Van der Vaart (2000)). Assumption ii) imposes a more stringent condition requiring a bounded third moment for the covariates. This assumption is used to verify Assumption 2(f) and is generally reasonable in most applications. Let χ2 α,p be the critical value from a χ2 distribution with p degrees of freedom. Proposition 25 speciﬁes an asymptotically valid credible region for the parameter θ under the probability measure P j t at level 1 α. The region is given by the set of θ that satisﬁes

(ˆθj t θ)T ˆVθ0(ˆθj t θ)T χ2 α,p

where ˆθj t is the estimator in Equation (8.9) and ˆVθ0 is deﬁned in Equation (8.11). Proposition 25 strengthens the robustness of distributed logistic regression models. The averaging Fisher information in the approximate covariance ensures that extreme covariate values do not overly inﬂuence the approximate uncertainty. The model-speciﬁc assumptions i) and ii) also serve as robustness checks for when the distributed logistic regression model attains the desired asymptotic property. In practice, checking the assumptions and using the Laplace approximation provides an eﬃcient and reliable approach to statistical inference in distributed logistic regression models, making it more resilient to various data irregularities.

8.3 Distributed Detection

In this example, we extend the distributed source location scenario from Nedi c et al. (2017) to demonstrate our theoretical results in a real-world context. Consider a communication network spread over a unit square [0, 1] [0, 1], comprised of m sensor agents located at points Zj, j [m]. The agents are interconnected via a connected undirected graph G, represented by the adjacency matrix A. The goal is to locate a target at an unknown position θ0 (0, 1) (0, 1). This target generates data X1 t , . . . , Xm t δθ0 at time t, as seen by the agents. However, agents do not receive the data Xj t directly; instead, the jth agent receives a noisy version of |Xj t Zj|, which is the Euclidean distance between the data point Xj t and the agent s location Zj. The goal is for the agents to identify the unknown location parameter, θ0, collectively. Due to noise and data corruption during transmission, the jth agent employs a statistical model to describe the observed signal |Xj t Zj|. Speciﬁcally, this signal is modeled as a normal distribution N(|θ Zj|, σj2), σj > 0 truncated to the interval [0, |Zj| + 1

Wu and Uribe

Let the parameter space be Θ = (0, 1)2. For each θ Θ, the jth agent s statistical model is given by:

pj θ(Xj k) = φ |Xj t Zj| |θ Zj|

σj h Φ |Zj|+ 1

σj Φ |θ Zj|

σj i I 0 |Xj t Zj| |Zj| + 1

Before collecting any data, the agents collectively decide on a uniform prior for the unknown location parameter θ: π(θ) = I(θ [0, 1]2). (8.13)

The statistical models from Equation (8.12) are based on truncated normal distributions with ﬁxed support, which belong to the exponential family. The suﬃcient statistic for |θ Zj| is |Xj t Zj|. We can express the generalized likelihood function fj t (θ), up to a constant, as:

fj t (θ) = 1

i=1 [At k ji ]

" (|Xj t Zj| |θ Zj|)2

The M-estimators for the loss function fj t are denoted by ˆθj t, and they minimize fj t (θ) over the parameter space [0, 1]2:

ˆθj t = arg min θ [0,1]2 fj t (θ). (8.14)

Due to the continuity and strict concavity of the objective function fj t , there exists a unique ˆθj t, which enables eﬃcient numerical methods like the Newton-Raphson algorithm for its calculation. Care is needed when fj t achieves its minimum at the corners of [0, 1]2, namely at the points (0, 0), (0, 1), (1, 0), (1, 1). In these scenarios, ˆθj t does not belong to Θ. Including this technicality is primarily to ensure that our deﬁnition of ˆθj t satisﬁes the conditions of the Argmax Theorem (Theorem 3.2.2 in van der Vaart and Wellner (2023)). Subsequent results conﬁrm that the sequence of estimators ˆθj t is consistent with θ0, allowing us to disregard the concern about the extreme corner cases with probability one.

Lemma 26 Let θ0 Θ be given. Under Assumption 2(a), the sequence of estimators ˆθj t deﬁned in Equation (8.14) converges to θ0 in [P0] probability. Moreover, limt P0(ˆθj t Θ) = 1.

We can use the Bernstein von Mises theorem for the distributed location detection problem.

Proposition 27 Let the private statistical models be given by Equation (8.12) and the prior be given by Equation (8.13). Let Assumption 1, 2(a) hold. For the sequence of estimators ˆθj t deﬁned in Equation (8.14), deﬁne qj t as the density of

t(θ ˆθj t), where θ P j t . Then we have Z

qj t (x) N(0, V 1 θ0 ) dx P0 0, (8.15)

Bv M for Distributed Bayes

where Vθ0 is given by

(θ0 Zj)(θ0 Zj)T

σj4|θ0 Zj|2

σj ) φ( |θ0 Zj|

σj ) Φ( |θ0 Zj|

Let χ2 α,p denote the critical value from a chi-squared distribution with p degrees of freedom. Proposition 27 delineates an asymptotically valid credible region for the parameter θ under the probability measure P j t at a conﬁdence level of 1 α. The credible region is deﬁned as all values of θ that satisfy the following inequality:

(ˆθj t θ) ˆVθ0(ˆθj t θ) χ2 α,p

where ˆθj t represents the estimator deﬁned in Equation (8.9). Furthermore, ˆVθ0 serves as the empirical analogue of the covariance matrix Vθ0, as deﬁned in Equation (8.16), with the true parameter θ0 replaced by its estimate ˆθj t.

9. Discussion and Future Directions

We have studied the frequentist statistical properties of distributed (non-Bayesian) Bayesian inference in terms of posterior consistency, Bernstein von Mises theorems, and posterior contraction rates. Our results provide the ﬁrst rigorous insights into the statistical eﬃciency of distributed Bayesian inference and its dependence on the design parameters of the underlying communication network, such as the number of agents and the network topology. The promising results oﬀer several avenues for future research. Future work should investigate the frequentist statistical properties of the distributed Bayesian posterior under more complex time-varying network structures, such as networks with single link failures (Shahrampour et al., 2015) or under other random graph structures such as Erd os R enyi graphs (Erd os et al., 1960). Understanding how distributed Bayesian inference adapts to such scenarios will be crucial for applying theoretical insights in unpredictable or adversarial settings. Another compelling direction is to explore the frequentist coverage properties of the distributed Bayesian posterior. This could include analyses of conﬁdence intervals or hypothesis tests when the underlying model is well-speciﬁed and misspeciﬁed. Understanding how the distributed Bayesian posterior aligns with frequentist criteria can oﬀer additional validation and robustness checks for the model. There are various settings to extend our results within the context of social learning. For example, Bordignon et al. (2021); Hu et al. (2023) introduce a step size in the distributed update rule (2.8) and establish an asymptotic normality result under a vanishing step size with ﬁxed n. One could combine our results with theirs to study limiting distributions under joint asymptotic regimes of time and step size. Another extension could involve proving a Bernstein von Mises theorem for a discrete, ﬁnite parameter space Θ, as explored in Bordignon et al. (2021); Hu et al. (2023). Additionally, agent-speciﬁc weights could be introduced in the likelihood term during the posterior update, similar to Bordignon et al. (2023), to explore how such an updating rule aﬀects the limiting distribution and contraction rate compared to using agent-independent weights.

Wu and Uribe

One could potentially connect our work with the microeconomics theory of social learning, such as the results in Molavi et al. (2018). Insights from non-Bayesian frameworks like variants of the De Groot model (De Groot, 1974; Acemoglu et al., 2011) could shed light on the convergence properties of distributed Bayesian methods, especially in settings where agents have heterogeneous prior beliefs or are inﬂuenced by external signals. Our work adds to the literature on frequentist distributed inference that focuses on communication eﬃciency. Future studies could integrate insights from this body of work, including communication-eﬃcient methods (Jordan et al., 2018) and high-dimensional distributed statistical inference (Battey et al., 2015), to design and analyze distributed Bayesian methods. Lastly, our theoretical ﬁndings could inform the design of communication networks in practical applications. Speciﬁcally, an analytical understanding of how the number of agents and communication costs interact could guide the construction of more eﬃcient and robust distributed Bayesian systems.

10. Acknowledgments

The authors thank the anonymous reviewers and the AE for their very constructive comments. This work is funded by National Science Foundation CISE Awards #2213568 and #2443064 and a Google Scholar Research Award.

Daron Acemoglu, Munther A Dahleh, Ilan Lobel, and Asuman Ozdaglar. Bayesian learning in social networks. The Review of Economic Studies, 78(4):1201 1236, 2011.

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed estimation and inference with statistical guarantees. ar Xiv preprint ar Xiv:1509.05457, 2015.

Peter J Bickel and Bas J K Kleijn. The semiparametric Bernstein von Mises theorem. The Annals of Statistics, 40:206 237, 2012. ISSN 0090-5364.

Peter J Bickel and Joseph A Yahav. Some contributions to the asymptotic theory of Bayes solutions. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 11:257 276, 1969. ISSN 1432-2064.

Natalia A Bochkina and Peter J Green. The Bernstein von Mises theorem and nonregular models. The Annals of Statistics, 42(5):1850 1878, 2014.

Virginia Bordignon, Vincenzo Matta, and Ali H. Sayed. Adaptation in online social learning. In European Signal Processing Conference, volume 2021-January, 2021. doi: 10.23919/ Eusipco47968.2020.9287445.

Virginia Bordignon, Mert Kayaalp, Vincenzo Matta, and Ali H Sayed. Social learning with non-Bayesian local updates. In 2023 31st European Signal Processing Conference (EUSIPCO), pages 1 5. IEEE, 2023. ISBN 9464593601.

Bv M for Distributed Bayes

Vivek Borkar and Pravin Varaiya. Asymptotic agreement in distributed estimation. IEEE transactions on automatic control, 27(3):650 655, 1982.

Paolo Braca, Stefano Marano, Vincenzo Matta, and Peter Willett. Asymptotic optimality of running consensus in testing binary hypotheses. IEEE Transactions on Signal Processing, 58, 2010. ISSN 1053587X. doi: 10.1109/TSP.2009.2030610.

Lucien Le Cam, Lucien Marie Le Cam, and Grace Lo Yang. Asymptotics in Statistics: Some Basic Concepts. Springer Science & Business Media, 2000. ISBN 0387950362.

Isma el Castillo and Judith Rousseau. A Bernstein von Mises theorem for smooth functionals in semiparametric models. The Annals of Statistics, 43:2353 2383, 2015. ISSN 0090-5364.

Thomas M Cover. Elements of Information Theory. John Wiley & Sons, 1999.

Morris H De Groot. Reaching a consensus. Journal of the American Statistical association, 69(345):118 121, 1974.

Larry G Epstein, Jawwad Noor, and Alvaro Sandroni. Non-Bayesian learning. The BE Journal of Theoretical Economics, 10:1 20, 2010.

Paul Erd os, Alfr ed R enyi, et al. On the evolution of random graphs. Publ. math. inst. hung. acad. sci, 5(1):17 60, 1960.

Subhashis Ghosal and Aad Van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44. Cambridge University Press, 2017.

Jayanta K. Ghosh and R.V. Ramamoorthi. Bayesian Nonparametrics. Springer-Verlag, 2003. doi: 10.1007/b97842. URL https://doi.org/10.1007%2Fb97842.

Gustavo L Gilardoni and Murray K Clayton. On reaching a consensus using degroot s iterative pooling. The Annals of Statistics, 21(1):391 401, 1993.

Benjamin Golub and Evan Sadler. Learning in social networks. Available at SSRN 2919146, 2017.

John A Gubner. Distributed estimation and quantization. IEEE Transactions on Information Theory, 39(4):1456 1459, 1993.

Matthew D Hoﬀman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 2013.

Ping Hu, Virginia Bordignon, Stefan Vlaski, and Ali H. Sayed. Optimal aggregation strategies for social learning over graphs. IEEE Transactions on Information Theory, 69, 2023. ISSN 15579654. doi: 10.1109/TIT.2023.3281647.

Yunus Inan, Mert Kayaalp, Emre Telatar, and Ali H. Sayed. Social learning under randomized collaborations. In IEEE International Symposium on Information Theory - Proceedings, volume 2022-June, 2022. doi: 10.1109/ISIT50566.2022.9834621.

Wu and Uribe

Ali Jadbabaie, Pooya Molavi, Alvaro Sandroni, and Alireza Tahbaz-Salehi. Non-Bayesian social learning. Games and Economic Behavior, 76:210 225, 2012. ISSN 0899-8256.

Michael I Jordan, Jason D Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference. Journal of the American Statistical Association, 2018.

Anya Katsevich. Improved scaling with dimension in the Bernstein-von Mises theorem for two statistical models. ar Xiv preprint ar Xiv:2308.06899, 2023.

Bas J K Kleijn and Aad W van der Vaart. The Bernstein-von-Mises theorem under misspeciﬁcation. Electronic Journal of Statistics, 6:354 381, 2012. ISSN 1935-7524.

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on Bayes rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132):1 109, 2022.

Anusha Lalitha, Anand Sarwate, and Tara Javidi. Social learning and distributed hypothesis testing. In 2014 IEEE International Symposium on Information Theory, pages 551 555. IEEE, 2014.

Pierre-Simon Laplace. Suppl ement au m emoire sur les approximations des formules qui sont fonction de tr es grands nombres. M emoires de l Acad emie Royale des sciences de Paris, 1810:559 565, 1809.

Lucien Le Cam. On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. Univ. California Pub. Statist., 1:277 330, 1953.

Cheng Li, Sanvesh Srivastava, and David B Dunson. Simple, scalable and accurate posterior interval estimation. Biometrika, 104:665 680, 2017. ISSN 0006-3444.

Marco Avella Medina, Jos e Luis Montiel Olea, Cynthia Rush, and Amilcar Velez. On the robustness to misspeciﬁcation of α-posteriors and their variational approximations. Journal of Machine Learning Research, 23(147):1 51, 2022.

Jeﬀrey W Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. J. Mach. Learn. Res., 22:168 1, 2021.

Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and robust Bayesian inference via the median posterior. In International conference on machine learning, pages 1656 1664. PMLR, 2014.

Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B. Dunson. Robust and scalable Bayes via a median of subset posterior measures. J. Mach. Learn. Res., 18: Paper No. 124, 40, 2017. ISSN 1532-4435,1533-7928.

Pooya Molavi, Alireza Tahbaz-Salehi, and Ali Jadbabaie. A theory of non-Bayesian social learning. Econometrica, 86:445 490, 2018. ISSN 0012-9682.

Angelia Nedi c, Alex Olshevsky, and C esar A Uribe. Fast convergence rates for distributed non-Bayesian learning. IEEE Transactions on Automatic Control, 62:5538 5553, 2017. ISSN 0018-9286.

Bv M for Distributed Bayes

Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel MCMC. ar Xiv preprint ar Xiv:1311.4780, 2013.

Maxim Panov and Vladimir Spokoiny. Finite sample Bernstein von Mises theorem for semiparametric problems. Bayesian Analysis, 10(3):665 710, 2015.

Maxim Rabinovich, Elaine Angelino, and Michael I Jordan. Variational consensus Monte Carlo. Advances in Neural Information Processing Systems, 28, 2015.

Jeﬀrey S Rosenthal. Convergence rates for Markov chains. Siam Review, 37(3):387 405, 1995.

Lorraine Schwartz. On Bayes procedures. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 4:10 26, 1965. ISSN 1432-2064.

Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E Mc Culloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11:78 88, 2016. ISSN 1750-9653.

Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Distributed detection: Finitetime analysis and impact of network topology. IEEE Transactions on Automatic Control, 61:3256 3268, 2015. ISSN 0018-9286.

Xiaotong Shen. Asymptotic normality of semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association, 97:222 235, 2002. ISSN 0162-1459.

John Tsitsiklis and Michael Athans. Convergence and asymptotic agreement in distributed decision problems. IEEE Transactions on Automatic Control, 29(1):42 50, 1984.

Cesar A Uribe, Alexander Olshevsky, and Angelia Nedich. Non-asymptotic concentration rates in cooperative learning part i: variational non-Bayesian social learning. IEEE Transactions on Control of Network Systems, 2022a. ISSN 2325-5870.

Cesar A Uribe, Alexander Olshevsky, and Angelia Nedich. Non-asymptotic concentration rates in cooperative learning part ii: inference on compact hypothesis sets. IEEE Transactions on Control of Network Systems, 2022b. ISSN 2325-5870.

Aad W. Van der Vaart. Asymptotic Statistics, volume 3. Cambridge University Press, 2000. ISBN 0521784506.

Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, 2023. ISBN 0-387-94640-3.

Stephen G Walker. Bayesian inference via a minimization rule. Sankhy a: The Indian Journal of Statistics, pages 542 553, 2006. ISSN 0972-7671.

Xiangyu Wang and David B Dunson. Parallelizing MCMC via Weierstrass sampler. ar Xiv preprint ar Xiv:1312.4605, 2013.

Wu and Uribe

Xiangyu Wang, Fangjian Guo, Katherine A Heller, and David B Dunson. Parallelizing MCMC with random partition trees. Advances in neural information processing systems, 28, 2015.

Yixin Wang and David M Blei. Frequentist consistency of variational Bayes. Journal of the American Statistical Association, 114(527):1147 1161, 2019.

Appendix A. Review of concepts

A.1 Distances and Divergences

This section reviews a class of divergence functions.

Deﬁnition 28 (R enyi divergence) Let ρ > 0 and ρ = 1. The ρ-R enyi divergence between two probability measures P1 and P2 is deﬁned as

Dρ(P1 P2) =

1 ρ 1 log R d P1 d P2

ρ 1 d P1 if P1 P2,

+ otherwise.

The relations between the R enyi divergence and other divergence functions are summarized below:

1. When ρ 1, the R enyi divergence converges to the Kullback Leibler divergence, deﬁned as

DKL(P1 P2) =

(R log d P1 d P2

d P1 if P1 P2,

+ otherwise.

2. When ρ = 1/2, the R enyi divergence is related to the Hellinger distance by

D1/2(P1 P2) = 2 log (1 H(P1, P2)) ,

and the Hellinger distance is deﬁned as

H(P1, P2) =

3. When ρ = 2, the R enyi divergence is related to the χ2-divergence by

D2(P1 P2) = log 1 + χ2(P1 P2) ,

and the χ2-divergence is deﬁned as

χ2(P1 P2) = Z (d P1)2

Deﬁnition 29 (Total variation) The total variation distance between two probability measures P1 and P2 is deﬁned as

TV (P1, P2) = 1

Z |d P1 d P2|.

Bv M for Distributed Bayes

A.2 An (incomplete) Summary of Previous Bv M Results

The history of the Bernstein von Mises (Bv M) theorem is marked by steady evolution, expanding its reach to wider contexts with weaker assumptions. The origins of the Bernstein von Mises (Bv M) theorem trace back to Laplace (1809). Earlier works on the Bv M theorem are restricted to well-speciﬁed, i.i.d. statistical models with ﬁxed, ﬁnite-dimensional parameters and use assumptions that involve up to fourth-order derivatives of the log-likelihood (Le Cam, 1953; Bickel and Yahav, 1969; Cam et al., 2000)). This classical Bernstein von Mises theorem for i.i.d., ﬁnite-dimensional data was brilliantly summarized in (Van der Vaart, 2000, Chapter 10). The Theorem relies on three assumptions: the existence of a sequence of consistent estimators, local asymptotic normality (LAN), and uniform consistent testing. One typical suﬃcient assumption for LAN is diﬀerentiable in quadratic means (DQM), which involves only ﬁrst-order derivatives of the log-likelihood to achieve a desirable quadratic expansion. In addition to the LAN assumption, Schwartz Schwartz (1965) proposed the concept of uniformly consistent test assumptions around the true parameter θ0. The contemporary wave of research has extended Bv M theorems beyond the canonical settings. These studies have broached areas such as model misspeciﬁcation (Kleijn and van der Vaart, 2012), semiparametric models Shen (2002); Bickel and Kleijn (2012); Panov and Spokoiny (2015); Castillo and Rousseau (2015), and parameters placed at the boundary of the parameter space (Bochkina and Green, 2014). Among these explorations, a particularly relevant line of inquiry has focused on Bv M for non-standard Bayes procedures (Knoblauch et al., 2022), including generalized posteriors (Miller, 2021), Bayes rules derived from optimization perspectives, and variational Bayesian posteriors (Wang and Blei, 2019; Medina et al., 2022). Another emerging thread of work establishes Bv M theorems for high-dimensional models, with the most recent results allowing the dimension to grow at the order d2 n (Panov and Spokoiny, 2015; Katsevich, 2023).

Appendix B. Simulating distributed Bayesian logistic regression

We illustrate the practical implication of our theory through an exponential family model, where the results are explicitly proven in Section 8.1. We posit an improper prior π(θ) 1. Then the distributed posterior is given by

pj t(θ) = exp θ, T j t Bj t (θ) , (B.1)

i=1 [At k ji ]Xi k Y i k, Bj t (θ) =

i=1 [At k ji ]σ( θ, Xj k ).

The gradient of the energy function is given by

θ log pj t(θ) =

i=1 [At k ji ]Xi k Y i k +

i=1 [At k ji ] exp θ, Xj k

1 + exp θ, Xj k Xj k,

where A0 jj = 1, A0 ji = 0 for i = j.

Wu and Uribe

With θ log pj t(θ) in hand, we implement the Langevin Monte Carlo (LMC) algorithm to sample from the distributed posterior log pj t(θ). Since log pj t(θ) is strictly concave, the samples generated by LMC are guaranteed to converge to the stationary distribution pj t(θ). The ﬁnal histograms were generated using 1,000 samples from LMC. Figure 1 shows the 10 marginals for the exact posterior sampled under LMC and the Bv M approximation derived in Theorem 25 for one agent, which are closely aligned. For the problem setting, we simulate a static connected graph with 10 nodes, generating 500 data points under a logistic regression model with the ground truth θ0 R5 sampled from a standard normal distribution.

Bv M for Distributed Bayes

20 15 10 5 0 5 10 15 20 0.0

Dimension 1

20 15 10 5 0 5 10 15 20 0.00

Dimension 2

20 15 10 5 0 5 10 15 20 0.00

Dimension 3

20 15 10 5 0 5 10 15 20 0.0

Dimension 4

20 15 10 5 0 5 10 15 20 0.00

Dimension 5

20 15 10 5 0 5 10 15 20 0.00

Dimension 6

20 15 10 5 0 5 10 15 20 0.0

Dimension 7

20 10 0 10 20 30 40 0.00

Dimension 8

20 15 10 5 0 5 10 15 20 0.0

Dimension 9

30 20 10 0 10 20 0.00

Dimension 10

Comparison of Exact Posterior and Bv M Approximation for Agent 4

Exact posterior Bv M

Figure 1: Histograms of the ten marginals computed via Langevin Monte Carlo vs. Bv M approximation for a 10-dimensional Bayesian logistic regression example.

Figure 2: Communication graph A.

Wu and Uribe

Appendix C. Proofs

C.1 Proofs of Results in Section 4

Proof [Lemma 2] Deﬁne vector - valued functions φt(θ) = f1 t (θ), , fm t (θ) T and

log pθ(Xt) = log P1 θ(X1 t ), , log Pm θ (Xm t ) T . Since P0| log Pi θ| = P0(log Pi θ)+ + P0(log Pi θ) < , we have

fj t (θ) = 1

i=1 [At k ji ] log pi θ(Xi t) = 1

D [At k j,. ], log Pθ E .

and φt(θ) = 1

k=1 At k log pθ(Xk)

By Lemma 34 [Distributed law of large numbers],

i=1 P0 log Pi θ = f(θ)1.

which implies the coordinate-wise convergence fj t (θ) P0 f(θ) for every j [m].

Proof [Theorem 3] Let ϵ > 0 be given. If we deﬁne µj t(B) = R

B exp( tfj t (θ))Π(dθ), then

P j t (B) = µj t(B) µj t(Θ) and µj t(Θ) = zj t < .

In the proof, we consider the Θ = Θ\Unull where π(θ) = 0 for every θ Unull. This null set should be of little importance because the set has zero weight under P j t . We now show that there exists α > 0 such that infθ Ucϵ fj t (θ) f(θ0) + α for all t large enough. By lemma 2, we get that there exists T such that fj t (θ) f(θ) < ϵ

2 for t > T. Given that infθ Ucϵ fj t (θ) f(θ0) = infθ Ucϵ [fj t (θ) f(θ)] + [f(θ) f(θ0)], infθ Ucϵ fj t (θ) f(θ0) > ϵ

2 for t > T because for θ Uc ϵ , f(θ) f(θ0) > ϵ. The result shows that for t > T, for all θ Uc ϵ , exp( t(fj t (θ) f(θ0) α)) 1. Therefore, for t > T,

exp(t(f(θ0) + α))µj t(Uc ϵ ) = Z

Ucϵ exp( t(fj t (θ) f(θ0) α))Π(dθ) Z

Ucϵ Π(dθ) 1

For any θ Aα/2, limt fj t (θ) f(θ0) α = f(θ) f(θ0) α < α/2 < 0. Thus, exp( t(fj t (θ) f(θ0) α)) as t . By Fatou s lemma, since Π(Aα/2) > 0 by assumption 2,

lim inf t et(f(θ0)+α)µj t(Aα/2) = lim inf t

Aα/2 exp( t(fj t (θ) f(θ0) α))Π(dθ) = .

Since µj t(Θ) µj t(Aα/2), we obtain et(f(θ0)+ϵ)µj t(Θ) . Combining the two results,

1 P j t (Uϵ) = P j t (Uc ϵ ) = µj t(Uc ϵ )

µj t(Θ) = exp(t(f(θ0) + α))µj t(Uc ϵ )

exp(t(f(θ0) + α))µj t(Θ) 0,

Bv M for Distributed Bayes

Proof [Lemma 5] Let j and θ be ﬁxed. If P0 Pj θ, then DKL(P0 Pj θ) < . This implies that P0 log pj θ > . Likewise, the expectation of log P0 under P0 is also ﬁnite, i.e., P0 log P0 < . Since the KL divergence is non-negative, DKL(P0 Pj θ) 0, it follows that the expectation of log pj θ under P0 is less than or equal to the expectation of log P0 under P0. Therefore, P0 log pj θ P0 log p0 < , and this completes the proof.

C.2 Proofs of Results in Section 5

Proof [Lemma 7] Suppose U is a neighborhood of θ0, within which θ 7 log Pi θ is convex for all i [m]. Being a linear combination of convex mappings, fj t (θ) retains convexity in U. As a convex function, fj t is Lebesgue-almost surely diﬀerentiable on U. According to

Lemma 2, fj t P0 f, and since f is also convex, it remains diﬀerentiable a.s. on U. Given that fj t (θ) is non-decreasing a.s. on U for every coordinate, fj t (θ0 ϵ) < η and ˆθj t θ0 ϵ implies fj t (ˆθj t) < η, which has probability tending to 0 for every η > 0 if fj t (ˆθj t) = od(1). This shows that for every ϵ, η > 0,

P( ˆθj t θ0 > ϵ) + o(1) P( fj t (θ0 ϵ) < η, fj t (θ0 + ϵ) > η)

Since f(θ0 ϵ) < 0 < f(θ0 + ϵ), taking 2η to be the smallest coordinates of f(θ0 ϵ) and f(θ0 + ϵ) makes the right hand side converge to 1.

Proof [Lemma 8] Using the consistency of ˆθj t along with Lemma 13, given the scaling factor ϵj t = 1

fj t (ˆθj t) = fj t (θ0) + 1

t(ˆθj t θ0)T Vθ0 j t,θ + 1

2(ˆθj t θ0)T Vθ0(ˆθj t θ0) + od(1) = fj t (θ0) + od(1),

by applications of Slusky s theorem. By Lemma 2, fj t (θ0) converges to f(θ0) in probability so fj t (ˆθj t) = f(θ0) + od(1). Assumption 4(a) assures the existence of δ > 0 such that:

inf θ Bϵ(ˆθj t )c(fj t (θ) fj t (ˆθj t)) = inf θ B2ϵ(θ0)c(fj t (θ) fj t (ˆθj t))

= inf θ B2ϵ(θ0)c(fj t (θ) f(θ0)) + od(1)

> 2δ + od(1) t large > δ.

hence infθ Bϵ(ˆθj t )c(fj t (θ) fj t (ˆθj t)) > δ with probability tending to 1 under [P0].

Wu and Uribe

Proof [Lemma 9] For any ϵ > 0 and δ > 0,

Qt(Bδ(0)) = Πt(Bδ/at(θt)) Πt(Bϵ(θ0)),

for all t suﬃciently large. Therefore, since Qt Q in total variational distance, we obtain

Q(Bδ(0)) = lim t Qt(Bδ(0)) lim inf t Πt(Bϵ(θ0)).

Take δ arbitrarily large shows that limt Πt(Bϵ(θ0)) = 1.

Proof [Theorem 10] For each j [m], the sequence of estimators ˆθj t are chosen to satisfy the equation t X

i=1 [At k ij ] log pi ˆθj t (Xi k) = 0.

The function fj t (θ) admits the following quadratic expansion.

fj t (θ) = 1

i=1 [At k ij ] log pi θ(Xi k)

i=1 [At k ij ]{log pi ˆθj t (Xi k) + log pi ˆθj t (Xi k)(θ ˆθj t)

2(θ ˆθj t) log pi ˆθj t (Xi k) log pi ˆθj t (Xi k) T (θ ˆθj t)T + Op(|θ ˆθj t|3)}

= fj t (ˆθj t) + 1

i=1 [At k ij ] log pi ˆθj t (Xi k) log pi ˆθj t (Xi k) T

| {z } ˆV j t

(θ ˆθj t)T Op(|θ ˆθj t|3) | {z }

rj t (θ ˆθj t )

The sequence of local estimates ˆθj t Θ satisﬁes the consistency requirement. By denoting the remainder term as rj t(θ ˆθj t), it immediately holds that

rj t(h) = Op(|h|3).

By Theorem 7.2 of Van der Vaart (2000), Assumption 2(c) implies the existence of the nonsingular Fisher information matrix V i θ0 for any i [m]. The average of Fisher information matrix, deﬁned as Vθ0 = 1 m Pm i=1 V i θ0, is also nonsingular and positive deﬁnite. It remains

to show that ˆV j t P0 Vθ0. We represent the sequence of matrices ˆV j t as:

i=1 [At k ij ] log pi ˆθj t (Xi k) log pi ˆθj t (Xi k) T .

Bv M for Distributed Bayes

Using Lipschitz continuity of the gradient (Assumption 2(d)), we can decompose ˆV j t as:

i=1 [At k ij ] log pi θ0(Xi k) + si(Xi k)(ˆθj t θ0) log pi θ0(Xi k) + si(Xi k)(ˆθj t θ0) T .

Applying the distributive property, we can decompose this into three terms:

i=1 [At k ij ] log pi θ0(Xi k) log pi θ0(Xi k) T

+ (ˆθj t θ0)1

i=1 [At k ij ]si(Xi k) log pi θ0(Xi k)T

+ (ˆθj t θ0)(ˆθj t θ0)T 1

i=1 [At k ij ][si(Xi k)]2.

By Lemma 34, we have

i=1 [At k ij ] log pi θ0(Xi k) log pi θ0(Xi k) T P0 Vθ0,

i=1 [At k ij ]si(Xi k) log pi θ0(Xi k)T P0 P0[si log pi θ0],

and By the Cauchy-Schwarz inequality, we have:

P0[si log pi θ0] q

thus the second term is o(1).

i=1 [At k ij ][si(Xi k)]2 P0 P0[si(Xi k)]2.

Finally, since ˆθj t θ0 = op(1), the second and the third term are op(1). Combining all these results, we have ˆV j t P0 Vθ0.

Note that qj t (x) = pj t(ˆθj t + x/

t)t p/2. Deﬁne

gj t (x) = exp( t[fj t (ˆθj t + x/

t) fj t (ˆθj t)])π(ˆθj t + x/

= exp(tfj t (ˆθj t))qj t (x)zj t tp/2.

recalling that zj t < by assumption and deﬁne

g0(x) = exp( 1

2x T Vθ0x)π(θ0)

Wu and Uribe

Let α (0, λ), where λ is the smallest eigenvalue of Vθ0. Let ϵ > 0 small enough that ϵ < α/(2c0), ϵ < ϵ0 and π(θ) 2π(θ0) for all θ B2ϵ(θ0). Let α = lim inft infθ Bϵ(ˆθj t )c(fj t (θ)

fj t (ˆθj t)), noting that δ > 0 with asymptotic probability 1 by lemma 8. Let Aj t = V j t αI and A0 = Vθ0 αI, deﬁne

2x T Aj tx)2π(θ0) if |x| < ϵ

2tδ)2π(ˆθj t + x/

t) if |x| ϵ

h0(x) = exp( 1

2x T A0x)2π(θ0)

We show that

1. gj t g0 and hj t h0 almost surely

R hj t converges to R h0

3. gj t = |gj t | hj t with asymptotic probability 1, and

4. gj t , g0, hj t, h0 L1(dx) with asymptotic probability 1.

By the generalized dominated convergence theorem, these results imply that R gj t P0 R g0 and R |(gj t )1/m g0|dx P0 0 with [P0] probability. Assuming these results are true, we want to show how asymptotic normality follows. Since R qj t = 1, the deﬁnition of gj t implies that

exp(tfj t (ˆθj t))tp/2zj t = Z gj t Z g0 = π(θ0) (2π)p/2

|m Vθ0|1/2 .

where |Vθ0| = |det(Vθ0)| and therefore

zj t exp( tfj t (ˆθj t))π(θ0) |m Vθ0|1/2 (2π

as t . For any aj t a R, we have that R |aj tgj t ag0| 0 since Z |aj tgj t ag0| Z |aj tgj t aj tg0| + Z |aj tg0 ag0| = |aj t| Z |gj t g0| + |aj t a| Z |g0| 0

Therefore, if we let aj t = (exp(tfj t (ˆθj t))tp/2zj t ) 1 and a = (π(θ0) (2π)p/2

|Vθ0|1/2 ) 1, we have shown

that aj t a and thus Z |qj t (x) |Vθ0|1/2

(2π)p/2 exp( 1

2x T Vθ0x)|dx P0 0.

This shows asymptotic normality. Finally, the posterior consistency at θ0 follows from Lemma 9. It remains to show statements 1 to 4 above. 1) Fix x Rp. For t suﬃciently large, |x| ϵ

t. It follows that

hj t(x) = exp( 1

2x T Aj tx)2π(θ) exp( 1

2x T Vθ0x)2π(θ0) = h0(x).

Bv M for Distributed Bayes

since Aj t A0 almost surely. For gj t , ﬁrst note that π(ˆθj t + x/

t) π(θ0) since π is continuous at θ0 and θj t θ0 and x/

t 0. By the uniform convergence of fj t to ft,

t(fj t (ˆθj t + x/

t) fj t (ˆθj t)) = 1

2x T V j t x + trj t(x/

because V j t Vθ0 and |x/

t| < ϵ0 for all t suﬃciently large. We then infer a bound on rj t,

t|3 = c0|x|3/

as t . Therefore, gj t (x) g0(x). 2) By the deﬁnition of hj t, letting Bt = Bϵ

t(0), Z hj t = Z

2x T Aj tx)2π(θ0)dx + Z

Bc t exp( 1

2nδ)2π(ˆθj t + x/

Since Aj t A0 almost surely and A0 is positive deﬁnite, for all t suﬃciently large, Aj t is also positive deﬁnite and the ﬁrst term equals

2π(θ0)(2π)p/2

|Aj t|1/2 P(|(Aj t) 1/2Z| < ϵ

t) a.s. 2π(θ0)(2π)p/2

|A0|1/2 = Z h0.

where Z N(0, 1). The second integral converges to 0, since it is nonnegative and has an upper bound as Z

2tδ)2π(ˆθj t + x/

t)dx = exp 1

2 nδ np/2 0.

using the fact that π(ˆθj t + x/

t)t p/2 is the density of

t(θ ˆθj t) where θ π. 3) With P0 probability going to 1, |θj t θ0| < ϵ thus the bound on rj t applies, infθ Bϵ(ˆθj t )c(fj t (θ)

fj t (ˆθj t)) > δ. Let x Rp be given again. If |x| > ϵ

t, then fj t (ˆθj t + x/

t) fj t (ˆθj t) > δ/2, and thus,

gj t (x) exp( 1

2tδ)π(ˆθj t + x/

t) = hj t(x).

In the meantime, if |x| < ϵ

t, then π(ˆθj t + x/

t) 2π(θ0) by our choice of ϵ, and

t(fj t (ˆθj t + x/

t) fj t (ˆθj t)) 1

2x T V j t x + trj t(x/

2x T V j t x 1

2x T Aj tx.

since |trj t(x/

t)| 1/2α|x|2, by the fact that |x/

t| < ϵ < ϵ0 and ϵ < α/(2c0). Therefore,

lim t P0 gj t (x) hj t(x) = 1.

4) Since Vθ0 and A0 are positive deﬁnite, R g0 and R h0 are ﬁnite. Since R hj t R h0, we have limt P0( R gj t R hj t < ) = 1. The measurability of gj t , hj t with respect to P0 follows from the measurability of fj t and π.

Wu and Uribe

Proof [Corollary 11] In the proof of Theorem 10, the diﬀerential in quadratic means assumption (Assumption 2(c)) is only used to establish Equation (5.2). Assumption 2(e) states that each log pi θ is twice continuously diﬀerentiable with nonsingular Hessian matrices. Since fj t (θ) is a positive linear combination of log pi θ, fj t (θ) is also twice continuously diﬀerentiable with a nonsingular Hessian matrix. Therefore, Equation (5.2) holds when we replace Assumption 2(c) with Assumption 2(e).

Proof [Corollary 12] Denote , as the tensor inner product, and as the tensor product. Let Assumption 2(f) hold. We have a third - order Taylor expansion of fj t (θ) around ˆθj t with nonsingular quadratic terms:

fj t (θ) = fj t (ˆθj t) + 1

2(θ θj t)T 2 θfj t (ˆθj t)(θ θj t) + 1

6 3 θfj t ( θ), (θ θj t) (θ θj t) (θ θj t) ,

for some θ between θ0 and ˆθj t. Under Assumption 2(f), we have | 3 θ log pj θ(x)| φj(x) for integrable φ in a neighborhood N j of θ0. Deﬁne N = m j=1N j. The consistency assumption implies that P0(ˆθj t N) 1. For all θ N, we have

P0| 3 θfj t ( θ)| P0 1

i=1 [At k ji ]| 3 log pj θ(Xj k)| P0 1

i=1 [At k ji ]φj(Xj k) < .

By the uniform law of large numbers (Theorem 1.3.3, Ghosh and Ramamoorthi (2003)) and Lemma 34, we have

3 θfj t ( θ) 1

i=1 P0 3 log pi θ

In the limit, we have 1

m Pm i=1 P0| 3 log pi θ| 1

m Pm i=1 P0φi, hence 3 θfj t ( θ) is asymptotically tight and the cubic term is op(1). We also have that

2 θfj t (ˆθj t) 2 θfj t (θ0) + 1

i=1 [At k ji ]| φj(Xj k), ˆθj t θ0 |,

hence 2 θfj t (ˆθj t) = 2 θfj t (θ0) + op(1). Finally, we have fj t (ˆθj t) = fj t (θ0) + op(1) by the continuous mapping theorem (Theorem 18.11, Van der Vaart (2000)). Putting the terms together, we conclude that

fj t (θ) = fj t (θ0) + 1

2(θ θj t)T 2 θfj t (θ0)(θ θj t) + op(1).

Thus, Equation (5.2) holds when substitute Assumption 2(c) and 2(d).

Bv M for Distributed Bayes

Proof [Lemma 13] By Theorem 7.2 from Van der Vaart (2000), diﬀerentiability in quadratic means implies that P0 log pj θ0 = 0 and the Fisher information matrix V j θ0 = P0 log pj θ0( log pj θ0)T

exists and is non-singular. By the central limit theorem, we have

k=1 log pj θ0(Xj k) d N(0, V j θ0)

Let s deﬁne Vθ0 as the average Fisher information matrix, i.e., Vθ0 = 1

m Pm j=1 V j θ0.

If we choose ϵj t = 1

t, then by Taylor expansion,

tfj t (θ0 + ϵj th)

i=1 [At k ij ] log pi θ0+ϵj th(Xi k)

i=1 [At k ij ]

log pi θ0(Xi k) + h T log pi θ0(Xi k)

2h T log pi θ0(Xi k) log pi θ0(Xi k)T

t h + Op(| h

= tfj t (θ0) h T Pt k=1 Pm i=1[At k ij ] log pi θ0(Xi k)

2h T Pt k=1 Pm i=1[At k ij ] log pi θ0(Xi k) log pi θ0(Xi k)T

Deﬁne j t,θ0 and Vt as:

i=1 [At k ij ] log pi θ0(Xi k)

i=1 [At k ij ] log pi θ0(Xi k) log pi θ0(Xi k) T !

Substituting these terms in the Taylor expansion, we have:

tfj t (θ0 + ϵj th) = tfj t (θ0) h T Vθ0 j t,θ0 + 1

2h T V j t h + op(|h2|).

By Lemma 35 (distributed central limit theorem), we have

j t,θ0 d N(0, (m Vθ0) 1),

thus j t,θ0 is bounded in [P0]-probability by Prokhorov s theorem (Theorem 2.4, Van der Vaart (2000)). By Lemma 34 (distributed law of large number),

V j t P0 Vθ0.

Wu and Uribe

This establishes the stochastic LAN:

tfj t (θ0 + ϵj th) + tfj t (θ0) h T Vθ0 j t,θ0 + 1

2h T Vθ0h = sup h K op(|h|) 0.

Proof [Theorem 14] Since P j t (A) is uniformly bounded by 1 in L , the set {P j t (A)}t N is uniformly integrable. Then assumption (5.6) implies that

P0P j t θ : d(θ, θ0) Mtϵj t 0.

The rest of the proof is split into two parts: in the ﬁrst part, we prove the assertion conditional on an arbitrary compact set K Rp, and in the second part, we use this to prove the asymptotic normality convergence. Throughout the proof, we denote the posterior for hj t = (θ θ0)/ϵj t when θ P j t by Qj t. For all Borel sets B, the posterior for hj t follows from that for θ by

Qj t(hj t B) = P j t θ θ0 ϵj t B .

The assumption that (5.5) holds at θ0 means that for all j [m], there exists non-singular V j θ0 and asymptotically tight sequence j t,θ0 such that for every compact set K Rp, as t ,

tfj t (θ0 + ϵj th) + tfj t (θ0) h T V j θ0 j t,θ0 + 1

2h T V j θ0h P0 0. (C.1)

We denote the normal distribution N( j t,θ, V 1 θ0 ) by Φj t. For a compact subset K Rp

such that Qj t(hj t K) > 0, we deﬁne a conditional measure Qj K,t of Qj t by Qj K,t(B) =

Qj t(B K)/Qj t(K). Similarly, we deﬁne a conditional measure Φj K,t corresponding to Φj t. Let K Rp be a compact subset of Rp. For every open neighborhood U Θ of θ0, θ0 + Kϵj t U for large enough t. Since θ0 is an interior point of Θ, for large enough t, the random function ψj t : K K 7 R

ψj t (h, h ) =

φt(h) qj t (h )

qj t (h) exp tfj t (θ0 + ϵj th) tfj t (θ0 + ϵj th ) +

is well-deﬁned, where φt : K R is the Lebesgue density of the distribution N( j t,θ0, V 1 θ0 ),

qj t : K R is the Lebesgue density of the prior for the centered and rescaled parameter hj t, and fj t : K R is the random functions deﬁned in (2.11). By the distributed LAN assumption (C.1), we have for every h K,

tfj t (θ0 + ϵj th) tfj t (θ0) = h T Vθ0 j t,θ0 1

2h T Vθ0h + o P0(1).

Bv M for Distributed Bayes

For any two sequences ht, h t K, limt qj t (ht)/qj t (h t) = 1 implies that

φt(h t) φt(ht) qj t (ht)

+ tfj t (θ0 + ϵj tht) tfj t (θ0 + ϵj th t)

= (h t ht)T Vθ0 j t,θ0 1

2h T t Vθ0ht + 1

2h t T Vθ0h t + o P0(1)

2(h t j t,θ0)T Vθ0(h t j t,θ0) 1

2(ht j t,θ0)T Vθ0(ht j t,θ0)

as t . Since all functions ψj t depend continuously on (h, h ) and K K is compact, we conclude that,

sup h,h K ψj t (h, h ) P0 0, as t , (C.2)

where the convergence is in outer probability. Assume that K contains a neighborhood of 0 (to guarantee that Φj t(K) > 0) and let Ξj t denote the event that Qj t(K) > 0. Let η > 0 be given and deﬁne the events:

sup h,h K ψj t (h, h ) η

Consider the inequality (recall that the total-variation norm || ||TV is bounded by 2):

P0 Qj K,t Φj K,t TV 1Ξj t P0 Qj K,t Φj K,t TV 1Ωj t Ξj t + 2P0(Ξj t \ Ωj t). (C.3)

As a result of Equation (C.2), the second term is o(1) as t . The ﬁrst term on the r.h.s. is calculated as follows:

1 2P0 Qj K,t Φj K,t TV 1Ωj t Ξj t = P0

1 dΦj K,t d Qj K,t

+ d Qj K,t1Ωj t Ξj t

φj K,t(h)qj t (h )sj t(h )

φj K,t(h )qj t (h)sj t(h) dΦj K,t(h )

+ d Qj K,t(h)1Ωj t Ξj t,

where sj t : K R is the function deﬁned as sj t(h) = exp(tfj t (θ0 + ϵj th)), and qj t : K R is the Lebesgue density of the prior for the centered and rescaled parameter hj t. Note that for all h, h K, φj K,t(h)/φj K,t(h ) = φt(h)/φt(h ), since on K, φj K,t diﬀers from φt only by a normalization factor. Using Jensen s inequality (with respect to the Φj K,t-expectation) for the convex function x 7 (1 x)+, we derive:

Wu and Uribe

1 2P0 Qj K,t Φj K,t TV 1Ωj t Ξj t

1 sj t(h )qj t (h )φt(h )

sj t(h )qj t (h )φt(h )

+ dΦj K,t(h )d Qj K,t(h )1Ωj t Ξj t

Ωj t Ξj t sup h,h K ψj t (h, h )dΦj K,t(h )d Qj K,t(h )

Combination with (C.3) shows that for all compact K Rp containing a neighborhood of 0, P0 Qj K,t Φj K,t TV 1Ξj t 0. (C.4)

Now, let (Kn) be a sequence of balls centered at 0 with radii Mn . For each n 1, the Equation (C.4) holds. Hence, the intuition is that if we choose a sequence of balls (Kt) that traverses the sequence Kn slowly enough, the convergence of the expected total variational distance to zero can still be guaranteed. Moreover, the corresponding events Ξj t = {Qj t(Kt) > 0} satisfy P0(Ξj t) 1 as a result of assumption (5.6). We conclude that there exists a sequence of radii (Mt) such that Mt and P0 Qj Kt,t

Φj Kt,t TV P0 0 (where the conditional probabilities on the l.h.s. are well-deﬁned on sets of probability growing to one). The total variation distance between a measure and its conditional version on a set K satisﬁes Q QK TV 2Q(Kc). Combining this with assumption (5.6) and the Hellinger integral test, we conclude that

P0 Qj t Φj t TV 0,

which implies the main result.

Proofs of Results in Section 6

Lemma 30 For any function f 0 and two probability measures P, Q, we have Z f(x)d Q(x) DKL(Q P) + log Z exp(f(x))d P(x).

Proof [Lemma 30] By the deﬁnition of KL divergence,

DKL(Q P) + log exp(f(x))d P(x) = Z log d Q(x) R exp(f(y))d P(y)

d P(x) d Q(x)

= Z log d Q(x) R exp(f(y))d P(y) exp(f(x))d P(x) d Q(x) + Z f(x)d Q(x)

= D(Q P ) + Z f(x)d Q(x) Z f(x)d Q(x)

Bv M for Distributed Bayes

where P is given by

d P (x) = exp(f(x))d P(x) R exp(f(y)d P(y).

Lemma 31 For P j t deﬁned in (2.10) and Pt deﬁned in (6.2), we have the following upper bound on the expected loss under P j t and P0.

P0P j t d(θ, θ0) inf a>0 1 amt

P0DKL(P j t Pt) + log Pθ0Pteamtd(θ,θ0) + 1

j=1 DKL(P0 Pj θ0)

Proof [Lemma 31] By Lemma 30, we have

amt P j t d(θ, θ0) DKL(P j t Pt) + log Pt exp(amtd(θ, θ0)),

for all a > 0. Taking expectations on both sides, we have

amt P0Ptd(θ, θ0) P0DKL(P j t Pt) + P0 log Pt exp(amtd(θ, θ0)),

By Lemma 30 and the tenderization property of KL divergence, we have

P0 log Pt exp(amtd(θ, θ0))

j=1 DKL(P0 Pj θ0)

+ log Pθ0Pt exp(amtd(θ, θ0)).

Putting it all together, we have

P0P j t d(θ, θ0) inf a>0 1 amt

P0DKL(P j t Pt) + 1

j=1 DKL(P0 Pj θ0) + log Pθ0Pt exp (amtd(θ, θ0))

Lemma 32 Let the assumptions of Theorem 15 hold. We have

Pθ0Pt(d(θ, θ0) > C1ϵ2) exp( C0tϵ2) + exp( λtϵ2) + 2 exp( tϵ2),

for all ϵ ϵ2 m,t, where λ = ρ 1 in assumption (C3).

Proof [Lemma 32] We deﬁne the sets

U = {θ : d(θ, θ0) > C1ϵ2}, Kt = {θ : 1

j=1 D1+λ(Pj θ0 Pj θ) C3ϵ2 m,t}.

Let Π be a probability measure deﬁned as Π(B) = Π(B Kt)

Π(Kt) . Deﬁne the event

At = {X(mt) : Z

pθ pθ0 (X(mt))d Π(θ) exp( (C3 + 1)tϵ2)},

Wu and Uribe

Let Πj t and φt be the set and testing function in (C1). We bound Pθ0Pt(U) by

Pθ0Pt(U) Pθ0φt(X(mt)) + Pθ0(At) + Pθ0(1 φt(X(mt)))Pt(U)I(At)c

= Pθ0φt(X(mt)) + Pθ0(At) + Pθ0(1 φt(X(mt)))I(At)c

U pθ pθ0 (X(mt))dΠ(θ) R

Θ pθ pθ0 (X(mt))dΠ(θ).

By (C1), we bound the ﬁrst term by

Pθ0φt(X(mt)) exp( C0tϵ2).

By Jensen s inequality, we have

Dρ(Pθ0 Pθ) =

1 ρ 1 log Z [d P i θ0] ρ m [d P i θ0] 1 m ρ

1 ρ 1 log Z [d P i θ0]ρ[d P i θ0]1 ρdµ = 1

i=1 Dρ(Pi θ0 Pi θ)

Using the deﬁnitions of event At and Markov inequality, we have

Pθ0(At) = Pθ0

pθ pθ0 (X(mt))d Π(θ) λ < exp(λ(C3 + 1)tϵ2 !

exp( λ(C3 + 1)tϵ2)Pθ0

pθ pθ0 (X(mt))d Π(θ) λ

exp( λ(C3 + 1)tϵ2) Z

X mt (pθ0(x(mt)))λ

(pθ(x(mt)))λ d Pθ0(x(mt))

= exp( λ(C3 + 1)tϵ2) Z

Θ exp(λD1+λ(

k=1 Pθ))d Π(θ)

= exp( λ(C3 + 1)tϵ2) Z

Kt exp(λt D1+λ(Pθ0 Pθ))d Π(θ)

exp( λ(C3 + 1)tϵ2) Z

Kt exp(λt 1

j=1 D1+λ(Pj θ0 Pj θ))d Π(θ)

exp( λ(C3 + 1)tϵ2 + λC3tϵ2 m,t)

exp( λtϵ2).

Let s analyze the third term. Conditioned on (At)c, we have

pθ pθ0 (X(mt))dΠ(θ) Π(Kt) Z

pθ pθ0 (X(mt))d Π(θ) exp( (C2 + C3 + 1)tϵ2),

Bv M for Distributed Bayes

where the last inequality follows from C3. It follows that

Pθ0(1 φt(X(mt)))I(At)c

U pθ pθ0 (X(mt))dΠ(θ) R

Θ pθ pθ0 (X(mt))dΠ(θ)

exp((C2 + C3 + 1)tϵ2)Pθ0

pθ pθ0 (X(mt))dΠ(θ)

exp((C2 + C3 + 1)tϵ2)

U Θt(ϵ) Pθ(1 φt(X(mt)))dΠ(θ) + Π (Θt(ϵ)c)

exp((C2 + C3 + 1)tϵ2)(exp( Ctϵ2) + exp( Ctϵ2)),

where the last inequality follows from C1 and C2. Since C0 > C3 + C2 + 2, we obtain the upper bound for the third term.

Pθ0(1 φt(X(mt)))I(At)c

U pθ pθ0 (X(mt))dΠ(θ) R

Θ pθ pθ0 (X(mt))dΠ(θ) 2 exp( tϵ2).

Putting the bounds all together, we have

Pθ0Pt(U) exp( C0tϵ2) + exp( λtϵ2) + 2 exp( tϵ2)

Lemma 33 (Sub-exponential Bound) Let the random variable X satisfy

P(X δ) c1 exp( c2δ),

for all δ δ0 > 0. Then, for any 0 < a 1

2c2, we have

E exp(a X) exp(aδ0) + c1,

Proof [Lemma 33] Deﬁne Y = exp(a X) for some 0 < a 1

2c2. For any C > 0,

C P(Y y)dy = C + Z

a log y)dy C + c1

C y c2/ady.

Take C = exp(aδ0). Since a 1

2c2, we have

EY exp(at) + c1 exp( aδ0) exp(aδ0) + c1.

Proof [Proof of Theorem 15] By Lemma 32, for all δ δ0, we have

Pθ0Pt(d(θ, θ0) > δ) c1 exp( c2δ).

Wu and Uribe

Take c1 = 4, c2 = mt min(λ, 1)/C1 as C0 > C1 + C2 + 2 > 1 and δ0 = C1mϵ2 m,t. By Lemma 33, we have Pθ0Pt exp(amtd(θ, θ0)) exp(amt C1ϵ2 m,t) + 4,

for all a min(λ, 1)/(2C1). Take a = min(λ, 1)/(2C1). By Lemma 31, we have

P0P j t d(θ, θ0) mtγ2 j,m,t + log(4 + exp(a C1mtϵ2 m,t)) + 1

m Pm j=1 DKL(P0 Pj θ0)

a + C1ϵ2 m,t + 4 exp( a C1mtϵ2 m,t) amt +

Pm j=1 DKL(P0 Pj θ0)

ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0)

for some C that depends only on C0, C1, λ.

Proof [Corollary 16] The ﬁrst result is a consequence of Markov s inequality.

d(θ, θ0) > Mt(ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0))

P0P j t d(θ, θ0)

Mt(ϵ2 m,t + γ2 j,m,t + 1 m2t Pm j=1 DKL(P0 Pj θ0)) C

The second result follows from Jensen s inequality

P0d(P j t θ, θ0) P0Ptd(θ, θ0) C(ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0))

Proof [Lemma 17] We have

P0DKL(P j t Pt) = P0P j t

i=1 [At k ij ] log Pi θ(Xi k) 1

i=1 log Pi θ(Xi k)

i=1 [At k ij ] log Pi θ(Xi k) 1

i=1 log Pi θ(Xi k)

i=1 [At k ij ]P0 log Pi θ 1

i=1 P0 log Pi θ

[At k ij ] 1

P j t P0 log Pi θ

[At k ij ] 1

n P0 log p0 P j t DKL(P0 Pi θ) o .

Bv M for Distributed Bayes

For each j, we have

P0DKL(P j t Pt)

i=1 |[At k ij ] 1

m| P0 log p0 max i [m] inf θ Θ DKL(P0 Pi θ)

i=1 |[At k ij ] 1

m| |P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

By Lemma 1, we get

P0DKL(P j t Pt) 16m2 log m

|P0 log p0| + 1

i=1 DKL(P0 Pi θ0)

By Assumption 2(a), we have

γ2 j,m,t 16m log m

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

C.3 Proofs of Results in Section 7

Proof [Proposition 19] Deﬁne T [t] as the set of τ where Gτ = G. This allows us to break the left-hand side of the inequality into two parts:

τ [k,t 1] T A

h A|[k,t 1] T |i

As t , |T |/t a.s. λ and |[k, t 1] T | a.s. λ(t k). Then with probability 1, we have

= lim sup t

Deﬁne δ = max(|λ2(A)|, |λm(A)|). The spectral radius of A is 1, thus δ < 1 by the Perron Frobenius theorem. Under the assumptions 1, by the standard property of stochastic matrices (see e.g. Rosenthal (1995)), the diagonalizable matrix A satisﬁes

e T i Aλt 1

m1 1 mδλt. (C.5)

for any i [m], using the fact that 1 m1 is the stationary distribution of the Markov chain with transition matrix A. Assume that λ c

m. For any t k t = log m

c +log λ λ log δ ,

mδλ(t k) mδλ t m

Wu and Uribe

Since e T i Aλt 1

m1 1 2 by the double stochasticity of A, we use (C.5) to break the

quantity of interest Pt k=1 Pm j=1 |[Aλ(t k) ij ] 1

m| into two parts. For any i [m],

j=1 |[Aλ(t k) ij ] 1

k=1 e T i Aλ(t k) 1

k=1 e T i Aλ(t k) 1

k>t t e T i Aλ(t k) 1

k=1 mδλ(t k) + 2 t mδλ t

1 δ + 2 t c λ(1 δ) + 2 log m

c + 2 log λ λ log δ .

Noting that 1 δ log δ and δ = 1 ν 4m2 for any doubly stochastic matrix A, we get

j=1 | h Aλ(t k)i

m| c + 2 log m

c + log λ λ(1 δ) 4m2c + 8m2 log m

c + 8m2 log λ λν .

m, the optimal bound is achieved when c = 1

j=1 | h Aλ(t k)i

m| 8m2(1 + log m

2 ) + 8m2 log λ λν 16m2 log m + 8m2 log λ

m, the optimal bound is achieved when c = λ,

j=1 | h Aλ(t k)i

Finally, when λ = 0,

= lim sup t

Bv M for Distributed Bayes

Proof [Corollary 20] We have

P0DKL(P j t Pt) = P0P j t

ij log Pi θ(Xi k) 1

i=1 log Pi θ(Xi k)

ij log Pi θ(Xi k) 1

i=1 log Pi θ(Xi k)

ij P0 log Pi θ 1

i=1 P0 log Pi θ

P j t P0 log Pi θ

n P0 log p0 P j t DKL(P0 Pi θ) o .

For each j, we have

P0DKL(P j t Pt)

m| P0 log p0 max i [m] inf θ Θ DKL(P0 Pi θ)

m| |P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

This implies that

P0DKL(P j t Pt) lim sup t

|P0 log p0| + 1

i=1 DKL(P0 Pi θ0)

Applying Proposition 19 gives us the desired results.

Proof [Corollary 21] Since Pt does not depend on the communication graph, Theorem 15 still holds. This implies the following upper bound on the contraction rate:

P0P j t d(θ, θ0) ϵ2 m,t + γ2 j,m,t + 1 m2t

j=1 DKL(P0 Pj θ0). (C.6)

The only term depending on the communication graph is γ2 j,m,t. By Corollary 21, we have the following upper bounds with probability 1 (with respect to the probability measure of A1, A2, ). If λ 2

γ2 j,m,t 16m log m + 8m log λ

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

Wu and Uribe

If 0 < λ < 2

γ2 j,m,t 4m2

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) .

Combining the upper bounds with Equation (C.6), we have: If λ 2

P0P j t d(θ, θ0) 1

t +16m log m + 8m log λ

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) + 1

j=1 DKL(P0 Pj θ0).

If 0 < λ < 2

P0P j t d(θ, θ0) 1

|P0 log p0| + max i [m] inf θ Θ DKL(P0 Pi θ) + 1 m2t

j=1 DKL(P0 Pj θ0).

This is the desired result after we remove the constants.

C.4 Proofs of Results in Section 8

Proof [Lemma 22] By Deﬁnition (2.10),

log pj t(θ) = tfj t (θ) + π(θ) + const.

for fj t deﬁned in Equation (2.11). We can compute fj t (θ) explicitly ,

fj t (θ) = 1

i=1 [At k ji ] log pi θ(Xi t)

i=1 [At k ji ] θ, T i(Xi k) ψi(θ) + const

i=1 [At k ji ]T i(Xi k)

i=1 [At k ji ]ψi(θ)

This implies that

log pj t(θ) = tfj t (θ) + π(θ) + const

i=1 [At k ji ]T i(Xi k)

i=1 [At k ji ]ψi(θ) + θ, u ψ0(θ) + const

i=1 [At k ji ]T i(Xi k) + u

i=1 [At k ji ]ψi(θ) ψ0(θ) + const.

Bv M for Distributed Bayes

This shows that pj t is a member of the exponential family of the form (8.3) with suﬃcient statistic χj t + u and log-partition function Bj t (θ) + ψ0(θ) as deﬁned in Equation (8.4).

Proof [Lemma 23] By Lemma 22, P j t is a member of the exponential family given by Equation (8.3). The full rankness of P j t follows directly from the deﬁnition of full-rank exponential family and the compact representation (8.3). By Lemma 4.5 of Van der Vaart (2000), the log-partition function Bj t : Θ R is inﬁnitely diﬀerentiable with the gradient being the mean parameter of χj t and the Hessian is the covariance matrix of χj t. Since P j t is full-rank, it is equivalent to the nonsingularity of the covariance matrix for χj t Van der Vaart (2000). This implies that the gradient θBj t is one-to-one in the interior of Θ, thus, θBj t is invertible.

Proof [Proposition 24] By Lemma (23), the sequence of estimators ˆθj t is well-deﬁned with [P0] probability tending to 1. By Theorem 4.6 of Van der Vaart (2000), the sequence of estimators ˆθj t has a limiting distribution

t(ˆθj t θ0) d N(0, Ij 1 θ0 ),

where Ij θ0 is the Fisher information for Pj θ at θ0. This corresponds to the Sandwich covariance matrix. Ij θ0 = h φ 1 θ0 Covθ0(χj t)φ 1 θ0

i 1 = Covθ0(χj t).

Since Bj t is inﬁnitely diﬀerentiable in a neighborhood of θ0, the Fisher information Ij θ must be uniformly bounded. Thus, log pj θ is Lipschitz around θ0 and Assumption 2(d) is satisﬁed. When int(Θ) = , the exponential family Pi Θ satisﬁes the diﬀerential in the quadratic mean Assumption 2(c) by Example 7.7 of Van der Vaart (2000). Given that θ0 int(Θ) and π have support on Θ, the prior π puts positive mass on every neighborhood of θ0, thus Assumption 3(b) is satisﬁed. Under the full-rank assumptions on Pj θ, the Hessian 2 θ log pj θ is negative deﬁnite for all θ. Thus, the Hessian of fj t is negative deﬁnite as a positive linear combination of 2 θ log pj θ. We have that fj t is strictly concave on Θ with a unique maximizer at θj t. The strict concavity implies that lim t P0( inf θ ˆθj t >δ |fj t (θ) fj t (ˆθj t)| ϵ) = 1

By Lemma 7, θj t is consistent at θ0. Then Assumption 4(a) is satisﬁed as a consequence of Lemma 8. The assumptions of Theorem 10 are all satisﬁed. Considering the sequence of random variables θ P j t centered at the moment estimator ˆθj t, we have Z

Θ |qj t (x) N(0, V 1 θ0 )|dx P0 0,

where qj t is the density of

t(θ ˆθj t), and Vθ0 = 1

m Pm i=1 Cov(T i).

Wu and Uribe

Proof [Proposition 25] The prior π is assumed to have full support over Rp, which trivially satisﬁes Assumption 3(a). Deﬁne functions fj t , ft, and f on Θ as per Equations (2.11) (2.13). By Assumption

(2)(a), the function f exists and we note that fj t (θ) P0 f(θ). The estimators ˆθj t and the true parameter θ0 are respectively given by:

ˆθj t = arg min θ Θ fj t (θ), θ0 = arg min θ Θ f(θ).

The gradient of fj t is given by

fj t (θ) = 1

t T j t + 1

[At k ji ]Xi ke θ,Xi k

1 + e θ,Xi k ,

and the Hessian is

2fj t (θ) = 1

[At k ji ]Xi k Xi T k e θ,Xi k 1 + e θ,Xi k 2 .

Given that 2fj t (θ) > 0, the function fj t is strictly convex in θ. This implies that the minimizer ˆθj t is unique, and we have ˆθj t int(Θ). The gradient of f is given by

" X1 1e θ,X1 1

1 + e θ,X1 1 X1 1Y 1 1

1 + e θ,X1 1 e θ0,X1 1

1 + e θ0,X1 1

The diﬀerentiation and expectation are interchanged by the Lebesgue-dominated convergence theorem since | f(θ)| |X1 1| and P0|X1 1| < by Assumption ii). Since f(θ) is strictly increasing in θ, we have f(θ ϵ) < 0 < f(θ + ϵ) in each

coordinate. Hence, Assumption 2(b) is satisﬁed. By Lemma 7, we have ˆθj t P0 θ0. Let E = {η R : |σ(η)| < }. The set E is open and nonempty. Additionally, η is identiﬁable, as σ(η) is a one-to-one function. Trivially, the set Θ is open and convex, and θT x E for all θ Θ and x X. For all η E, we have 0 < σ(η) < 1 and

|σ (η)| = |σ(1 σ(η))(1 2σ(η))2 2σ2(1 σ(η))2

(1 σ(η))2 | 3.

After algebraic manipulations, we have

3 θ log pj θ(. | xj k)a,b,c = σ ( θ, xj k )xj k,axj k,bxj k,c 3xj k,axj k,bxj k,c,

for all θ Θ, xj k X and a, b, c [p]. Thus, 3 θ log pj θ is uniformly bounded by an integrable function.

Bv M for Distributed Bayes

The Fisher information for agent j is given by

V j θ0 = P0 2 θ log pj θ(. | Xj k) = P0

Xj T k e θ,Xj k Xj k 1 + e θ,Xj k 2

By Assumption i), V j θ0 exists and is nonsingular. Hence, Assumption 2(f) is satisﬁed. Assumption 2(c), 4(a) are satisﬁed with the same argument as Proposition 24. By Corollary 12 and Slusky s theorem, we have Z

qj t (x) N(0, ˆV 1 θ0 ) dx P0 0.

where ˆVθ0 is the ﬁnite-sample version of 1 m Pm j=1 V j θ0. Speciﬁcally,

i=1 Xi T diag

e Pp j=0 θjxi 1j 1 + e Pp j=0 θjxi 1j 2 , . . . , e Pp j=0 θjxi tj 1 + e Pp j=0 θjxi tj 2

Proof [Lemma 26] By Lemma 2, fj t (θ) P0 f(θ) for each θ Θ. For all θ [0, 1]2, fj t are uniformly bounded.

|fj t (θ)| 1

i=1 [At k ji ]

" (|Xj t θ|)2

i=1 [At k ji ]

" 1 σj2 + log

" 1 σj2 + log

!# 1 + 16m2 log m

The gradient fj t (θ) is uniformly bounded over [0, 1]2. To see this, consider

fj t (θ) = A + B,

where A, B are given by

i=1 [At k ji ]

σj2 |Xj t Zj|(θ Zj)

i=1 [At k ji ] (θ Zj) h φ |Zj|+ 1

σj φ |θ Zj|

σj|θ Zj| h Φ |Zj|+ 1

σj Φ |θ Zj|

Wu and Uribe

Applying the following upper bounds on A and B,

i=1 [At k ji ]

i=1 [At k ji ] φ |Zj|+ 1

σj Φ |Zj|+ 1

fj t (θ) = A + B

2 σj2 + φ( |Zj|+ 1

σj Φ( |Zj|+ 1

1 + 16m2 log m

Then fj t is uniformly equicontinuous in t, as fj t is Lipschitz. By Lemma 36, we have

fj t f P0 0. Note that ˆθj t = arg minθ [0,1]2 fj t (θ) and θ0 = arg minθ [0,1]2 f(θ). By the Argmax theo-

rem (Theorem 3.2.2, van der Vaart and Wellner (2023)), we have ˆθj t P0 θ0. Since θ0 Θ, for small enough ϵ, we have

P0(|ˆθj t θ0| > ϵ) = P0(|ˆθj t θ0| > ϵ, ˆθj t / Θ) + P0(|ˆθj t θ0| > ϵ, ˆθj t Θ)

= P0(ˆθj t / Θ) + P0(|ˆθj t θ0| > ϵ, ˆθj t Θ) 0,

thus P0(ˆθj t / Θ) 0.

Proof [Proposition 27] The prior π is assumed to have full support over Θ, which trivially satisﬁes Assumption 3(a). From the proof of Lemma 26, we proved that fj t is uniformly bounded for all t, thus fj t is uniformly equicontinuous. By deﬁnition, there exists δ > 0 such that θ ˆθj t > δ implies that |fj t (θ) fj t (ˆθj t)| > ϵ. This veriﬁes Assumption 4(a). The gradient of log pj θ is given by

log pj θ(Xj t ) = θ Zj

|θ Zj| |Xj t Zj| φ |Zj|+ 1

σj φ |θ Zj|

σj Φ |θ Zj|

This allows us to derive the Fisher information of pj θ.

V j θ = P0[ θ log pj θ θ log pj T θ ]

= (θ Zj)(θ Zj)T

|θ Zj| |θ0 Zj| φ( |Zj|+ 1

σj ) φ( |θ Zj|

σj ) Φ( |θ Zj|

Bv M for Distributed Bayes

At θ0, the Hessian simpliﬁes to V j θ0 = (θ0 Zj)(θ0 Zj)T

σj4|θ0 Zj|2

" φ( |Zj|+ 1

σj ) φ( |θ0 Zj|

σj ) Φ( |θ0 Zj|

which is nonsingular. Thus, Assumption 2(e) is satisﬁed. Reusing the argument in the proof of Lemma 26, we have

sup θ [0,1]2 log pj θ

2 σj2 + φ( |Zj|+ 1

σj Φ( |Zj|+ 1

This implies that log pj θ is a continuously diﬀerentiable function over a compact domain [0, 1]2. Thus, it is Lipschitz continuous. Assumption 2(d) is satisﬁed. The Hessian for fj t is simply the average of agent Fisher information:

2fj t (θ) = 1

i=1 [At k ji ]V i θ .

Recall that in the proof of Since fj t is uniformly equicontinuous, We deﬁne the event Et as E = {X(mt) : ˆθj t Θ}.

Conditioning on Et, by Corollary 12 and Slusky s theorem, we have Z

qj t (x) N(0, V 1 θ0 ) dx P0 0,

for qj t deﬁned as the density of

t(θ ˆθj t) and Vθ0 deﬁned in Equation (8.16). By Lemma 26, limt P0(Et) = 1. This completes our proof.

Appendix C. Supporting Results

Proof [Lemma 1] Deﬁne δ = max(|λ2(A)|, |λm(A)|). The spectral radius of A is 1, thus δ < 1 by the Perron-Frobenius theorem. Under the assumptions 1, by the standard property of stochastic matrices (see e.g. Rosenthal (1995)), the diagonalizable matrix A satisﬁes

m1 1 mδt (C.7)

for any i [m], using the fact that 1 m1 is the stationary distribution of the Markov chain with transition matrix A. For any t k t = log m

mδt k mδ t m

Wu and Uribe

Since e T i At 1

m1 1 2 by the double stochasticity of A, we use (C.7) to break the quantity of interest Pt k=1 Pm j=1 |[At k ij ] 1

m| into two parts, that is, for any i [m],

j=1 |[At k ij ] 1

k=1 e T i At k 1

k=1 e T i At k 1

k>t t e T i At k 1

k=1 mδt k + 2 t mδ t

1 δ + 2 t 2 1 δ + 2 log m

Noting that 1 δ log δ and δ = 1 ν 4m2 for any doubly stochastic matrix A, we get

j=1 |[At k ij ] 1

m| 2 + 2 log m

2 1 δ 4 log m

1 δ 16m2 log m

Lemma 34 (Distributed Law of Large Numbers) Assume that Si t, t 1 are i.i.d. random variables where E[|Si t|] exists and is ﬁnite, for all i [m]. Under Assumption 1, for any j [m], the random variables

k=1 [At k ij ]Si t

converge in probability to 1 m Pm i=1 E[Si 1] as t .

Proof Let Zt = [Z1 t , , Zm t ]T and St = [S1 t , , Sm t ]T . Since E[|Si t|] < exists and is ﬁnite, we have

k=1 At k Sk.

Using the property of stochastic matrices (Rosenthal, 1995), the diagonalizable matrix A satisﬁes

for all i [m] and some δ < 1.

Bv M for Distributed Bayes

This implies that Zt

k=1 At k Sk 1

k=1 (At k 1

k=1 max i [m]

e T i At k 1

m1 max i [m] |Si k|, by Cauchy - Schwarz inequality

k=1 mδt k max i [m] |Si k|

i=1 δt k|Si k|

i=1 δt k|Si k| + m

i=1 |Si k|,

for some ﬁxed constant c. By the strong law of large numbers, Pt k=1 1 m11T Sk 1

m Pm i=1 E[Si 1] = od(1). Then Zt 1

i=1 E[Si 1]

1 m11T Sk 1

i=1 E[Si 1]

i=1 |Si k| + m

i=1 δt k|Si k| + od(1)

i=1 |Si k| + od(1)

i=1 E[Si 1] + od(1).

Since this is for arbitrarily large c, we conclude that Zt 1

m Pm i=1 E[Si 1] = od(1), that is

i=1 E[Si 1] in probability.

Lemma 35 (Distributed Central Limit Theorem) Assume that Si t, t 1 are i.i.d. random variables with E[Si t] = µi and cov[Si t] = Σi exist and are ﬁnite, for all i [m]. Under Assumption 1, for any j [m], the random variables

i=1 Σi ! 1/2 m X

k=1 [At k ij ](Si t µi)

Wu and Uribe

converge in distribution to a standard normal distribution as t .

Proof The proof applies the Linderberg central limit theorem. Without loss of generality, assume that µi = 0. Let Σm,t be the sum of covariances. We have

i=1 Cov([At k ij ]Si t) =

i=1 [At k ij ]2Σi.

The matrix Σm,t is asymptotically equivalent to the scaling factor Σ m,t = t m2 Pm i=1 Σi.

Σ m,t 1 Σm,t = m

i=1 Σi ! 1 t X

i=1 [At k ij ]2Σi

i=1 Σi ! 1 m X

i=1 Σi Pt k=1[At k ij ]2

i=1 Σi ! 1 m X

i=1 Σi Pt k=1[Ak ij]2

By Assumption 1, limt Ak ij 1

m, thus limt [Ak ij]2 1 m2 .Then we have

lim t Σ m,t 1 Σm,t = m

i=1 Σi ! 1 m X

i=1 Σi lim t

Pt k=1[Ak ij]2

i=1 Σi ! 1 m X

It remains to verify the Lindeberg condition

i=1 E Σ 1/2 m,t [At k ij ]Si t 2I Σ 1/2 m,t [At k ij ]Si t >ϵ

= 0, ϵ > 0.

The condition is equivalent to

i=1 E Σ 1/2 m,t [At k ij ]Si t 2I Σ 1/2 m,t [At k ij ]Si t >ϵ

= 0, ϵ > 0.

Since limt Σ m,t = , limt Σm,t = and I Σ 1/2 m,t [At k ij ]Si t >ϵ converge to 0 almost

surely. Because maxi [m] supt[At ij]2 < , we have

max i [m] sup t t E h Σ 1/2 m,t [At k ij ]Si t 2i = max i [m] sup t E

i=1 Σi) 1[At k ij ]Si t 2 #

m(max i [m] sup t [At ij]2) ( 1

i=1 Σi) 1 op max i [m] Σi op

Bv M for Distributed Bayes

By the Lebesgue dominated convergence theorem, we have

lim t t E Σ 1/2 m,t [At k ij ]Si t 2I Σ 1/2 m,t [At k ij ]Si t >ϵ

The Ces aro sums also converge to 0:

k=1 t E Σ 1/2 m,t [At k ij ]Si t 2I Σ 1/2 m,t [At k ij ]Si t >ϵ

= 0, ϵ > 0.

This completes the proof.

Lemma 36 (Miller (2021)) Suppose that hn : E F for n N, where E is a totally bounded space and F is a normed space. If hn converges pointwise and is equicontinuous, then it converges uniformly.