# decentralized_langevin_dynamics_for_bayesian_learning__9656f442.pdf

Decentralized Langevin Dynamics for Bayesian Learning

Anjaly Parayil1, He Bai2, Jemin George1, and Prudhvi Gurram1,3

1CCDC Army Research Laboratory, Adelphi, MD 20783, USA 2Oklahoma State University, Stillwater, OK 74078, USA 3Booz Allen Hamilton, Mc Lean, VA 22102, USA 1panjaly05@gmail.com, jemin.george.civ@mail.mil, pkgurram@ieee.org 2he.bai@okstate.edu

Motivated by decentralized approaches to machine learning, we propose a collaborative Bayesian learning algorithm taking the form of decentralized Langevin dynamics in a non-convex setting. Our analysis show that the initial KL-divergence between the Markov Chain and the target posterior distribution is exponentially decreasing while the error contributions to the overall KL-divergence from the additive noise is decreasing in polynomial time. We further show that the polynomial-term experiences speed-up with number of agents and provide sufﬁcient conditions on the time-varying step-sizes to guarantee convergence to the desired distribution. The performance of the proposed algorithm is evaluated on a wide variety of machine learning tasks. The empirical results show that the performance of individual agents with locally available data is on par with the centralized setting with considerable improvement in the convergence rate.

1 Introduction

With the recent advances in computational infrastructure, there has been an increase in the use of larger machine learning models with millions of parameters. Even though there is a parallel increase in the size of training datasets for these models, there is a signiﬁcant disparity between the amount of existing data and the data required to train the large models to avoid overﬁtting and provide good generalization performance. Such models trained in point estimate settings such as Maximum A Posteriori (MAP) neglect any associated epistemic uncertainties and make overconﬁdent predictions. Bayesian learning framework provides a principled way to avoid over-ﬁtting and model uncertainties by estimating the posterior distribution of the model parameters. However, analytical solutions of exact posterior or sampling from the exact posterior is often impossible due to the intractability of the evidence. Therefore, one needs to resort to approximate Bayesian methods such as Markov Chain Monte Carlo (MCMC) sampling techniques. To this effect, we focus on a speciﬁc class of MCMC methods, called Langevin dynamics to sample from the posterior distribution and perform Bayesian machine learning.

Langevin dynamics derives motivation from diffusion approximations and uses the information of a target density to efﬁciently explore the posterior distribution over parameters of interest [1]. Langevin dynamics, in essence, is the steepest descent ﬂow of the relative entropy functional or the KL-divergence with respect to the Wasserstein metric [2 4]. Just as the gradient ﬂow converges exponentially fast under a gradient-domination condition, Langevin dynamics converges exponentially fast to the stationary target distribution if the relative entropy functional satisﬁes the log-Sobolev inequality [3 5].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

The Unadjusted Langevin Algorithm (ULA) is a popular inexact ﬁrst-order discretized implementation of the Langevin dynamics without an acceptance/rejection criteria. Analysis of convergence properties of the ULA and other Langevin approximations has been a topic of active research over past several years [6 12]. Reference [3] shows that a bias exists in the ULA for any arbitrarily small (ﬁxed) step size, even for a Gaussian target distribution. Controlling the bias and exponential convergence of KL divergence for strongly log-concave smooth target distributions using ULA is discussed in [3, 6, 7, 9 11]. Non-asymptotic bounds on variation error of the Langevin approximations for smooth log-concave target distributions have been established by [6] and [8]. Assuming a Lipschitz continuous Hessian, [6] introduces a modiﬁed version of the Langevin algorithm requiring fewer iterations to achieve the same precision level. Tight relations between the Langevin Monte Carlo for sampling and the gradient descent for optimization for (strongly) log-concave target distributions are presented in [7]. Similarly, using the notion of gradient ﬂows over probability space and KLdivergence, [9] analyzes the non-asymptotic convergence of discretized Langevin diffusion. These results were improved and extended with particular emphasis on scalability of the approach with dimension, smoothness, and curvature of the function of interest in [10 12].

Compared to log concave ULA settings where local properties replicate the global behavior and optimal values are attained in a single pass, non-convex objective functions naturally require multiple passes through training data. Analysis of ULA in such cases often requires assuming that the negative log of the target distribution satisﬁes some dissipative property [13 17], contractivity condition [18], or limiting the non-convexity to a local region [19,20]. In particular, [13] makes the ﬁrst attempt in analyzing non-asymptotic convergence in a nonconvex setting and shows SGLD tracks continuous Langevin diffusion in quadratic Wasserstein distance for empirical risk minimization. Recent work [14, 19, 21] reports computational efﬁciency of sampling algorithm to optimization methods in the nonconvex setting. The approach is extended to relaxed dissipativity conditions, to evaluate dependent data streams and provides sharper convergence estimates uniform in the number of iterations in [15,16]. More recently, it is shown that the convergence is polynomial in terms of dimension and error tolerance [17,18,20].

Besides the ULA, higher-order Langevin diffusion for accelerated sampling algorithms are presented in [22,23]. Analysis of leapfrog implementation of Hamiltonian Monte-Carlo (HMC) for strongly log-concave target distributions is presented in [24] and [25]. Following the introduction of a stochastic gradient-based Langevin approach for Bayesian inference in [26], stochastic gradient based Langevin diffusion and other HMC schemes are presented in [27 31].

Related Work: The approaches discussed so far assume a centralized entity to process large datasets. However, communication challenges associated with transferring large amounts of data to a central location and the associated privacy issues motivate a decentralized implementation over its centralized counterparts [32, 33]. Master-slave architecture for distributed MCMC via moment sharing is presented [34]. Data-parallel MCMC algorithms for large-scale Bayesian posterior sampling are presented in [35 38]. These parallel MCMC schemes [39 42] are not applicable in decentralized setting since they require a central node to aggregate and combine the samples from individual chains generated by the computing nodes in a ﬁnal post-processing step to generate an approximation of the true posterior. Recently, decentralized Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) methods for strongly log-concave posterior distribution are presented in [43].

Contribution: In this paper, we draw on the recent ULA literature and develop a decentralized learning algorithm based on the centralized ULA in a nonconvex setting. We consider the problem of collaboratively inferencing the global posterior of a parameter of interest based on independent data sets distributed among a network of n agents. The communication topology between the agents can be any undirected connected graph, including the master-slave topology as a special case. We propose a decentralized ULA (D-ULA) that incorporates an average consensus process into the ULA with time-varying step-sizes. In this algorithm, each agent shares its current Markov Chain sample with neighboring agents at each time step. We show that the resulting distribution of the averaged sample converges to the true posterior asymptotically. We provide theoretical analysis of the convergence rate and step-size conditions to achieve speed up of convergence with respect to the number of agents. Empirical results show that the performance of our proposed algorithm is on par with centralized ULA with considerable improvement in the convergence rate, for three different machine learning tasks.

Notation: Let Rn m denote the set of n m real matrices. For a vector φ, φi is the i th entry of φ. An n n identity matrix is denoted as In and 1n denotes an n-dimensional vector of all ones. For p [1, ], the p-norm of a vector x is denoted as x p. For matrices A Rm n and B Rp q, A B Rmp nq denotes their Kronecker product. For a graph G (V, E) of order n, V {v1, . . . , vn} represents the agents or nodes and the communication links between the agents are represented as E {e1, . . . , eℓ} V V. Let A = [ai,j] Rn n be the adjacency matrix with entries of ai,j = 1 if (vi, vj) E and zero otherwise. Deﬁne = diag (A1n) as the in-degree matrix and L = A as the graph Laplacian.

2 Problem formulation

Consider a connected network of n agents, each with a randomly distributed set of mi data items,

Xi = n xj i oj=mi

j=1 , i = 1, . . . , n. Here xj i Rdx is the j-th data element in a set of mi data items

available to the i-th agent. Let w Rdw be the parameter vector associated with the model and p(w) is the prior associated with the model parameters. The global posterior distribution of w given the n independent data sets distributed among the agents can be expressed as

p(w|X1, . . . , Xn) p(w)

i=1 p(Xi|w) =

i=1 p(Xi|w)p (w)

1 n | {z } local posterior

In the optimization literature, the prior, p (w), regularizes the parameter and the likelihood, p(Xi|w), represents the local cost function available to each agent. Here, the set of n independent data sets are distributed among n agents each with a size of mi, i = 1, . . . , n. At the risk of abusing the notation, deﬁne p(w|Xi) as the local posterior distribution. Thus the global posterior can be written as the product of local posteriors as

p(w|X1, . . . , Xn)

i=1 p(w|Xi). (2)

The main issue with point estimates obtained from optimization schemes like maximum likelihood and maximum a posteriori estimation is that they fail to capture the parameter uncertainty and they can potentially over-ﬁt the data. This paper is aimed at developing a method for collaborative Bayesian learning from large scale datasets distributed among a networked set of agents as a solution to the numerous issues associated with the point estimation schemes. In particular, we present a decentralized version of the unadjusted Langevin algorithm to distributedly obtain samples from the global posterior p(w|X1, . . . , Xn). For the ease of notation, we use X to denote the entire data set. Thus the global posterior can be written as p(w|X).

3 Decentralized unadjusted Langevin algorithm

To efﬁciently explore the global posterior p(w|X), we ﬁrst rewrite the target distribution in terms of an energy function U as follows [1,44,45]: p(w|X) exp( U(w)), (3) where U is the analogue of potential energy given by U(w) log p(w|X). (4) The Langevin algorithm is a well known family of gradient based Monte Carlo sampling algorithms. The sample obtained using Unadjusted Langevin Algorithm (ULA) at a given time instant k is given by [19] w(k + 1) = w(k) αk U(w(k)) +

2αkv(k) (5) where αk is the algorithm step-size, w(k) represents the sample obtained at the k-th time instant and v(k) is a dw-dimensional, independent, zero-mean, unit variance, Gaussian sequence, i.e., v(k) N(0, Idw), k 0. Now substituting (4) yields

w(k + 1) = w(k) + αk log p(w(k)|X) +

2αkv(k). (6) Substituting (1) yields

w(k + 1) = w(k) + αk

log p(Xi|w(k)) + 1

n log p(w(k)) +

2αkv(k). (7)

The samples obtained using the continuous-time version of the centralized ULA given in (7) have shown to exponentially converge to the target posterior distribution [46] for a certain class of distributions with exponential tails. Convergence properties of the ULA had been widely studied for log-concave target distributions [6,8 12]. Non-asymptotic analysis of centralized ULA without the strong log-concavity assumption on target distribution is presented in [13 20].

However, when the data is distributed among n agents and there is no central agent to pool all the local gradients, exact implementation of the above ULA is difﬁcult, if not impossible. Therefore we propose the following decentralized ULA:

wi(k + 1) = wi(k) βk

j=1 ai,j (wi(k) wj(k))

+ αkn log p(Xi|wi(k)) + 1

n log p(wi(k)) +

where ai,j denotes the entries of the adjacency matrix corresponding to the communication network G (V, E), βk is the consensus step-size and vi(k) are dw-dimensional, independent, zero-mean, Gaussian sequence with variance n, i.e., vi(k) N(0dw, n Idw), i I.

Remark 1. Compared to the parallel MCMC setting, our formulation do not require a central coordinator and each computing nodes reconstruct the approximation to the posterior simply relying on individually available data set and prior information (incorporated as log p(Xi|wi(k)) + 1

n log p(wi(k)) into the algorithm as shown in (8)) and by interacting with their one-hop neighbors as dictated by the undirected communication graph G(V, E) (as denoted as Pn j=1 ai,j (wi(k) wj(k)) in (8)). Here ai,j is the (i, j)-th entry of the n n adjacency matrix A. ai,j = 1 if the i-th node can communicate with the j-th node and zero otherwise. Similar technique is used in decentralized supervised learning [32,47,48].

Deﬁne w(k) w 1 (k) . . . w n (k) Rndw and v(k) v 1 (k) . . . v n (k) Rndw. Now (8) can be written as w(k + 1) = w(k) βk (L Idw) w(k) αkng(w(k), X) +

2αkv(k), (9) where L is the network Laplacian and

g1 (w1(k), X1) ... gn (wn(k), Xn)

U1 (w1(k), X1) ... Un (wn(k), Xn)

where Ui(w, Xi) = log p(w|Xi) and p(w|Xi) is the local posterior, given in (1). Deﬁne the network weight-matrix Wk = (In βk L). Thus the proposed decentralized ULA can be written as

w(k + 1) = (Wk Idw) w(k) αkng(w(k), X) +

2αkv(k). (10) If we ignore the additive noise term, then the decentralized ULA of (10) can be considered a consensus optimization algorithm aimed at solving the problem, minw U(w, X), where

i=1 Ui(w, Xi). (11)

Denote by p the stationary probability distribution corresponding to the global posterior distribution, i.e., the target distribution. It then follows from (3) that p ( ) = exp ( U ( , X) + C) , (12) for some positive constant C corresponding to the normalizing constant. Now note that the centralized ULA for generating samples from the target distribution p ( w ) of (12) is given as [9]

w (k + 1) = w (k) αk U ( w (k), X) +

2αk v(k). (13) The continuous-time limit of (13) can be obtained as the following Stochastic Differential Equation (SDE) known as the Langevin equation [49]:

d w (t) = U( w (t), X)dt +

2d Bt, (14) where Bt is a dw-dimensional Brownian motion. The pseudocode of the proposed decentralized ULA is given in Algorithm 1, where αk = a (k+1)δ2 and βk = b (k+1)δ1 (see Condition 1 in S1). We refer materials from supplementary sections with the preﬁx S.

4 Main Results

Though our proposed algorithm is built on ULA, analysis of even the centralized ULA (C-ULA) for non-log-concave target distributions requires assuming that the negative log of the target distribution satisﬁes some dissipative property [13 17], contractivity condition [18], or limiting the non-convexity to a local region [19,20]. Given analysis of D-ULA is novel/non-trivial compared to the existing non-convex consensus-optimization and non-log-concave ULA literature because: (i) the consensus analysis and the results in Theorem 1 are novel since we use time-varying step-sizes αk and βk and provide an explicit consensus rate in term of step-size decay rates (see (25)), (ii) compared to existing C-ULA analysis for non-log-concave target distributions, the continuous-time approximation to the D-ULA contains an additional consensus error term ζ( ) in (21) that complicates the analysis. Requirements on the time-varying step sizes are also not straightforward to obtain as the existing literature is focused on ﬁxed step-sizes.

Analysis of the proposed distributed ULA given in (10) requires that the sequences {αk} and {βk} be selected as (see Condition 1 in S1)

αk = a (k + 1)δ2 and βk = b (k + 1)δ1 , (15)

where 0 < a, 0 < b, 0 δ1 and 1

2 + δ1 < δ2 < 1. Furthermore, we make the following three assumptions (formally stated in S1): (i) the gradients Ui are Lipschitz continuous with Lipschitz constant Li > 0, i = 1, . . . , n; (ii) the communication network is given as a connected undirected graph; and (iii) there exists a positive constant µg < such that the disagreement on the gradient among the distributed agents, denoted as g(wk, X), satisﬁes E g(wk, X) 2 2 |wk nµg(1 + k)δ2 , a.s., where g(wk, X) = g(wk, X) 1

n1n1 n Idw g(wk, X).

From the proposed distributed ULA given in (10), the average dynamics is given as

w(k + 1) = w(k) αk

i=1 Ui (wi(k), Xi) +

2αk v(k), (16)

where w(k) = 1

n Pn i=1 wi(k) and v(k) = 1

n Pn i=1 vi(k) is a zero-mean, unit-variance Gaussian random vector. Now adding and subtracting αk Pn i=1 Ui ( w(k), Xi) = αk U ( w(k), X) yields

w(k + 1) = w(k) αk U ( w(k), X) αkζ( w(k), w(k)) +

2αk v(k), (17) where U ( w(k), X) is deﬁned in (11), the consensus error w(k) is deﬁned as w(k) = In 1

n1n1 n Idw w(k) and ζ( w(k), w(k)) is deﬁned as

ζ( w(k), w(k)) =

i=1 ( Ui ( w(k) + wi(k), Xi) Ui ( w(k), Xi)) . (18)

For all k 0, let [tk, tk+1) denote the current time-interval, i.e., t [tk, tk+1), where tk is deﬁned as tk = Pk 1 j=0 αj. Here, tk+1 = tk + αk. Deﬁne ω(t) as

ω(t) = w(k), t [tk, tk+1), k 0. (19) Now (17) can be written as

w(tk+1) = w(tk) αk U ( w(tk), X) αkζ( w(tk), w(tk)) +

2 Btk+1 Btk , (20) where Bt is a dw-dimensional Brownian motion. Thus, for tk t < tk+1, the discretized equation of (17) is given by

d w(t) = U ( w(tk), X) dt +

2d Bt ζ( w(tk), ω(t))dt. (21) Let w(t) in (21) admits a probability distribution pt( w) for tk t < tk+1. Here we aim to show that ptk( w) p as k .

4.1 Kullback-Leibler (KL) divergence and log-Sobolev inequality

Sampling can be viewed as optimization in the space of measures, where the objective function in the space of measures attains its minimum at the target distribution. Following [3,4,19,20], we use the relative entropy or the KL-divergence of pt( w) to the target distribution p , denoted by F(pt( w)), as the objective, i.e.,

F(pt( w)) = Z pt( w) log pt( w)

Algorithm 1 Decentralized ULA (D-ULA)

1: Initialization : w(0) = w 1 (0) . . . w n (0)

2: Input : a, b, δ1 and δ2 3: for k 0 do 4: for i = 1 to n do 5: Sample vi(k) N(0, n Idw) & compute gi (wi(k), Xi) 6: Compute ˆwi(k) = Pn j=1 ai,j (wi(k) wj(k))

7: Update wi(k + 1) = wi(k) βk ˆwi(k) αkn gi (wi(k), ξi(k)) + 2αkvi(k) 8: end for 9: end for

KL-divergence is non-negative and it is minimized at the target distribution, i.e., F(pt( w)) 0 and F(pt( w)) = 0 if and only if pt = p . The property of p that we rely on to show convergence of the proposed algorithm is that it satisﬁes a log-Sobolev inequality. Consider a Sobolev space deﬁned by the weighted norm: R g( w)2p ( w) d w, where p ( w) exp( U( w)). We say that p ( w) satisﬁes a log-Sobolev inequality if there exists a constant ρU > 0 such that for any smooth function g satisfying R g( w)p ( w) d w = 1, we have: Z g( w) log g( w)p ( w) d w 1 2ρU

g( w) p ( w) d w, (23)

where ρU is the log-Sobolev constant. Let g( w) = pt( w)

p ( w). Thus we have

F(pt( w)) = Ept( w)

1 2ρU Ept( w)

" log pt( w)

Now we present our ﬁrst result, which shows that the average-consensus error wk is decreasing at the rate O 1 (k+1)δ2 2δ1 (see (S96) for an explicit expression). This implies that the individual samples

wi(k) are converging to wk and this is possible only because of the decaying step-size αk, which also multiplies the additive Gaussian noise.

Theorem 1. Consider the decentralized ULA (D-ULA) given in Algorithm 1 under Assumptions 1-3. Then, for the average-consensus error deﬁned as wk = Indw 1

n1n1 n Idw wk, there holds:

E wk+1 2 2 W3 exp (W1(k + 1)1 δ1) + W2 (k + 1)δ2 2δ1 , (25)

where W1, W2 and W3 are positive constants deﬁned in (S87), (S88), (S89) and (S90).

Detailed proof of Theorem 1 is given in S3. Now we present our main result which shows that the KL-divergence between pt and p is in fact decreasing.

Theorem 2. Consider the decentralized ULA (D-ULA) given in Algorithm 1 under Assumptions 1-3 with αk, βk given in Condition 1 and a in αk selected as

3 , γ > 2. (26)

Given that the target distribution satisﬁes the log-Sobolev inequality (24) with a constant ρU > 0, and has a bounded second moment, i.e., R w 2 2 p ( w) d w c1 for some bounded positive constant c1, then for all initial distributions pt0( w) satisfying F(pt0( w)) c2, we have

F(ptk+1( w)) F(pt0( w)) + CF1

exp ρU Pk ℓ=0 αℓ + 1 nγ 2 CF2 (k + 1)δ2 2δ1 + CF3 exp ρUa 1 δ2 (k + 1)1 δ2 (27)

where the positive constants CF1 , CF2 , CF3 and associated parameters are deﬁned in (S224)-(S232).

Proof of Theorem 2 is given in S4. Compared to the existing results, by using a decaying step-size, we are able to remove the constant bias term present in the KL-divergence due to the additive noise. In (27), the constants CF1 and CF3 are dominated by the consensus-error while the additive noise contributes most to CF2. Note that the exponential convergence rate for the initial KL-divergence

F(pt0( w)) is similar to what is currently known in the literature [4,9,19]. More importantly, the constant bias-term present in the existing results for the KL-divergence between the actual and target distribution, which is absorbed into the constant CF2, is decreasing and we do see a speed-up for decay with the number of agents due to the nγ 2-term. Even though this speed-up increases with γ, an increasing γ in fact decreases the exponential rates of the ﬁrst and the third terms in (27). Furthermore, constants CF1, CF2 are CF3 are polynomial in the problem dimension dw. Corollary 1. For the decentralized ULA (D-ULA) given in Algorithm 1 under the conditions of Theorem 2 and error tolerance ϵ (0, 1), there holds F(ptk( w)) ϵ, k k (28) where

aρU log 2Q1

1 1 δ2 , 2Q2

Q1 = F(pt0( w)) + CF1 exp aρU 1 δ2

+ CF3 and Q2 = CF2 nγ 2 .

Corollary 1 follows from Theorem 2 and the proof is given in S5. Corollary 1 provides the minimum number of iterations required to decrease the KL-divergence below a given error-tolerance ϵ.

5 Numerical experiments

We apply the proposed algorithm to perform decentralized Bayesian learning for Gaussian mixture modeling, logistic regression, and classiﬁcation and empirically compare our proposed algorithm to centralized ULA (C-ULA). In all the experiments, we have used a network of ﬁve agents in an undirected unweighted ring topology for the decentralized setting. Additional details of all the experiments including step sizes and number of epochs are provided in the Supplementary material (see S6).

5.1 Parameter estimation for Gaussian mixture

In this section, we compare the efﬁciency of D-ULA against the C-ULA for parameter estimation of a multimodal Gaussian mixture with tied means [26]. The Gaussian mixture is given by

θ1 N(0, σ2 1); θ2 N(0, σ2 2) and xi 1

2N(θ1, σ2 x) + 1

2N(θ1 + θ2, σ2 x)

where σ2 1 = 10, σ2 2 = 1, σ2 x = 2 and w [θ1, θ2] R2. For the centralized setting, similar to [26], 100 data samples are drawn from the model with θ1 = 0 and θ2 = 1. Available 100 data samples are randomly divided into 5 sets of 20 samples that are made available to each agent in the decentralized network. The posterior distribution of the parameters is bimodal with negatively correlated modes at w = [0, 1] and w = [1, 1]. As shown in Figure 1, the posteriors estimated by D-ULA and C-ULA replicate the true posterior distribution of parameters. Quality of estimated posteriors are compared using an approximate Wasserstein measure [50]. With accurate metric being computational complex, we resort to the Sinkhorn distance and Sinkhorn s algorithm introduced in [50] which in essence deﬁnes the cost incurred while mapping the estimated posterior to the true posterior using a transport matrix. The regularization parameter, λ in Sinkhorn algorithm is set to 0.1. The experiments are performed for networks of size 1, 5 and 10 and the corresponding Sinkhorn distances, d M, are given by 0.259, 0.251 and 0.244, respectively.

5.2 Bayesian logistic regression

We compare the performance of D-ULA and C-ULA for Bayesian inference of logistic regression models using a9a dataset available at the UCI machine learning repository 1. The dataset contains 32561 observations and 123 parameters. We use a Laplace prior with a scale of 1 on the parameters. Test accuracy averaged over 50 runs for both approaches are shown in Figure 2. During each run, we chose random 80% of data for training and the remaining 20% for testing as in [26]. For D-ULA, we consider networks with 5, 10, and 25 agents. The training data for each case is divided into random sets of equal sizes and made available to agents in the decentralized network. During each run, the same 20% partition is used to test the performance of C-ULA and D-ULA. Test results over ten epochs averaged over 50 runs indicate that the performance of D-ULA is comparable to that of C-ULA. Figures 2b, 2c, 2c, and 2d are zoomed into ﬁrst 1200 iterations to better show faster

1http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/ datasets/binary/a9a

θ1 -1 0 1 2

θ1 -1 0 1 2

θ1 -1 0 1 2

θ1 -1 0 1 2

θ1 -1 0 1 2

Figure 1: (a) True posterior (b) Estimated posterior by C-ULA (c)-(g) Posteriors estimated by D-ULA

(a) Centralized ULA

(b) Set of 5 agents

(c) Set of 10 agents

(d) Set of 25 agents

Figure 2: Test accuracy averaged over 50 runs

convergence with increase in network size. Corresponding accuracy values for C-ULA and D-ULA networks with agents 5, 10, and 25 are 83.89 %, 84.38%, 84.5637%, and 84.5637%. Insets of Figures 2b, 2c, 2c, and 2d indicates faster convergence of D-ULA compared to C-ULA. Test accuracy of all the agents in D-ULA networks settle to the same accuracy level as shown in the insets. The shaded region in the ﬁgures indicates one standard deviation.

Table 1: Probability of predicted labels (mean/standard deviation)

SGD C-ULA Agent 1 Agent 2 Agent 3 Agent 4 Agent 5

Mean 0.974 0.968 0.973 0.972 0.972 0.973 0.973

Std. dev. 0.078 0.086 0.079 0.08 0.08 0.08 0.08

Mean 0.849 0.604 0.659 0.6588 0.653 0.663 0.651

Std. dev 0.154 0.169 0.188 0.189 0.188 0.19 0.187

5.3 Bayesian learning for handwritten digit classiﬁcation and OOD detection

In this section, we present decentralized Bayesian learning as a potential strategy to recognize handwritten digits in images. For this, we use the MNIST data set containing 60000 gray scale images of 10 digits (0-9) for training and 10000 images for testing. Each agent in D-ULA aims to train its own neural network, which is a randomly initialized Le Net-5 [51] with Kaiming uniform prior [52] on the parameters of the network. Each agent has access to 12000 randomly chosen training samples. Test accuracy obtained using stochastic gradient descent (SGD), C-ULA, and 5 agents of D-ULA after 10 epochs are 98.15%, 98.16%, 98.52%, 98.52%, 98.39%, 98.45% and 98.47%, respectively.

Next, we explore the efﬁcacy of the proposed algorithm to detect out-of-distribution (OOD) samples or outliers in the datasets. We train each Le Net-5 neural network on the MNIST training data set and test it on MNIST test data set for normalcy class and Street View House Numbers (SVHN)2 test data set for OOD data. SVHN data set is similar to MNIST, but with color images of 10 digits (0-9) and extra confusing digits around the central digit of interest. We converted them to gray scale for this experiment. Networks trained on MNIST are expected to give relatively low prediction probabilities for SVHN data samples. Table 1 summarizes the mean and standard deviation of probabilities of predicted labels obtained for all the approaches. Since SGD is a maximum a posteriori point estimate, it fails to recognize out of sample data sets, and gives high prediction probabilities even for OOD SVHN data. One the other hand, C-ULA and D-ULA show an improved performance in detecting OOD SVHN data by giving lower prediction probabilities for SVHN data, but giving high prediction probablities for MNIST test data as seen in Table 1. The plots of probability density of predicted labels corresponding to all the approaches are provided in the Supplementary material.

The decentralized ULA results in Section 5.2 and 5.3 were obtained using a mini-batch version of the proposed D-ULA algorithm, where the log-likelihood was obtained from random mini-batches of Xi for agent i, i = 1, , n. Although our theoretical analysis is based on the likelihood from the entire Xi, the empirical results in these two sections show that the mini-batch D-ULA algorithm is also effective. This is plausible since the additive noise 2αkvi(k) in (9) will dominate the noise in the local posterior term as k increases.

6 Conclusion

In this paper, we present a decentralized collaborative approach for a group of agents to sample the posterior distribution of a parameter of interest with locally available data sets. We assume an undirected connected communication topology between the agents. We propose a decentralized unadjusted Langevin algorithm with time-varying step-sizes and establish conditions on the step-sizes for asymptotic convergence to the target distribution. The algorithm also exhibits a guaranteed speed-up in convergence in the number of agents. We conducted three experiments on Gaussian mixtures, logistic regression, and image classiﬁcation. The experimental results demonstrated that the proposed algorithm offers improved accuracy with enhanced speed of convergence. The results from the last experiment also suggest a potential application of the proposed algorithm for outlier detection.

Broader Impact

This work presents a basic line of research on reducing computational complexity, enhancing speed of convergence, and addressing potential privacy issues associated with centralized Bayesian learning. Experiments and empirical results cover a broad set of applications including parameter estimation for local non-convex models, logistic regression, image classiﬁcation and outlier detection. We have

2http://uﬂdl.stanford.edu/housenumbers/

used publicly available datasets, which have no implications on machine learning bias, fairness or ethics. Hence, we believe that this section about potential negative impact of our work on society is not applicable to the proposed work.

Acknowledgement

This work was supported by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-16-2-0008. The work of the second author was supported in part by the National Science Foundation under Grant No. 1925147. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes not withstanding any copyright notation here on.

[1] R. M. Neal, MCMC using Hamiltonian dynamics, ar Xiv e-prints, ar Xiv:1206.1901, 2012. [2] R. Jordan, D. Kinderlehrer, and F. Otto, The variational formulation of the Fokker Planck equation, SIAM Journal on Mathematical Analysis, vol. 29, no. 1, pp. 1 17, 1998. [3] A. Wibisono, Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem, in Proceedings of the 31st Conference On Learning Theory, vol. 75. PMLR, 06 09 Jul 2018, pp. 2093 3027. [4] S. Vempala and A. Wibisono, Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry sufﬁces, in Advances in Neural Information Processing Systems, 2019, pp. 8094 8106. [5] F. Otto and C. Villani, Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality, Journal of Functional Analysis, vol. 173, pp. 361 400, 2000. [6] A. S. Dalalyan, Theoretical guarantees for approximate sampling from smooth and log-concave densities, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 79, no. 3, pp. 651 676, 2017. [7] A. Dalalyan, Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent, in Proceedings of the 2017 Conference on Learning Theory, vol. 65. Proceedings of Machine Learning Research, 2017, pp. 678 689. [8] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan, Underdamped langevin mcmc: A non-asymptotic analysis, in Proceedings of the 31st Conference On Learning Theory, vol. 75, July 2018, pp. 300 323. [9] X. Cheng and P. L. Bartlett, Convergence of Langevin MCMC in KL-divergence, Proceedings of Machine Learning Research, no. 83, pp. 186 211, 2018. [10] A. Durmus and E. Moulines, Sampling from strongly log-concave distributions with the unadjusted Langevin algorithm, ar Xiv e-prints, vol. ar Xiv:1605.01559, 2016. [11] A. Durmus, E. Moulines et al., Nonasymptotic convergence analysis for the unadjusted Langevin algorithm, The Annals of Applied Probability, vol. 27, no. 3, pp. 1551 1587, 2017. [12] , High-dimensional Bayesian inference via the unadjusted Langevin algorithm, Bernoulli, vol. 25, no. 4A, pp. 2854 2882, 2019. [13] M. Raginsky, A. Rakhlin, and M. Telgarsky, Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis, in Proceedings of the 2017 Conference on Learning Theory, vol. 65. PMLR, 07 10 Jul 2017, pp. 1674 1703. [14] P. Xu, J. Chen, D. Zou, and Q. Gu, Global convergence of Langevin dynamics based algorithms for nonconvex optimization, Advances in Neural Information Processing Systems, pp. 3122 3133, 2018. [15] Y. Zhang, Ö. D. Akyildiz, T. Damoulas, and S. Sabanis, Nonasymptotic estimates for stochastic gradient Langevin dynamics under local conditions in nonconvex optimization, ar Xiv preprint, vol. ar Xiv:1910.02008, 2019. [16] N. H. Chau, É. Moulines, M. Rásonyi, S. Sabanis, and Y. Zhang, On stochastic gradient Langevin dynamics with dependent data streams: the fully non-convex case, ar Xiv preprint, vol. ar Xiv:1905.13142, 2019.

[17] W. Mou, N. Flammarion, M. J. Wainwright, and P. L. Bartlett, Improved bounds for discretization of Langevin diffusions: near-optimal rates without convexity, ar Xiv preprint, vol. ar Xiv:1907.11331, 2019. [18] M. B. Majka, A. Mijatovi c, and L. Szpruch, Non-asymptotic bounds for sampling algorithms without log-concavity, ar Xiv preprint, vol. ar Xiv:1808.07105, 2018. [19] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, Sampling can be faster than optimization, Proceedings of the National Academy of Sciences, vol. 116, no. 42, pp. 20 881 20 885, 2019. [20] X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan, Sharp convergence rates for Langevin dynamics in the nonconvex setting, ar Xiv preprint, vol. ar Xiv:1805.01648, 2018. [21] K. Talwar, Computational separations between sampling and optimization, in Advances in Neural Information Processing Systems, 2019, pp. 14 997 15 007. [22] Y.-A. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. Bartlett, and M. I. Jordan, Is There an Analog of Nesterov Acceleration for MCMC? ar Xiv e-prints, Feb. 2019. [23] W. Mou, Y.-A. Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan, High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm, ar Xiv e-prints, Aug. 2019. [24] O. Mangoubi and N. Vishnoi, Dimensionally tight bounds for second-order Hamiltonian Monte Carlo, in Advances in Neural Information Processing Systems, 2018, pp. 6027 6037. [25] O. Mangoubi and A. Smith, Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions 2: Numerical integrators, in Proceedings of Machine Learning Research, ser. Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama, Eds., vol. 89. PMLR, 16-18 Apr 2019, pp. 586 595. [26] M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 681 688. [27] S. Patterson and Y. W. Teh, Stochastic gradient Riemannian Langevin dynamics on the probability simplex, in Advances in Neural Information Processing Systems 26, 2013, pp. 3102 3110. [28] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven, Bayesian sampling using stochastic gradient thermostats, in Advances in Neural Information Processing Systems 27, 2014, pp. 3203 3211. [29] C. Chen, N. Ding, and L. Carin, On the convergence of stochastic gradient MCMC algorithms with high-order integrators, in Advances in Neural Information Processing Systems 28, 2015, pp. 2278 2286. [30] Y.-A. Ma, T. Chen, and E. Fox, A complete recipe for stochastic gradient MCMC, in Advances in Neural Information Processing Systems 28, 2015, pp. 2917 2925. [31] A. S. Dalalyan and A. Karagulyan, User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient, Stochastic Processes and their Applications, vol. 129, no. 12, pp. 5278 5311, 2019. [32] J. George, T. Yang, H. Bai, and P. Gurram, Distributed stochastic gradient method for nonconvex problems with applications in supervised learning, in IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 5538 5543. [33] V. Kungurtsev, Stochastic gradient Langevin dynamics on a distributed network, ar Xiv preprint, vol. ar Xiv:2001.00665, 2020. [34] M. Xu, B. Lakshminarayanan, Y. W. Teh, J. Zhu, and B. Zhang, Distributed Bayesian posterior sampling via moment sharing, in Advances in Neural Information Processing Systems 27, 2014, pp. 3356 3364. [35] S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. Mc Culloch, Bayes and big data: The consensus Monte Carlo algorithm, International Journal of Management Science and Engineering Management, vol. 11, pp. 78 88, 2016. [36] M. Rabinovich, E. Angelino, and M. I. Jordan, Variational consensus Monte Carlo, in Advances in Neural Information Processing Systems, 2015, pp. 1207 1215.

[37] S. L. Scott, Comparing consensus Monte Carlo strategies for distributed Bayesian computation, Braz. J. Probab. Stat., vol. 31, no. 4, pp. 668 685, 11 2017. [38] L. J. Rendell, A. M. Johansen, A. Lee, and N. Whiteley, Global consensus Monte Carlo, ar Xiv e-prints, Jul. 2018. [39] X. Wang and D. B. Dunson, Parallelizing MCMC via Weierstrass Sampler, ar Xiv e-prints, ar Xiv:1312.4605, 2013. [40] W. Neiswanger, C.Wang, and E. Xing., Asymptotically exact, embarrassingly parallel MCMC, in 30th Conference on Uncertainty in Artiﬁcial Intelligence, UAI, 2014, p. 623 632. [41] X. Wang, F. Guo, K. A. Heller, and D. B. Dunson, Parallelizing MCMC with random partition trees, in Advances in Neural Information Processing Systems 28, 2015, pp. 451 459. [42] A. Chowdhury and C. Jermaine, Parallel and distributed MCMC via shepherding distributions, in 31st International Conference on Artiﬁcial Intelligence and Statistics, 2018, pp. 1819 1827. [43] M. Gürbüzbalaban, X. Gao, Y. Hu, and L. Zhu, Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte Carlo, ar Xiv e-prints, p. ar Xiv:2007.00590, 2020. [44] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, Hybrid Monte Carlo, Physics letters B, vol. 195, no. 2, pp. 216 222, 1987. [45] B. Leimkuhler, S. Reich, and C. U. Press, Simulating Hamiltonian Dynamics, ser. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004. [46] G. O. Roberts and R. L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, vol. 2, no. 4, pp. 341 363, 12 1996. [47] J. George and P. Gurram, Distributed stochastic gradient descent with event-triggered communication, in AAAI 2020, 2020, pp. 7169 7178. [48] N. Singh, D. Data, J. George, and S. Diggavi, SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization, in IEEE 59th Conference on Decision and Control (CDC), 2020. [49] D. S. Lemons and A. Gythiel, Paul Langevin s 1908 paper On the Theory of Brownian Motion [ Sur la Théorie du mouvement Brownien, C. R. Acad. Sci. (Paris) 146, 530 533 (1908)], American Journal of Physics, vol. 65, no. 11, pp. 1079 1081, 1997. [50] M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, in Advances in neural information processing systems, 2013, pp. 2292 2300. [51] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324, 1998. [52] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026 1034. [53] W. X. I. Gutman, Generalized inverse of the Laplacian matrix and some applications, Bulletin, Classe des Sciences Mathématiques et Naturelles, Sciences mathématiques, vol. 129, no. 29, pp. 15 23, 2004. [54] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, ser. Computer science. MIT Press, 2009. [55] S. Kar, J. Moura, and H. Poor, Distributed linear parameter estimation: asymptotically efﬁcient adaptive strategies, SIAM Journal on Control and Optimization, vol. 51, no. 3, pp. 2200 2229, 2013. [56] G. Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, ser. Texts in Applied Mathematics. Springer New York, 2014. [57] L. Ambrosio, N. Gigli, and G. Savaré, Gradient ﬂows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008. [58] T. Chen, E. Fox, and C. Guestrin, Stochastic gradient Hamiltonian Monte Carlo, in International conference on machine learning, 2014, pp. 1683 1691.