# hogwildgibbs_can_be_panaccurate__f16cd926.pdf

HOGWILD!-Gibbs Can Be Pan Accurate

Constantinos Daskalakis EECS & CSAIL, MIT costis@csail.mit.edu

Nishanth Dikkala EECS & CSAIL, MIT nishanthd@csail.mit.edu

Siddhartha Jayanti EECS & CSAIL, MIT jayanti@mit.edu

Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an accurate method for estimating probabilities of events on a small number of variables of a graphical model satisfying Dobrushin s condition [DSOR16]. We investigate whether it can be used to accurately estimate expectations of functions of all the variables of the model. Under the same condition, we show that the synchronous (sequential) and asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between their (multivariate) samples remains bounded by O(τ log n), where n is the number of variables in the graphical model, and τ is a measure of the asynchronicity. A similar bound holds for any constant power of the Hamming distance. Hence, the expectation of any function that is Lipschitz with respect to a power of the Hamming distance, can be estimated with a bias that grows logarithmically in n. Going beyond Lipschitz functions, we consider the bias arising from asynchronicity in estimating the expectation of polynomial functions of all variables in the model. Using recent concentration of measure results [DDK17, GLP17, GSS18], we show that the bias introduced by the asynchronicity is of smaller order than the standard deviation of the function value already present in the true model. We perform experiments on a multiprocessor machine to empirically illustrate our theoretical ﬁndings.

1 Introduction

The increasingly ambitious applications of data analysis, and the corresponding growth in the size of the data that needs to processed has brought important scalability challenges to machine learning algorithms. Fundamental methods such as Gradient Descent and Gibbs sampling, which were designed with a sequential computational model in mind, are to be applied on datasets of increasingly larger size. As such, there has recently been increased interest towards developing techniques for parallelizing these methods. However, these algorithms are inherently sequential and are difﬁcult to parallelize.

HOGWILD!-SGD, proposed by Niu et al. [NRRW11], is a lock-free asynchronous execution of stochastic gradient descent that has been shown to converge under the right sparsity conditions. Several variants of this method, and extensions of the asynchronous execution approach have been recently proposed, and have found successful applications in a broad range of applications ranging from Page Rank approximation, to deep learning and recommender systems [YHSD12, NO14, MBDC15, MPP+15, LWR+15, LWR+15, DSZOR15].

Similar to HOGWILD!-SGD, lock-free asynchronous execution of Gibbs sampling, called HOGWILD!-Gibbs, was proposed by Smola and Narayanamurthy [SN10], and empirically shown to work well on several models [ZR14]. Johnson et al. [JSW13] provide sufﬁcient conditions under

Supported by NSF awards CCF-1617730 and IIS-1741137, a Simons Investigator Award, a Google Faculty Research Award, and an MIT-IBM Watson AI Lab research grant. Also supported by the Department of Defense (Do D) through the NDSEG Program.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

which they show theoretically that HOGWILD!-Gibbs produces samples with the correct mean in Gaussian models, while Terenin et al. [TSD15] propose a modiﬁcation to the algorithm that is shown to converge under some strong assumptions on asynchronous computation.

Input: Set of variables V , Conﬁguration x0 S|V |, Distribution π initialization; for t = 1 to T do

Sample i uniformly from {1, 2, . . . , n}; Sample Xi Prπ [.|X i = x i] and set xi,t = Xi; For all j = i, set xj,t = xj,t 1; end

Algorithm 1: Gibbs Sampling

In a more recent paper, De Sa et al. [DSOR16] propose the study of HOGWILD!-Gibbs under a stochastic model of asynchronicity in graphical models with discrete variables. Whenever the graphical model satisﬁes Dobrushin s condition, they show that the mixing time of the asynchronous Gibbs sampler is similar to that of the sequential (synchronous) one. Moreover, they establish that the asynchronous Gibbs sampler accurately estimates probabilities of events on a sublinear number of variables, in particular events on up to O(εn/ log n) variables can be estimated within variational distance ε, where n is the total number of variables in the graphical model (Lemma 2, [DSOR16]).

Our Results. Our goal in this paper is to push the theoretical understanding of HOGWILD!-Gibbs to estimate functions of all the variables in a graphical model. In particular, we are interested in whether HOGWILD!-Gibbs can be used to accurately estimate the expectations of such functions. Results from [DSOR16] imply that an accurate estimation is possible whenever the function under consideration is Lipschitz with a good Lipschitz constant with respect to the Hamming metric. Under the same Dobrushin condition used in [DSOR16] (see Deﬁnition 3), and under a stochastic model of asynchronicity with weaker assumptions (see Section 2.1), we show that you can do better than the bounds implied by [DSOR16] even for functions with bad Lipschitz constants. For instance, consider quadratic functions on an Ising model, which is a binary graphical model, and serves as a canonical example of Markov random ﬁelds [LPW09, MS10, Fel04, DMR11, GG86, Ell93]. Under appropriate normalization, these functions take values in the range [ n2, n2] and have a Lipschitz constant of n. Given this, the results of [DSOR16] would imply we can estimate quadratic functions on the Ising model within an error of O(n). We improve this error to be of O( n). In particular, we show the following in our paper:

Starting at the same initial conﬁguration, the executions of the sequential and the asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between the multivariate samples that the two samplers maintain is bounded by O(τ log n), where n is the number of variables in the graphical model, and τ is a measure of the average contention in the asynchronicity model of Section 2.1. See Lemma 2. More generally, the expectation of the d-th power of the Hamming distance is bounded by C(d, τ) logd n, for some function C(d, τ). See Lemma 3.

It follows from Lemmas 2 and 3 that, if a function f of the variables of a graphical model is K-Lipschitz with respect to the d-th power of the Hamming distance, then the bias in the expectation of f introduced by HOGWILD!-Gibbs under the asynchronicity model of Section 2.1 is bounded by K C(d, τ) logd n. See Corollary 1.

Next, we improve the bounds of Corollary 1 for functions that are degree-d polynomials of the variables of the graphical model. Low degree polynomials on graphical models are a natural class of functions which are of interest in many statistical tasks performed on graphical models (see, for instance, [DDK18] ). For simplicity we show these improvements for the Ising model, but our results are extendible to general graphical models. We show, in Theorem 4, that the bias introduced by HOGWILD!-Gibbs in the expectation of a degree-d polynomial of the Ising model is bounded by O((n log n)(d 1)/2). This bound improves upon the bound computed by Corollary 1 by a factor of about (n/ log n)(d 1)/2, as the Lipschitz constant with respect to the Hamming distance of a degree-d polynomial of the Ising model can be up to O(nd 1). Importantly, the bias of O((n log n)(d 1)/2) that we show is introduced by the asynchronicity is of a lower order of magnitude than the standard de-

viation of degree-d polynomials of the Ising model, which is O((n)d/2) see Theorem 2, and which is already experienced by the sequential sampler. Moreover, in Theorem 5, we also show that the asynchronous Gibbs sampler is not adding a higher order variance to its sample. Thus, our results suggest that running Gibbs sampling asynchronously leads to a valid bias-variance tradeoff. Our bounds for the expected Hamming distance between the sequential and the asynchronous Gibbs samplers follow from coupling arguments, while our improvements for polynomial functions of Ising models follow from a combination of our Hamming bounds and recent concentration of measure results for polynomial functions of the Ising model [DDK17, GLP17, GSS18].

In Section 5, we illustrate our theoretical ﬁndings by performing experiments on a multicore machine. We experiment with graphical models over two kinds of graphs. The ﬁrst is the n n grid graph (which we represent as a torus for degree regularity) where each node has 4 neighbors, and the second is the clique over n nodes. We ﬁrst study how valid the assumptions of the asynchronicity model are. The main assumption in the model was that the average contention parameter τ doesn t grow as the number of nodes in the graph grows. It is a constant which depends on the hardware being used and we observe that this is indeed the case in practice. The expected contention grows linearly with the number of processors on the machine but remains constant with respect to n (see Figures 1 and 2). Next, we look at quadratic polynomials over graphical models associated with both the grid and clique graphs. We estimate their expected values under the sequential Gibbs sampler and HOGWILD!-Gibbs and measure the bias (absolute difference) between the two. Our theory predicts that this should scale at n and we observe that this is indeed the case (Figure 3). Our experiments are described in greater detail in Section 5.

2 The Model and Preliminaries

In this paper, we consider the Gibbs sampling algorithm as applied to discrete graphical models. The models will be deﬁned on a graph G = (V, E) with |V | = n nodes and will represent a probability distribution π. We use S to denote the range of values each node in V can take. For any conﬁguration X S|V |, πi(.|X i) will denote the conditional distribution of variable i given all other variables of state X.

In Section 4, we will look at Ising models, a particular class of discrete binary graphical models with pairwise local correlations. We consider the Ising model on a graph G = (V, E) with n nodes. This is a distribution over Ω= { 1}n, with a parameter vector θ R|V |+|E|. θ has a parameter corresponding to each edge e E and each node v V . The probability mass function assigned to a string x is

v V θvxv + X

e=(u,v) E θexuxv Φ( θ)

where Φ( θ) is the log-partition function for the distribution. We say an Ising model has no external ﬁeld if θv = 0 for all v V . For ease of exposition we will focus on the case with no external ﬁeld in this paper. However, the results extend to Ising models with external ﬁelds when the functions under consideration (in Section 4) are appropriately chosen to be centered. See [DDK17].

Throughout the paper we will focus on bounded functions deﬁned on the discrete space S|V |. For a function f, we use f to denote the maximum absolute value of the function over its domain. We will use [n] to denote the set {1, 2, . . . , n}. In Section 4, we will study polynomial functions over the Ising model. Since x2 i = 1 always in an Ising model, any polynomial function of degree d can be represented as a multilinear function of degree d and we will refer to them interchangeably in the context of Ising models. Deﬁnition 1 (Polynomial/Multilinear Functions of the Ising Model). A degree-d polynomial deﬁned on n variables x1, . . . , xn is a function of the following form X

S [n]:|S| d a S Y

where a : 2[n] R is a coefﬁcient vector.

We will use a to denote the coefﬁcient vector of such a multilinear function and a to denote the maximum element of a in absolute value. Note that we will use permutations of the subscripts to refer to the same coefﬁcient, i.e., aijk is the same as ajik.

We now give a formal deﬁnition of Dobrushin s uniqueness condition, also known as the hightemperature regime. First we deﬁne the inﬂuence of a node j on a node i. Deﬁnition 2 (Inﬂuence in Graphical Models). Let π be a probability distribution over some set of variables V . Let Bj denote the set of state pairs (X, Y ) which differ only in their value at variable j. Then the inﬂuence of node j on node i is deﬁned as

I(j, i) = max (X,Y ) Bj

πi(.|X i) πi(.|Y i) T V

Now, we are ready to state Dobrushin s condition. Deﬁnition 3 (Dobrushin s Uniqueness Condition). Consider a distribution π deﬁned on a set of variables V . Let

α = max i V

j V I(j, i)

π is said to satisfy Dobrushin s uniqueness condition if α < 1.

We have the following result from [DSOR16] about mixing time of Gibbs sampler for a model satisfying Dobrushin s condition. Theorem 1 (Mixing Time of Sequential Gibbs Sampling). Assume that we run Gibbs sampling on a distribution that satisﬁes Dobrushin s condition, α < 1. Then the mixing time of sequential-Gibbs is bounded by tmix seq(ε) n 1 α log n

Deﬁnition 4. For any discrete state space S|V | over the set of variables V , The Hamming distance between x, y S|V | is deﬁned as d H(x, y) = P

i V 1{xi =yi}. Deﬁnition 5 (The greedy coupling between two Gibbs Sampling chains). Consider two instances of Gibbs sampling associated with the same discrete graphical model π over the state space S|V |: X0, X1, . . . and Y0, Y1, . . .. The following coupling procedure is known as the greedy coupling. Start chain 1 at X0 and chain 2 at Y0 and in each time step t, choose a node v V uniformly at random to update in both the chains. Without loss of generality assume that S = {1, 2, . . . , k}. Let p(i1) denote the probability that the ﬁrst chain sets Xt,v = i1 and let q(i2) be the probability that the second chain sets Yt,v = i2. Plot the points Pi j=1 p(j) = P(i), and Pi j=1 q(j) = Q(i) for all i [k] on the interval from [0, 1]. Also pick P(0) = Q(0) = 0 and P(k + 1) = Q(k + 1) = 1. Couple the updates according to the following rule:

Draw a number x uniformly at random from [0, 1]. Suppose x [P(i1), P(i1 + 1)] and x [Q(i2), Q(i2 + 1)]. Choose Xt,v = i1 and Yt,v = i2.

We state an important property of this coupling which holds under Dobrushin s condition, in the following Lemma.

Lemma 1. The greedy coupling (Deﬁnition 5) satisﬁes the following property. Let X0, Y0 S|V | and consider two executions of Gibbs sampling associated with distribution π and starting at X0 and Y0 respectively. Suppose the executions were coupled using the greedy coupling. Suppose in the step t = 1, node i is chosen to be updated in both the models. Then,

Pr [X1,i = Y1,i] πi(.|X i 0 ) πi(.|Y i 0 ) T V (1)

2.1 Modeling Asynchronicity

We use the asynchronicity model from [RRWN11] and [DSOR16]. Hogwild!-Gibbs is a multithreaded algorithm where each thread performs a Gibbs update on the state of a graph which is

stored in shared memory (typically in RAM). We view each processor s write as occuring at a distinct time instant. And each write starts the next time step for the process. Assuming that the writes are all serialized, one can now talk about the state of the system after t writes. This will be denoted as time t. HOGWILD! is modeled as a stochastic system adapted to a natural ﬁltration Ft. Ft contains all events thast have occured until time t. Some of these writes happen based on a read done a few steps ago and hence correspond to updates based on stale values in the local cache of the processor. The staleness is modeled in a stochastic manner using the random variable τi,t to denote the delay associated with the read performed on node i at time step t. The value of node i used in the update at time t is going to be Yi,t = Xi,(t τi,t). Delays across different node reads can be correlated. However delay distribution is independent of the conﬁguration of the model at time t. The model imposes two restrictions on the delay distributions. First, the expected value of each delay distribution is bounded by τ. We will think of τ as a constant compared to n in this paper. We call τ the average contention parameter associated with a HOGWILD!-Gibbs execution. [DSOR16] impose a second restriction which bounds the tails of the distribution of τi,t. We do not need to make this assumption in this paper for our results. [DSOR16] need the assumption to show that the HOGWILD! chain mixes fast. However, by using coupling arguments we can avoid the need to have the HOGWILD! chain mix and will just use the mixing time bounds for the sequential Gibbs sampling chain instead. Let T denote the set of all delay distributions. We refer to the sequential Gibbs sampler associated with a distribution π as Gπ and the HOGWILD! Gibbs sampler together with T associated with a distribution p by HT p . Note that Hπ is a time-inhomogenuous Markov chain and might not converge to a stationary distribution.

2.2 Concentration of Polynomials on Ising Models

Here we state a known result about concentration of measure for polynomial functions on Ising models satisfying Dobrushin s condition.

Theorem 2 (Concentration of Measure for Polynomial Functions of the Ising model, [DDK17, GLP17, GSS18]). Consider an Ising model p without external ﬁeld on a graph G = (V, E) satisfying Dobrushin s condition with Dobrushin parameter α < 1. Let fa be a degree d-polynomial over the Ising model. Let X p. Then, there is a constant c(α, δ), such that,

Pr [|fa(X) E [fa(X)]| > t] 2 exp

c(α, d) a 2/d n

As a corollary this also implies,

Var [fa(X)] C3(d, α)nd.

3 Bounding The Expected Hamming Distance Between Coupled Execution of HOGWILD! and Sequential Gibbs Samplers

In this Section, we show that under the greedy coupling of the sequential and asynchronous chains, the expected Hamming distance between the two chains at any time t is small. This will form the basis for our accurate estimation results of Section 4. We begin by with Lemma 2.

Lemma 2. Let π denote a discrete probability distribution on n variables (nodes) with Dobrushin parameter α < 1. Let Gπ = X0, X1, . . . , Xt, . . . denote the execution of the sequential Gibbs sampler on π and HT π = Y0, Y1, . . . , Yt, . . . denote the HOGWILD! Gibbs sampler associated with π such that X0 = Y0. Suppose the two chains are running coupled in a greedy manner. Let Kt denote all events that have occured until time t in this coupled execution. Then we have, for all t 0, under the greedy coupling of the two chains,

E [d H(Xt, Yt)|K0] τα log n

At a high level, the proof proceeds by studying the expected change in the Hamming distance under one step of the coupled execution of the chains. We can bound the expected change using the Dobrushin parameter and the property of the greedy coupling (Lemma 1). We then show that the expected change is negative whenever the Hamming distance between the two chains was above

O(log n) to begin with. This allows us to argue that when the two chains start at the same conﬁguration, then the expected Hamming distance remains bounded by O(log n).

Next, we generalize the above Lemma to bound also the dth moment of the Hamming distance between Xt and Yt obtained from the coupled executions. Lemma 3 (dth moment bound on Hamming). Consider the same setting as that of Lemma 2. We have, for all t 0, under the greedy coupling of the two chains,

E d H(Xt, Yt)d|K0 C(τ, α, d) logd n,

where C(.) is some function of the parameters τ, α and d.

The proof of Lemma 3 follows a similar ﬂavor as that of Lemma 2. It is however more involved to bound the expected increase in the dth power of the Hamming distance and it requires some careful analysis to see that the bound doesn t scale polynomially in n.

4 Estimating Global Functions Using HOGWILD! Gibbs Sampling

To begin with, we observe that our Hamming moment bounds from Section 3 imply that we can accurately estimate functions or events of the graphical model if they are Lipschitz. We show this below as a Corollary of Lemma 3.

Now, we state Corollary 1 which quantiﬁes the error we can attain when trying to estimate expectations of Lipschitz functions using HOGWILD!-Gibbs. Corollary 1. Let π denote the distribution associated with a graphical model over the set of variables V (|V | = n) taking values in a discrete space Sn. Assume that the model satisﬁes Dobrushin s condition with Dobrushin parameter α < 1. Let f : S|V | R be a function such that, for all x, y S|V |, |f(x) f(y)| Kd H(x, y)d. Let X π and let Y0, Y1, . . . , Yt denote an execution of HOGWILD!-Gibbs sampling on π with average contention parameter τ. For t > n 1 α log (2 f n/K),

|E[f(Yt)] E[f(X)]| K.(C(τ, α, d) logd n + 1).

We note that the results of [DSOR16] can be used to obtain Corollary 1 when the function is Lipschitz with respect to the Hamming distance. The above corollary provides a simple way to bound the bias introduced by HOGWILD! in estimation of Lipschitz functions. However, many functions of interest over graphical models are not Lipschitz with good Lipschitz constants. In many cases, even when the Lipschitz constants are bad, there is still hope for more accurate estimation. As it turns out Dobrushin s condition provides such cases. We will focus on one such case which is polynomial functions of the Ising model. Our goal will be to accurately estimate the expected values of constant degree polynomials over the Ising model. Using the bounds from Lemmas 2 and 3, we now proceed to bound the bias in computing polynomial functions of the Ising model using HOGWILD! Gibbs sampling.

We ﬁrst remark that linear functions (degree 1 polynomials) suffer 0 bias in their expected values due to HOGWILD!-Gibbs. This is because under zero external ﬁeld Ising models E[P i ai Xi] = 0 since each node individually has equal probability of being 1. This symmetry is maintained by HOGWILD!-Gibbs since the delays are conﬁguration-agnostic. Hence the delays when a node is +1 and when it is 1 can be coupled perfectly leaving the symmetry intact. Therefore, we start our investigation at quadratic polynomials. Theorem 3 states the bound we show for the bias in computation of degree 2 polynomials of the Ising model. Theorem 3 (Bias in Quadratic functions of Ising Model computed using HOGWILD!-Gibbs). Consider the quadratic function fa(x) = P

i,j:i<j aijxixj. Let p denote an Ising model on n nodes with Dobrushin parameter α < 1. Let {Xt}t 0 denote a run of sequential Gibbs sampler and HT p = {Yt}t 0 denote a run of HOGWILD!- Gibbs on p, such that X0 = Y0. Then we have, for t > 6n 1 α log(2 a n), under the greedy coupling of the two chains,

|E[fa(Xt) fa(Yt)]| c2 a τα log n (1 α)3/2 (n log n)1/2.

The main intuition behind the proof is that we can improve upon the bound implied by the Lipschitz constant by appealing to strong concentration of measure results about functions of graphical models under Dobrushin s condition [DDK17, GLP17, GSS18].

We extend the ideas in the above proof to bound the bias introduced by the HOGWILD! Gibbs algorithm when computing the expected values of a degree d polynomial of the Ising model in high temperature. Our main result concerning d-linear functions is Theorem 4.

Theorem 4 (Bias in degree d polynomials computed using HOGWILD!-Gibbs). Consider a degree d polynomial of the form fa(x) = P

i1i2,...,id ai1i2...idxi1xi2 . . . xid. Consider the same setting as

that of Theorem 3. Then we have, for t > n(d+1)

1 α log n, under the greedy coupling of the two chains,

|E[fa(Xt) fa(Yt)]| c a (n log n)(d 1)/2.

Next, we show that we can accurately estimate the expectations above by showing that the variance of the functions under the asynchronous model is comparable to that of the functions under the sequential model.

Theorem 5 (Variance of degree d polynomials computed using HOGWILD!-Gibbs). Consider a high temperature Ising model p on n nodes with Dobrushin parameter α < 1. Let fa(x) be a degree d polynomial function Let Y0, Y1, . . . , Yt denote a run of HOGWILD! Gibbs sampling associated with p. We have, for t > (d+1)n

1 α log n2 ,

Var [f(Yt)] a 2 C(d, α, τ)nd.

4.1 Going Beyond Ising Models

We presented results for accurate estimation of polynomial functions over the Ising model. However, the results can be extended to hold for more general graphical models satisfying Dobrushin s condition. A main ingredient here was concentration of measure. If the class of functions we look at has dth-order bounded differences in expectation, then we indeed get concentration of measure for these functions (Theorem 1.2 of [GSS18]). This combined with the techniques in our paper would allow similar gains in accurate estimation of such functions on general graphical models.

5 Experiments

We show the results of experiments run on a machine with four 10-core Intel Xeon E7-4850 CPUs to demonstrate the practical validity of our theory. In our experiments, we focused on two Ising models Curie-Weiss and the Grid. The Curie-Weiss CW(n, α) is the Ising model corresponding to the complete graph on n vertices with edges of weight β = α n 1. The Grid(k2, α) model is the Ising model corresponding to the k-by-k grid with the left connected to the right and top connected to the bottom to form a torus a four-regular graph; the edge weights are α

4 . The total inﬂuence of each of these models is at most α, so we chose α = 0.5 to ensure Dobrushin s condition. To generate samples, we start at a uniformly random conﬁguration and run Markov chains for T = 10n log2(n) steps to ensure mixing.

In our ﬁrst experiment (Figure 1) we validate the modeling assumption that the average delay of a read τ is a constant. Computing the exact delays in a real run of the HOGWILD! is not possible, but we approximate the delays by making processes log read and write operations to a lock-free queue as they execute the HOGWILD!-updates. We present two plots of the average delay of a read in a HOGWILD! run of the CW(n, 0.5) Markov chain with respect to n. Four asynchronous processors were used to generate the ﬁrst plot, while twenty were used for the second. We notice that the average delay depends on the number of asynchronous processes, but is constant with respect to n as assumed in our model.

Next, we plot (in Figure 2) the relationship between the number of asynchronous processors used in a HOGWILD! execution and the delay parameter τ. For this plot, we estimated τ by the average empirical delay over HOGWILD! runs of CW(n, 0.5) models, with n ranging from 100 to 1000 in increments of one hundred. The plot shows a linear relationship, and suggests that the delay per additional processor is approximately 0.4 steps.

Figure 1: Average delay of reads for CW(n, 0.5) model. Four asynchronous processors were used on the left, while twenty were used on the right.

Figure 2: Average delay of reads for CW(n, 0.5) model as the number of processors used varies.

The primary purpose of our work is to demonstrate that polynomial statistics computed from samples of a HOGWILD! run of Gibbs Sampling will approximate those computed from a sequential run. Our third experiment demonstrates exactly this fact. We plot (in Figure 3 on the left) the empirical expectations of the complete bilinear function f(X1, . . . , Xn) = P

i =j Xi Xj as we vary the number of nodes n in a Curie-Weiss model graph. Each red point is the empirical mean of the function f computed over 5000 samples from the HOGWILD! Markov chain corresponding to CW(n, 0.5), and each blue point is the empirical mean produced from 5000 sequential runs of the same chain. Our theory (Theorem 3) predicts that the bias, the vertical difference in height between red and blue points, at any given value of n will be on the order of the standard deviation divided by n (standard deviation is Θ(n) and bias is O( n)). We plot error bars of this order, and ﬁnd that the HOGWILD! means fall inside the error bars, thus corroborating our theory. We show that theory and practice coincide even for sparse graphs, by making the same plot for the Grid(n, 0.5) model on the right of the same ﬁgure.

Figure 3: Means (with appropriately scaled error bars) of the complete bilinear function computed over 5000 sequential and hogwild runs of CW(n, 0.5) (left) and Grid(n, 0.5) (right).

6 Acknowledgements

We thank Prof. Srinivas Devdas and Xiangyao Yu for helping us gain access to and program on their multicore machines.

[DDK17] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Concentration of multilinear functions of the Ising model with applications to network data. In Advances in Neural Information Processing Systems 30, NIPS 17. Curran Associates, Inc., 2017.

[DDK18] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 18, Philadelphia, PA, USA, 2018. SIAM.

[DMR11] Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Evolutionary trees and the Ising model on the Bethe lattice: A proof of Steel s conjecture. Probability Theory and Related Fields, 149(1):149 189, 2011.

[DSOR16] Christopher De Sa, Kunle Olukotun, and Christopher Ré. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. In JMLR workshop and conference proceedings, volume 48, page 1567. NIH Public Access, 2016.

[DSZOR15] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A uniﬁed analysis of hogwild-style algorithms. In Advances in neural information processing systems, pages 2674 2682, 2015.

[Ell93] Glenn Ellison. Learning, local interaction, and coordination. Econometrica, 61(5):1047 1071, 1993.

[Fel04] Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates Sunderland, 2004.

[GG86] Stuart Geman and Christine Grafﬁgne. Markov random ﬁeld image models and their applications to computer vision. In Proceedings of the International Congress of Mathematicians, pages 1496 1517. American Mathematical Society, 1986.

[GLP17] Reza Gheissari, Eyal Lubetzky, and Yuval Peres. Concentration inequalities for polynomials of contracting Ising models. ar Xiv preprint ar Xiv:1706.00121, 2017.

[GSS18] Friedrich Götze, Holger Sambale, and Arthur Sinulis. Higher order concentration for functions of weakly dependent random variables. ar Xiv preprint ar Xiv:1801.06348, 2018.

[JSW13] Matthew Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel gaussian gibbs sampling. In Advances in Neural Information Processing Systems, pages 2715 2723, 2013.

[LPW09] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.

[LWR+15] Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285 322, 2015.

[MBDC15] Ioannis Mitliagkas, Michael Borokhovich, Alexandros G Dimakis, and Constantine Caramanis. Frogwild!: fast pagerank approximations on graph engines. Proceedings of the VLDB Endowment, 8(8):874 885, 2015.

[MPP+15] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. ar Xiv preprint ar Xiv:1507.06970, 2015.

[MS10] Andrea Montanari and Amin Saberi. The spread of innovations in social networks. Proceedings of the National Academy of Sciences, 107(47):20196 20201, 2010.

[NO14] Cyprien Noel and Simon Osindero. Dogwild!-distributed hogwild for cpu & gpu. In NIPS Workshop on Distributed Machine Learning and Matrix Computations, 2014.

[NRRW11] Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693 701, 2011.

[RRWN11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693 701, 2011.

[SN10] Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2):703 710, 2010.

[TSD15] Alexander Terenin, Daniel Simpson, and David Draper. Asynchronous gibbs sampling. ar Xiv preprint ar Xiv:1509.08999, 2015.

[YHSD12] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 765 774. IEEE, 2012.

[ZR14] Ce Zhang and Christopher Ré. Dimmwitted: A study of main-memory statistical analytics. Proceedings of the VLDB Endowment, 7(12):1283 1294, 2014.