# distributed_nonparametric_regression_under_communication_constraints__b779f762.pdf

Distributed Nonparametric Regression under Communication Constraints

Yuancheng Zhu 1 John Lafferty 2

This paper studies the problem of nonparametric estimation of a smooth function with data distributed across multiple machines. We assume an independent sample from a white noise model is collected at each machine, and an estimator of the underlying true function needs to be constructed at a central machine. We place limits on the number of bits that each machine can use to transmit information to the central machine. Our results give both asymptotic lower bounds and matching upper bounds on the statistical risk under various settings. We identify three regimes, depending on the relationship among the number of machines, the size of data available at each machine, and the communication budget. When the communication budget is small, the statistical risk depends solely on this communication bottleneck, regardless of the sample size. In the regime where the communication budget is large, the classic minimax risk in the non-distributed estimation setting is recovered. In an intermediate regime, the statistical risk depends on both the sample size and the communication budget.

1. Introduction

Classic statistical theory studies the difﬁculty of estimation under various models, and attempts to ﬁnd the optimal estimation procedures. Such studies usually assume that all of the collected data are available to construct the estimators. In this paper, we study the problem of statistical estimation with data residing at multiple machines. Estimation in distributed settings is becoming common in modern data analysis tasks, as the data can be collected or stored at different locations. In order to obtain an estimate of some statistical functional, information needs to be gathered and

1Department of Statistics, Wharton School, University of Pennsylvania 2Department of Statistics and Data Science, Yale University. Correspondence to: Yuancheng Zhu <yuancheng.zhu@gmail.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

aggregated from the multiple locations to form the ﬁnal estimate. However, the communication between machines may be limited. For instance, there may be a communication budget that limits how much information can be transmitted. In this setting, it is important to understand how the statistical risk of estimation degrades as the communication budget becomes more limited.

A similar problem, called the CEO problem, was ﬁrst studied in the electrical engineering community from a rate-distortion-theory perspective (Berger et al., 1996; Viswanathan & Berger, 1997). More recently, several studies have focused on more speciﬁc statistical tasks and models; see, for example, Zhang et al. (2013a); Shamir (2014); Battey et al. (2015); Braverman et al. (2016); Diakonikolas et al. (2017); Fan et al. (2017); Lee et al. (2017) treating mean estimation, regression, principal eigenspace estimation, discrete density estimation and other problems. Most of this existing research focuses on parametric and discrete models, where the parameter of interest has a ﬁnite dimension. While there are also studies of nonparametric problems and models (Zhang et al., 2013b; Blanchard & M ucke, 2016; Chang et al., 2017; Shang & Cheng, 2017), the fundamental limits of distributed nonparametric estimation are still under-explored.

In this paper, we consider a fundamental nonparametric estimation task estimating a smooth function in the white noise model. We assume observation of the random process

d Y (t) = f(t)dt + 1 nd W(t), 0 t 1, (1.1)

where 1 n is the noise level, W(t) is a standard Wiener process, and f is the underlying function to be estimated. The white noise model is a centerpiece of nonparametric estimation, being asymptotically equivalent to nonparametric regression and density estimation (Brown & Low, 1996; Nussbaum, 1996). We intentionally express the noise level as 1 n to reﬂect the connection between the white noise model and a nonparametric regression problem with n evenly spaced observations. We focus on the important case where the regression function lies in the Sobolev space F(α, c) of order α and radius c; the exact deﬁnition of this function space is given in the following section.

In a distributed setting, instead of observing a single sample

Distributed Nonparametric Regression under Communication Constraints

path Y (t), we assume there are m machines, each of which observes an independent copy of the stochastic process. That is, the jth machine gets

d Yj(t) = f(t)dt + 1 nd Wj(t), 0 t 1,

for j = 1, . . . , m where Wj(t) s are mutually independent standard Wiener processes. Furthermore, each machine has a budget of b bits to communicate with a central machine, where a ﬁnal estimate bf is formed based on the messages received from the m machines. Speciﬁcally, we denote by Πj the message that the jth machine sends to the central estimating machine; each Πj can be viewed as a (possibly random) functional of the stochastic process Yj(t). In this way, the tuple (n, m, b) deﬁnes a problem instance for the function class F(α, c). We use the minimax risk

R(n, m, b; F(α, c))

= inf b f, Π1:m sup f F(α,c) E f bf(Π1, . . . , Πm) 2

to quantify the hardness of distributed estimation of f in the Sobolev space F(α, c).

The main contribution of the paper is to identify the following three asymptotic regimes.

An insufﬁcient regime where mb n 1 2α+1 . Under this scaling, the total number of bits, mb, is insufﬁcient to preserve the classical, non-distributed, minimax rate of convergence for the sample size n on a single machine. Therefore, the communication budget becomes the main bottleneck, and we have

R(n, m, b; F(α, c)) (mb) 2α.

A sufﬁcient regime where b (mn) 1 2α+1 . In this case, the number of bits allowed per machine is relatively large, and we have the minimax risk

R(n, m, b; F(α, c)) (mn) 2α 2α+1 .

Note that this is also the optimal convergence rate if all the data were available at the central machine.

An intermediate regime where b (mn) 1 2α+1 and mb n 1 2α+1 . In this regime, the minimax risk depends on all three parameters, and scales according to

R(n, m, b; F(α, c)) (mnb) α α+1 .

Together, these three regimes give a sharp characterization of the statistical behavior of distributed nonparametric estimation for the Sobolev space F(α, c) under communication constraints, covering the full range of parameters and

problem settings. The Bayesian framework adopted in this paper to establish the lower bounds is different from the techniques used in previous work, which typically rely on Fano s lemma and the strong data processing inequality. Finally, we note that an essentially equivalent set of minimax convergence rates is obtained in a simultaneously and independently written paper by Szabo & van Zanten (2018).

The paper is organized as follows. In the next section, we explain our notation and give a brief introduction of nonparametric estimation over a Sobolev space for the usual non-distributed setting and a distributed setting. In Section 3, we state our main results on the risk of distributed nonparametric estimation with communication constraints. We outline the proof strategy for the lower bounds in Section 3.1, deferring some of the technical details and proofs to the supplementary material. In Section 4, we show achievability of the lower bounds by a particular distributed protocol and estimator. We conclude the paper with a discussion of possible directions for future work.

2. Problem formulation

The Sobolev space of order α and radius c is deﬁned by

F(α, c) = f : f (α 1) is absolutely continuous,

0 (f (α)(t))2)dt c2, and f [0, 1] R .

Intuitively, it is a space of functions having a certain degree of smoothness, increasing with α. The periodic Sobolev space is deﬁned by

e F(α, c) =F(α, c) \

f (j)(0) = f (j)(1), j = 0, 1, . . . , α 1 .

The white noise model (1.1) can be reformulated in terms of an inﬁnite Gaussian sequence model. Let (ϕi) i=1 be the trigonometric basis, and let

0 ϕi(t)f(t)dt, i = 1, 2, . . .

be the Fourier coefﬁcients. It is known that f belongs to e F(α, c) if and only if the sequence θ belongs to the Sobolev ellipsoid Θ(α, c), deﬁned as

i=1 a2 i θ2 i c2

( iα if i is even (i 1)α if i is odd.

Distributed Nonparametric Regression under Communication Constraints

To ease the analysis, we will assume ai = iα and use ec2 in the place of c2 π2α . Expanding the observed process Y (t) in terms of the same basis we obtain the Gaussian sequence

0 ϕi(t) d Y (t) N (θi, 1/n) .

Given an estimator bθ for θ, we can formulate a corresponding estimator for f by

i=1 bθiϕi(t),

and the squared errors satisfy bθ θ 2 = bf f 2. In this way, estimating the function f in the white noise model is equivalent to estimating the means θ in the Gaussian sequence model.

The minimax risk of estimating f over the periodic Sobolev space is deﬁned as

R(n; e F(α, c)) = inf b f sup f F(α,c) E bf f 2,

which, as just shown, is equal to the minimax risk of estimating θ over the Sobolev ellipsoid in the corresponding Gaussian sequence model,

R(n; Θ(α, ec)) = inf bθ sup θ Θ(α,ec) E bθ θ 2.

It is well known that the asymptotic minimax risk scales according to

R(n; e F(α, c)) = R(n; Θ(α, c)) n 2α 2α+1

as n (Tsybakov, 2008).

In a distributed setting, we suppose there are m machines, and the jth machine independently observes Yj(t) such that

d Yj(t) = f(t)dt + 1 nd Wj(t) 0 t 1

for j = 1, . . . , m. Equivalently, if we express this in terms of the Gaussian sequence model, the jth machine observes data Xij N(θi, 1/n), i = 1, 2, . . . . We further assume there is a central machine where a ﬁnal estimator needs to be calculated based on messages received from the m local machines. Local machine j sends a message of length bj bits to the central machine; we denote this message by Πj. Then Πj = Πj(X1j, X2j, . . . ) can be viewed as a (possibly random) mapping from R to {1, 2, . . . , 2bj}. The ﬁnal estimator bθ is then a functional of the collection of messages. The mechanism can be summarized by the following diagram:

Y1(t) X11, . . . , Xn1 b1 Π1 Y2(t) X12, . . . , Xn2 b2 Π2 ...

Ym(t) X1m, . . . , Xnm bm Πm

Suppose that the communication is restricted by one of two types of constraints: An individual constraint, where bj b, for each j = 1, . . . , m and a given budget b, and a sum constraint, where Pm j=1 bj mb. We call the set

of mappings Π1, . . . , Πm and bθ a distributed protocol, and denote by Γind(m, b) and Γsum(m, b) the collection of all such protocols, operating under the individual constraint and the sum constraint, respectively.

We note here that for simplicity we consider only one round of communication. A variant is to allow multiple rounds of communication, for which the local machines can get access to a blackboard where the central machine broadcasts information back to the distributed nodes.

The minimax risk of the distributed estimation problem under the communication constraint is deﬁned by

R(n, m, b; Θ(α, c))

= inf (Π1,...,Πm,bθ) Γ(m,b) sup θ Θ(α,c) E bθ(Π1, . . . , Πm) θ 2.

(2.1) Here Γ represents either Γind or Γsum. In fact, it will be clear that the minimax risks under the two types of constraints are asymptotically equivalent.

3. Lower bounds for distributed estimation

In what follows, we will work in an asymptotic regime where the tuple (n, m, b) goes to inﬁnity while satisfying some relationships, and show how the minimax risk for the distributed estimation problem scales accordingly. The main result can be summarized in the following theorem.

Theorem 3.1. Let R(n, m, b; Θ(α, c)) be deﬁned as in (2.1) with Γ = Γsum

1. If b(mn) 1 2α+1 , then

lim inf mn (mn) 2α 2α+1 R(n, m, b; Θ(α, c)) C.

2. If b(mn) 1 2α+1 = O(1) and mbn 1 2α+1 , then

lim inf mn (mnb) α α+1 R(n, m, b; Θ(α, c)) C.

3. If mbn 1 2α+1 = O(1), then

lim inf mn (mb)2αR(n, m, b; Θ(α, c)) C.

Remark 3.1. The lower bounds are valid for both the sum constraint and the individual constraint. In fact, the individual constraint is more stringent than the sum constraint, so in terms of lower bounds, it sufﬁces to prove it for the sum constraint.

Distributed Nonparametric Regression under Communication Constraints

Remark 3.2. To put the result more concisely, we can write

R(n, m, b; Θ(α, c))

(mn) 2α 2α+1 if b(mn) 1 2α+1

(mbn) α α+1 if b(mn) 1 2α+1 = O(1) and mbn 1 2α+1 (mb) 2α if mbn 1 2α+1 = O(1)

There are multiple ways to interpret this main result and here we illustrate one of the many possibilities. Fixing m and b, and viewing the minimax risk as a function of n, the sample size on each machine, we have

n 2α 2α+1 m 2α 2α+1 if n b2α+1

m n α α+1 (mb) α α+1 if b2α+1

m n (mb)2α+1

(mb) 2α if n (mb)2α+1 .

This indicates that when the conﬁguration of machines and communication budget stay the same, as we increase the sample size at each machine, the risk starts to decay at the optimal rate with exponent 2α 2α+1. Once the sample size is large enough, the convergence rate slows down to an exponent α α+1. Eventually, the sample size exceeds a threshold, beyond which any further increase won t decrease the risk due to the communication constraint. Remark 3.3. This work can be viewed as a natural generalization of Zhu & Lafferty (2017), where the authors consider estimation over a Sobolev space with a single remote machine and communication constraints. Speciﬁcally, by setting m = 1 we recover the main results in Zhu & Lafferty (2017) up to some constant factor. However, with more than one machine, it is non-trivial to uncover the minimax convergence rate, especially in the intermediate regime.

3.1. Proof of the lower bounds

We now proceed to outline the proof of the lower bounds in Theorem 3.1. Most existing results rely on Fano s lemma and the strong data processing inequality (Zhang et al., 2013a; Braverman et al., 2016). An extension of this information-theoretic approach is used by Szabo & van Zanten (2018) in the nonparametric setting to obtain essentially the same lower bounds as we establish here. However, we develop the Bayesian framework for deriving minimax lower bounds (Johnstone, 2017), circumventing the need for both Fano s lemma and the strong data processing inequality, and associating the lower bounds with the solution of an optimization problem.

We consider a prior distribution π(θ) asymptotically supported on the parameter space Θ. For any estimator bθ that follows the distributed protocol, we have

sup θ Θ Eθ bθ θ 2 Z

Θ Eθ bθ θ 2dπ(θ). (3.1)

That is, the worst-case risk associated with bθ is bounded from below by the integrated risk. We speciﬁcally consider the Gaussian prior distribution θi N(0, σ2 i ) for i = 1, . . . , ℓ, and P(θi = 0) = 1 for i = ℓ+ 1, . . . , where the sequence σi satisﬁes Pℓ i=1 i2ασ2 i c2. We make (3.1) clear in the following lemma, whose proof can be found in the supplementary material.

Lemma 3.1. Suppose that a sequence of Gaussian prior distributions for θ and estimator bθ satisfy Pℓ i=1 i2ασ2 i max1 i ℓi2ασ2 i = O(ℓ) and Z

Θ Eθ[ bθ θ 2]dπ(θ) = O(ℓδ) (3.2)

for some δ > 0 as ℓ . Then

sup θ Θ Eθ bθ θ 2 Z

Θ Eθ[ bθ θ 2]dπ(θ) (1 + o(1)).

The next step is to lower bound the integrated risk R

Θ Eθ[ bθ θ 2]dπ(θ). Lemma 3.2 is derived from a result that appears in Wang et al. (2010); for completeness we include the proof in the supplementary material.

Lemma 3.2. Suppose θi N(0, σ2 i ) and Xij N(θi, ε2) for i = 1, . . . , ℓand j = 1, . . . , m. Let Πj : Rℓ {1, . . . , Mj} be a (random) mapping, which takes up to Mj different values. Let bθ : {1, . . . , M1} {1, . . . , Mm} Rℓbe an estimator based on the messages created by Π1, . . . , Πm. Under the constraint that 1 m Pm j=1 log Mj b, E bθ θ 2 can be lower bounded by the value of the following optimization problem

L(m, b, ε;σ)

min di, i=1,...,ℓ

1 2 log σ2 i di + m

m ε2 1 σ2 i + m

σ2 i ε2 m σ2 i + ε2

m di σ2 i for i = 1, . . . , ℓ.

Combining the Lemma 3.1 and 3.2, we have the following asymptotic lower bound

R(m, b, n; Θ(α, c)) L(m, b, n 1

for sequences σi satisfying Pℓ i=1 i2ασ2 i ec2 and Pℓ i=1 i2ασ2 i max1 i ℓi2ασ2 i = O(ℓ) as ℓ .

Next, based on the optimization problem formulated above, we work under three different regimes, and derive three

Distributed Nonparametric Regression under Communication Constraints

forms of lower bounds of the minimax risk. The key is to choose appropriate sequences of prior variances σ2 i for different regimes, as we shall illustrate.

1. Suppose that d1, . . . , dℓis a feasible solution to the problem (3.3). Using the ﬁrst constraint, we have

1 2 log σ2 i di + m

m ε2 1 σ2 i + m

1 2 log σ2 i di

1 2 log σ2 i ℓ

where we have used Jensen s inequality. Therefore,

i=1 di ℓexp

i=1 log σ2 i 2mb

Consider an asymptotic regime where mbn 1 2α+1 = O(1), and pick a sequence of corresponding prior distributions with ℓ= γmb for some constant γ and σ2 i = ec2 i2αℓfor i = 1, . . . , ℓ. Note that this choice satisﬁes condition (3.2). With such a choice of the prior distribution, we have

ec2e 4α e 2mb

2. Again suppose that d1, . . . , dℓis a feasible solution to the problem (3.3). This time we take another viewpoint of the ﬁrst constraint

1 2 log σ2 i di + m

m ε2 1 σ2 i + m

m ε2 1 σ2 i + m

To minimize Pℓ i=1 di under the constraint that

Pℓ i=1 1 2 log

m ε2 1 σ2 i + m

di b, we write the Lagrangian

m ε2 1 σ2 i + m

2 1 1 σ2 i + m

Solving this gives us that

1 σ2 i + 2mb

This time, consider a regime where b(mn) 1 2α+1 = O(1) and mbn 1 2α+1 . Pick a sequence of corresponding prior distributions with ℓ= (γmbn) 1 2α+2 for some constant γ and σ2 i = ec2 Pℓ i=1 i2α for i = 1, . . . , ℓ, which satisﬁes condition (3.2). With this choice and replacing ε2 by 1

1 Pℓ i=1 i2α

ec2(2α+1) + 2mbn

= ec2(2α + 1) 2ec2(2α + 1) + 1ℓ 2α(1 + o(1))

(mbn) α α+1 .

3. For the last regime where b(mn) 1 2α+1 , we use

the constraint that di σ2 i ε2

m σ2 i + ε2

m and write

σ2 i ε2 m σ2 i + ε2

σ2 i 1 mn σ2 i + 1 mn .

Let ℓ= (γmn) 1 2α+1 and σ2 i = ec2 Pℓ i=1 i2α satisfying (3.2), and we have

ec2 Pℓ i=1 i2α 1 mn

ec2 Pℓ i=1 i2α + 1 mn

2α+1 1 mn ℓ2α+1

2α+1 + 1 mn

(mn) 2α 2α+1 .

Thus, combining the previous three scenarios, we conclude the lower bound in 3.1.

Distributed Nonparametric Regression under Communication Constraints

4. Achievability

In this section, we describe how the lower bound can be achieved through the use of a certain distributed protocol. Unlike for the lower bound, we shall work under the individual constraint on the communication budget, instead of the sum constraint. However, a protocol satisfying the individual constraint automatically satisﬁes the sum constraint.

4.1. High-level idea

In nonparametric estimation theory, it is known that for the Gaussian sequence model Xi N θ, 1

n for i = 1, . . . , with θ Θ(α, c), the optimal scaling of the ℓ2 risk is n 2α 2α+1 , and this can be achieved by truncating the sequence at i = O(n 1 2α+1 ). That is, the estimator

( Xi if i n 1 2α+1

0 if i > n 1 2α+1

has worst-case risk supθ Θ(α,c) Eθ bθ θ 2 n 2α 2α+1 . We are going to build on this simple but rate-optimal estimator in our distributed protocol. But before carefully deﬁning and analyzing the protocol, we ﬁrst give a high-level idea of how it is designed.

In our distributed setting, we have a total budget of mb bits to communicate from the local machines to the central machine, which means that we can transmit O(mb) random variables to a certain degree of precision.

In the ﬁrst regime where we have mb n 1 2α+1 , the communication budget is so small that the total number of bits is smaller than the effective dimension for the noise level 1/n. In this case, we let each machine transmit information regarding a unique set of O(b) components of θ. Thus, at the central machine, we can decode and obtain information about the ﬁrst O(mb) components of θi. This is equivalent to truncating a centralized Gaussian sequence at i = O(mb), and gives us a convergence rate of (mb) 2α.

In the second regime (b (mn) 1 2α+1 and mb n 1 2α+1 ), we have a larger budget at our disposal, and can thus afford to transmit more than one random variable containing information about θi. Suppose that for a speciﬁc i we quantize and transmit Xij for k different values of j, namely at k different machines. The budget of O(mb) random variables will allow us to acquire information about the ﬁrst O( mb

k ) components of θ. When aggregating at the central machine, we have Zi N θi, 1

nk for i = 1, . . . , O mb

k , and no information about θi for i O mb

k . Now consider the effect of choosing different values of k. In choosing a smaller k, we will be able to estimate more components of θ, but each at a lower accuracy. On the other hand, a larger k leads to fewer components being estimated, but with smaller error. We know from nonparametric estimation theory that the

tradeoff is optimized when (kn) 1 2α+1 mb

k . This gives us

the optimal choice k (mb) 2α+1 2α+2 n 1 2α+2 , with risk scaling as (mbn) α α+1 .

In the last regime, we have b (mn) 1 2α+1 . In this case, the number of bits available at each machine is larger than the effective dimension associated with the global noise level 1 mn. We simply quantize and transmit the ﬁrst O(mn) 1 2α+1 of Xij from each machine to the central machine, where we decode and simply average the received random variables.

4.2. Algorithm

First we state a lemma describing and analyzing a simple scalar quantization method.

Lemma 4.1. Suppose that X is a random variable supported on [ c, c], and that U Unif(0, δ) independently, for some constant δ > 0. Let G(u, δ) = {u + iδ : i = 0, 1, 2, . . . } be a grid of points with base point u and skip δ. Deﬁne

q(x; u, δ) = arg min g G(u,δ) |x g|.

Let E = q(X; U, δ) X. Then X and E are independent, and E Unif( δ

Proof. Let us condition on the event that X = x. We have for ϵ ( δ

P(E (ϵ, ϵ + dϵ) | X = x)

= P(q(X) (x + ϵ, x + ϵ + dϵ) | X = x)

= P(U (x + ϵ δ (x + ϵ)/δ ,

x + ϵ δ (x + ϵ)/δ + dϵ) | X = x)

= P(U (x + ϵ δ (x + ϵ)/δ ,

x + ϵ δ (x + ϵ)/δ + dϵ))

We thus conclude that E|X Unif( δ

2), and therefore E and X are independent.

By this lemma, we know that with a public key for randomness, we can transmit a random variable X supported on ( c, c) using log2 2c

δ bits, so that the central machine receives X+E with E Unif( δ

2) and independent to X. We are now ready to describe the algorithm of estimating θ.

α: order of the Sobolev space. c: radius of the Sobolev space.

Distributed Nonparametric Regression under Communication Constraints

mb n 1 2α+1

b (mn) 1 2α+1 , mb n 1 2α+1

b (mn) 1 2α+1

Figure 1. Allocation of communication budget for the three regimes. Each dot represents a random variable Xij. The jth row represents the random variables on the jth machine, and the ith dot in that row is for the random variable Xij N(θi, 1

n). If a dot is colored black, it means that the random variable is quantized and transmitted to the central machine; otherwise, we don t spend any communication budget on it. In the ﬁrst regime, each θi is only estimated on at most one machine, while in the second regime, it is estimated on multiple but not all machines. In the last regime, we quantize and transmit all random variables associated with θi on the m machines, before truncating at some position.

Xij: independent N θi, 1

n r.v. for i = 1, . . . , at machine j for j = 1, . . . , m. b: number of bits for communication at each machine.

δ = max n (mb) 2α+1

b0 = log2 δ, eb = b/b0 .

k = (mb) 2α+1 2α+2 n 1 2α+2 1 m.

2. At the jth machine (for j = 1, . . . , m), let Ij = n (ms + j)/k : s = 0, . . . ,eb 1 o .

(a) Generate a random seed shared with the central machine. (b) For i Ij, generate Uij Unif(0, δ) independently based on the seed. (c) For i Ij, Winsorize Xij at [ c, c] and quantize

e Xij = q ((Xij c) ( c); Uij, δ) .

(d) Transmit the quantized random variables n e Xij : i Ij o to the central machine using

b/b0 b0 b bits.

3. At the central machine, decode the messages and construct the estimator

j: i Ij e Xij if i meb/k (mn) 1 2α+1

0 otherwise .

A graphical illustration of the algorithm is shown in Figure 1. We must also note that while the algorithm is rate optimal, it is not adaptive, in the sense that it requires knowledge of the parameter α.

4.3. Analysis

We now analyze the statistical risk associated with the algorithm described in the previous section. Suppose that θ Θ(α, c). First notice that the Winsorization in Step 2(c) makes Xij bounded prior to quantization and it only decreases the risk. Write i = meb/k (mn) 1 2α+1 . The risk of the ﬁnal estimator satisﬁes

Eθ h bθ θ 2i i X

i=i +1 θ2 i

Distributed Nonparametric Regression under Communication Constraints

j: i Ij Xij θi

j: i Ij Eij

meb/k (mn) 1 2α+1

meb/k (mn) 1 2α+1 δ2

where Eij denotes the uniform error introduced by quantizing Xij and we have used the fact that they are mutually independent and independent to Xij. Also recall the deﬁnitions of δ, k,eb as appearing in the algorithm. Therefore, we have

Eθ h bθ θ 2i

meb/k (mn) 1 2α+1

meb/k (mn) 1 2α+1 δ2

+ c2 meb/k (mn) 1 2α+1 2α

Now we analyze the risk for the three regimes respectively.

In the ﬁrst regime where mb n 1 2α+1 , we have

k = (mb) 2α+1 2α+2 n 1 2α+2 1 m = 1

δ = max n (mb) 2α+1

2 o = (mb) 2α+1

and consequently

Eθ h bθ θ 2i meb

n + meb 3(mb)2α+1 + c2

= O (mb) 2α log(mb) .

In the second regime where b (mn) 1 2α+1 and mb n 1 2α+1 , we have

k = (mb) 2α+1 2α+2 n 1 2α+2 1 m = (mb) 2α+1 2α+2 n 1 2α+2 ,

δ = max n (mb) 2α+1

and it follows that

Eθ h bθ θ 2i 4 meb/k

= O (mbn) α α+1 (log n)2α .

In the last regime where b (mn) 1 2α+1 , we have

k = (mb) 2α+1 2α+2 n 1 2α+2 1 m = m,

δ = max n (mb) 2α+1

Eθ h bθ θ 2i

(mn) 1 2α+1

mn + (mn) 1 2α+1

O (mn) 2α 2α+1 if b (mn) 1 2α+1 log n

O (mn) 2α 2α+1 (log n)2α otherwise .

5. Future directions

One interesting direction for future work is to study adaptivity in distributed estimation. An adaptive protocol (Π, bθ) satisﬁes

supθ Θ(α,c) E h bθ(Π) θ 2i

inf(Π,ˇθ) Γ(m,b) supθ Θ(α,c) E ˇθ(Π) θ 2 < ,

for (almost) all α and c. That is, the protocol should be minimax optimal for all θ without prior knowledge of the parameter space in which θ resides. While we had conjectured that this may not be possible with only one round of communication, Szabo & van Zanten (2018) recently developed an adaptive estimator using a modiﬁcation of Lepski s method.

A second interesting direction for future work is distributed estimation of other functionals. For instance, one might study the sum of squares (or ℓ2 norm) of the mean of a normal random vector. It would be of interest to understand the minimax risk of the norm of the mean in a distributed setting, and to develop optimal distributed protocols for this functional.

Finally, other nonparametric problems should be considered in a distributed estimation setting. For example, it will be interesting to study nonparametric estimation of functions with varying smoothness (e.g., over Besov bodies), and with shape constraints such as monotonicity and convexity.

Acknowledgment

Research supported in part by ONR grant N00014-12-10762 and NSF grant DMS-1513594.

Distributed Nonparametric Regression under Communication Constraints

Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. Distributed estimation and inference with statistical guarantees. ar Xiv preprint ar Xiv:1509.05457, 2015.

Berger, T., Zhang, Z., and Viswanathan, H. The CEO problem. IEEE Trans. Inform. Theory, 42(3):887 902, 1996.

Blanchard, G. and M ucke, N. Parallelizing spectral algorithms for kernel learning. ar Xiv preprint ar Xiv:1610.07487, 2016.

Braverman, M., Garg, A., Ma, T., Nguyen, H. L., and Woodruff, D. P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC 16, pp. 1011 1020, 2016.

Brown, L. D. and Low, M. G. Asymptotic equivalence of nonparametric regression and white noise. Ann. Statist., 24(6):2384 2398, 1996.

Chang, X., Lin, S.-B., and Zhou, D.-X. Distributed semisupervised learning with kernel ridge regression. Journal of Machine Learning Research, 18(46):1 22, 2017.

Diakonikolas, I., Grigorescu, E., Li, J., Natarajan, A., Onak, K., and Schmidt, L. Communication-efﬁcient distributed learning of discrete distributions. In Advances in Neural Information Processing Systems, pp. 6394 6404, 2017.

Fan, J., Wang, D., Wang, K., and Zhu, Z. Distributed estimation of principal eigenspaces. ar Xiv preprint ar Xiv:1702.06488, 2017.

Johnstone, I. M. Gaussian estimation: Sequence and wavelet models. Unpublished manuscript, 2017.

Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. Communication-efﬁcient sparse regression. Journal of Machine Learning Research, 18(5):1 30, 2017.

Nussbaum, M. Asymptotic equivalence of density estimation and Gaussian white noise. Ann. of Statist., pp. 2399 2430, 1996.

Shamir, O. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS 14, pp. 163 171, 2014.

Shang, Z. and Cheng, G. Computational limits of a distributed algorithm for smoothing spline. The Journal of Machine Learning Research, 18(1):3809 3845, 2017.

Szabo, B. and van Zanten, H. Adaptive distributed methods under communication constraints. ar Xiv preprint ar Xiv:1804.00864, 2018.

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics, 1st edition, 2008.

Viswanathan, H. and Berger, T. The quadratic Gaussian ceo problem. IEEE Transactions on Information Theory, 43 (5):1549 1559, Sep 1997.

Wang, J., Chen, J., and Wu, X. On the sum rate of Gaussian multiterminal source coding: New proofs and results. IEEE Transactions on Information Theory, 56(8):3946 3960, 2010.

Zhang, Y., Duchi, J., Jordan, M. I., and Wainwright, M. J. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pp. 2328 2336, 2013a.

Zhang, Y., Duchi, J., and Wainwright, M. Divide and conquer kernel ridge regression. In Conference on Learning Theory, pp. 592 617, 2013b.

Zhu, Y. and Lafferty, J. Quantized minimax estimation over Sobolev ellipsoids. Information and Inference, 2017.