# a_resilient_distributed_boosting_algorithm__d299d02a.pdf A Resilient Distributed Boosting Algorithm Yuval Filmus * 1 Idan Mehalel * 1 Shay Moran * 2 3 Given a learning task where the data is distributed among several parties, communication is one of the fundamental resources which the parties would like to minimize. We present a distributed boosting algorithm which is resilient to a limited amount of noise. Our algorithm is similar to classical boosting algorithms, although it is equipped with a new component, inspired by Impagliazzo s hard-core lemma (Impagliazzo, 1995), adding a robustness quality to the algorithm. We also complement this result by showing that resilience to any asymptotically larger noise is not achievable by a communication-efficient algorithm. 1. Introduction Most work in learning theory focuses on designing efficient learning algorithms which generalize well. New considerations arise when the training data is spread among several parties: speech recorded on different smartphones, medical data gathered from several clinics, and so on. In such settings, it is important to minimize not only the computational complexity, but also the communication complexity. Apart from practical considerations of limited bandwidth, minimizing the communication complexity also limits the amount of data being exposed to prying ears. This motivates designing distributed learning algorithms, which improve on the naive idea of sending all training data to a single party. In the classical PAC model, distributed learning has been studied mostly in the realizable setting, where it was shown that distributed implementations of boosting algorithms can learn any VC class with communication complexity which is polynomial in the description length (in bits) of a single example (Balcan, Blum, Fine, and Mansour, 2012; *Equal contribution 1The Henry and Marilyn Taub Faculty of Computer Science, Technion, Haifa, Israel 2Faculty of Mathematics, Technion, Haifa, Israel 3Google Research, Israel. Correspondence to: Idan Mehalel . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Daum e, Phillips, Saha, and Venkatasubramanian, 2012b; Kane, Livni, Moran, and Yehudayoff, 2019). In this work we deviate from the realizable setting and allow a small amount of noise in the input sample. In our setting there are k players and a center, who are given a domain U of size |U| = n and a concept class H over U with VC dimension d n. For a labelled input sample S distributed among the players, let OPT := OPT(S) N denote the number of examples in S which are misclassified by the best hypothesis in H. In most parts of the paper, we require that OPT polylog n. The goal of the parties is to learn together a classifier f that has at most OPT errors on S, while using poly(d, k, log |S|, log n) bits of communication. Note that log n is the number of bits needed to encode a single point in U, and thus polylog n means polynomial in the description length of a single example.1 Main result. Our main result, formally stated in Theorems 2.2 and 2.3 asserts the following: for every VC class, if the minimal error of an hypothesis satisfies OPT polylog n, then a simple robust variant of classical boosting learns it with poly(d, k, log |S|, log n) communication complexity. Conversely, when OPT / polylog n, there exist onedimensional VC classes for which any learning algorithm has super-polylogarithmic communication complexity. The novelty of our algorithm lies in a non-standard usage of boosting that identifies small hard sets for which any hypothesis from the class has large error. This kind of usage resembles (and is inspired by) Impagliazzo s hard-core lemma (Impagliazzo, 1995), in particular its proof using the method of multiplicative weights. Our negative result is a slight extension of the argument appearing in (Kane, Livni, Moran, and Yehudayoff, 2019). We note that our positive result can alternatively be obtained by a reduction to semi-agnostic learning (Bun, Kamath, Steinke, and Wu, 2019), that is, agreeing on a classifier with at most c OPT errors for some constant c. Semiagnostic learning is possible using poly(d, k, log |S|, log n) bits of communication by the works of Balcan, Blum, 1We refer the reader to (Kane, Livni, Moran, and Yehudayoff, 2019; Braverman, Kol, Moran, and Saxena, 2019) for a more thorough discussion regarding the choice of polylog n as a yardstick for communication efficiency. A Resilient Distributed Boosting Algorithm Fine, and Mansour (2012); Chen, Balcan, and Chau (2016). Given a semi-agnostic communication protocol with a constant approximation factor c and communication complexity poly(d, k, log |S|, log n), one can proceed as follows: execute the semi-agnostic protocol to obtain a hypothesis f, and have each player broadcast her examples that f misclassifies. Then, the players modify f on the misclassified points and output an optimal hypothesis f . If there exists an hypothesis in the class whose error is polylog(n) then the communication cost of this step is poly(d, k, log |S|, log n), and thus the overall communication complexity is poly(d, k, log |S|, log n). The advantage of our approach is the simplicity of our protocol, which is a simple modification of the classical boosting approach that makes it resilient to mild noise. This is in contrast with semi-agnostic learning protocol which rely on non-trivial subroutines (e.g. the distributed implementation of Bregman projection in the protocol by Chen, Balcan, and Chau (2016)). Empirical loss versus population loss. From a technical perspective, this work focuses on distributed empirical risk minimization with efficient communication complexity; that is, the objective is to design an efficient distributed protocol which minimizes the empirical loss. While this deviates from the main objective in statistical learning of minimizing the population loss, we focus on the empirical loss for the following reasons: (i) Efficient communication implies generalization: As discussed in (Kane, Livni, Moran, and Yehudayoff, 2019; Braverman, Kol, Moran, and Saxena, 2019), Occam s razor and sample compression arguments can be naturally used to bound the generalization gap i.e. the absolute difference between the empirical and population losses of efficient distributed learning algorithms. In a nutshell, the bound follows by arguing that the output hypothesis is determined by the communication transcript of the protocol. Hence, the communication complexity of the protocol upperbounds the description length of the output hypothesis, which translates to a bound on the generalization gap via Occam s razor or sample compression. In particular, this reasoning applies to the algorithm we present in this work, and hence it generalizes. Thus, for communication-efficient protocols, the empirical loss is a good proxy of the population loss. (ii) Focusing on empirical loss simplifies the exposition: while it is possible to translate our results to the setting of population loss, this introduces additional probabilistic machinery and complicates the presentation without introducing any new ideas. Further, Empirical risk minimization is a natural and classical problem, and previous work on distributed PAC learning focused on it, at least implicitly (Kane, Livni, Moran, and Yehudayoff, 2019; Vempala, Wang, and Woodruff, 2020; Braverman, Kol, Moran, and Saxena, 2019). Paper organization. In Section 2 we formally define the model and give an overview of our results and related work. Section 3 contains brief preliminaries. We prove the upper bound in Section 4 and the lower bound in Section 5. The paper closes with Section 6, which discusses directions for future research. 2. Model and results Following (Balcan, Blum, Fine, and Mansour, 2012), we consider a distributed setting consisting of k players numbered 1, . . . , k, and a center. Each player can communicate only with the center. An hypothesis class H over a universe U is given, and a finite domain set U U of size n is given as well. We denote the VC-dimension of H by d := d(H). The finite domain U is known in advance to the center and to all players. A pair z := (x, y), where x U and y { 1}, is called an example. A sequence of examples z1, . . . , zm is called a sample, and denoted by S. For a classifier f : U { 1}, let ES(f) denote the number of examples in S that f misclassifies: (x,y) S 1[f(x) = y]. Let OPT be the number of misclassified examples in S with respect to the best hypothesis in H: OPT = OPT(S, H) := min h H ES(h). In most parts of the paper we require that OPT polylog n. In our setting, a sample S is adversarially distributed between the k players into k subsamples S1, . . . , Sk. Note that the center gets no input. We use the notation S = Si k i=1 to clarify that player i has a fraction Si of the sample, and concatenating all the Si s yields the entire input sample S. The goal is to learn H, which we define as follows: Definition 2.1. Let H be a concept class over a (possibly infinite) universe U and let k denote the number of players. For a function T : N N we say that H is learnable under the promise OPT T(n) if there exists a communication complexity bound C(d, k, n, m) poly(d, k, log n, log m) such that for every finite U U of size n = |U|, there exists a distributed algorithm π = π(U) that satisfies the following. For every input sample S = Si k i=1 with m examples from U, if OPT = OPT(S, H) T(n) then A Resilient Distributed Boosting Algorithm the k parties and the center agree on an output hypothesis f which satisfies ES(f) OPT with probability at least 2 3 (over the randomness of the protocol π, when randomized), while transmitting at most C(d, k, n, m) bits. Let us make a few remarks in order to clarify some choices made in the above definition. 1. Infinite classes. The above definition allows one to handle natural infinite classes H such as Euclidean halfspaces. The finite subdomain U U models a particular instance of the learning task defined by H. For example, if H is the class of halfspaces in Rd, and we use an encoding of real numbers with B bits, then U consists of all possible 2d B points in Rd that can be encoded. The universal quantification over U serves to make the definition scalable and independent of the encoding of the input points. 2. The protocol may depend on U U. This possible dependence reflects the fact that when designing algorithms in practice, one knows how the domain points are being encoded as inputs.2 2.2. Results Our positive result is stated in the following theorem. Theorem 2.2 (Positive Result). Let H be a concept class with d(H) < and let T = T(n) polylog n. Then, H is learnable under the promise OPT T(n), and this is achieved by a simple variant of classical boosting. Furthermore, the algorithm is deterministic and oblivious to T and OPT. The protocol we use to prove Theorem 2.2 is a resilient version of realizable-case boosting. It is resilient in the sense that it can be applied to any input sample, including samples that are not realizable by the class H. Moreover, as long as the input sample is sufficiently close to being realizable, this variant of boosting enjoys similar guarantees as in the fully realizable case. This feature of our protocol is not standard in boosting algorithms in the realizable case, which are typically vulnerable to noise (Dietterich, 2000; Long and Servedio, 2010). Our protocol can be implemented in the no-center model, in which the players can communicate directly (see (Balcan, Blum, Fine, and Mansour, 2012) for a more thorough discussion of these two models), by having one of the players play the part of the center. It also admits a randomized computationally efficient implementation, assuming an oracle access to a PAC learning algorithm for H in the centralized setting (see Section 4 for further discussion). On the other hand, the 2We remark however that the protocol appearing in Section 4 is uniform in U, that is, it can accept U as an additional input. protocol is improper. This is unavoidable: a result by Kane, Livni, Moran, and Yehudayoff (2019) shows that even in the realizable case (i.e. T(n) = 0), some VC classes cannot be properly learned by communication-efficient protocols. As mentioned in the introduction, the positive result of Theorem 2.2 can also be proved by reduction to semi-agnostic learning. However, our direct approach results in a simpler protocol. The following negative result shows that the assumption OPT polylog n made by Theorem 2.2 is necessary for allowing communication-efficient learning, even if the protocol is allowed to be randomized and improper. Theorem 2.3 (Negative Result). Let H = {hn : n N}, where hn(i) = 1 if and only if i = n, be the class of singletons over N. If T(n) = logω(1)(n) then H is not learnable under the promise that OPT T(n), even when there are only k = 2 players. When there are two players, our model is equivalent to the standard two-party communication model (Yao, 1979; Kushilevitz and Nisan, 1996; Rao and Yehudayoff, 2020), in which two players, Alice and Bob, communicate through a direct channel, and this is the setting in which we prove Theorem 2.3. Our results are in fact more general than stated. The algorithm used to prove Theorem 2.2 outputs a hypothesis making at most OPT many mistakes using OPT poly(d, k, log m, log n) communication (without having to know OPT in advance). The lower bound used to prove Theorem 2.3 shows that for any value T(n) and for any algorithm that learns the class of singletons there exists an input sample with OPT T(n) on which the communication complexity of the protocol is Ω(T(n)). 2.3. Related Work Originally, distributed learning was studied from the point of view of parallel computation (a partial list includes (Bshouty, 1997; Collins, Schapire, and Singer, 2002; Zinkevich, Weimer, Smola, and Li, 2010; Long and Servedio, 2011)). The focus was on reducing the time complexity rather than the communication complexity. More recent work aims at minimizing communication (Balcan, Blum, Fine, and Mansour, 2012; Daum e, Phillips, Saha, and Venkatasubramanian, 2012a;b; Blum, Heinecke, and Reyzin, 2021). In (Balcan, Blum, Fine, and Mansour, 2012), privacy aspects of such learning tasks are discussed as well. A related natural model of distributed learning was proposed in (Balcan, Blum, Fine, and Mansour, 2012). In this model, there are k entities and a center, and each entity i can draw examples from a distribution Di. The goal is to learn a good hypothesis with respect to the mixture distribution A Resilient Distributed Boosting Algorithm D = 1 k Pk i=1 Di. The communication topology in this model is a star: all entities can communicate only with the center. In this work, we consider a slightly different model, studied by Daum e, Phillips, Saha, and Venkatasubramanian (2012b;a); Kane, Livni, Moran, and Yehudayoff (2019); Braverman, Kol, Moran, and Saxena (2019), which we call the adversarial model. In this model, a sample S is given and partitioned freely among k players by an adversary. While this model might seem less natural, it is more general than the model of Balcan, Blum, Fine, and Mansour (2012), and our main contribution is a protocol that can be applied to this general model. The work by Lazarevic and Obradovic (2001) suggested a framework for using boosting in distributed environments. In (Chen, Balcan, and Chau, 2016), a clever analysis of Smooth Boosting (Kale, 2007) is used to give an efficient semi-agnostic boosting protocol. Kane, Livni, Moran, and Yehudayoff (2019) characterize which classes can be learned in the distributed and proper setting, and give some bounds for different distributed learning tasks. In (Braverman, Kol, Moran, and Saxena, 2019), tight lower and upper bounds on the communication complexity of learning halfspaces are given, using geometric tools. 3. Preliminaries We use log for the base 2 logarithm. Let U be the domain set and let S U { 1} be a sample. We follow the standard definitions of empirical loss and loss: (x,y) S 1[f(x) = y], Lp(f) := Pr (x,y) p[f(x) = y], respectively, where p is a probability distribution over U { 1}. We now briefly overview some relevant technical tools. Boosting. The seminal work of Freund and Schapire (1997) used the Ada Boost algorithm to boost a weak learner into a strong one. In this work we use a simplified version of Ada Boost (see (Schapire and Freund, 2013)): given a distribution p over a sample S, an α-weak hypothesis with respect to p is a hypothesis h which is better than a random guess by an additive factor of α: Pr (x,y) p[h(x) = y] 1/2 α. The boosting algorithm requires an oracle access to an αweak learner, which is an algorithm that returns α-weak hypotheses. Given such a weak learner, the boosting algorithm operates as follows: it receives as input a sample S, and initializes the weight W1(z := (x, y)) of any example z S to be 1. In each iteration t 1, . . . , T, it then uses the weak learner to obtain an α-weak hypothesis ht with respect to the distribution pt on S, which is defined by the weight function Wt, i.e. the probability of each example z is proportional to Wt(Z). The weights are then updated according to the performance of the weak hypothesis ht on each example: Wt+1(z) = Wt 2 1[h(x)=y]. After T iterations, the algorithm returns the classifier We have the following upper bound3 on the value of T required for f to satisfy ES(f) = 0. Theorem 3.1 (Freund and Schapire (1997)). Let T 6 log|S|, and assume that in any iteration t, a hypothesis ht which is 1 15 -weak with respect to the current distribution pt is provided to the variant of Ada Boost described above. Then for any (x, y) S we have t=1 1[ht(x) = y] 1/3. An immediate corollary is that if f is the classifier returned by Ada Boost and T 6 log|S|, then ES(f) = 0. Small ϵ-approximations. Let H { 1}U be a concept class of VC-dimension d < , let p be a distribution over examples in U { 1}, and let ϵ > 0. The seminal uniform convergence theorem of Vapnik and Chervonenkis (1971) implies that a random i.i.d sample S of size |S| = O(d/ϵ2) which is drawn from p satisfies with a positive probability that ( h H) : |LS(h) Lp(h)| ϵ. Crucially, note that |S| depends only on d, ϵ. In particular, for every distribution p there exists such a sample in its support. Communication complexity. Our negative result applies already when there are only two players, in which case our model is equivalent to the standard two-party communication model (Yao, 1979; Kushilevitz and Nisan, 1996). One of the standard problems in the two-party communication model is set disjointness. In this problem, Alice gets a string x {0, 1}n, Bob gets a string y {0, 1}n, and the goal is to compute the following function DISJn(x, y): DISJn(x, y) = ( 0 xi = yi = 1 for some i, 1 otherwise. 3This formulation of the theorem appears explicitly as Lemma 2 in (Kane, Livni, Moran, and Yehudayoff, 2019). A Resilient Distributed Boosting Algorithm The randomized communication complexity of DISJn is known to be large: Theorem 3.2 (Razborov (1990); Kalyanasundaram and Schintger (1992)). The randomized communication complexity of DISJn is Θ(n). 4. A resilient boosting protocol In this section we use our boosting variant to prove Theorem 2.2, which follows from the next theorem. Theorem 4.1. Let H be an hypothesis class with VC dimension d < , let k be the number of players, and let T(n) polylog(n). The protocol Accurately Classify, described in Figure 2, is a learning protocol under the promise OPT T(n), and has communication complexity of O OPT k log|S|(d log n + log|S|) , where S is its input sample. Furthermore, if S contains no contradicting examples (that is, examples (x, +1), (x, 1)) then the classifier f which the protocol outputs is consistent (i.e. satisfies ES(f) = 0). Accurately Classify relies on the Boost Attempt protocol, appearing in Figure 1, which is similar to classical boosting. To prove the theorem, we first argue that if Boost Attempt does not get stuck (i.e. it reaches Item 3 in Figure 1), then it simulates boosting and enjoys the guarantees stated in Theorem 3.1. Then, we take into account what happens when Boost Attempt does get stuck; in this case we adopt the perspective inspired by Impagliazzo s Hardcore lemma to remove a small subsample of the input which is hard in the sense that every hypothesis in H has large error on it. Finally, we analyze the total communication cost of the two protocols. Lemma 4.2. If protocol Boost Attempt, described in Figure 1, outputs a classifier f, then ES(f) = 0. Proof. We show that if Boost Attempt does not stop at step 2(e) of some iteration, then in every iteration t, the provided hypothesis ht is a 1 50 -weak hypothesis with respect to the current distribution pt in the boosting process: recall from the preliminaries that pt is a distribution on S, which is defined by the weight function Wt, i.e. the probability of each example z is proportional to Wt(Z). To establish the above we use two crucial properties of ht: The hypothesis ht satisfies LDt(ht) 1/100, where Dt is the distribution defined in step 2(c), i.e. it is the mixture of the uniform distributions over the S i s weighted by W (i) t Wt . Boost Attempt: Boosting that may get stuck Setting: There are k players and a center, and H is a known hypothesis class over a domain U. Input: A distributed sample S := Si k i=1, where Si = (xi 1, yi 1), . . . , (xi |Si|, yi |Si|) for i [k]. Output: Either all players agree on a classifier f : U { 1} which makes no errors on S, or each player i holds a sample S i Si such that the concatenated sample S = S i k i=1 is not realizable. The center holds S . 1. Initialize: Each player i initializes W1(zi j) = 1 for all 1 j |Si|. 2. For t := 1, . . . , T = 6 log|S| : (a) For all i [k], let pi t be the distribu- tion over Si defined by pi t(zi j) = Wt(zi j) where W (i) t = P 1 j |Si| Wt(zi j). Each player i sends to the center a 1 100approximation w.r.t. pi t of minimal size, denoted by S i = ˆzi 1, . . . , ˆzi |S i|. (b) Each player i sends W (i) t to the center. (c) Let S = S i k i=1. Let Dt be the distribu- tion on S defined by Dt(ˆzi j) = 1 |S i| W (i) t Wt , where Wt = Pk i=1 W (i) t is the total sum of weights. (d) If there is ˆh H such that LDt(ˆh) 1/100 then: The center sets ht := ˆh and sends ht to all players. (e) Else: Output S . (f) Each player i updates Wt+1(zi j) = Wt(zi j) 2 1[ht(xi j)=yi j] for any zi j Si. 3. Output the classifier f(x) = sign Figure 1. A boosting protocol that may get stuck when the input sample is not realizable. A Resilient Distributed Boosting Algorithm Accurately Classify: A learning protocol Setting: There are k players and a center, and H is a known hypothesis class over a domain U. Input: A distributed sample S := Si k i=1. (Below we treat each Si as a multiset.) Output: A classifier f : U { 1}. 1. Initialize: The center initializes a multiset D := . 2. While Boost Attempt( Si k i=1) returns a nonrealizable subsample S := S i k i=1: (a) The center updates D := D S . (b) Each player updates Si := Si\S i. 3. Let g be the classifier returned by Boost Attempt. 4. For every x U, let n+(x) be the number of times that the example (x, +1) occurs in D, and define n (x) similarly. 5. Output the classifier f : U { 1} defined for any x U as follows: +1 n+(x) 1, n+(x) n (x), 1 n (x) 1, n (x) > n+(x), g(x) otherwise. Figure 2. A resilient improper, deterministic learning protocol. S i is a 1 100-approximation of the distribution pi t on Si, defined by pi t(zi j) = Wt(zi j) W (i) t , and hence LS i(h) Lpi t(h) 1/100 for all h H. Let pt be the normalization of the weights in iteration t, that is pt(zi j) = Wt(zi j) Wt . So: zi j Si pt(zi j)1[ht(xi j) = yi j] Wt(zi j) Wt 1[ht(xi j) = yi j] W (i) t 1[ht(xi j) = yi j] W (i) t Wt Lpi t(ht) LS i(ht) + 1/100 ˆzi j S i 1[ht(ˆxi j) = ˆyi j] |S i| + 1/100 = LDt(ht) + 1/100 1/100 + 1/100 = 1/50. Since 1/50 < 1/15, by Theorem 3.1 a total of 6 log|S| iterations are enough to output a classifier f that satisfies ES(f) = 0. Next, we we consider the case in which Boost Attempt does get stuck. In this case, note that the small sample S sent to the center is not realizable. Observation 4.3. Let D be a distribution over a sample S. If for all h H it holds that LD(h) > 1/100 then S is not realizable. The following observation states that Boost Attempt is called at most OPT times by Accurately Classify. Observation 4.4. Let S be a non-realizable sample, and let S be a non-realizable subsample of S. Then for all h H, ES(h) > ES\S (h). That is, if we remove any non-realizable subsample from S, then the number of mistakes of any hypothesis decreases by at least 1. A Resilient Distributed Boosting Algorithm We are now ready to prove Theorem 4.1. The main part is analysing the communication complexity of Boost Attempt. Theorem 4.1. First we show correctness, and then analyze the communication complexity. Correctness. The loop in Accurately Classify is executed as long as Boost Attempt returns a non-realizable sample. Due to Observation 4.4, after at most OPT iterations, Boost Attempt will return a classifier, since the input sample will then be realizable. This classifier makes zero errors on the input to Boost Attempt, due to Lemma 4.2. Consequently, the classifier f returned by Accurately Classify makes the least number of errors among all possible classifiers. Furthermore, if S contains no contradicting examples, then ES(f) = 0. Communication. We first analyze the communication complexity of Boost Attempt and show that its upper bounded by O(k log |S|(d log n + log|S|)). First, it has 6 log|S| = O(log|S|) iterations. In each iteration, k many 1 100-approximations are sent to the center in step 2(a), each taking O(d log n) bits to encode, according to (Vapnik and Chervonenkis, 1971). Then, the sums of weights of each player are sent to the center in step 2(b). This requires O(k log|S|) communication: indeed, the initial weight of each element is 1, and in each iteration it might be halved. There are O(log|S|) iterations, so the weight of any element may decrease up to Ω(1/|S|). So, encoding the sums of weights in step 2(b) requires O(k log|S|) bits. Steps 2(c-e) can now be executed by the center, with zero communication. Now, if the condition in step 2(d) does not hold, a non-realizable sample S , which is the concatenation of the 1 100-approximations S i, is outputted by Boost Attempt. This step requires k bit of communication, in which the center indicates to each of the players that this condition does not hold. Also notice that this step happens at most once and hence increases the total communication complexity by at most k bits. If this condition holds and the protocol continues, then each player updates its weights with zero communication. Thus, we get a total of O(k log |S|(d log n+log|S|)) communication used in Boost Attempt. Accurately Classify executes Boost Attempt at most OPT times due to Observation 4.4, and hence the total communication used by Accurately Classify is O(OPT k log|S|(d log n + log|S|)). A computationally efficient implementation. We defined Boost Attempt as a communication-efficient deterministic protocol. However, as currently formulated, the protocol is not computationally efficient, since step 2(a) requires finding a 1 100-approximation, which cannot be done efficiently in general. Vapnik and Chervonenkis (1971) proved that a random sample of size O(d/ϵ2) is an ϵ-approximation with high probability. This can be used to make our protocol efficient at the cost of making it randomized. Furthermore, notice that in step 2(d), a weak hypothesis for the distribution Dt on S is found by the center. This step can also be implemented efficiently provided that H admits an efficient agnostic PAC learner in the centralized setting. 5. A complementing negative result In this section we prove Theorem 2.3. Theorem (Theorem 2.3 restatement). Let H = {hn : n N}, where hn(i) = 1 if and only if i = n, be the class of singletons over N. If T(n) = logω(1) n then H is not learnable under the promise that OPT T(n), even when there are only k = 2 players. The proof uses a mapping suggested in (Kane, Livni, Moran, and Yehudayoff, 2019) together with Theorem 3.2, the wellknown communication lower bound for set disjointness. Lemma 5.1 (Kane, Livni, Moran, and Yehudayoff (2019)). Let x, y {0, 1}n, and let w(x) denote the hamming weight of a binary string x. Let H be the class of singletons over [n] (it contains exactly all hypotheses that assign 1 to a single i [n] and 1 to all other elements). Then, there are mappings Fa, Fb : {0, 1}n ([n] { 1})n taking boolean n-vectors to samples such that the combined sample S := Fa(x); Fb(y) satisfies: 1. If DISJn(x, y) = 1 then ES(f) w(x) + w(y) for any classifier f (not necessarily from H). 2. If DISJn(x, y) = 0 then the optimal h H satisfies ES(h) = w(x) + w(y) 2. The proof follows by letting Fa(x) = (i, ( 1)1 xi) : i [n] , Fb(y) = (i, ( 1)1 yi) : i [n] . Those mappings are used in (Kane, Livni, Moran, and Yehudayoff, 2019) to prove a reduction to set disjointness, in order to show that agnostic classification requires Ω(n) communication under some conditions. A slight modification of their proof results in the bound of Theorem 2.3. Proof of Theorem 2.3. Let n N and set U = [n]. Given a randomized improper learning protocol π(U) for H under the promise that OPT T(n), we construct the following protocol π for DISJr, where r = T (n) 1. Let x, y {0, 1}r denote the inputs for DISJr. 2. Publish w(x), w(y). A Resilient Distributed Boosting Algorithm 3. Extend x, y to strings x , y {0, 1}n by adding n r zeroes to each. 4. Construct S := Fa(x ); Fb(y ) as described in Lemma 5.1. 5. Execute π(S) and let f be the hypothesis it outputs. 6. Output 1 if and only if ES(f) w(x) + w(y). Note that by construction, OPT is at most 2r T(n) (because any singleton hi where i r has error at most 2r on S). So, OPT T(n) and therefore Lemma 5.1 implies that this protocol solves set disjointness correctly with probability at least 2/3. Thus, by Theorem 3.2, its communication complexity is Ω(r) = Ω(T(n)). We now wrap up the proof by showing that the communication complexity of π is not in poly(log n, log|S| = log n, k = 2) = polylog(n). Indeed, the communication complexity of π is at most 2 log r larger than that of π. Thus, also the communication complexity of π is Ω(r) = Ω(T(n)), and by assumption T(n) = logω(1) n. 6. Suggestions for future research Characterizing agnostic learning. Our main result can be viewed as an agnostic learning protocol whose communication complexity depends linearly on OPT. There are concept classes in which such dependence is necessary, as shown by Theorem 2.3. It is also easy to see that there are classes for which this dependence can be avoided, for example finite classes. Is there a natural characterization of those classes which are learnable without any promise on OPT? Are there infinite classes with this property? The approximation factor in semi-agnostic learning. Balcan, Blum, Fine, and Mansour (2012) and Chen, Balcan, and Chau (2016) give efficient semi-agnostic learners that approximate the error of a best hypothesis from the class up to a multiplicative factor of c 4. A simple alteration of the constants in their proofs improves the approximation factor to 2 + α for every α > 0 (at the cost of higher communication complexity which deteriorates as α 0). Can the multiplicative factor be further improved, say to c for some c 2? Bounded communication complexity and generalization. It is interesting to further explore the relationship between the communication complexity and the generalization capacity of distributed learning protocols. Acknowledgments We thank an anonymous ALT 2022 reviewer for pointing out that Theorem 2.2 can be proved by reduction to semiagnostic learning. Balcan, M. F., Blum, A., Fine, S., and Mansour, Y. Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pp. 26.1 26.22. JMLR Workshop and Conference Proceedings, 2012. Blum, A., Heinecke, S., and Reyzin, L. Communicationaware collaborative learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021. Braverman, M., Kol, G., Moran, S., and Saxena, R. R. Convex set disjointness, distributed learning of halfspaces, and lp feasibility. ar Xiv preprint ar Xiv:1909.03547, 2019. Bshouty, N. H. Exact learning of formulas in parallel. Machine Learning, 26(1):25 41, 1997. Bun, M., Kamath, G., Steinke, T., and Wu, S. Z. Private hypothesis selection. Advances in Neural Information Processing Systems, 32:156 167, 2019. Chen, S.-T., Balcan, M.-F., and Chau, D. H. Communication efficient distributed agnostic boosting. In Artificial Intelligence and Statistics, pp. 1299 1307. PMLR, 2016. Collins, M., Schapire, R. E., and Singer, Y. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1):253 285, 2002. Daum e, III, H., Phillips, J., Saha, A., and Venkatasubramanian, S. Protocols for learning classifiers on distributed data. In Artificial Intelligence and Statistics, pp. 282 290. PMLR, 2012a. Daum e, III, H., Phillips, J. M., Saha, A., and Venkatasubramanian, S. Efficient protocols for distributed classification and optimization. In International Conference on Algorithmic Learning Theory, pp. 154 168. Springer, 2012b. Dietterich, T. G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139 157, 2000. Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. A Resilient Distributed Boosting Algorithm Impagliazzo, R. Hard-core distributions for somewhat hard problems. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 538 545. IEEE, 1995. Kale, S. Boosting and hard-core set constructions: a simplified approach. In Electronic Colloquium on Computational Complexity (ECCC), volume 14, pp. 131. Citeseer, 2007. Kalyanasundaram, B. and Schintger, G. The probabilistic communication complexity of set intersection. SIAM Journal on Discrete Mathematics, 5(4):545 557, 1992. Kane, D., Livni, R., Moran, S., and Yehudayoff, A. On communication complexity of classification problems. In Conference on Learning Theory, pp. 1903 1943. PMLR, 2019. Kushilevitz, E. and Nisan, N. Communication Complexity. Cambridge University Press, 1996. doi: 10.1017/ CBO9780511574948. Lazarevic, A. and Obradovic, Z. The distributed boosting algorithm. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 311 316, 2001. Long, P. and Servedio, R. Algorithms and hardness results for parallel large margin learning. Advances in Neural Information Processing Systems, 24:1314 1322, 2011. Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. Machine learning, 78(3):287 304, 2010. Rao, A. and Yehudayoff, A. Communication Complexity: and Applications. Cambridge University Press, 2020. doi: 10.1017/9781108671644. Razborov, A. A. On the distributional complexity of disjointness. In International Colloquium on Automata, Languages, and Programming, pp. 249 253. Springer, 1990. Schapire, R. E. and Freund, Y. Boosting: Foundations and algorithms. Kybernetes, 2013. Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264 280, 1971. Vempala, S. S., Wang, R., and Woodruff, D. P. The communication complexity of optimization. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1733 1752. SIAM, 2020. Yao, A. C.-C. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pp. 209 213, 1979. Zinkevich, M., Weimer, M., Smola, A. J., and Li, L. Parallelized stochastic gradient descent. In NIPS. Citeseer, 2010.