# do_outliers_ruin_collaboration__b139fc14.pdf Do Outliers Ruin Collaboration? Mingda Qiao 1 We consider the problem of learning a binary classifier from n different data sources, among which at most an η fraction are adversarial. The overhead is defined as the ratio between the sample complexity of learning in this setting and that of learning the same hypothesis class on a single data distribution. We present an algorithm that achieves an O(ηn + ln n) overhead, which is proved to be worst-case optimal. We also discuss the potential challenges to the design of a computationally efficient learning algorithm with a small overhead. 1. Introduction Consider the following real-world scenario: we would like to train a speech recognition model based on labeled examples collected from different users. For this particular application, a high average accuracy over all users is far from satisfactory: a model that is correct on 99.9% of the data may still go seriously wrong for a small yet non-negligible 0.1% fraction of the users. Instead, a more desirable objective would be finding personalized speech recognition solutions that are accurate for every single user. There are two major challenges to achieving this goal, the first being user heterogeneity: a model trained exclusively for users with a particular accent may fail miserably for users from another region. This challenge hints that a successful learning algorithm should be adaptive: more samples need to be collected from users with atypical data distributions. Equally crucial is that a small fraction of the users are malicious (e.g., they are controlled by a competing corporation); these users intend to mislead the speech recognition model into generating inaccurate or even ludicrous outputs. Motivated by these practical concerns, we propose the Robust Collaborative Learning model and study from a theoret- 1Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University, Beijing, China. Correspondence to: Mingda Qiao . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). ical perspective the complexity of learning in the presence of untrusted collaborators. In our model, a learning algorithm interacts with n different users, each associated with a data distribution Di. As mentioned above, a successful learning algorithm should, ideally, find personalized classifiers f1, f2, . . . , fn for different distributions, such that err Di(fi) Pr x Di[fi(x) = f (x)] < ϵ holds for every i [n], where f (x) denotes the true label of sample x. Further complicating the situation is that the algorithm can only interact with the data distributions via the users, each of which is either truthful or adversarial. A truthful user always provides the learning algorithm with independent samples drawn from his distribution together with the correct labels, whereas the labeled samples collected from adversarial users are arbitrary. In the presence of malicious users, it is clearly impossible to learn an accurate classifier for every single distribution: an adversary may choose to provide no information about his data distribution. Therefore, a more realistic objective is to satisfy all the truthful users, i.e., to learn n classifiers f1, f2, . . . , fn such that err Di(fi) < ϵ holds for every truthful user i. Na ıvely, one could ignore the prior knowledge that samples from truthful users are labeled by the same function, and run n independent copies of the same learning algorithm for the n users. This straightforward approach clearly needs at least n times as many samples as that required by learning on a single data distribution. Following the terminology of Blum et al. (2017), we say that this na ıve algorithm leads to an Ω(n) sample complexity overhead. The notion of overhead measures the extent to which learning benefits from the collaboration and sharing of information among different parties. Blum et al. (2017) proposed a learning algorithm that achieves an O(ln n) overhead for the case that all users are truthful, i.e., η = 0. We are then interested in answering the following natural question: can we still achieve a sublinear overhead for the case that η > 0, at least when η is sufficiently small? In other words, do adversaries ruin the efficiency of collaboration? Do Outliers Ruin Collaboration? 1.1. Model and Preliminaries Similar to the classic Probably Approximately Correct (PAC) learning framework due to Valiant (1984), we consider the binary classification problem on a set X. The hypothesis class F is a collection of binary functions on X with VCdimension d. The elements in X are labeled by an unknown target function f F.1 Suppose that D is a probability distribution on set X. Let OF denote the oracle that, given a set S = {(xi, yi)} of labeled examples, either returns a classifier f F that is consistent with the examples (i.e., f(xi) = yi for every (xi, yi) S) or returns if F contains no such consistent classifiers. A classic result in PAC learning states that if m = Θ d ln(1/ϵ) + ln(1/δ) independent labeled samples S = {(xi, f (xi)) : i [m]} are drawn from D, with probability at least 1 δ, inequality err D(f) < ϵ holds for every possible output f = OF (S) (Blumer et al., 1989). In the Robust Collaborative Learning setting, we consider n different data distributions D1, D2, . . . , Dn supported on X. A learning algorithm interacts with these distributions via n user oracles O1, O2, . . . , On, each of which operates in one of two different modes: truthful or adversarial. Upon each call to a truthful oracle Oi, a sample x is drawn from distribution Di and the labeled sample (x, f (x)) is returned. On the other hand, an adversarial oracle Oi may output an arbitrary pair in X {0, 1} each time.2 We define (ϵ, δ, η)-learning in the Robust Collaborative Learning model as the task of learning an ϵ-accurate classifier for each truthful user with probability 1 δ, under the assumption that at most an η fraction of the oracles are adversarial. Definition 1.1 ((ϵ, δ, η)-learning). Algorithm A is an (ϵ, δ, η)-learning algorithm if A, given a concept class F and access to n user oracles O1, O2, . . . , On among which at most ηn oracles are adversarial, outputs functions f1, f2, . . . , fn : X {0, 1}, such that with probability at least 1 δ, err Di(fi) < ϵ holds simultaneously for every truthful oracle Oi. We also formally define the sample complexity of (ϵ, δ, η)- learning. Definition 1.2 (Sample Complexity). Let MA(F, {Oi}) denote the expected number of times that algorithm A calls oracles O1, O2, . . . , On in total, when it runs on hypothesis 1This is known as the realizable setting of PAC learning. 2Our results hold even if the adversarial oracles are allowed to collude and they know the samples previously drawn by truthful oracles. class F and user oracles {Oi}. The sample complexity of (ϵ, δ, η)-learning a concept class with VC-dimension d from n users is defined as: mn,d(ϵ, δ, η) inf A sup F,{Oi} MA (F, {Oi}) . Here the infimum is over all (ϵ, δ, η)-learning algorithms A. The supremum is taken over all hypothesis classes F with VC-dimension d and user oracles O1, O2, . . . , On, among which at most an η fraction are adversarial. The overhead of Robust Collaborative Learning is defined as the ratio between the sample complexity mn,d(ϵ, δ, η) and its counterpart in the classic PAC learning setting, m1,d(ϵ, δ, 0). To simplify the notations and restrict our attention to the dependence of overhead on parameters n, d and η, we assume that ϵ = δ = 0.1 in our definition of overhead.3 Definition 1.3 (Overhead). For n, d N and η [0, 1], the sample complexity overhead of Robust Collaborative Learning is defined as o(n, d, η) mn,d(ϵ, δ, η) m1,d(ϵ, δ, 0) , where ϵ = δ = 0.1. Following our definition of the overhead, the results in (Blum et al., 2017) imply that when all users are truthful (i.e., when η = 0) and n = O(d), o(n, d, 0) = O(ln n). They also proved the tightness of this bound in the special case that n = Θ(d). 1.2. Our Results Information-theoretically, collaboration can be robust. In Section 3, we present our main positive result: a learning algorithm that achieves an O(ηn + ln n) sample complexity overhead when n = O(d). Our result recovers the O(ln n) overhead upper bound due to Blum et al. (2017) for the special case η = 0. In Section 4, we complement our positive result with a lower bound, which states that an Ω(ηn) overhead is inevitable in the worst case. In light of the previous Ω(ln n) overhead lower bound for the special case that n = Θ(d) (Blum et al., 2017), our learning algorithm achieves an optimal overhead when parameters n and d differ by a bounded constant factor. Our characterization of the sample complexity in Robust Collaborative Learning indicates that efficient cooperation is possible even if a small fraction of arbitrary outliers are present. Moreover, the overhead is largely determined by ηn, the maximum possible number of adversaries. Our 3This definition only changes by a constant factor when 0.1 is replaced by other sufficiently small constants. Do Outliers Ruin Collaboration? results suggest that for practical applications, the learning algorithm could greatly benefit from a relatively clean pool of data sources. Computationally, outliers may ruin collaboration. Our study focuses on the sample complexity of Robust Collaborative Learning, yet also important in practice is the amount of computational power required by the learning task. Indeed, the algorithm that we propose in Section 3 is inefficient due to an exhaustive enumeration of the set of truthful users, which takes exponential time. In Section 5, we provide evidence that hints at a time-sample complexity tradeoff in Robust Collaborative Learning. Informally, we conjecture that any learning algorithm with a sublinear overhead must run in super-polynomial time. In other words, while the presence of adversaries does not seriously increase the sample complexity of learning, it may still ruin the efficiency of collaboration by significantly increasing the computational burden of this learning task. We support our conjecture with known hardness results in computational complexity theory. 2. Related Work Most related to our work is the recent Collaborative PAC Learning model proposed by Blum et al. (2017). They also considered the task of learning the same binary classifier on different data distributions, yet all users are assumed to be truthful in their model. In fact, the Robust Collaborative Learning model reduces to the personalized setting of their model when η = 0. Here the word personalized emphasizes the assumption that each user may receive a specialized classifier tailored to his distribution. In addition to the personalized setting, they also studied the centralized setting, in which all the n users should receive the same classifier. They proved that a poly-logarithmic overhead is still achievable in this more challenging setting. In our Robust Collaborative Learning model, however, centralized learning is in general impossible due to the indistinguishability between truthful and adversarial users. The following simple impossibility result holds for extremely simple concept classes and even when infinitely many samples are available. Proposition 2.1. For any ϵ [0, 1), δ 0, 1 2 and η (0, 1], no algorithms (ϵ, δ, η)-learn any concept class of VCdimension d 2, under the restriction that all users should receive the same classifier. Proof of Proposition 2.1. Let x0 and x1 be two different samples that can be shattered by F. Choose f0, f1 F such that f0(x0) = f0(x1) = f1(x0) = 0 and f1(x1) = 1. Let n be large enough such that ηn 1. Construct (degenerate) distributions D1, D2, . . . , Dn such that D1(x1) = D2(x1) = 1 and Di(x0) = 1 for each 3 i n. Consider the following two cases: The target function is f0. The only adversarial user, O1, misleads the learning algorithm by outputing the labeled example (x1, 1). The target function is f1. The only adversarial user, O2, misleads the learning algorithm by outputing the labeled example (x1, 0). Note that in both cases, oracles O1 and O2 always return (x1, 1) and (x1, 0) respectively, while all other oracles return (x0, 0). Consequently, no algorithms can distinguish these two cases with success probability strictly greater than 1 2. Thus, any learning algorithm would have a failure probability of at least 1 A related line of research is multi-task learning (Caruana, 1997; Baxter, 2000; Ben-David et al., 2002; 2003), which studies the problem of learning multiple related tasks simultaneously with significantly fewer samples. Most work in this direction assumes certain relation (e.g., a transfer function) between the given learning tasks. In contrast to multi-task learning, our work focuses on the problem of learning the same classifier on multiple data distributions, without assuming any similarity between these underlying distributions. Also relevant to our study is the work on robust statistics, i.e., the study of learning and estimation in the presence of noisy data and arbitrary outliers; see Lai et al. (2016); Charikar et al. (2017); Diakonikolas et al. (2016; 2017; 2018) and the references therein for some recent work in this line of research. Classic problems in this regime include the estimation of the mean and covariance of a high-dimensional distribution, given a dataset consisting of samples drawn from the distribution and a small fraction of arbitrary outliers. Our model differs from this line of research in that we consider a general classification setting, and the learning algorithm is allowed to sample different sources adaptively, instead of learning from a given dataset of fixed size. 3. An Iterative Learning Algorithm In this section, we present an iterative (ϵ, δ, η)-learning algorithm achieves an O(ηn + ln n) overhead when n = O(d). Here n is the number of users, and d denotes the VCdimension of the hypothesis class F. Since F can be large and even infinite, we assume that the algorithm access F via an oracle OF that, given a set S = {(xi, yi)} of labeled examples, either returns a classifier f F such that f(xi) = yi holds for each pair (xi, yi) S, or returns if F does not contain any consistent functions. The Do Outliers Ruin Collaboration? Algorithm 1 Iterative Robust Collaborative Learning Input: Parameters n, d, ϵ, δ, η. Output: Classifiers f1, f2, . . . , fn. r 1; G1 [n]; while ηn |Gr| 10 do δr δ 5r2 ; ˆfr Candidate(Gr, d, ϵ, δr); Gr+1 Test(Gr, ˆfr, ϵ, δr); Set fi ˆfr for each i Gr \ Gr+1; r r + 1; end while for i Gr do Si Θ d ln(1/ϵ)+ln(n/δ) ϵ labeled samples from Oi; fi OF(Si); end for Output f1, f2, . . . , fn; algorithm interacts with the underlying data distributions D1, D2, . . . , Dn via n example oracles O1, O2, . . . , On, among which at most an η fraction are adversarial. 3.1. Algorithm Our algorithm is formally described in Algorithms 1 through 3. The main algorithm proceeds in rounds and maintains a set Gr of the indices of active users at the beginning of round r, i.e., users who have not received an ϵ-accurate classifier so far. When ηn , the maximum possible number of adversaries, is below |Gr| 10 , the algorithm invokes subroutine Candidate to find a candidate classifier ˆfr. Then, Algorithm 1 calls the validation procedure Test to check whether ˆfr is accurate for each user i Gr (with respect to accuracy threshold ϵ). If so, the algorithm marks the output for user i as ˆfr; otherwise, user i stays in set Gr+1 for the next round. When the proportion of adversaries reaches 1 10, the algorithm learns for the remaining users independently: for each active user, it draws samples from his oracle and outputs an arbitrary classifier that is consistent with his data. 3.2. Analysis of Subroutines Subroutine Candidate (Algorithm 2) is the key to the sample efficiency of our algorithm, as it enables us to learn a candidate classifier that is accurate simultaneously for a constant fraction of the active users, using only a nearly-linear number of samples (with respect to parameters |G| and d). Subroutine Test (Algorithm 3) further checks whether the learned classifier is accurate enough for each active user. This allows us to determine whether a user should remain active in the next iteration. We devote this subsection to the analysis of these two subroutines. Algorithm 2 Candidate(G, d, ϵ, δ) Input: Index set G, parameters d, ϵ and δ. Output: Candidate classifier ˆf F. M Θ d ln(1/ϵ)+ln(2|G|/δ) ϵ + |G| ln |G| |G| labeled samples from Oi; end for G H G : |H| 9 10|G| ; for H G do i H Si); if ˆf H = then Output ˆf H; end if end for Algorithm 3 Test(G, ˆf, ϵ, δ) Input: Set G of indices, candidate function ˆf, parameters ϵ and δ. Output: Set G of surviving indices. for i G do Si Θ ln(|G|/δ) ϵ samples from Oi; θi 1 |Si| P (x,y) Si I h ˆf(x) = y i ; end for Output G = {i G : θi > 3 Lemma 3.1. Suppose G [n] denotes the indices of |G| users, among which at most |G| 10 are adversarial. Let ˆf denote the output of Candidate(G, d, ϵ, δ). With probability 1 δ, the following two conditions hold simultaneously for at least |G| 2 indices i G: (1) err Di( ˆf) ϵ 2; (2) oracle Oi is truthful. The proof of Lemma 3.1 relies on the following technical claim, which enables us to relate the union of several equalsize datasets to the samples drawn from the uniform mixture of the corresponding distributions. Claim 3.2. Suppose m = Ω n ln n δ balls are thrown into n bins independently and uniformly at random. Then with probability 1 δ, no bins contain more than 2m Proof of Claim 3.2. Let random variable X denote the number of balls in a fixed bin, so X m is the average of m i.i.d. Bernoulli random variables with mean 1 n. The Chernoff bound implies that n) = e Ω(n ln n where the last step holds if we choose a sufficiently large hidden constant in m = Ω n ln n δ . The claim follows from a union bound over the n bins. Do Outliers Ruin Collaboration? Proof of Lemma 3.1. Let G denote the indices of truthful users in G. By assumption, |G | 9 10|G| and F contains a function f that is consistent with S i G Si. This guarantees that Algorithm 2 should return ˆf H as the output when H = G , so function Candidate is well-defined. Recall that in Algorithm 2, we set d ln(1/ϵ) + ln 2|G|/δ ϵ + |G| ln |G| Consider the following thought experiment. For each nonempty H G, we draw a sequence AH of M integers, each of which is chosen uniformly and independently at random from H. We also draw M samples from oracle Oi for each i G. If all users in H are truthful, the samples together with AH naturally specify a realization of drawing M samples from the uniform mixture distribution DH 1 |H| P i H Di: we arrange the M samples drawn from each distribution into a queue, and when we would like to draw the i-th sample, we simply take the sample at the front of queue AH(i). For a fixed non-empty subset H G that only contains truthful users, the VC theorem implies that with probability 1 δ 2|G|+1 (over the randomness in both the samples and the choice of AH), when we draw samples from the uniform mixture DH as described above, any function f F that is consistent with the labeled samples satisfies err DH(f) ϵ 10. By a union bound over 2|G| different sets H G, the above holds for every H G simultaneously with probability 1 2|G| δ 2|G|+1 = 1 δ Recall that in Algorithm 2, we first query each oracle Oi to obtain a training set Si of size 4M |G| for each i G. Then we find set H G and classifier ˆf H F such that: (1) |H| 9 10|G|; (2) ˆf H is consistent with all labeled samples in S i H Si. Suppose that H is the set associated with the output of Algorithm 2, and let H = {i H : Oi is truthful}. Note that |H | |H| |G| The crucial observation is that since M = Ω |G| ln |G| Claim 3.2 implies that with probability at least 1 δ 2, each index i H appears less than 2M |G| times in AH . In other words, S i H Si is a superset of the M samples that are supposed to be drawn from DH (in our thought experiment). Since ˆf H is consistent with S i H Si, a union bound shows that with probability 1 2 δ 2 = 1 δ, we have i H err Di( ˆf H) = err DH ( ˆf H) ϵ This further implies that err Di( ˆf H) ϵ 2 holds for at least |G| 2 indices i H ; otherwise, we would have i H err Di( ˆf H) 1 |H | which leads to a contradiction. Here the second step applies |H | 4 5|G|. This proves the lemma. The following lemma, which directly follows from a Chernoff bound and a union bound, states that with probability 1 δ, Test(G, ˆf, ϵ, δ) correctly determines whether ˆf has an O(ϵ) error for each user in G. Lemma 3.3. Let G denote the output of Test(G, ˆf, ϵ, δ). With probability 1 δ, the following holds for every i G simultaneously: (1) if err Di( ˆf) > ϵ, i G ; (2) if err Di( ˆf) ϵ Proof of Lemma 3.3. Fix a truthful oracle Oi with i G. Recall that Algorithm 3 sets θi = 1 |Si| (x,y) Si I h ˆf(x) = y i . Note that θi is the average of Ω ln(|G|/δ) ϵ independent Bernoulli random variables, each with mean err Di( ˆf). Thus, the Chernoff bound implies that with probability 1 δ |G|, the following two conditions holds simultaneously: (1) if err Di( ˆf) > ϵ, θi > 3 4ϵ; (2) if err Di( ˆf) ϵ 4ϵ. The lemma follows from a union bound over all i G. 3.3. Correctness and Sample Complexity Now we are ready to prove our main result. Theorem 3.4. For any ϵ, δ (0, 1] and η [0, 1], Algorithm 1 is an (ϵ, δ, η)-learning algorithm and takes at most O d ln(1/ϵ) ϵ (ηn + ln n) + n ln(n/δ) By Theorem 3.4, the sample complexity mn,d(ϵ, δ, η) reduces to O (d(ηn + ln n)) when ϵ and δ are constants and n C d for some constant C. Therefore, when n = O(d), we have the following overhead upper bound: o(n, d, η) = O (d(ηn + ln n)) Θ(d) = O(ηn + ln n). Proof of Theorem 3.4. The proof proceeds by applying Lemmas 3.1 and 3.3 iteratively. In each round r, Lemma 3.1 Do Outliers Ruin Collaboration? guarantees that with probability 1 δr, the learned classifier ˆfr has an error below ϵ 2 for at least |Gr| 2 truthful users. By Lemma 3.3, for each such distribution, the validation error θi should be below 3 4ϵ, so these users will exit the algorithm by receiving ˆfr as the classifier, and the number of active users decreases by a factor of 1 2. Therefore, the while-loop in Algorithm 1 terminates after at most log2 n + 1 iterations. Finally, the algorithm satisfies the remaining active users by drawing Θ d ln(1/ϵ)+ln(n/δ) ϵ samples from each of them. Thus, the VC theorem guarantees that for each truthful user, the learned classifier is ϵ-accurate with probability at least 1 δ 3n. By a union bound, with probability at least r=1 2δr n δ Algorithm 1 returns an ϵ-accurate classifier for each truthful user. It remains to bound the sample complexity of Algorithm 1. In round r, the number of active users is at most |Gr| n 2r 1 . Recall that δr = δ 5r2 . The number of samples that Candidate and Test draw in round r is then upper bounded by O d ln(1/ϵ) + |Gr| ln(|Gr|/δr) =O d ln(1/ϵ) ϵ + n ln(n/δ) Therefore, the number of samples drawn in the O(ln n) iterations is upper bounded by: log2 n +1 X r=0 O d ln(1/ϵ) ϵ + n ln(n/δ) =O d ln(1/ϵ) ln n + n ln(n/δ) When the while-loop in Algorithm 1 terminates, it holds that |Gr| 10ηn = O(ηn). After that, we learn on the remaining distributions separately, using O ηn d ln(1/ϵ) + ln(n/δ) samples in total. Adding (1) and (2) gives the desired sample complexity upper bound. 4. Overhead Lower Bound In this section, we show that an Ω(ηn + ln n) overhead is unavoidable when n = Θ(d). Therefore, the overhead achieved by Algorithm 1 is optimal up to a constant factor, when the number of users is commensurate with the complexity of the hypothesis class. Formally, we have the following theorem. Theorem 4.1. For any n, d N, ϵ 0, 1 2 and δ, η (0, 1), mn,d(ϵ, δ, η) = Ω ηnd Theorem 4.1 directly implies the following lower bound on the overhead: o(n, d, η) = Ω(ηnd) Θ(d) = Ω(ηn). Combining this with the previous lower bound o(n, d, η) = Ω(ln n) when n = Θ(d) and η = 0 (Blum et al., 2017)4, we obtain the desired worst-case lower bound of Ω(ηn + ln n). Proof of Theorem 4.1. Assume without loss of generality that ηn is an integer between 1 and n 1. We consider the binary classification problem on set X = [d] { }, while the hypothesis class F contains all the 2d binary functions on X that map to 0. The target function f is chosen uniformly at random from F. Suppose that for (1 η)n 1 truthful users, the data distribution is the degenerate distribution on { }, so these truthful users provide no information on the correct classifier f . On the other hand, the data distribution of the only remaining truthful user i satisfies Di (x) = 2ϵ d for any x [d] and Di ( ) = 1 2ϵ. By construction, a learning algorithm must draw Ω d ϵ samples from Di in order to learn an ϵ-accurate classifier with a non-trivial success probability 1 δ. Now suppose that each of the ηn adversarial users tries to pretend that he is the truthful user i . More specifically, each adversarial user i chooses a function fi F uniformly at random, and answer the queries as if he is the truthful user with a different target function fi. In other words, upon each request from the learning algorithm, oracle Oi draws x from Di and returns x, fi(x) . Recall that the actual target function f is also uniformly distributed in F, so from the perspective of the learning algorithm, the truthful user i is indistinguishable from the other ηn adversarial users. Therefore, an (ϵ, δ, η)-learning algorithm must guarantee that each of these (ηn + 1) users receives an ϵ-accurate classifier with probability at least 1 δ. The sample complexity lower bound from PAC learning theory implies that we must draw Ω d ϵ samples from each such user and thus (ηn + 1) Ω d 4They proved an Ω(ln n) lower bound for the special case that n = d, yet their proof directly implies the same lower bound when n = Θ(d). Do Outliers Ruin Collaboration? samples in total. 5. Discussion: A Computationally Efficient Algorithm? Although Algorithm 1 is proved to achieve an optimal sample complexity overhead in certain cases, the algorithm is computationally inefficient and of limited practical use when there are a large number of users. In particular, subroutine Candidate performs an exhaustive search over all user subsets of size 9 10|G|, and thus may potentially call oracle OF exponentially many times. In contrast, the na ıve approach that learns for different users separately, though obtaining an Ω(n) overhead, only makes n calls to oracle OF. Naturally, one may wonder whether we can achieve the best of both worlds by finding a computationally efficient learning algorithm with a small overhead? We conjecture that such an algorithm, unfortunately, does not exist. Conjecture 5.1. For any α > 1 and β < 1, no learning algorithms that make polynomially many calls to oracle OF achieve an O(nβ) overhead when η = Ω(nα). In words, when there is a non-trivial number of adversaries, any efficient learning algorithm would incur a nearly-linear overhead. We remark that it is necessary to assume α > 1 since when ηn, the maximum possible number of adversaries, is a constant, the learning algorithm can enumerate the subset of adversarial users in polynomial time, thus achieving an optimal overhead efficiently. Proving or refuting Conjecture 5.1 would greatly further our understanding of the impact of arbitrary outliers on collaborative learning. The key to our sample-efficient learning algorithm is that subroutine Candidate identifies a large user group such that some classifier ˆf F is consistent with all their labeled samples. Lemma 3.1 further guarantees that ˆf is ϵ-accurate for at least half of the users. This allows us to satisfy almost all the users in O(ln n) iterations, resulting in the ln n term in the overhead. We note that finding a group of users with consistent datasets generalizes the problem of finding a large clique in a graph: For an undirected graph with vertices labeled from 1 to n, we construct the user oracles O1, O2, . . . , On such that Oi and Oj produce conflicting labels on the same data if the edge (i, j) is absent from the graph. Then a group of users have consistent datasets if and only if they form a clique in the corresponding graph. Unfortunately, Zuckerman (2006) proved that even if the graph is known to contain a hidden clique of size Ω(n)5, it is still NP-hard to find a clique of size Ω(n1 β) for any β < 1. This indicates that, following the approach of Algorithm 1, 5Analogously, in our setting, we know that a large fraction of users have non-conflicting datasets. a computationally efficient algorithm can only find accurate classifiers for at most O(n1 β) users in each iteration. As a result, Ω(nβ) iterations would be necessary to satisfy all the n users. The algorithm consequently incurs an Ω(nβ) overhead. Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research (JAIR), 12(1):149 198, 2000. Ben-David, S., Gehrke, J., and Schuller, R. A theoretical framework for learning from a pool of disparate data sources. In International Conference on Knowledge Discovery and Data Mining (KDD), pp. 443 449, 2002. Ben-David, S., Schuller, R., et al. Exploiting task relatedness for multiple task learning. Lecture Notes in Computer Science (LNCS), pp. 567 580, 2003. Blum, A., Haghtalab, N., Procaccia, A. D., and Qiao, M. Collaborative pac learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2389 2398, 2017. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929 965, 1989. Caruana, R. Multitask learning. Machine Learning, 28(1): 41 75, 1997. Charikar, M., Steinhardt, J., and Valiant, G. Learning from untrusted data. In Symposium on Theory of Computing (STOC), pp. 47 60, 2017. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), pp. 655 664, 2016. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning (ICML), pp. 999 1008, 2017. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Robustly learning a gaussian: Getting optimal error, efficiently. In Symposium on Discrete Algorithms (SODA), pp. 2683 2702, 2018. Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), pp. 665 674, 2016. Valiant, L. G. A theory of the learnable. Communications of the ACM (CACM), 27(11):1134 1142, 1984. Do Outliers Ruin Collaboration? Zuckerman, D. Linear degree extractors and the inapproximability of max clique and chromatic number. In Symposium on Theory of Computing (STOC), pp. 681 690, 2006.