# collaborative_learning_with_different_labeling_functions__c8b516b3.pdf Collaborative Learning with Different Labeling Functions Yuyang Deng * 1 Mingda Qiao * 2 We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the n data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions. We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in (Crammer & Mansour, 2012) in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class. In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sampleand computationally-efficient. 1. Introduction In recent years, the remarkable success of data-driven machine learning has transformed numerous domains using the vast and diverse datasets collected from the real world. An ever-increasing volume of decentralized data is generated on a multitude of distributed devices, such as smartphones and personal computers. To better utilize these distributed data shards, we are faced with a challenge: how to effectively learn from these heterogeneous and noisy data sources? Authors are listed in alphabetical order. 1Pennsylvania State University, State College, PA, USA 2University of California, Berkeley, Berkeley, CA, USA. Correspondence to: Yuyang Deng , Mingda Qiao . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Collaborative PAC Learning (Blum et al., 2017) is a theoretical framework that abstracts the challenge above. In this model, there are n data distributions D1, D2, . . . , Dn, from which we can adaptively sample. We are asked to learn n classifiers ˆf1, ˆf2, . . . , ˆfn, such that each ˆfi has an error at most ϵ on Di. The goal is to minimize the number of labeled examples that we sample from the n distributions in total. Note that if we ignore the potential connection among the n learning tasks and solve them separately, the sample complexity is necessarily linear in n. Previously, Blum et al. (2017) introduced a sample-efficient algorithm when all distributions admit the same labeling function, i.e., some classifier in the hypothesis class has a zero error on every Di. Their algorithm has an O((d + n) log n) sample complexity, where d is the VC dimension of hypothesis class.1 When d is large, the overhead of the sample complexity is significantly reduced from n to log n. However, in the real world, it is often too strong an assumption that every data distribution is consistent with the same ground truth classifier. This is especially true when we are learning for a diverse population consisting of multiple subgroups, each with different demographics and preferences. In light of this, we study a model of collaborative learning with different labeling functions. In particular, we aim to determine the conditions under which sample-efficient learning is viable when the data from different sources are labeled differently, and find the optimal sample complexity. The contribution of this work is summarized as follows; see Section 3 for formal statements of our results. We formalize a model of collaborative learning with different labeling functions, and a sufficient condition, termed (k, ϵ)-realizability, for sample-efficient collaborative learning. This realizability assumption was used by Crammer & Mansour (2012) in the context of multi-task learning. Under this assumption, we give a learning algorithm with sample complexity O(kd log(n/k) + n log n). This algorithm is based on Empirical Risk Minimizarion (ERM) over an augmentation of the hypothesis class. 1For brevity, we treat the accuracy and confidence parameters as constants here. Collaborative Learning with Different Labeling Functions We show that the ERM problem over the augmented hypothesis class is always NP-hard when k 3, and NP-hard for a specific hypothesis class when k = 2. This rules out efficient learners based on ERM, as well as strongly proper learners that always output at most k different classifiers in the hypothesis class. Finally, we identify two cases in which computationally efficient learning is possible. When all distributions share the same marginal distribution on X, we give a simple polynomial-time algorithm that matches the information-theoretic bound. When the hypothesis class satisfies a 2-refutability assumption, we give a different algorithm based on approximate graph coloring, which outperforms the na ıve approach with an Ω(nd) sample complexity. 2. Problem Setup We adopt the following standard model of binary classification: The hypothesis class F {0, 1}X is a family of binary functions over the instance space X. A data distribution D is a distribution over X {0, 1}. The population error of a function f : X {0, 1} on data distribution D is defined as err D(f) := Pr (x,y) D [f(x) = y] . A dataset is a multiset with elements in X {0, 1}. The training error of f : X {0, 1} on dataset S = {(xi, yi)}i [m] is defined as err S(f) := 1 i=1 1 {f(xi) = yi} . The learning algorithm is given sample access to n data distributions D1, D2, . . . , Dn. At each step, the algorithm is allowed to choose one of the n distributions (possibly depending on the previous samples) and draw a labeled example from it. The algorithm may terminate at any time and return n functions ˆf1, ˆf2, . . . , ˆfn. The learning algorithm is (ϵ, δ)-PAC if it satisfies Pr h err Di( ˆfi) ϵ, i [n] i 1 δ. The sample complexity of the algorithm is the expected number of labeled examples sampled in total. Note that the above is almost the same as the personalized setup (i.e., the algorithm may output different classifiers for different distributions) of the model of Blum et al. (2017), except that in their model, in addition, it is assumed that there exists a classifier in F with a zero error on every Di. 3. Overview of Our Results A sufficient condition for sample-efficient learning. We start by stating a sufficient condition for n distributions to be learnable with a sample complexity that is (almost) linear in some parameter k instead of in n. Definition 3.1 ((k, ϵ)-Realizability). Distributions D1, D2, . . ., Dn are (k, ϵ)-realizable with respect to hypothesis class F, if there exist f 1 , f 2 , . . . , f k F such that minj [k] err Di(f j ) ϵ holds for every i [n]. In words, (k, ϵ)-realizability states that we can find k classifiers in F, such that on each of the n distributions, at least one of the classifiers achieves a population error below ϵ. Our first result is a general algorithm that efficiently learns under the (k, ϵ)-realizability assumption. Theorem 3.2. Suppose that D1, D2, . . . , Dn are (k, ϵ)- realizable with respect to hypothesis class F. For any δ > 0, there is an (8ϵ, δ)-PAC algorithm with sample complexity O kd log(n/k) log(1/ϵ) ϵ + n log k log(1/ϵ) + n log(n/δ) Viewing ϵ and δ as constants, the sample complexity reduces to O(kd log(n/k)+n log n). When d is large, the overhead in the sample complexity is k log(n/k), which interpolates between the O(log n) overhead at k = 1 (shown by (Blum et al., 2017)) and the O(n) overhead at k = n (where the n learning tasks are essentially unrelated, and a linear overhead is unavoidable). The factor 8 in the PAC guarantee can be replaced by any fixed constant that is strictly greater than 1, at the cost of a different hidden constant in the sample complexity. This follows from straightforward modifications to our proof. While we prove Theorem 3.2 (as well as the other positive results in the paper) under the assumption that the learning algorithm is given the value of k, this assumption can be removed via a standard doubling trick: We consider a sequence of guesses on the value of k: 1 = k1 < k2 < k3 < , where each ki+1 is the smallest value of k such that the sample complexity bound is at least twice the bound for ki. Then, we run the learning algorithm with k set to k1, k2 1, k2, k3 1, k3, . . . in order, and terminate the algorithm as soon as we are convinced that the actual k is larger than the current guess. This procedure succeeds as soon as the guess exceeds the actual value of k, and the sample complexity only increases by a constant factor. We prove Theorem 3.2 using the following natural augmentation of the instance space and the hypothesis class. Definition 3.3 ((G, k)-Augmentation). Let F be a hypothesis class over instance space X. For finite set G and k [|G|], the (G, k)-augmentation of F is the hypothe- Collaborative Learning with Different Labeling Functions sis class FG,k over X := G X defined as: FG,k := gf,c : f Fk, c [k]G , where for f = (f1, f2, . . . , fk) and c = (ci)i G, gf,c is the function that maps (i, x) X to fci(x). When G = [n] for some integer n, we use the shorthands Fn,k and (n, k)-augmentation . Definition 3.3 becomes more natural in light of the following observation. For each i [n], let D i be the distribution of ((i, x), y) when (x, y) is drawn from Di. Then, for any f Fk and c [k]n, we have err D i(gf,c) = Pr ((i,x),y) D i [gf,c(i, x) = y] = Pr (x,y) Di [fci(x) = y] = err Di(fci). In particular, when distributions D1, D2, . . . , Dn are (k, ϵ)- realizable w.r.t. F, by definition, there exist k classifiers f1, f2, . . . , fk F and n numbers c1, c2, . . . , cn [k] such that err Di(fci) ϵ. Then, the corresponding gf,c has a population error of at most ϵ on every D i. This reduces the problem to an instance of collaborative learning on hypothesis class Fn,k and distributions D 1 through D n, with a single unknown classifier that is simultaneously ϵ-accurate for all D i (i.e., the (1, ϵ)-realizability assumption). Our proof of Theorem 3.2 first upper bounds the VC dimension of Fn,k by a function of n, k, and the VC dimension of F. Then, we adapt an algorithm of Blum et al. (2017) to achieve the sample complexity bound. A sample complexity lower bound. Complementary to Theorem 3.2, our next result shows that the sample complexity can be lower bounded in terms of the sample complexity for the (1, 0)-realizable case. Theorem 3.4. Let m(n, d, ϵ, δ) denote the optimal sample complexity of (ϵ, δ)-learning on a hypothesis class of VC dimension d and n distributions that are (1, 0)- realizable. Then, under the (k, 0)-realizable assumption, the sample complexity is lower bounded by Ω(k) m ( n/k , d, ϵ, O(δ/k)). Theorem 3.1 of Blum et al. (2017) bounds m(n, d, ϵ, δ) by ϵ ((d + n) log(1/ϵ) + n log(n/δ)) , which contains a (d log n)/ϵ term. Assuming that this term is unavoidable (i.e., m(n, d, ϵ, δ) = Ω((d log n)/ϵ)), by Theorem 3.4, we have a lower bound of Ω kd log(n/k) for the (k, 0)-realizable case. In other words, the leading term of the sample complexity in Theorem 3.2 is necessary. Proving such an Ω(d log n) lower bound, however, is still an open problem, even under the additional restriction that the learner must output the same function for all the n distributions (see Problem 2 in the COLT 23 open problem of Awasthi et al. (2023)). Intractability of ERM and proper learning. A downside of Theorem 3.2 is that the learning algorithm might not be computationally efficient, even if there is a computationally efficient learner for F in the usual PAC learning setup. Concretely, our learning algorithm requires Empirical Risk Minimization (ERM) on Fn,k, the (n, k)-augmentation of F. The straightforward approach involves enumerating all partitions of [n] into k sets, which takes exponential time.2 Our next result shows that this ERM problem generalizes certain intractable discrete optimization problems, and is unlikely to be efficiently solvable. We first give a formal definition of the an ERM oracle. Definition 3.5 (ERM Oracle). An ERM oracle for hypothesis class F {0, 1}X is an oracle that, given any dataset S, returns f arg minf F err S(f). To state the hardness result rigorously, we need to consider a parametrized family of hypothesis classes instead of a fixed one. Definition 3.6 (Regular Hypothesis Family). A regular hypothesis family is {(Xd, Fd)}d N that satisfies the following for every d: Fd is a collection of binary functions over Xd with VC dimension at least d. There is an efficient algorithm that, given d, outputs x1, x2, . . ., xd Xd that are shattered by Fd. Remark 3.7. The first condition prevents the family from containing only simple classes with bounded VC dimensions. The second condition allows us to efficiently find witnesses for the VC dimension. Note that the second condition holds for natural hypothesis classes such as halfspaces and parity functions, the VC dimension of which can be lower bounded in a constructive way. We will show that the following decision version of ERM is already hard for Fn,k: instead of finding f arg minf Fn,k err S(f), we are only required to decide whether minf Fn,k err S(f) is 0 or not. Problem 1 (ERM over Augmented Classes). For a regular hypothesis family {(Xd, Fd)}d N, an instance of the ERM problem consists of parameters (d, n, k) and n datasets S1, S2, . . . , Sn Xd {0, 1}. The goal is to decide whether there exist classifiers f1, f2, . . . , fk Fd such that for every i [n], minj [k] err Si(fj) = 0. 2There is a faster algorithm via dynamic programming, though its runtime is still 2Ω(n). Collaborative Learning with Different Labeling Functions Remark 3.8. Problem 1 is equivalent to deciding whether there exists a classifier f Fn,k with a zero training error on the dataset {((i, x), y) : i [n], (x, y) Si}. Therefore, if we could efficiently implement the ERM oracle for Fn,k, we would be able to solve Problem 1 efficiently as well. Now we are ready to state our intractability result. Theorem 3.9. For any regular hypothesis family, ERM over augmented classes (Problem 1) is NP-hard for any k 3. Furthermore, there exists a regular hypothesis family on which Problem 1 is polynomial-time solvable for k = 1 but NP-hard for k = 2. One might argue that Theorem 3.9 only addresses the worst case, and does not exclude the possibility of efficiently implementing ERM (with high probability) over datasets that are randomly drawn. In Appendix D.3, we state and prove a distributional analogue of Theorem 3.9, which shows that it is also unlikely for an efficient (and possibly randomized) algorithm to succeed on randomly drawn samples. Recall that a proper learner is one that always returns hypotheses in the hypothesis class. In our setup, we say that a learning algorithm is strongly proper if, when executed under the (k, ϵ)-realizability assumption, it always outputs n functions ˆf1, . . . , ˆfn F such that |{ ˆf1, . . . , ˆfn}| k. Note that the (k, ϵ)-realizability assumption implies that it is always possible to find accurate classifiers that satisfy this constraint. Unfortunately, our proof of Theorem 3.9 also implies that, unless RP = NP, no strongly proper learner can be computationally efficient in general. Efficient algorithms for special cases. Despite the computational hardness in the general case, we identify two special cases in which computationally efficient learners exist, assuming an efficient ERM for F. The first case is when the n data distributions share the same marginal over X. Theorem 3.10. Suppose that D1, D2, . . . , Dn are (k, ϵ)- realizable and have the same marginal distribution on X. Fix constant α > 0. For any δ > 0, there is a ((3 + α)ϵ, δ)- PAC algorithm that runs in poly(n, k, 1/ϵ, log(1/δ)) time, makes at most k calls to an ERM oracle for F, and has a sample complexity of O kd log(1/ϵ) ϵ + n log(n/δ) Our algorithm for the theorem above follows a similar approach to the lifelong learning algorithms in (Balcan et al., 2015; Pentina & Urner, 2016). Our next positive result applies to hypothesis classes that are 2-refutable in the sense that whenever a dataset cannot be perfectly fit by F, it contains two labeled examples that explain this inconsistency. Definition 3.11 (2-Refutability). A hypothesis class F {0, 1}X is 2-refutable if, for any dataset S such that minf F err S(f) > 0, there is S = {(x1, y1), (x2, y2)} S such that minf F err S (f) > 0. The following gives examples of natural hypothesis classes that are 2-refutable, and shows that 2-refutability is preserved under certain operations. Example 3.12. The following hypothesis classes are 2refutable: F = {0, 1}X . Any dataset that cannot be perfectly fit by F must contain both (x, 0) and (x, 1) for some x X. F = {f : X {0, 1} : P x X f(x) 1}. Any dataset that cannot be perfectly fit by F must contain (x1, 1) and (x2, 1) for different x1, x2 X. F = {f g : f F }, where F {0, 1}X is 2-refutable and g : X X is fixed. F = {f g : f F }, where F {0, 1}X is 2-refutable, g : X {0, 1} is fixed, and denotes pointwise XOR. Assuming that the hypothesis class is 2-refutable and the data distributions are (k, 0)-realizable, ERM on the augmented class Fn,k gets reduced to graph coloring, in light of the following definition and simple lemma. Definition 3.13 (Conflict Graph). The conflict graph induced by datasets S1, S2, . . . , Sn and hypothesis class F is an undirected graph G = ([n], E), where {i, j} E if and only if minf F err Si Sj(f) > 0. Lemma 3.14. Let F be a 2-refutable hypothesis class. Datasets S1, . . . , Sn satisfy minf F err Si(f) = 0 for every i [n]. Let V be an independent set in the conflict graph induced by S1, S2, . . . , Sn and F. Then, for S = S i V Si, it holds that minf F err S (f) = 0. Proof. Suppose for a contradiction that minf F err S (f) is non-zero. Since F is 2-refutable, there exist i1, i2 V , (x1, y1) Si1 and (x2, y2) Si2 such that no classifier in F correctly labels both examples. If i1 = i2, this contradicts the assumption minf F err Si1(f) = 0. If i1 = i2, i1 and i2 must be neighbours in the conflict graph, which contradicts the independence of V . Assuming that the datasets are drawn from distributions are (k, 0)-realizable, the induced conflict graph must be kcolorable. If we could find a valid k-coloring efficiently, each color corresponds to an independent set of the graph. Collaborative Learning with Different Labeling Functions By Lemma 3.14, we can call the ERM oracle for F to find a consistent function. Combining the functions for the k different colors gives a solution to the ERM problem over the augmented class Fn,k. Unfortunately, graph coloring is NP-hard when k 3. Nevertheless, there are efficient algorithms for approximate coloring, i.e., color a graph using a few colors, when the graph is promised to be k-colorable for some small k. The definition below together with Theorem 3.16 gives a way of systematically translating an approximate coloring algorithm into an efficient algorithm for collaborative learning. Definition 3.15. For k 3, let c k (0, 1] denote any constant such that any k-colorable graph with n vertices can be efficiently colored with O(nc k) colors. A result of Karger, Motwani and Sudan (Karger et al., 1998) shows that we can take c k = 1 3 k+1 + ϵ for any ϵ > 0. For k = 3, a more recent breakthrough of Kawarabayashi & Thorup (2017) gives c 3 = 0.19996. Theorem 3.16. Suppose that D1, D2, . . . , Dn are (k, 0)- realizable with respect to a 2-refutable hypothesis class F. For any δ > 0, there is an (ϵ, δ)-PAC algorithm that runs in poly(n, k, 1/ϵ, log(1/δ)) time, makes poly(n) calls to an ERM oracle for F, and has a sample complexity of O d log(1/ϵ) + n ϵ log n + n log(1/δ) if k = 2, and O d log(1/ϵ) + n ϵ nc k + n log(1/δ) Note that when k = 2, the sample complexity is as good as the one in Theorem 3.2. When k 3, the overhead increases from log n to poly(n). Nevertheless, this overhead is still sub-linear in n for any fixed k. 4. Related Work Most closely related to our work are the previous studies of Collaborative PAC Learning (Blum et al., 2017; Nguyen & Zakynthinou, 2018; Chen et al., 2018; Qiao, 2018) and related fields such as multi-task learning (Hanneke & Kpotufe, 2022), multi-distribution learning (Haghtalab et al., 2022; Awasthi et al., 2023; Peng, 2023; Zhang et al., 2023), federated learning (Mc Mahan et al., 2017; Mohri et al., 2019; Cheng et al., 2023) and multi-source domain adaptation (Mansour et al., 2008; Konstantinov & Lampert, 2019; Mansour et al., 2021). In multi-task learning/multi-source domain adaptation, there are n distributions, each with a fixed number of samples, and our goal is to use these samples to learn a hypothesis that has a small risk on some target distribution. A line of works (Ben-David et al., 2010; Konstantinov & Lampert, 2019; Mansour et al., 2021) studied the generalization risk in this scenario, but their bounds all depend on the discrepancy among n distributions and contain non-vanishing residual constants. To avoid this residual constant in the bounds, Hanneke & Kpotufe (2022) considered a Bernstein condition assumption on the hypothesis class and some transferrability assumptions between the n source distributions and the target distribution. They studied the minimax rate of this learning scenario, and gave a nearly-optimal adaptation algorithm. Specially, some works studied the multi-task linear regression (Yang et al., 2020; Huang et al., 2023). Yang et al. (2020) considered linear regression from multiple distributions, where all tasks share the same input feature covariate X Rm d, but with different labeling functions. They designed the Hard Parameter Sharing estimator and established an excess risk upper bound of O( dσ2 m + heterogeneity). Recently, Huang et al. (2023) proposed the s-sparse heterogeneity assumption among the labeling functions in multi-task linear regression, and designed an algorithm which achieves an O( sσ2 mi + dσ2 Pn i=1 mi ) excess risk on the i-th task. Notice that when s is smaller than d, this bound is strictly sharper than individual learning bound O( dσ2 mi ). The difference between our scenario and multi-task learning is that we allow the learner to draw an arbitrary number of samples from each distribution, instead of assuming that each distribution only has a fixed number of samples. Federated learning is another relevant key learning scenario, where n players with their own underlying distributions, and fixed number of samples drawn from them, aim at learning model(s) that can have small risk on everyone s distribution. A line of works aimed at studying its statistical properties (Chen et al., 2023; Cheng et al., 2023; Mohri et al., 2019). (Chen et al., 2023) studied the minimax risk of federated learning in logistic regression setting, and showed that the minimax risk is controlled by the heterogeneity among n distributions and their labeling functions. Cheng et al. (2023) studied the risk bound of federated learning in the linear regression setting, and in an asymptotic fashion when the dimension of the model goes to infinity. Similar to multitask learning, federated learning also assumes each player only has fixed number of samples, and the analysis does not give a PAC learning bound. Multi-distribution learning was recently proposed by Haghtalab et al. (2022), where n players try to learn a single model ˆf, that can have an ϵ excess error on the worst case distribution among n players, i.e., maxi [n] err Di( ˆf) ϵ + OPT. They gave an algorithm with sample complexity O( d log n ϵ2 + nd log(d/ϵ) ϵ ), and proved a lower bound of Collaborative Learning with Different Labeling Functions ϵ2 ). Note that this is a more pessimistic learning guarantee than ours, since the value OPT can be very large. Very recently, two concurrent papers (Peng, 2023; Zhang et al., 2023) gave algorithms that match this lower bound, resolving some of the open problems formulated in (Awasthi et al., 2023). We defer the discussion on other related work in the theoretical computer science literature including mixture learning from batches, the computational hardness of learning, and approximate coloring to Appendix A. 5. Sample Complexity Upper Bound The key step in our proof of Theorem 3.2 is the following upper bound on the VC dimension of Fn,k. Lemma 5.1. For any n k 1 and hypothesis class F of VC dimension d, the VC dimension of Fn,k is at most O(kd + n log k). Lemma 5.1 implies that in general, FG,k has a VC dimension of O(kd + |G| log k). To gain some intuition behind the bound in Lemma 5.1, suppose that F is a finite class of size 2d. By Definition 3.3, the size of Fn,k is at most |F|k kn = 2kd+n log2 k, and the bound in the lemma immediately follows. The actual proof, of course, needs to deal with the case that F is larger or even infinite. The lemma improves a previous result of Crammer & Mansour (2012), which upper bounds the VC dimension of Fn,k (called the class of hard k-shared task classifiers ) by O(kd log(nkd) + n log k). Note that this bound can be looser than the one in Lemma 5.1 by a logarithmic factor. Proof of Lemma 5.1. We will show that, for some integer m kd to be chosen later, no m instances in X = [n] X can be shattered by Fn,k. This upper bounds the VC dimension of Fn,k by m 1. Fix a set S of m elements in X . To bound the number of ways in which S can be labeled by Fn,k, we also fix c [k]n and focus on the classifiers in Fn,k associated with c . Note that c naturally partitions S into S1 S2 Sk, where Sj := {(i, x) S : c i = j}. Furthermore, let Xj := {x X : (i, x) Sj, i [n]} be the projection of Sj to X. Note that we have j=1 |Sj| = |S| = m. Let Φ( ) be the growth function of hypothesis class F. Then, for each fixed c [k]n, the number of ways in which S can be labeled by classifiers in {gf,c : f Fk} Fn,k is j=1 Φ(|Xj|). By the Sauer-Shelah-Perles lemma, we have the following upper bound: Φ(m) Φ(m) := ( em, m d, em d d , m > d. It can be verified that the function m 7 ln Φ(m) is monotone increasing and concave on [0, + ). It then follows from Pk j=1 |Xj| m that j=1 ln Φ(|Xj|) (concavity) (monotonicity) kd . (m kd) Then, summing over the kn different choices of c , the logarithm of the growth function of Fn,k at m is at most n ln k + kd ln em For some sufficiently large m = O(n log k +kd), the above is strictly smaller than ln(2m), which means that the m points in S cannot be shattered by Fn,k. Given Lemma 5.1, Theorem 3.2 essentially follows from the learning algorithm of Blum et al. (2017) for the personalized setup, with slight modifications. For completeness, we state the algorithm and prove its correctness in Appendix B. 6. Evidence of Intractability In this section, we sketch the proof of Theorem 3.9. 6.1. Reduction from Graph Coloring We prove the first part (the k 3 case) by a reduction from graph k-coloring, which is well-known to be NP-hard. Proof sketch of Theorem 3.9 (the first part). For every kcoloring instance G = (V, E), we construct an instance of Problem 1 with parameters k and d = n = |V |. Without loss of generality, we assume V = [n]. By Definition 3.6, we can efficiently find n instances x1, x2, . . . , xn Xd that are shattered by Fd. Collaborative Learning with Different Labeling Functions For each i [n], we define the i-th dataset as Si := {(xi, 1)} {(xj, 0) : {i, j} E}. It can be shown that G has a k-coloring if and only if the ERM instance is feasible, i.e., the n datasets can be perfectly fit by k classifiers from Fd. This reduces k-coloring to Problem 1 and proves the first part of Theorem 3.9. 6.2. Hardness under Two Labeling Functions Next, we deal with the k = 2 case. Since 2-coloring can be efficiently solved, we must reduce from a different NP-hard problem. Intuitively, we want the problem to correspond to partitioning a set into k = 2 parts. This motivates our reduction from a version of subset sum. Proof sketch of Theorem 3.9, the second part. For each integer d 1, we consider the instance space Xd := [d] {0, 1, . . . , 2d} and the following hypothesis class: fθ : θ {0, 1, . . . , 2d}d, i=1 θi 2d ) where fθ is defined as fθ(i, j) = 1 {j θi}. In other words, each fθ Fd can be viewed as a direct product of d threshold functions, subject to that the thresholds sum up to 2d. It can be verified that {(Xd, Fd)}d N is a regular hypothesis family, and the k = 1 case of Problem 1 is easy. Now, we consider a specific choice of the datasets: for each i [n], the i-th dataset contains exactly one data point of form (i, ai) with label 1. The key observation is that the ERM problem for k = 2 is equivalent to deciding whether {a1, . . . , an} can be partitioned into two parts, each with sum 2d. This problem can be shown to be NP-hard by a reduction from the standard subset sum problem. 7. Efficient Learning via Approximate Coloring In this section, we prove Theorem 3.16, which gives efficient learning algorithms when the hypothesis class is 2-refutable. 7.1. The Learning Algorithm The two cases are proved via a common strategy. In each iteration, we carefully choose a parameter m and draw m samples from each distribution. We use an approximate coloring algorithm to color the conflict graph (Definition 3.13) induced by the datasets. For each color that is used by considerably many vertices, we combine the corresponding datasets and fit a classifier to this joint dataset. The key is to argue that this classifier must be accurate for many distributions. Finally, we repeat the above on the distributions that have not received an accurate classifier. We formally define a meta-algorithm in Algorithm 1. In the r-th iteration of the while-loop, we draw m(r) samples from each of the remaining distributions in G(r). We then build the conflict graph based on these datasets, and compute a γ(r)-coloring of the graph. The vertices that receive color i are denoted by Gi, and ˆgi is chosen as an arbitrary classifier in F that is consistent with Sv for every v Gi. This choice is always possible by Lemma 3.14. The algorithm is under-specified in three aspects: the number of samples m(r), the number of colors γ(r), as well as the algorithm for computing a γ(r)-coloring. We will specify these choices when we prove Theorem 3.16 later. Algorithm 1 Collaborative Learning via Approximate Coloring 1: Input: 2-refutable hypothesis class F. Sample access to D1, . . . , Dn. Parameters k, ϵ, δ, c. 2: Output: Hypotheses ˆf1, ˆf2, . . . , ˆfn. 3: r 1; G(1) [n]; 4: while G(r) = do 5: δ(r) δ/r2; 6: Set parameters m(r) and γ(r) according to |G(r)|, d, ϵ, δ(r); 7: for i G(r) do 8: Draw m(r) samples from Di to form Si; 9: end for 10: (G(r), E) conflict graph of {Si : i G(r)}; 11: Compute a γ(r)-coloring of (G(r), E). Let Gi G(r) denote the set of vertices with color i; 12: G(r+1) G(r); 13: for i [γ(r)] such that |Gi| |G(r)|/(2γ(r)) do 14: Find ˆgi F such that err Sv(ˆgi) = 0, v Gi; 15: for v Gi do 16: Draw c ln(|G(r)|/δ(r)) ϵ samples from Dv to form Sv; 17: if err Sv(ˆgi) ϵ/2 then 18: ˆfv ˆgi; 19: G(r+1) G(r+1) \ {v}; 20: end if 21: end for 22: end for 23: r r + 1; 24: end while 25: Return ˆf1, ˆf2, . . . , ˆfn; 7.2. A Key Technical Lemma We state and prove a key technical lemma in the analysis of Algorithm 1. The lemma states that whenever we find an approximate coloring of the conflict graph, the coloring can guide us to cluster the datasets and learn classifiers that are accurate on average. The remainder of the proof is relatively standard and thus deferred to Appendix F. Collaborative Learning with Different Labeling Functions Lemma 7.1. There is a universal constant c > 0 such that the following holds. In the r-th iteration of the while-loop, if m(r) is at least |G(r)| d ln(1/ϵ) + |G(r)| + ln(1/δ(r)) ln(|G(r)|/δ(r)) o , it holds with probability 1 δ(r)/6 that, for every i [γ(r)] such that |Gi| |G(r)|/(2γ(r)), v Gi err Dv(ˆgi) ϵ/8. The proof is based on similar techniques to the analysis of (Qiao, 2018) for a different variant of collaborative learning, in which a small fraction of the data sources are adversarial. Proof. For brevity, we omit the superscripts in m(r), δ(r) and γ(r). Let M := m |G(r)| 4γ . Consider the following thought experiment: We draw M independent samples z(i) 1 , z(i) 2 , . . . , z(i) M from each Di. (Recall that in Algorithm 1, only m M data points are actually drawn to form the dataset Si.) Independently, for each non-empty U G(r), we choose a sequence A(U) U M uniformly at random. We may then consider the fictitious dataset S(U) = n S(U) 1 , . . . , S(U) M o defined as: S(U) i := z(j) k , where j = A(U) i , k = l=1 1 n A(U) l = j o . In words, for each i U, if entry i appears t times in sequence A(U), S(U) contains the first t data points collected from Di (namely, z(i) 1 , . . . , z(i) t ). It can be easily verified that, over the randomness in all z(i) and A(U), each fictitious dataset S(U) is identically distributed as M samples from the uniform mixture DU := 1 |U| P The rest of the proof consists of two parts: First, we show that with high probability, every S(U) is representative for distribution DU. Formally, any classifier in F with a zero training error on S(U) must have an O(ϵ) population error on DU. Then, we show that for each color i [γ], the actual datasets (each of size m) collected from the distributions with color i can simulate the fictitious dataset S(Gi). Step 1: Fictitious datasets are representative. For each fixed non-empty U G(r), Theorems 28.3 and 28.4 in (Shalev-Shwartz & Ben-David, 2014) imply that for some universal constant c > 0, Pr [ f F, err S(U)(f) = 0 = err DU (f) ϵ/8] is at least 1 (8/ϵ)d e ϵM/c . For M c d ln(8/ϵ)+|G(r)| ln 2+ln(12/δ) ϵ , the right-hand side above is at least 1 δ 12 2|G(r)| . Then, a union bound over the 2|G(r)| 1 choices of U shows that with probability at least 1 δ/12, for all non-empty U G(r), any classifier in F that is consistent with S(U) has an error ϵ/8 on distribution DU. Step 2: Fictitious datasets can be simulated. Fix i [γ] such that |Gi| |G(r)|/(2γ). Recall that the classifier ˆgi has a zero training error on Ti := S v Gi Sv. In the first step, we showed that any f F that achieves a zero training error on S(Gi) must have a small population error on DGi. Thus, it suffices to argue that Ti S(Gi) with high probability. Recall that we computed the coloring solely based on the datasets, which are independent of the indices A(U). Therefore, conditioning on the realization of G1, G2, . . . , Gγ, each A(Gi) still uniformly distributed among GM i . In particular, for every i [γ] and v Gi, the number of times v appears in A(Gi), denoted by ni,v, follows the binomial distribution Binomial(M, 1/|Gi|). As long as ni,v m for every (i, v) pair, each Ti (which contains the first m data points from Dv) will be a superset of S(Gi) (which contains the first ni,v data points from Dv). There are at most |G(r)| such (i, v) pairs. The probability for each pair to violate the condition is at most Pr X Binomial(M,1/|Gi|) [X m] Pr X Binomial(M,1/|Gi|) (M = m|G(r)|/(4γ)) Pr X Binomial(M,2γ/|G(r)|) (|Gi| |G(r)|/(2γ)) By a Chernoff bound, the last expression is at most exp 2Mγ 3|G(r)| , which can be made smaller than δ 12|G(r)| since M c |G(r)| 4γ ln |G(r)| δ for sufficiently large c. By a union bound, the aforementioned condition holds for all (i, v) pairs with probability at least 1 δ/12. Finally, the lemma follows from the two steps above and another union bound. 8. Discussion on Open Problems Tighter sample complexity bounds. The most obvious open problem is to either improve the kd log(n/k)/ϵ term in the sample complexity bound in Theorem 3.2 or prove a matching lower bound. In light of Theorem 3.4, it is sufficient to prove a lower bound of Ω((d log n)/ϵ) for the personalized setup of collaborative learning (i.e., the (1, 0)- Collaborative Learning with Different Labeling Functions realizable case). Conversely, any improvement on this term implies a better algorithm for the (1, 0)-realizable case. A stronger hardness result. The NP-hardness is proved either for the ERM problem (in Theorem 3.9), or against learners that are strongly proper in the sense that they always return at most k different classifiers (recall the discussion in Section 3). Our result does not rule out efficient learners that are neither ERM-based nor strongly proper.3 Can we prove the intractability of sample-efficient learning directly, at least for specific hypothesis classes? Conflict graphs with bounded degrees. Our Theorem 3.16 gives a computationally efficient learner based on approximate coloring. It is also known that k-colorable graphs with the maximum degree bounded by can be colored with a smaller number of colors (e.g., O( 1/3) colors when k = 3 (Karger et al., 1998)). It is interesting to identify natural assumptions on the data distributions that ensure this small-degree property in the conflict graph, and explore whether that leads to a lower sample complexity. Efficient learner for concrete hypothesis classes. Even when F is simply the class of all binary functions on an instance space of size d and the data distributions are (k, 0)- realizable for k = 3, we do not have a computationally efficient learner that achieves the information-theoretic bound in Theorem 3.2. For this setup, since F is 2-refutable, Theorem 3.16 gives an algorithm with sample complexity of roughly d n0.19996/ϵ. Can we improve the overhead from poly(n) to polylog(n) via an efficient learner? Acknowledgements We thank the anonymous reviewers of ICML for pointers to related work and suggestions that helped improve the presentation. Impact Statement This paper presents theoretical work which refines an existing framework of machine learning. While there might be potential societal consequences of our work, none of them is sufficiently concrete and imminent to be specifically highlighted here. Anthony, M. and Bartlett, P. L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 3In fact, the distributions that we constructed in the proof of Theorem 3.4 can be easily learned by an improper algorithm. Arora, S., Chlamtac, E., and Charikar, M. New approximation guarantee for chromatic number. In Symposium on Theory of Computing (STOC), pp. 215 224, 2006. Awasthi, P., Haghtalab, N., and Zhao, E. Open problem: The sample complexity of multi-distribution learning for vc classes. In Conference on Learning Theory (COLT), pp. 5943 5949, 2023. Balcan, M.-F., Blum, A., and Vempala, S. Efficient representations for lifelong learning and autoencoding. In Conference on Learning Theory (COLT), pp. 191 210, 2015. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79:151 175, 2010. Berger, B. and Rompel, J. A better performance guarantee for approximate graph coloring. Algorithmica, 5(1-4): 459 466, 1990. Blum, A. New approximation algorithms for graph coloring. Journal of the ACM (JACM), 41(3):470 516, 1994. Blum, A. and Karger, D. An O(n3/14)-coloring algorithm for 3-colorable graphs. Information Processing Letters, 61(1):49 53, 1997. Blum, A., Haghtalab, N., Procaccia, A. D., and Qiao, M. Collaborative PAC learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2389 2398, 2017. Chen, J., Zhang, Q., and Zhou, Y. Tight bounds for collaborative pac learning via multiplicative weights. In Advances in Neural Information Processing Systems (Neur IPS), pp. 3602 3611, 2018. Chen, S., Zheng, Q., Long, Q., and Su, W. J. Minimax estimation for personalized federated learning: An alternative between fedavg and local training? Journal of Machine Learning Research, 24(262):1 59, 2023. URL http: //jmlr.org/papers/v24/21-0224.html. Cheng, G., Chadha, K., and Duchi, J. Federated asymptotics: a model to compare federated learning algorithms. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 10650 10689, 2023. Chlamtac, E. Approximation algorithms using hierarchies of semidefinite programming relaxations. In Foundations of Computer Science (FOCS), pp. 691 701, 2007. Crammer, K. and Mansour, Y. Learning multiple tasks using shared hypotheses. In Advances in Neural Information Processing Systems (NIPS), pp. 1475 1483, 2012. Collaborative Learning with Different Labeling Functions Das, A., Jain, A., Kong, W., and Sen, R. Efficient listdecodable regression using batches. In International Conference on Machine Learning (ICML), pp. 7025 7065, 2023. Diakonikolas, I., Kane, D., Manurangsi, P., and Ren, L. Cryptographic hardness of learning halfspaces with massart noise. Advances in Neural Information Processing Systems (Neur IPS), 35:3624 3636, 2022. Diakonikolas, I., Kane, D., and Ren, L. Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals. In International Conference on Machine Learning (ICML), pp. 7922 7938, 2023. Feldman, V. Optimal hardness results for maximizing agreements with monomials. In Conference on Computational Complexity (CCC), pp. 226 236, 2006. Feldman, V., Gopalan, P., Khot, S., and Ponnuswami, A. K. New results for learning noisy parities and halfspaces. In Foundations of Computer Science (FOCS), pp. 563 574, 2006. Guruswami, V. and Raghavendra, P. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39 (2):742 765, 2009. Haghtalab, N., Jordan, M., and Zhao, E. On-demand sampling: Learning optimally from multiple distributions. In Advances in Neural Information Processing Systems (Neur IPS), pp. 406 419, 2022. Hanneke, S. and Kpotufe, S. A no-free-lunch theorem for multitask learning. The Annals of Statistics, 50(6):3119 3143, 2022. Huang, X., Xu, K., Lee, D., Hassani, H., Bastani, H., and Dobriban, E. Optimal heterogeneous collaborative linear regression and contextual bandits. ar Xiv preprint ar Xiv:2306.06291, 2023. Jain, A., Sen, R., Kong, W., Das, A., and Orlitsky, A. Linear regression using heterogeneous data batches. ar Xiv preprint ar Xiv:2309.01973, 2023. Karger, D., Motwani, R., and Sudan, M. Approximate graph coloring by semidefinite programming. Journal of the ACM (JACM), 45(2):246 265, 1998. Kawarabayashi, K.-I. and Thorup, M. Coloring 3-colorable graphs with less than n1/5 colors. Journal of the ACM (JACM), 64(1):1 23, 2017. Kearns, M. J., Schapire, R. E., and Sellie, L. M. Toward efficient agnostic learning. In Annual Workshop on Computational Learning Theory (COLT), pp. 341 352, 1992. Koch, C., Strassle, C., and Tan, L.-Y. Properly learning decision trees with queries is np-hard. In Foundations of Computer Science (FOCS), pp. 2383 2407, 2023a. Koch, C., Strassle, C., and Tan, L.-Y. Superpolynomial lower bounds for decision tree learning and testing. In Symposium on Discrete Algorithms (SODA), pp. 1962 1994, 2023b. Kong, W., Somani, R., Kakade, S., and Oh, S. Robust metalearning for mixed linear regression with small batches. In Advances in Neural Information Processing Systems (Neur IPS), pp. 4683 4696, 2020a. Kong, W., Somani, R., Song, Z., Kakade, S., and Oh, S. Meta-learning for mixed linear regression. In International Conference on Machine Learning (ICML), pp. 5394 5404, 2020b. Konstantinov, N. and Lampert, C. Robust learning from untrusted sources. In International Conference on Machine Learning (ICML), pp. 3488 3498, 2019. Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems (NIPS), pp. 1041 1048, 2008. Mansour, Y., Mohri, M., Ro, J., Suresh, A. T., and Wu, K. A theory of multiple-source adaptation with limited target labeled data. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 2332 2340, 2021. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1273 1282, 2017. Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In International Conference on Machine Learning (ICML), pp. 4615 4625, 2019. Nguyen, H. and Zakynthinou, L. Improved algorithms for collaborative pac learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 7631 7639, 2018. Peng, B. The sample complexity of multi-distribution learning. ar Xiv preprint ar Xiv:2312.04027, 2023. Pentina, A. and Urner, R. Lifelong learning with weighted majority votes. In Advances in Neural Information Processing Systems (NIPS), pp. 3619 3627, 2016. Qiao, M. Do outliers ruin collaboration? In International Conference on Machine Learning (ICML), pp. 4180 4187, 2018. Collaborative Learning with Different Labeling Functions Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Wigderson, A. Improving the performance guarantee for approximate graph coloring. Journal of the ACM (JACM), 30(4):729 735, 1983. Yang, F., Zhang, H. R., Wu, S., R e, C., and Su, W. J. Precise high-dimensional asymptotics for quantifying heterogeneous transfers. ar Xiv preprint ar Xiv:2010.11750, 2020. Zhang, Z., Zhan, W., Chen, Y., Du, S. S., and Lee, J. D. Optimal multi-distribution learning. ar Xiv preprint ar Xiv:2312.05134, 2023. Collaborative Learning with Different Labeling Functions A. Additional Discussion on Related Work Mixture learning from batches. Another recent line of work (Kong et al., 2020b;a; Das et al., 2023; Jain et al., 2023) studied learning mixtures of linear regressions from data batches. In these setups, there are k unknown linear regression models. Each data batch consists of labeled examples produced by one of the linear models chosen randomly. This line of work gave trade-offs between the number of batches and the batch size in order for the parameters of the k linear models to be efficiently learnable. In comparison, our model allows the learner to adaptively sample from the data distributions, whereas the batch sizes are fixed in the model of learning with batches. We also note that the results for learning with batches require assumptions on the marginal distribution, such as Gaussianity or certain hypercontractivity and condition number properties. Also, except the recent work of Jain et al. (2023), they all required the marginal distribution of the instance to be the same across all batches. Computational hardness of learning. There is a huge body of work on the computational hardness of learning. Early work along this line showed that, under standard complexity-theoretic assumptions, it is hard to properly and agnostically learn halfspaces (Feldman et al., 2006; Guruswami & Raghavendra, 2009) and boolean disjunctions (Kearns et al., 1992; Feldman, 2006). More recent work have obtained a finer-grained understanding of this computational hardness. It is now known that many natural hypothesis classes are hard to learn even under additional assumptions, e.g., learning halfspaces under Massart noise (Diakonikolas et al., 2022), agnostically learning halfspaces under Gaussian Marginals (Diakonikolas et al., 2023), and properly learning decision trees using membership queries (Koch et al., 2023a;b). Approximate coloring. Approximate coloring is the problem of finding a valid coloring of a given graph with as few colors as possible. This line of work was initiated by Wigderson (1983), who gave an efficient algorithm that colors a 3-colorable graph with n vertices using O( n) colors. This result was later improved by a series of work (Berger & Rompel, 1990; Blum, 1994; Karger et al., 1998; Blum & Karger, 1997; Arora et al., 2006; Chlamtac, 2007; Kawarabayashi & Thorup, 2017). The best known upper bound of O(n0.19996) is due to Kawarabayashi and Thorup (Kawarabayashi & Thorup, 2017). The analogous problem for k-colorable graphs (where k 4) has also been studied. B. Proofs for Section 5 The algorithm is formally defined in Algorithm 2, and follows the same strategy as Algorithm 1 of Blum et al. (2017). Recall that for each data distribution Di, D i denotes the distribution of ((i, x), y) when (x, y) Di. Therefore, on Line 7, to sample from the mixture distribution 1 |G(r)| P i G(r) D i, it suffices to sample i from G(r) uniformly at random, draw (x, y) Di, and then use the labeled example ((i, x), y). The algorithm maintains G(r) as the set of active distributions at the beginning of the r-th round. The algorithm samples from the uniform mixture of the active distributions, learns a classifier gf,c FG(r),k via ERM, and then tests whether the learned classifier is good enough for each distribution in G(r). If the learned classifier achieves an O(ϵ) empirical error on Di, we use it as the answer ˆfi; otherwise, Di stays active for the next round. Finally, the iteration terminates whenever the number of active distributions drops below k, at which point we na ıvely learn on the k remaining distributions separately. The analysis of Algorithm 2 is straightforward, and relies on the following definition of a good event . Definition B.1. Let Egood denote the event that the following happen simultaneously when Algorithm 2 is executed: Whenever Line 8 is reached, it holds that err D(gf,c) 2ϵ. Whenever Line 11 is reached, it holds that: (1) err Di(fci) 4ϵ implies err Si(fci) 6ϵ; (2) err Di(fci) > 8ϵ implies err Si(fci) > 6ϵ. Whenever Line 22 is reached, it holds that err Di( ˆfi) 2ϵ. Then, Theorem 3.2 is a consequence of the following three lemmas. Lemma B.2. For some universal constant c, when Algorithm 2 is executed with parameter c, Pr Egood 1 δ. Collaborative Learning with Different Labeling Functions Algorithm 2 Collaborative Learning for (k, ϵ)-Realizable Distributions 1: Input: Hypothesis class F. Sample access to D1, . . ., Dn. Parameters k, ϵ, δ, c. 2: Output: Hypotheses ˆf1, ˆf2, . . . , ˆfn. 3: r 1; G(1) [n]; 4: while |G(r)| > k do 5: δ(r) δ/r2; 6: d(r) c (kd + |G(r)| log k); 7: Draw c d(r) ln(1/ϵ)+ln(1/δ(r)) ϵ samples from D := 1 |G(r)| P i G(r) D i to form S; 8: gf,c arg ming FG(r),k err S(g); 9: G(r+1) ; 10: for i G(r) do 11: Draw c ln(|G(r)|/δ(r)) ϵ samples from Di to form Si; 12: if err Si(fci) 6ϵ then 13: ˆfi fci; 14: else 15: G(r+1) G(r+1) {i}; 16: end if 17: end for 18: r r + 1; 19: end while 20: for i G(r) do 21: Draw c d ln(1/ϵ)+ln(k/δ) ϵ samples from Di to form Si; 22: ˆfi arg minf F err Si(f); 23: end for 24: Return: ˆf1, . . . , ˆfn; Proof. Whenever Line 8 is reached, by Lemma 5.1, for some sufficiently large constant c > 0, the VC dimension of FG(r),k is upper bounded by d(r) = c (kd + |G(r)| log k). Then, it follows from Theorem 5.7 in (Anthony & Bartlett, 1999) that, for some universal constant c > 0, the first condition is violated at the r-th round with probability at most δ(r)/6. By a union bound over all possible r, the first condition holds with probability at least 1 P+ r=1 δ(r)/6 1 δ/3. Again, by Theorem 5.7 in (Anthony & Bartlett, 1999), the probability for the third condition to be violated for a specific i is at most δ/(3k). Since |G(r)| k, by a union bound, the third condition holds with probability at least 1 k δ 3k = 1 δ/3. Finally, a Chernoff bound shows that the second condition holds for a specific r and i G(r) with probability at least 1 2 exp Ω c ln(|G(r)|/δ(r)) which can be made greater than 1 δ(r) 6|G(r)| for sufficiently large c. By a union bound over all r and i G(r), the second condition holds with probability at least r=1 |G(r)| δ(r) 6|G(r)| 1 δ/3. The lemma follows from the three claims above and yet another union bound. Lemma B.3. When event Egood happens, the output of Algorithm 2 satisfies err Di( ˆfi) 8ϵ for every i [n]. Proof. We assign a classifier as ˆfi for some i [n] either right after Line 11 or on Line 22. In either case, event Egood guarantees that ˆfi is 8ϵ-accurate on Di. Collaborative Learning with Different Labeling Functions Lemma B.4. When event Egood happens, Algorithm 2 terminates with a sample complexity of O kd log(n/k) log(1/ϵ) ϵ + n log k log(1/ϵ) + n log(n/δ) Proof. We first control the size of G(r) in each round r. We claim that when event Egood happens, if G(r+1) is defined during the execution of Algorithm 2, it holds that |G(r+1)| |G(r)|/2. Indeed, the first condition of Egood guarantees that for D = 1 |G(r)| P i G(r) D i, i G(r) err Di (fci) = 1 |G(r)| i G(r) err D i (gf,c) = err D (gf,c) 2ϵ. By Markov s inequality, it holds for at least half of the values i G(r) that err Di (fci) 4ϵ. Then, the second condition of Egood guarantees that |G(r+1)| |G(r)|/2. It follows immediately that |G(r)| 21 r n. Then, the sample complexity of the r-th iteration of the while-loop is upper bounded by c d(r) ln(1/ϵ) + ln(1/δ(r)) ϵ + |G(r)| c ln(|G(r)|/δ(r)) (kd + 2 r n log k) log(1/ϵ) + log(1/δ) + log r ϵ + 2 r n log(2 r n) + log(1/δ) + log r Since the while-loop terminates whenever |G(r)| k, there are at most O(log(n/k)) iterations, and summing the above over r = 1, 2, . . . , O(log(n/k)) gives kd log(1/ϵ) + log(1/δ) ϵ log(n/k) + n log k log(1/ϵ) ϵ + log2(n/k) ϵ + n log n + n log(1/δ) kd log(n/k) log(1/ϵ) ϵ + n log k log(1/ϵ) + n log(n/δ) Finally, the last for-loop of the algorithm takes O kd log(1/ϵ)+k log(k/δ) ϵ samples in total, which is always dominated by the above. Therefore, we have the desired upper bound on the sample complexity. Finally, we put all the pieces together and prove Theorem 3.2. Proof of Theorem 3.2. Lemmas B.2, B.3, and B.4 together imply that, conditioning on an event that happens with probability at least 1 δ, Algorithm 2 returns 8ϵ-accurate classifiers for each of the n distributions, while the number of samples is upper bounded by some M = O kd log(n/k) log(1/ϵ) ϵ + n log k log(1/ϵ) + n log(n/δ) To control the sample complexity the (unconditional) expectation of the number of samples, we simply terminate the algorithm whenever the number of samples exceeds M. The resulting algorithm is still (8ϵ, δ)-PAC guarantee, and satisfies the desired upper bound on the sample complexity. C. Sample Complexity Lower Bound We prove Theorem 3.4 in this section. Our proof is based on a simple observation: learning n distributions under (k, 0)- realizability is at least as hard as learning k unrelated instances, where each instance consists of learning n/k distributions under (1, 0)-realizability. Our actual proof is essentially the same as the lower bound proof in (Blum et al., 2017), and formalizes the intuition above. Formally, assuming that a learning algorithm A (ϵ, δ)-PAC learns n distributions under (k, 0)-realizability, we use A to construct another learner A , which is (ϵ, O(δ/k))-PAC for n/k distributions that are (1, 0)-realizable. Furthermore, the sample complexity of A is an O(1/k) fraction of that of A. For completeness, we state the reduction in the following. Collaborative Learning with Different Labeling Functions Proof of Theorem 3.4. Fix parameters n, k, d, ϵ, and δ. Let n := n/k and δ := 10δ 9k . Let F be a hypothesis class with VC dimension d, and Dhard be a distribution over hard instances for collaborative learning on n distributions. Formally, Dhard is a distribution over n data distributions (D1, D2, . . . , Dn ) such that: Every (D1, . . . , Dn ) in the support of Dhard is (1, 0)-realizable with respect to F. If a learning algorithm achieves an (ϵ, δ )-PAC guarantee when learning F on D1, . . . , D n drawn from Dhard, it must take m(n , d, ϵ, δ ) samples in expectation. Now, let A be an (ϵ, δ)-PAC learning algorithm over k n distributions under (k, 0)-realizability. For brevity, we relabel the k n distributions as Di,j where i [k] and j [n ]. In the following, we construct another learning algorithm A that learns n distributions (denoted by Dactual 1 , . . . , Dactual n ) drawn from Dhard by simulating A: 1. For each i [k], independently draw (Di,1, . . . , Di,n ) from Dhard. 2. Sample i from [k] uniformly at random. 3. We simulate algorithm A on distributions (Di,j)i [k],j [n ], except that Di ,1 through Di ,n are replaced by the n actual distributions Dactual 1 , . . . , Dactual n . In other words, whenever A requires a sample from Di,j, we truly sample from Di,j if i = i ; otherwise, we sample from Dactual j , and forward the sample to A. 4. When A terminates and outputs ( ˆfi,j)i [k],j [n ], we test if err Di,j( ˆfi,j) ϵ holds for all i = i and j [n ]. If so, we output ˆfi ,1 through ˆfi ,n as the answer; otherwise, we repeat the procedure above. In each repetition of the above procedure, from the perspective of algorithm A, it runs on k n distributions divided into k groups, where each group consists of n distributions drawn from Dhard. Clearly, the k n distributions together satisfy (k, 0)-realizability. Intuitively, it is impossible for A to tell the index i that corresponds to the actual instance, so the actual instance only suffers from an O(1/k) fraction of the error probability as well as the sample complexity. Let M denote the expected number of samples that A draws on such a random instance. Analogous to Claims 4.3 and 4.4 from (Blum et al., 2017)4, we have the following guarantees of the constructed learner A : Claim 1. Assuming δ 0.1, on a random instance drawn from Dhard, A achieves an (ϵ, 10δ 9k )-PAC guarantee and draws at most 10M 9k samples in expectation. By our assumption on Dhard, we have 10M 9k m(n , d, ϵ, δ ). Hence, we conclude that A takes at least 9k 10 m(n , d, ϵ, δ ) = Ω(k) m( n/k , d, ϵ, O(δ/k)) samples in expectation. D. Proofs for Section 6 In this section, we prove Theorem 3.9 as well as a distributional version of it. D.1. Reduction from Graph Coloring We prove the first part (the k 3 case) by a reduction from graph coloring, which is well-known to be NP-hard. Proof of Theorem 3.9 (the first part). Fix an arbitrary regular hypothesis family {(Xd, Fd)}d N. We will show that if, for any fixed k 3, there is a polynomial-time algorithm that solves Problem 1, the same algorithm can be used to solve graph k-coloring efficiently. This implies the first part of the theorem. 4While the two claims in (Blum et al., 2017) were proved for a concrete construction of hard instances, the proof only relies on the symmetry and independence among the instances, and thus can be applied to our case without modification. Collaborative Learning with Different Labeling Functions Given an instance G = (V, E) of the k-coloring problem, we construct an instance of Problem 1 with parameters k and d = n = |V |. Without loss of generality, we assume V = [n], since we can always relabel the vertices. By Definition 3.6, the VC dimension of Fd is at least d = n, and we can efficiently find n instances x1, x2, . . . , xn Xd that are shattered by Fd. For each v [n], we define the v-th dataset as Sv := {(xv, 1)} {(xu, 0) : {u, v} E}. We will show in the following that G has a k-coloring if and only if the ERM instance is feasible, i.e., the n datasets can be perfectly fit by k classifiers from Fd. From coloring to classifiers. Suppose that c : V [k] is a valid k-coloring of G. Since Fd shatters x1, x2, . . . , xn, we can find f1, f2, . . . , fk Fd such that fi(xv) := 1 {c(v) = i} holds for every i [k] and v [n]. Then, for every v [n], the dataset Sv is perfectly fit by classifier fc(v), since fc(v)(xv) = 1 {c(v) = c(v)} = 1 and fc(v)(xu) = 1 {c(u) = c(v)} = 0 for every neighbor u of v. From classifiers to coloring. Conversely, let f1, f2, . . . , fk Fd be k classifiers such that each dataset Sv is consistent with one of the classifiers. We can then choose a labeling c : V [k] such that Sv is perfectly fit by fc(v). Now we show that c is a valid k-coloring. Indeed, suppose that {u, v} E is an edge and c(u) = c(v) = i. Then, fi must correctly label both (xu, 1) Su and (xu, 0) Sv, which is impossible. Finally, note that the above reduction works for any k 3 and any family of hypothesis classes, while k-coloring is NP-hard for any k 3. This proves the first part of Theorem 3.9. D.2. Hardness under Two Labeling Functions Next, we deal with the k = 2 case. Since 2-coloring can be efficiently solved, we must reduce from a different NP-hard problem. Intuitively, we want the problem to correspond to partitioning a set into k = 2 parts. This motivates our reduction from (a variant of) subset sum. Proof of Theorem 3.9, the second part. We start by defining the regular hypothesis family. For each integer d 1, we consider the instance space Xd := [d] {0, 1, . . . , 2d} and the following hypothesis class: fθ : θ {0, 1, . . . , 2d}d, i=1 θi 2d ) where fθ is defined as fθ(i, j) = 1 {j θi} . In other words, each fθ Fd can be viewed as a direct product of d threshold functions, subject to that the thresholds sum up to at most 2d. It can be easily verified that {(Xd, Fd)}d N satisfies Definition 3.6, since for every d, the instances (1, 1), (2, 1), . . . , (d, 1) are shattered by Fd, witnessed by {fθ : θ {0, 1}d} Fd. Vanilla ERM is easy. We first show that the k = 1 case of Problem 1 is easy. Indeed, when k = 1, Problem 1 reduces to deciding whether S1 S2 Sn Xd {0, 1} can be perfectly fit by a hypothesis in Fd. By construction of Fd, this can be easily done via the following two steps: First, check whether there exist i [d] and 0 j1 < j2 2d such that both ((i, j1), 0) and ((i, j2), 1) are in the dataset. Also check whether the dataset contains ((i, 0), 0) for any i [d]. If either condition holds, report no solution . Then, for each i [d], let θi denote the largest value j {0, 1, . . . , 2d} such that the dataset contains ((i, j), 1); let θi = 0 if no such labeled example exists. If Pd i=1 θi 2d, the function fθ is a valid solution; otherwise, report no solution . The correctness of this procedure is immediate given the definition of Fd. Collaborative Learning with Different Labeling Functions Construction of datasets. We consider a specific choice of the datasets: for each i [n], the i-th dataset contains exactly one data point of form (i, ai) with label 1. In the rest of the proof, the key observation is that the datasets with indices in T [n] can be simultaneously satisfied by a hypothesis in Fd if and only if P i T ai 2d. The ERM problem for k = 2 is then equivalent to deciding whether {ai : i [n]} can be partitioned into two sets, each of which sums up to 2d. This problem can be easily shown to be NP-hard via a standard reduction from the subset sum problem. From subset sum to a special case. We first reduce the general subset sum problem to a special case, in which the n numbers sum up to 2n+1 and the target value is exactly 2n. Let ({ai}i [m], t) be an instance of subset sum (i.e., deciding whether there exists T [m] such that P i T ai = t). Let s = Pm i=1 ai and pick n = max{m + 2, log2 s + 1} such that n m 2 and 2n > s. We pad n m numbers to the instance, such that am+1 = am+2 = = an 2 = 0, an 1 = 2n t, an = 2n (s t). Note that the numbers now sum up to i=1 ai = s + (2n t) + [2n (s t)] = 2n+1. Also note that the size of the instance increases at most polynomially after the padding: A natural representation of the original subset sum instance ({ai}i [m], t) takes at least m + log2 s bits. In the new instance, there are n = O(m + log s) numbers that sum up to 2n+1, so its representation takes at most O(n2) bits, which is at most quadratic in the size of the original instance. We claim that ({ai}i [n], 2n) has the same answer as ({ai}i [m], t). Indeed, if P i T ai = t for some T [m], T {n 1} would be a feasible solution to the new instance. Conversely, suppose that T [n] is a subset such that P i T ai = 2n. Since an 1 + an = 2n+1 s > 2n, T must contain exactly one of n 1 and n. Without loss of generality, we have n 1 T, and T \ {n 1} would then give P i T \{n 1} ai = 2n (2n t) = t. From the special case to ERM. Then, we construct a collaborative learning instance with n distributions and set d = n in the definition of Xd and Fd. The i-th dataset only contains ((i, ai), 1). We claim that the datasets can be fit by k = 2 functions in Fd if and only if the subset sum instance ({ai}i [n], 2n) is feasible. For the if direction, suppose that T [n] satisfies P i T ai = 2n. Then, we define θ(1) and θ(2) as: ( ai, i T, 0, i / T, and θ(2) i = ( ai, i / T, 0, i T. Clearly, both fθ(1) and fθ(2) are in Fd. The i-th dataset is consistent with the first function if i T and the second otherwise. Conversely, suppose the datasets can be perfectly fit by two hypotheses fθ(1), fθ(2) Fd. Let T := {i [n] : fθ(1)(i, ai) = 1} be the indices of the data that are consistent with the former. We then have θ(1) i ai for every i T and thus P i T θ(1) i 2n. The same argument, when applied to [n] \ T and θ(2), implies P i [n]\T ai P i [n]\T θ(2) i 2n. Since the ai s sum up to 2n+1, we conclude that each summation must be equal to 2n, i.e., T is a feasible solution to the subset sum instance. D.3. Hardness of the Distributional Version of ERM As we mentioned earlier, Theorem 3.9 and its proof are arguably of a worst-case nature, and does not exclude the possibility of efficiently performing ERM (with high probability) over datasets that are randomly drawn. Indeed, for the first part of the proof (via a reduction from graph coloring), when the graph G = (V, E) is dense, each dataset constructed in the proof would be of size Ω(|V |) = Ω(d), whereas in the context of sample-efficient collaborative learning, the datasets tend to be much smaller. Unfortunately, we can also prove the hardness of this distributional version, which we formally define below. Problem 2 (ERM over Randomly Drawn Datasets). This is a variant of Problem 1, in which we specify n distributions D1, D2, . . . , Dn over Xd {0, 1} and a parameter m. Each dataset Si consists of m samples independently drawn from Di. We first note that the k = 2 part of Theorem 3.9 still holds for Problem 2. This is because our proof shows that Problem 1 is hard (for a specific hypothesis family) even if all the datasets are of size 1. Then, the same construction can be used to reduce subset sum to Problem 2, in which each distribution is a degenerate distribution. Collaborative Learning with Different Labeling Functions To prove the hardness of Problem 2 when k 3 and the hypothesis family is arbitrary, we will reduce from the following variant of the graph k-coloring problem, in which the graph is promised to have degrees bounded by O(k). Problem 3. Given a graph G = (V, E) with maximum degree at most 2k 1, decide whether G can be k-colored. Lemma D.1. Problem 3 is NP-hard. Proof. We reduce from the usual k-coloring. Let G = (V, E) be an instance of k-coloring. We will give an efficient algorithm that transforms G into a graph G that is a valid instance for Problem 3. Furthermore, G is k-colorable if and only if G is k-colorable. This immediately implies the NP-hardness of Problem 3. For each node v V , we split it into |V | 1 copies v(1), v(2), . . . , v(|V | 1). We also add |V | 2 cliques of size k 1, denoted by Cv,1, Cv,2, . . . , Cv,|V | 2. Every vertex in clique Cv,i is also linked to v(i) and v(i+1). Note that this forces v(1), v(2), . . . , v(|V | 1) to take the same color in a valid k-coloring. Then, for every edge {u, v} E, we link some copy u(i) of u to another copy v(j) of v, so that no copy of any vertex is used twice. The resulting graph G = (V , E ) satisfies |V | = |V | [|V | 1 + (|V | 2) (k 1)] = O(k|V |2), and the maximum degree is 1 + 2(k 1) = 2k 1. The equivalence between the k-colorability of G and G is immediate from our construction. Now we prove a strengthening of Theorem 3.9. Theorem D.2. Unless NP = RP, no polynomial-time (possibly randomized) algorithm for Problem 2 achieves the following guarantee for k 3 and m = Ω(k log n): With probability at least 1/poly(d, n, m) over the randomness in S1, . . . , Sn, if S1, . . . , Sn admits a feasible solution, the algorithm outputs a feasible solution with probability at least 1/poly(d, n, m). Indeed, the guarantee required in Theorem D.2 seems minimal for a useful ERM oracle: It only needs to succeed on a non-negligible fraction of instances, and the definition of success is merely to be able to output a feasible solution (if one exists) with a non-negligible probability. Still, it is unlikely to achieve such a guarantee efficiently under the standard computational hardness assumption of NP = RP. Proof. We prove the contrapositive: the existence of such algorithms implies NP = RP. Suppose that A is an efficient algorithm with the desired guarantees. We derive an efficient algorithm for Problem 3. Given an instance of Problem 3, the reduction from the proof of Theorem 3.9 produces n = |V | datasets ˆS1, ˆS2, . . . , ˆSn, each of size at most 2k. We define the i-th data distribution as the uniform distribution over ˆSi. When m samples are drawn from each Di, every element in the support of every Di appears at least once, except with probability at most m 2kn exp m which can be made much smaller than the 1/poly(d, n, m) term in the theorem statement for some appropriate m = Ω(k log n). In other words, except with a negligibly small probability, the randomly drawn datasets S1, . . . , Sn coincide with the intended datasets ˆS1, . . . , ˆSn (when both are viewed as sets rather than multisets). Then, the hypothetical algorithm A for Problem 2 must output the correct answer with probability larger than 1/poly(d, n, m), when the sampled datasets are ˆS1, . . . , ˆSn. We repeat A on ˆS1, . . . , ˆSn for O(poly(d, n, m)) times and check whether it ever outputs a feasible solution. If so, we output Yes ; we output No otherwise. This gives an efficient randomized algorithm for Problem 3 that: (1) when the input graph is k-colorable, outputs Yes with probability 1/2; (2) when the graph is not k-colorable, always outputs No . This implies NP = RP in light of Lemma D.1. E. An Efficient Algorithm for Identical Marginals In this section, we prove Theorem 3.10, which addresses the special case that D1, D2, . . . , Dn share the same marginal distribution over X. In this case, we show that there is a simple algorithm that efficiently clusters the distributions and learn accurate classifiers using nd samples. The algorithm is formally defined as Algorithm 3, and follows a similar approach to the lifelong learning algorithms in (Balcan et al., 2015; Pentina & Urner, 2016). Collaborative Learning with Different Labeling Functions Algorithm 3 Efficient Clustering under Identical Marginals 1: Input: Hypothesis class F. Sample access to D1, . . . , Dn. Parameters k, ϵ, δ, α, c. 2: Output: Hypotheses ˆf1, ˆf2, . . . , ˆfn. 3: F ; 4: for i [n] do 5: Draw c ln(n|F |/δ) ϵ samples from Di to form dataset S; 6: ˆf arg minf F err S(f); 7: if err S( ˆf) (3 + 2 8: ˆfi ˆf; 9: else 10: Draw c d ln(1/ϵ)+ln[(|F |+1)/δ] ϵ samples from Di to form dataset S; 11: ˆfi arg minf F err S(f); 12: F F { ˆfi}; 13: end if 14: end for 15: Return: ˆf1, ˆf2, . . . , ˆfn; The algorithm maintains a list F of classifiers from class F. For each distribution Di, we first test whether any classifier in F is accurate enough on it. If so, we set ˆfi as the best classifier in F; otherwise, we draw fresh samples to learn an accurate classifier for Di and add it to F. Now we analyze Algorithm 3. We first define a good event that implies the accuracy and sample efficiency of the algorithm. Definition E.1. Let Egood denote the event that the following happen simultaneously when Algorithm 3 is executed: Whenever Line 5 is reached, it holds for every f F that: (1) err Di(f) (3 + α/3)ϵ implies err S(f) 3 + 2 3α ϵ; (2) err S(f) > (3 + α)ϵ implies err Di(f) > 3 + 2 Whenever Line 11 is reached, it holds that err Di( ˆfi) (1 + α/3)ϵ. We show that Egood happens with high probability, and implies that Algorithm 3 is ((3+ α)ϵ, δ)-PAC and has a small sample complexity. Lemma E.2. For any fixed α > 0, there exists a sufficiently large c > 0 such that when Algorithm 3 is executed with parameters α and c, Pr Egood 1 δ. Proof. We upper bound the probability for each of the two conditions in Egood to be violated. For the first condition, we fix i [n] and f F. We note that err S(f) is the average of c ln(n|F |/δ) ϵ independent Bernoulli random variables, each with expectation err Di(f). By a Chernoff bound, for sufficiently large constant c (that depends on α), it holds with probability 1 δ 2n|F | that: (1) err Di(f) (3+α/3)ϵ implies err S(f) 3 + 2 3α ϵ; (2) err Di(f) > (3+α)ϵ implies err S(f) > 3 + 2 3α ϵ. By a union bound over all f F, the first condition of Egood holds for a specific i [n] with probability at least 1 δ/(2n). By another union bound, the first condition holds for all i [n] with probability at least 1 δ/2. For the second condition, suppose that we reach Line 11 at the i-th iteration of the for loop, and |F| = r 1. Recall that the (k, ϵ)-realizability of D1 through Dn implies that there exists f F such that err Di(f) ϵ. Then, by Theorem 5.7 of (Anthony & Bartlett, 1999), for some sufficiently large c, we have err Di( ˆfi) (1 + α/3)ϵ with probability at least 1 δ/(4r2). Since |F| is incremented whenever Line 11 is reached, we only need a union bound over all r = 1, 2, . . . , n, and the probability for the second condition to be violated is upper bounded by Finally, yet another union bound gives Pr Egood 1 δ/2 δ/2 = 1 δ. Collaborative Learning with Different Labeling Functions Lemma E.3. When Egood happens, the outputs of Algorithm 3 satisfy err Di( ˆfi) (3 + α)ϵ for every i [n]. Proof. Fix i [n], and consider the i-th iteration of the for-loop in Algorithm 3. If the condition err S( ˆf) 3 + 2 3α ϵ holds, by the first condition in the definition of Egood, we must have err Di( ˆf) (3 + α)ϵ. Then, by setting ˆfi to ˆf, we guarantee that ˆfi is (3 + α)ϵ-accurate for Di. Otherwise, we pick ˆfi in Line 11, in which case the second condition of Egood gives err Di( ˆfi) (1 + α)ϵ (3 + α)ϵ. Lemma E.4. When Egood happens, Algorithm 3 runs in poly(n, d, 1/ϵ, log(1/δ)) time, makes at most k calls to the ERM oracle, and takes O kd log(1/ϵ) ϵ + n log(n/δ) Proof. The key of the proof is to show that when Egood happens, |F| is always at most k throughout the execution of Algorithm 3. Upper bound |F|. Suppose towards a contradiction that |F| > k at the end of Algorithm 3, while Egood happens. By definition of (k, ϵ)-realizability from Definition 3.1, there exist f 1 , . . . , f k F and c [k]n such that err Di(f ci) ϵ holds for every i [n]. By the pigeonhole principle, there exist i < j such that ci = cj, and Algorithm 3 increments |F| on both the i-th and the j-th iterations of the for-loop. During the i-th iteration, we add ˆfi to F. By the second condition in the definition of Egood, we have err Di( ˆfi) (1 + α/3)ϵ. Now, we use the fact that Di and Dj share the same marginal over X, which we denote by Dx. Define function pi : X [0, 1] as pi(x ) := Pr(x,y) Di [y = 1|x = x ], i.e., the expectation of y|x according to Di. Define pj(x ) := Pr(x,y) Dj [y = 1|x = x ] analogously. Note that we have the following relation for every f : X {0, 1} and i {i, j}: err Di (f) = Pr (x,y) Di [f(x) = y] = E x Dx Pr y Bernoulli(pi (x)) [f(x) = y] = E x Dx [|f(x) pi (x)|] . Since ci = cj, for every x X we have ˆfi(x) pj(x) = [ ˆfi(x) pi(x)] + [pi(x) f ci(x)] + [f cj(x) pj(x)]. It then follows from the triangle inequality that err Dj( ˆfi) = E x Dx h ˆfi(x) pj(x) i h ˆfi(x) pi(x) i + E x Dx pi(x) f ci(x) + E x Dx h f cj(x) pj(x) i = err Di( ˆfi) + err Di(f ci) + err Dj(f cj) (3 + α/3)ϵ. The last step above applies err Di( ˆfi) (1 + α/3)ϵ, err Di(f ci) ϵ, and err Dj(f cj) ϵ. The first inequality was proved earlier. The second and the third inequalities follow from our choice of f 1 , . . . , f k and c [k]n. Finally, by the first condition in the definition of Egood, err Dj( ˆfi) (3 + α/3)ϵ implies that, during the j-th iteration of the for-loop, we have err S( ˆfi) 3 + 2 3α ϵ on Line 5. Then, we will not increment |F| during the j-th iteration, which leads to a contradiction. Oracle calls, runtime, and sample complexity. We first note that |F| is incremented each time we call the ERM oracle on Line 11. Therefore, we make at most k calls to the ERM oracle. Other than this step, the remainder of Algorithm 3 can clearly be implemented in polynomial time. Finally, to bound the sample complexity, we note that in every iteration of the for-loop, Collaborative Learning with Different Labeling Functions c ln(n|F |/δ) ϵ = O log(n/δ) ϵ samples are drawn. In addition, before each time |F| is incremented, O d log(1/ϵ)+log(k/δ) samples are drawn. Therefore, the total sample complexity is upper bounded by: n O log(n/δ) + k O d log(1/ϵ) + log(k/δ) = O kd log(1/ϵ) + n log(n/δ) Finally, we put everything together to prove Theorem 3.10. Proof of Theorem 3.10. By Lemmas E.2, E.3 and E.4, conditioning on an event that happens with probability at least 1 δ, Algorithm 3 returns (3 + α)ϵ-accurate classifiers for all the n distributions, and the runtime, number of ERM oracle calls, and the number of samples are bounded accordingly. In order to control the (unconditional) sample complexity, runtime, and number of oracle calls, we simply terminate the algorithm when any of these quantities exceeds the corresponding bound. The resulting algorithm is still ((3 + α)ϵ, δ)-PAC, and satisfies the desired upper bounds on the sample complexity, runtime, and number of oracle calls. F. Proofs for Section 7 F.1. Technical Lemmas We state and prove a few additional technical lemmas that will be useful for proving the two cases in Theorem 3.16. As in the analysis in the previous sections, we define a good event that implies the success of Algorithm 1. Definition F.1. Let Egood denote the event that the following happen simultaneously when Algorithm 1 is executed: The condition in Lemma 7.1 holds at every iteration r. Whenever Line 16 is reached, err Dv(ˆgi) ϵ/4 implies err Sv(ˆgi) ϵ/2 and err Dv(ˆgi) > ϵ implies err Sv(ˆgi) > ϵ/2. Lemma F.2. When Algorithm 1 is executed with some sufficiently large constant c, it holds that Pr Egood 1 δ. Proof. By Lemma 7.1, the probability for the condition in Lemma 7.1 to be violated in the r-th iteration is at most δ(r)/6. Summing over all r gives P+ r=1 δ(r) 6 < δ/3. By the same argument as in the proof of Lemma B.2, the probability for the second condition to be violated is also at most δ/3. By a union bound, Pr Egood 1 δ/3 δ/3 1 δ. Analogous to Lemma B.3, we have the following lemma, which states that event Egood guarantees that the classifiers returned by the algorithm are accurate. Lemma F.3. When Egood happens, the output of Algorithm 1 satisfies err Di( ˆfi) ϵ for every i [n]. Finally, we prove that the number of active distributions, |G(r)|, decreases at an exponential rate, so the while-loop is executed at most O(log n) times. This will be useful for upper bounding the sample complexity. Lemma F.4. When event Egood happens, |G(r+1)| 3 4|G(r)| holds at the end of the r-th iteration of the while-loop. Proof. Consider the r-th iteration of the while-loop. For brevity, we drop the superscript in γ(r). Since Pγ i=1 |Gi| = |G(r)|, we have i=1 |Gi| 1 n |Gi| |G(r)|/(2γ) o i=1 |Gi| 1 n |Gi| < |G(r)|/(2γ) o |G(r)| γ |G(r)| 2γ = |G(r)|/2. Collaborative Learning with Different Labeling Functions Fix i [γ] that satisfies |Gi| |G(r)|/(2γ). By the first condition in the definition of Egood, event Egood implies that v Gi err Dv(ˆgi) ϵ/8. By Markov s inequality, there are at least |Gi|/2 elements v Gi such that err Dv(ˆgi) ϵ/4. Then, by the second condition in the definition of Egood, every such element v will not appear in G(r+1). Therefore, we conclude that |G(r+1)| |G(r)| 2 1 n |Gi| |G(r)|/(2γ) o |G(r)| |G(r)| F.2. The Bipartite Case We start with the simpler case that k = 2. In this case, we set m(r) according to Lemma 7.1 and set γ(r) = 2 in Algorithm 1. Furthermore, the coloring algorithm is simply the efficient algorithm for 2-coloring. Proof of Theorem 3.16, the k = 2 case. In light of Lemmas F.2 and F.3, it remains to upper bound the sample complexity of Algorithm 1 under event Egood. As a simple corollary of Lemma F.4, we have |G(r)| (3/4)r 1 n at the r-th iteration of the while-loop. The sample complexity of the r-th iteration of the while-loop is upper bounded by m(r) |G(r)| + |G(r)| c ln(|G(r)|/δ(r)) d log(1/ϵ) + |G(r)| + log(1/δ(r)) ϵ + |G(r)| log(|G(r)|/δ(r)) d log(1/ϵ) + (3/4)r n + log(1/δ) + log r ϵ + (3/4)r n log[(3/4)r n] + log(1/δ) + log r Since the while-loop terminates when G(r) is empty, there are at most O(log n) iterations, and summing the above over all rounds gives d log(1/ϵ) log n + n + log(1/δ) log n + log2 n ϵ + n log n + n log(1/δ) d log(1/ϵ) + n ϵ log n + n log(1/δ) Therefore, we have the desired sample complexity upper bound. F.3. The General Case When k 3, we can no longer find a k-coloring efficiently. Instead, we compute an approximate coloring with O(nc k) colors, where n = |G(r)| is the number of vertices in the graph. The hope is that as long as c k < 1, we can still combine the datasets from vertices that share the same color, and use the data more efficiently. Formally, let α be a constant such that there is an efficient algorithm that colors every k-colorable graph with n vertices using at most α nc k colors. We set γ(r) = α |G(r)|c k and set m(r) according to Lemma 7.1. Proof of Theorem 3.16, the k 3 case. Again, we focus on upper bounding the sample complexity. The number of samples drawn in the r-th round is at most m(r) |G(r)| + |G(r)| c ln(|G(r)|/δ(r)) |G(r)|c k d log(1/ϵ) + |G(r)| + log(1/δ(r)) ϵ + |G(r)| log(|G(r)|/δ(r)) Collaborative Learning with Different Labeling Functions Plugging |G(r)| (3/4)r 1 n into the above and summing over r = 1, 2, . . . gives dnc k log(1/ϵ) + n1+c k + nc k log(1/δ) ϵ + n log n + n log(1/δ) d log(1/ϵ) + n ϵ nc k + n log(1/δ)