# derandomizing_multidistribution_learning__4be9efaf.pdf Derandomizing Multi-Distribution Learning Kasper Green Larsen Department of Computer Science Aarhus University larsen@cs.au.dk Omar Montasser Department of Statistics and Data Science Yale University omar.montasser@yale.edu Nikita Zhivotovskiy Department of Statistics University of California, Berkeley zhivotovskiy@berkeley.edu Multi-distribution or collaborative learning involves learning a single predictor that works well across multiple data distributions, using samples from each during training. Recent research on multi-distribution learning, focusing on binary loss and finite VC dimension classes, has shown near-optimal sample complexity that is achieved with oracle efficient algorithms. That is, these algorithms are computationally efficient given an efficient ERM for the class. Unlike in classical PAC learning, where the optimal sample complexity is achieved with deterministic predictors, current multi-distribution learning algorithms output randomized predictors. This raises the question: can these algorithms be derandomized to produce a deterministic predictor for multiple distributions? Through a reduction to discrepancy minimization, we show that derandomizing multi-distribution learning is computationally hard, even when ERM is computationally efficient. On the positive side, we identify a structural condition enabling an efficient black-box reduction, converting existing randomized multi-distribution predictors into deterministic ones. 1 Introduction We consider the problem of multi-distribution learning where there are k unknown data distributions P = {D1, . . . , Dk} over X { 1, 1}, where X is an input domain and { 1, 1} are the possible labels. The goal is to learn a classifier f : X { 1, 1} that satisfies er P(f) := max i er Di(f) min h H max i er Di(h) + ε, where er Di(f) = Pr (x,y) Di[f(x) = y]. (1) Here H { 1, 1}X is the benchmark hypothesis class of VC-dimension d that the learner competes against, and minh H maxi er Di(h) is the optimal worst-case error that can be achieved with classifiers from H. The framework of multi-distribution learning, introduced by Haghtalab et al. [10], is a natural generalization of agnostic PAC learning [22, 21, 5], and captures several important applications such as min-max fairness [12, 18, 15, 8, 20], and group distributionally robust optimization [16]. In the realizable setting, where minh H er P(h) = 0, there is a learning algorithm using O((d+k)/ε) samples to produce such a deterministic classifier f, see e.g., the works [4, 7, 13]. Here, and throughout the paper, O hides terms that are poly ln(dk/(εδ)). This work was primarily done while the author was a FODSI-Simons postdoc at UC Berkeley. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). In the more challenging agnostic setting, where OPT := minh H er P(h) is greater than 0, recent works show that the sample complexity is O((d + k)/ε2) [10, 2, 14, 23]. We refer the reader to Table 1 in [23] for a detailed sample complexity comparison of prior algorithms. Importantly, the guarantee provided by all existing algorithms is slightly different from the objective (1) above. Concretely, all previous algorithms do not produce a deterministic classifier f : X { 1, 1}, but instead output a distribution F over H, such that max i Ef F [er Di(f)] min h H max i er Di(h) + ε. (2) Due to the fact that classical PAC bounds, which involve learning from a single distribution, are achieved using deterministic predictors it is somewhat unsatisfactory to always output a randomized predictor in the multi-distribution case. Observe that because, as in (2), we want optimal performance simultaneously for all distributions, even using a randomized algorithm is somewhat problematic. Indeed, assume that in practice we want to sample a single ˆf according to F and use it as our predictor. Now, if we seek a guarantee like (1) for ˆf, then the best we can guarantee from (2) is to use Markov s inequality and a union bound over all k distributions to ensure that er Di( ˆf) 2k min h H max i er Di(h) + ϵ , with probability at least 1/2, which is, of course, too conservative. Let us also remark that there are examples of distributions F for which this is basically tight. Consider e.g. an input domain X = x1, . . . , xk and k hypotheses h1, . . . , hk such that hi(xi) = 1 and hi(xj) = 1 for j = i. Let Di be the distribution that returns (xi, 1) with probability 1. Then for the uniform distribution F = k 1 P i hi over classifiers, we have maxi Ef F [er Di(f)] = 1/k, but for any single f in the support of F, we have er P (f) = maxi er Di(f) = 1. The example also shows that for every fixed distribution Di, if we sample an f from F, then with probability 1/k, its error exceeds the expectation by a factor k for that distribution Di. There may thus be a large gap between the guarantees of a deterministic and randomized classifier, i.e. the bounds in (1) and (2) are quite different. The main focus of our work, is on replacing the random classifiers in previous works on agnostic multi-distribution learning by deterministic classifiers and understanding the inherent complexity of doing so. In particular, we are interested in understanding any inherent statistical or computational gaps in multi-distribution learning between deterministic classifiers and randomized classifiers. Our contributions Our first contribution is a strong negative result towards derandomizing previous classifiers. Recall that the complexity class BPP denotes bounded-error probabilistic polynomial time2. That is, problems that have polynomial time randomized algorithms that are correct with probability at least 2/3 on every input. It is conjectured that P = BPP and thus most likely BPP = NP. Recall that a set of n points is shattered if each of the 2n possible labelings of the points can be realized by some h H. Our negative result is then the following. Theorem 1. If BPP = NP, then as n = min{d, k, 1/ε} tends to infinity, for every hypothesis class H of VC-dimension d for which one can find n points shattered by H in polynomial time, any multidistribution learning algorithm for H that on the set of k input distributions P = {D1, . . . , Dk} with probability at least 2/3 produces a deterministic classifier f : X { 1, 1} with er P(f) minh H er P(h) + ε, must have either nω(1) (i.e. super-polynomial) training time, or f has nω(1) evaluation time. We remark that this computational hardness result holds even when the class H admits efficient Empirical Risk Minimization (ERM), and even when the distributions are known to the learning algorithm. This highlights that the hardness stems not from the need to sample from the underlying distributions nor from the hardness of ERM, but from the computational problem of deciding which label to assign the points of the input domain. Note that the assumption in Theorem 1 that one can find a set of n shattered points in polynomial time is not restrictive. Finding such points is trivial for many H, i.e., simply choose 0, e1, . . . , ed 1 Rd 1 = X for linear classifiers with VC-dimension d. More generally, the standard result on the 2We refer to the monograph [19] as a standard reference discussing computational complexity classes. class of classifiers induced by positive halfspaces in Rd shows that this class has VC dimension d, and for any set of points such that at most d of its points are contained on a single hyperplane, any subset of size d of this set is shattered. Similar properties are also known for the classes induced by balls in Rp and positive sets in the plane defined by polynomials of degree at most p 1. See [9] for a detailed exposition of these examples. While this might have been the end of the story, our NP-hardness proof fortunately highlights a path to circumventing the lower bound. In particular, the proof carefully uses data distributions D1, . . . , Dk for which Di(y | x) varies between the distributions. Here Di(y | x) denotes the conditional distribution of the label y of a sample (x, y) given x X. We thus consider the following restricted version of collaborative learning in which Di(y | x) = Dj(y | x) for all x, i, j. That is, the k different distributions may vary arbitrarily over X, but the label y of any x X follows the same distribution for all Di. As a particular model of label consistent learning, one may think of a deterministic labeling setup where it is assumed that there is f : X {1, 1} such that across all distributions y = f (x), while no assumption is made that f belongs to H. Remarkably, in terms of sample complexity, in the case of a single distribution, the case of deterministic labeling is almost as hard as the general agnostic case as shown in [3]. Thus, we believe our label-consistent multi-distribution learning setup is quite natural and interesting. Furthermore, this restriction turns out to be sufficient for derandomizing multi-distribution learning algorithms. In particular, we give a new algorithm, Algorithm 1, that uses a randomized (i.e., an algorithm that outputs a randomized predictor given the training data) multi-distribution learning algorithm (like (2)) as a black-box, and produces from it a deterministic classifier, as in (1). Theorem 2. For any finite domain X, if the data distributions D1, . . . , Dk are label-consistent, then given a multi-distribution learning algorithm A that uses m(k, d, OPT, ε, δ) samples and t(k, d, OPT, ε, δ) training time to produce, with probability 1 δ, a distribution F over classifiers from H satisfying maxi Ef F [er Di(f)] OPT + ε, Algorithm 1 produces with probability 1 δ a classifier f : X { 1, 1} with er P(f) OPT + ε with the sample complexity m(k, d, OPT, ε/2, δ/2) + O(k ln2(k/(εδ))/ε2). Using the additional ideas in Section 3.1, the training time of Algorithm 1 is t(k, d, OPT, ε/2, δ/2) + O(k/ε2 + ln(|X|/δ)). If the evaluation time of hypotheses in H is bounded by s, then the evaluation time of the classifier f is bounded by O(s|F|) + O(ln(k/δ) ln(|X|/ε)). Note that several of the previous randomized multi-distribution learning algorithms are indeed computationally efficient as long as ERM is efficient over H. This includes the algorithm in [23] that has a near-optimal sample complexity of m(k, d, OPT, ε, δ) = O((d + k)/ε2) with t(k, d, OPT, ε, δ) = poly(k, d, ε 1, ln(1/δ))t ERM and |F| = poly(k, d, ε 1, ln(1/δ)), where t ERM denotes the time complexity of ERM over H. Plugging this into Theorem 2 gives a polynomial time deterministic multi-distribution learning algorithm. We view the restriction to finite domains X in Theorem 2 as rather mild, as any realistic implementation of a learning algorithm requires an input representation that can be stored on a computer. Moreover, our running time dependency on |X| is only logarithmic. Even so, in Section 3.2 we give some initially promising directions for extending our algorithm to infinite X. Discussion of implications. Prior work has shown that in agnostic multi-distribution learning, a sample complexity of Ω(dk/ϵ2), which is worse than the optimal O((d+k)/ϵ2) sample complexity, is unavoidable with proper learning algorithms, which are algorithms restricted to outputting a classifier in the class H [23, Theorem 18]. In contrast, our negative result in Theorem 1 implies that there is no sample-efficient and oracle-efficient multi-distribution learning algorithm that aggregates multiple ERM predictors in polynomial time. For example, our result rules out the simple majority-vote aggregation approach (which is feasible in the realizable setting when OPT = 0). Note, however, that this does not rule out the existence of computationally inefficient aggregation approaches to construct deterministic predictors. That is, putting computational efficiency aside, it is still an open question whether there exists a sample-efficient and oracle-efficient multi-distribution learning algorithm that outputs a deterministic predictor, and we know from the lower bound of Zhang et al. [23, Theorem 18] that this predictor must be improper. 2 Hardness of derandomization In this section, we prove that it is NP-hard to derandomize multi-distribution learning in the most general setup of input distributions Di over X { 1, 1}. In particular, the hardness proof carefully exploits that different data distributions may assign different labels to the same x X. Our NP-hardness proof goes via a reduction from Discrepancy Minimization. In Discrepancy Minimization, we are given as input an n n matrix with 0-1 entries. The goal is to find a coloring z { 1, 1}n such that every entry of Az is as small as possible in absolute value. Formally, we seek to minimize Az . The seminal work by Charikar et al. [6] showed NP-hardness of computing the best coloring. In full details, their results are as follows. Theorem 3 ([6]). There is a constant c > 0 such that it is NP-hard to distinguish whether an input matrix A {0, 1}n n has Az 2 cn for all z { 1, 1}n, or whether there exists z { 1, 1}n with Az = 0. Since Az Az 2/ n, this similarly implies that it is NP-hard to distinguish whether all z have Az c n, or there is a z with Az = 0. Let us now use Theorem 3 to prove our hardness result, Theorem 1. We remark that NP-hardness is formally defined in a uniform model of computation where a Turing Machine takes an encoded input on a tape and decides language membership. As we believe our reduction is clear without going into such formalities, we have deferred a discussion of how to formalize multi-distribution learning in a uniform model of computation to Appendix 4. Proof. Let n = min{d, k/2, c2/(4ε2)} and let H be an arbitrary hypothesis set of VC-dimension d for which we can find a set of n points that are shattered by H in n O(1) time. This is possible due to our assumption. Let A denote an arbitrary deterministic multi-distribution algorithm. Given a matrix A {0, 1}n n such that either Az c n for all z { 1, 1}n, or there exists a z { 1, 1}n with Az = 0, we will now use A to correctly distinguish these two cases with probability at least 2/3, thus concluding that the running time of A is super-polynomial unless BPP = NP. Start by computing an arbitrary set x1, . . . , xn of n points that are shattered by H. Now define 2n distributions D+ 1 , D 1 , . . . , D+ n , D n . Distribution D+ i and D i are both defined from the i-th row of A. If mi denotes the number of ones in the i-th row of A, we let D+ i return the sample (xj, 1) with probability 1/mi for each j with ai,j = 1. The distribution D i similarly returns (xj, 1) with probability 1/mi for each j with ai,j = 1. Observe that these distributions can be described using n bits each. Now consider running the multi-distribution learning algorithm A on distributions D+ 1 , D 1 , . . . , D+ n , D n to obtain a deterministic classifier f : X { 1, 1}. Evaluate f on x1, . . . , xn and compute er P(f). This can be done trivially in polynomial time using the definitions of the distributions. If er P(f) < 1/2 + ε, then output that there exists z { 1, 1}n such that Az = 0. Otherwise, output that no such z exists. Clearly this runs in polynomial time. It thus remains to argue correctness. Consider first the case where there exists z { 1, 1}n with Az = 0. Since z has inner product 0 with every row of A, it follows that it assigns 1 to precisely half of the non-zero entries of the i-th row and 1 to the remaining half. The labeling z of x1, . . . , xn thus has er D+ i (z) = er D i (z) = 1/2 and er P(z) = 1/2. Furthermore, since x1, . . . , xn are shattered by H, it follows that minh H er P(h) 1/2. By correctness of A, it must hold with probability at least 2/3 that we correctly output that there exists z with Az = 0. Consider next the case that every z { 1, 1}n has Az c n. It follows that there is a row ai such that the vector v = (f(x1), . . . , f(xn)) has |v T ai| c n. Let σ = sign(v T ai). Then er D σ i (f) = 1 mi j:ai,j=1 1{f(xj) = σ} = 1 j:ai,j=1 (1/2)(f(xj)σ + 1) = 1/2 + 1 2mi j:ai,j=1 f(xj)σ = 1/2 + σv T ai = 1/2 + |v T ai| 2mi 1/2 + c/(2 n). Since we chose n = min{d, k/2, c2/(4ε2)}, we have c/(2 n) ε and thus we return with probability 1 that all z { 1, 1}n have Az c n. Let us end by observing that the distributions used in the above hardness result have OPT 1/2. The proof can be modified to prove lower bounds for smaller OPT by adding a dummy point x0 and letting all distributions return (x0, 1) with probability 1 2OPT and the points in the above distributions with probability 2OPT /mi. This reduces the value of OPT to around OPT . However, we also need to reduce n to min{d, k/2, (c OPT /ε)2}. This agrees well with the fact that for realizable multi-distribution learning, i.e., OPT = 0, it is in fact possible to compute a deterministic classifier in polynomial time. 3 Deterministic multi-distribution learner In this section, we give our algorithm for derandomizing multi-distribution learners for labelconsistent distributions, i.e., we assume Di(y | x) = Dj(y | x) for all i, j, x. We start by presenting the high level ideas of our algorithm. Recall that we defined OPT = minh H er P(h), where P = {D1, . . . , Dk}. First, consider running any of the previous randomized multi-distribution learners, producing a distribution F over hypotheses in H satisfying maxi Ef F [er Di(f)] OPT + ε/2. Consider randomly rounding this distribution to a deterministic classifier ˆf as follows: For every x X independently (recall that we focus on finite domains), sample an f F and let ˆf(x) = f(x). For any distribution Di, we clearly have E ˆ f[er Di( ˆf)] = Ef F [er Di(f)] OPT + ε/2. However, as also discussed in the introduction, it is not clear that we can union bound over all k distributions and argue that er Di( ˆf) OPT + ε for all of them simultaneously. Notice however that the independent choice of ˆf(x) for each x gets us most of the way. Indeed, if we let Zx be a random variable (determined by ˆf(x)) giving Pry D(y|x)[ ˆf(x) = y], then er Di( ˆf) = P x X Di(x)Zx, where Di(x) denotes the probability of x under Di and Di(y | x) gives the conditional distribution of the label y given x. Now notice that Di(x)Zx is a random variable taking values in {(1/2 |βx|)Di(x), (1/2 + |βx|)Di(x)} where βx = Pry D1(y|x)[y = 1] 1/2 denotes the bias of the label of x. Furthermore, these random variables are independent. We also have E[er Di( ˆf)] = Ef F [er Di(f)]. Thus by Hoeffding s inequality Pr[| er Di(f) Ef F [er Di(f)]| > ε/2] < 2 exp ε2 x X(2βx Di(x))2 Examining this expression closely, we observe that this probability is small if βx Di(x) is small for all x X. Using this observation, our algorithm then starts by drawing O(ε 2) samples from each distribution Di and collecting all x for which the fraction of 1 s and 1 s is so biased towards either 1 or 1, that the majority label almost certainly equals sign(βx). We then let ˆf(x) equal this majority label for all such x, and put these x into a set T. What remains is all x / T. Here we show that these x have so little bias, i.e., βx Di(x) is so small, that the random rounding strategy above suffices. The full algorithm is shown as Algorithm 1. Before giving the formal analysis of the algorithm, note that storing the classifier ˆf is quite expensive, as we need to remember the random choice of ˆf(x) for every x X \ T. This is one place where we Algorithm 1: DETERMINISTICLEARNER(P, ε, δ, A) Input: Distributions P = {D1, . . . , Dk}. Precision ε > 0, failure probability δ > 0, randomized multi-distribution learner A. Result: Classifier ˆf : X { 1, 1}. 1 Let C > 0 be a large enough constant. 2 Let γ = Ck/(εδ). 3 Let T = . 4 Run A with P = {D1, . . . , Dk} and H as input, precision ε/2 and failure probability δ/2 to obtain a distribution F over classifiers in H. 5 for i = 1, . . . , k do 6 Draw m = C ln2(γ)/ε2 samples {(xj, yj)}m j=1 from Di. 7 For every x X \ T such that ni,x := |{j : xj = x}| > 0, let ρi,x = (|{j : xj = x yj = 1}| |{j : xj = x yj = 1}|)/ni,x. 8 For every x X \ T such that ni,x > 0, if |ρi,x| > p ln(γ)/ni,x, add x to T and let ˆf(x) = sign(ρi,x). 9 For every x X \ T, independently draw an f F and let ˆf(x) = f(x). 10 return ˆf use the assumption that X is finite. Note however that even for finite X, storing |X| random choices to represent the classifier might be infeasible. Furthermore, the sampling of ˆf(x) for every x also adds |X| to the running time, which is again too expensive. We propose a method for reducing the storage and running time requirement later in this section. For now, we analyse Algorithm 1 without worrying about |X|. Analysis. In our analysis, we separately handle x T and x / T. The two technical results we need are stated next. First, define the bias βx of an x X as Pry D1(y|x)[y = 1] 1/2. We say that an x is heavily biased if β2 x Di(x) > ε2 8 ln(4k/δ) for at least one i, and lightly biased otherwise. Intuitively, our algorithm ensures that T contains all heavily biased x and that all predictions made on x T are correct. This is stated in the following Lemma 4. It holds with probability at least 1 δ/4 that every heavily biased x is in T, and furthermore, for every x T, we have ˆf(x) = sign(βx). Next, we also show that random rounding outside T suffices. Lemma 5. Assume every heavily biased x is in T after the for-loop. Then with probability at least 1 δ/4 over the random choice of ˆf(x) with x X \ T, it holds for all i that Ef F [E(x,y) Di[1{x / T f(x) = y}]] E(x,y) Di[1{x / T ˆf(x) = y}] ε/2. Before giving the proof of Lemma 4 and Lemma 5, let us use these two results to complete the proof of Theorem 2. Proof of Theorem 2. From a union bound and Lemma 4 and Lemma 5, we have with probability 1 δ, that all of the following hold The invocation of A in step 1 of Algorithm 1 returns a distribution F with maxi Ef F [er Di(f)] OPT + ε/2. For every x T, we have ˆf(x) = sign(βx). For every distribution Di, Ef F [E(x,y) Di[1{x / T f(x) = y}]] E(x,y) Di[1{x / T ˆf(x) = y}] ε/2. Assume now that all of the above hold. We rewrite er P( ˆf) by splitting the contributions to the error into x T and x / T, er P( ˆf) = max i er Di( ˆf) E(x,y) Di[1{x T ˆf(x) = y}] + E(x,y) Di[1{x / T ˆf(x) = y}] . Using that ˆf(x) = sign(βx) for x T, we have E(x,y) Di[1{x T ˆf(x) = y}] = minz { 1,1}T [EDi[1{x T z(x) = y}]. Thus the above is bounded by min z { 1,1}T[EDi[1{x T z(x) = y}] + Ef F [EDi[1{x / T f(x) = y}]] + ε/2. Since every f in the support of F is a deterministic classifier, we have min z { 1,1}T[EDi[1{x T z(x) = y}] Ef F [EDi[1{x T f(x) = y}]. We therefore have er P( ˆf) max i (Ef F [EDi[1{x T f(x) = y}]] + Ef F [EDi[1{x / T f(x) = y}]]) + ε/2 = max i Ef F [er Di(f)] + ε/2 This completes the proof of Theorem 2. Proof of Lemma 4. We first define the two types of failures that may occur: For every i and every x with β2 x Di(x) > ε2/(8 ln(4k/δ)), let E1 i,x denote the event that ni,x < (C/2)β 2 x ln γ. For every i, let E2 i denote the event that there is an x with ni,x > 0 and |βx ρi,x/2| > p ln(γ)/(16ni,x). Assume first that none of the events occur. Consider a heavily biased x. Then there is an i for which β2 x Di(x) > ε2/(8 ln(4k/δ)). Since E1 i,x does not occur, we have ni,x (C/2)β 2 x ln γ. Since E2 i does not occur, we also have |βx ρi,x/2| p ln(γ)/(16ni,x). Hence |ρi,x| 2|βx| 2 p ln(γ)/(16ni,x). But |βx| p (C/2) ln(γ)/ni,x and thus |ρi,x| ( ln(γ)/ni,x. For C large enough, this is at least p ln(γ)/ni,x, which puts x in T during step 8 of Algorithm 1. Thus every heavily biased x is in T. Secondly, note that when an x is added to T in iteration i of the for-loop, we have |ρi,x| > p ln(γ)/ni,x. Since E2 i does not occur, we have |βx ρi,x/2| p ln(γ)/(16ni,x). But this implies βx [ρi,x/2 p ln(γ)/(16ni,x), ρi,x/2 + p ln(γ)/(16ni,x)]. Since |ρi,x| > p ln(γ)/ni,x, every number in this interval has the same sign as ρi,x, i.e. ˆf(x) = sign(ρi,x) = sign(βx). Thus what remains is to bound the probability of these events. For E1 i,x, fix an i and x with β2 x Di(x) > ε2/(8 ln(4k/δ)), we have E[ni,x] = Di(x)m > ε2m/(8β2 x ln(4k/δ)) > Cβ 2 x ln γ. For C large enough, we get from a Chernoff bound that Pr[E1 i,x] = Pr[ni,x < (C/2)β 2 x ln γ] < γ 2. For E2 i , let us first condition on an outcome of the values ni,x for all x. Then for every x, we have that pi,x := |{j : xj = x yj = 1}| |{j : xj = x yj = 1}| is distributed as the sum of ni,x independent 1/1 random variables taking the value 1 with probability βx + 1/2. Hence E[pi,x] = 2βxni,x. Since ρi,x = pi,x/ni,x, it follows from Hoeffding s inequality that Pr |βx ρi,x/2| > q = Pr |2βxni,x pi,x| > 2 q < 2 exp 8 ln(γ)ni,x For any fixed values ni,x, there are at most m distinct x with a non-zero ni,x. A union bound over all of them implies Pr[E2 i | ni,x] 2mγ 2. Since this upper bound holds for any outcome of the ni,x, we have also Pr[E2 i ] 2mγ 2. We now observe that for every i, there are at most ε 28 ln(4k/δ) distinct x with β2 x Di(x) > ε2/(8 ln(4k/δ)). Hence Pr[ x E1 i,x] ε 28 ln(4k/δ)γ 2. A union bound over all i finally implies Pr[( i x E1 i,x) ( i E2 i )] kγ 2 ε 28 ln(4k/δ) + 2m Since γ = Ck/(εδ) and m = C ln2(γ)/ε2, we have for large enough C that this probability is bounded by δ/4. Proof of Lemma 5. Fix a distribution Di. Observe that for any x X \ T, we have that the distribution of ˆf(x) is the same as f(x) for f F. Hence E ˆ f[E(x,y) Di[1{x / T ˆf(x) = y}]] = Ef F [E(x,y) Di[1{x / T f(x) = y}]]. Denote this expectation by µ. If we let Zx be the random variable (as a function of ˆf(x)) taking the value Pry Di(y|x)[ ˆf(x) = y], then E(x,y) Di[1{x / T ˆf(x) = y}] = X x X\T Di(x)Zx. Observe that Zx is either 1/2 |βx| or 1/2 + |βx|, depending on whether ˆf(x) = sign(βx) or not. Hence Di(x)Zx [Di(x)(1/2 |βx|), Di(x)(1/2 + |βx|)] and the Zx are independent. We thus get from Hoeffding s inequality and that x / T are lightly biased that x X\T Di(x)Zx > µ + ε/4 2(ε/2)2 P x X\T (2|βx|Di(x))2 ε2 P x X\T Di(x)ε2/ ln(4k/δ) exp ( ln(4k/δ)) = δ/(4k). A union bound over all Di completes the proof. 3.1 Reducing storage and time The above description of Algorithm 1 requires the storage of an independent random choice of ˆf(x) for every x X. This is infeasible for large X, both in terms of space usage and the time needed for making these random choices. Instead, we can reduce the storage requirements by using an r-wise independent hash function q : X Y for a sufficiently large output domain Y to make the random rounding. Recall that an r-wise independent hash function hashes any set of up to r distinct keys x1, . . . , xr independently and uniformly at random into Y. Such a hash function can be implemented in space O(r ln(|X||Y|)) bits and evaluated in time O(r ln(|X||Y|)) by e.g., interpreting an x X as an index into [|X|] = {0, . . . , |X| 1} and letting q(x) = Pr 1 i=0 αixi(modp) for a prime p = |Y| > |X| and the αi independent and uniformly random in [p]. Using fast multiplication algorithms, q(x) can be evaluated in time O(r ln(|X||Y|)), even when ln(|X||Y|) bits does not fit in a machine word. The time to sample the hash function is only O(r ln |X||Y|) (we just need the random coefficients of the polynomial). Instead of storing ˆf(x) for every x X \ T explicitly, the learning algorithm instead stores q and the distribution F. Given this information, we evaluate ˆf(x) by computing q(x) and letting ˆf(x) = 1 if q(x) Prf F [f(x) = 1]|Y| 1 and 1 otherwise. Since q(x) is uniform over Y for any x, we have Pr[ ˆf(x) = 1] = Prf F [f(x) = 1]|Y| /|Y|. This probability satisfies Prf F [f(x) = 1] 1/|Y| Prf F [f(x) = 1] Prf F [f(x) = 1] and is thus almost the same rounding probability as in Algorithm 1. Since previous multi-distribution learning algorithms also store F, this only adds O(r ln(|X||Y|)) bits to the storage. What remains is to determine an r and |Y| for which this is sufficient for the guarantees of Algorithm 1. We will show that r = 2 ln(4k/δ) and |Y| = Θ(ε 3 ln(k/δ)) suffices if we increase the sample complexity of Algorithm 1 by a logarithmic factor. Observe that the O(r ln(|X| ln(k/δ)/ε)) extra bits is only proportional to storing O(ln(k/δ)) samples from X, provided that ln(k/δ)/ε is no larger than a polynomial in |X|. The space overhead is thus very minor. We only give an outline of how to modify the proof in the previous section to work with r-wise independence as it follows the previous proof rather uneventfully. First, redefine the threshold for being heavily biased to β2 x Di(x) > ε2/(C ln2(4k/δ)) for large enough constant C . For the proof of Lemma 4 to still go through, this requires us to increase m by a C ln γ factor, i.e. to CC ln3(γ)/ε2, and also increase γ by C to CC k/(εδ). Then the only change to the proof, is that we have an event E1 i,x for every i and every x with β2 x Di(x) > ε2/(C ln2(4k/δ)). Otherwise, all conditions in the events E1 i,x and E2 i remain the same. Thus the proof still goes through if we can argue Pr[E1 i,x] γ 2. So fix an i and x with β2 x Di(x) > ε2/(C ln2(4k/δ)). Then E[ni,x] = Di(x)m > ε2m/(C β2 x ln2(4k/δ)) > Cβ 2 x ln γ. This is the same lower bound on E[ni,x] as the previous proof and thus we can complete the steps. Finally, note that we finished the proof of Lemma 4 by a union bound. Here we needed kγ 2(ε 28 ln(4k/δ) + 2m) < δ/4. This is still the case for our new m and γ. Now for the proof of Lemma 5, we used Hoeffding s inequality. This requires the random rounding to be independent for different x. With our modified approach, the roundings are only r-wise independent and thus we need the following variant of Hoeffding s inequality for r-wise independent random variables Theorem 6 ([17]). Let Z1, . . . , Zn be a sequence of r-wise independent random variables for r 2 with |Zi E[Zi]| 1 for all outcomes. Let Z = P i Zi with E[Z] = µ and let σ2(Z) = P i σ2(Zi) denote the variance of Z. Then the following holds for even r and any Q max{r, σ2(Z)}: Pr[|Z µ| T] r Q e2/3T 2 If we repeat the proof of Lemma 5, define Zx as the random variable (as a function of the random choice of q) taking the value Pry Di(y|x)[ ˆf(x) = y]. Note that Zx {1/2 |βx|, 1/2 + |βx|}. This also implies that |Zx E[Zx]| 2|βx| for all outcomes of Zx. When all heavily biased x are in T, we have β2 x Di(x) ε2/(C ln2(4k/δ)) for all x / T. This implies |βx| ε/(ln(4k/δ) p C Di(x)). Now let α = 2ε/(ln(4k/δ) E(x,y) Di[1{x / T ˆf(x) = y}] = X x X\T Di(x)Zx = α X The random variable Di(x)Zx/α thus satisfies |Di(x)Zx/α E[Di(x)Zx/α]| 2Di(x)|βx|/α p Di(x) 1 for all outcomes. This also gives us σ2(Di(x)Zx/α) Di(x) and thus x X\T Di(x) 1. Now consider the expected value (with a b = [a b, a + b]) x X\T Di(x)Zx] x X\T Di(x)Eq[ Pr y Di(y|x)[ ˆf(x) = y]] x X\T Di(x) Ef F [ Pr y Di(y|x)[f(x) = y]] 1/|Y| Ef F [E(x,y) Di[1{x / T f(x) = y}]] 1/|Y|. Letting µ = Ef F [E(x,y) Di[1{x / T f(x) = y}]], we then have by Theorem 6 with Q = r that x X\T Di(x)Zx µ x X\T Di(x)Zx µ e2/3(T α/|Y|)2 Inserting T = ε/(2α) and using r = 2 ln(4k/δ), |Y| 4α2/ε gives (T α/|Y|) ε/(4α) and thus finally implies Pr h E(x,y) Di[1{x / T ˆf(x) = y}] Ef F [E(x,y) Di[1{x / T f(x) = y}]] ε/2 i C e2/3 ln2(4k/δ) e r/2 = δ/(4k). Here, the last inequality follows for C large enough. Thus, if we increase the sample complexity to m(k, d, OPT, ε/2, δ/2) + O(k ln3(k/(εδ))/ε2), then we may sample and store a hash function using only O(ln(n/δ) ln(|X| ln(k/δ)/ε)) extra bits and O(ln(n/δ) ln(|X| ln(k/δ)/ε)) time. 3.2 Infinite Input Domains In the above presentation of our algorithm, we have assumed a finite input domain X. While we believe this is a very reasonable assumption, we here present some initial ideas for how this restrictions might be circumvented. Assume that the black-box randomized multi-distribution learner A always outputs a distribution F over a finite number of classifiers in H. Let m be an upper bound on the size of the support. Then since H has VC-dimension d, the dual VC-dimension is at most 2d [1]. By Sauer-Shelah, this implies that the number of distinct ways x X may be labeled by the support of F is bounded m 2d+1 , i.e. finite. We believe that treating just the distinct ways x is labeled by the hypotheses in the support should be sufficient to recover our results for finite X. Acknowledgments Kasper Green Larsen is co-funded by a DFF Sapere Aude Research Leader Grant No. 9064-00068B by the Independent Research Fund Denmark and co-funded by the European Union (ERC, TUCLA, 101125203). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Omar Montasser was supported by a FODSI-Simons postdoctoral fellowship at UC Berkeley. [1] P. Assouad. Densité et dimension. Annales de l Institut Fourier, 33:233 282, 1983. [2] P. Awasthi, N. Haghtalab, and E. Zhao. Open problem: The sample complexity of multidistribution learning for VC classes. In G. Neu and L. Rosasco, editors, The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, 12-15 July 2023, Bangalore, India, volume 195 of Proceedings of Machine Learning Research, pages 5943 5949. PMLR, 2023. [3] S. Ben-David and R. Urner. The sample complexity of agnostic learning under deterministic labels. In Conference on Learning Theory, pages 527 542. PMLR, 2014. [4] A. Blum, N. Haghtalab, A. D. Procaccia, and M. Qiao. Collaborative PAC learning. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 2392 2401, 2017. [5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnikchervonenkis dimension. J. ACM, 36(4):929 965, 1989. [6] M. Charikar, A. Newman, and A. Nikolov. Tight hardness results for minimizing discrepancy. In D. Randall, editor, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 1607 1614. SIAM, 2011. [7] J. Chen, Q. Zhang, and Y. Zhou. Tight bounds for collaborative PAC learning via multiplicative weights. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 3602 3611, 2018. [8] E. Diana, W. Gill, M. Kearns, K. Kenthapadi, and A. Roth. Minimax group fairness: Algorithms and experiments. In M. Fourcade, B. Kuipers, S. Lazar, and D. K. Mulligan, editors, AIES 21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021, pages 66 76. ACM, 2021. [9] S. Floyd and M. K. Warmuth. Sample compression, learnability, and the vapnik-chervonenkis dimension. Mach. Learn., 21(3):269 304, 1995. [10] N. Haghtalab, M. I. Jordan, and E. Zhao. On-demand sampling: Learning optimally from multiple distributions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. [11] M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA, USA, 1994. [12] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 4615 4625. PMLR, 2019. [13] H. L. Nguyen and L. Zakynthinou. Improved algorithms for collaborative PAC learning. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 7642 7650, 2018. [14] B. Peng. The sample complexity of multi-distribution learning. Co RR, abs/2312.04027, 2023. [15] G. N. Rothblum and G. Yona. Multi-group agnostic PAC learnability. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 9107 9115. PMLR, 2021. [16] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. [17] J. P. Schmidt, A. Siegel, and A. Srinivasan. Chernoff hoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics, 8(2):223 250, 1995. [18] S. Shekhar, G. Fields, M. Ghavamzadeh, and T. Javidi. Adaptive sampling for minimax fair classification. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 24535 24544, 2021. [19] M. Sipser. Introduction to the theory of computation. SIGACT News, 27(1):27 29, 1996. [20] C. J. Tosh and D. Hsu. Simple and near-optimal algorithms for hidden stratification and multi-group learning. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 21633 21657. PMLR, 2022. [21] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134 1142, 1984. [22] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. [23] Z. Zhang, W. Zhan, Y. Chen, S. S. Du, and J. D. Lee. Optimal multi-distribution learning. Co RR, abs/2312.05134, 2023. 4 Uniform Model of Computation For a fully formalized NP-hardness proof, we technically need to define an input encoding of a multi-distribution learning problem and argue that the sampling steps may be simulated by a Turing Machine. Furthermore, details such as whether the hypothesis set H is part of the input or known to the algorithm also needs to be formalized. In this section, we discuss various choices one could make. We note that similar discussions and formalizations of learning in a uniform model of computation has been carefully carried out in classic learning theory books [11]. First, we find it most natural that H is part of the learning problem, i.e. not an input to the algorithm, but is allowed to be "hard-coded" into the algorithm. This is the best match to standard learning problems, where e.g. the Support Vector Machine learning algorithm, or Logistic Regression via gradient descent, knows that we are working with linear models. Similarly, the input domain seems best modeled by letting it be known to the algorithm. One tweak could be that if the input is ddimensional vectors, then d could be part of the input to the algorithm. This again matches how most natural learning algorithms work for arbitrary d (and our proof needs d to grow for our n to grow). Now regarding modeling multi-distribution learning, we find that the following uniform computational model most accurately matches what the community thinks of as multi-distribution learning (here stated for the input domain being n-dimensional vectors and the hypothesis set being linear models). A solution to multi-distribution learning with linear models, is a special Turing machine M. M receives as input a number n on the input tape. In addition to a standard input/output tape and a tape with random bits, M has a "sample"-tape, a "target distribution"-tape and a special "sample"-state. When M enters the "sample"-state, the bits on the "target distribution" tape is interpreted as an index in i and the contents of the "sample"-tape is replaced by a binary description of a fresh sample from a distribution Di (Di is only accessible through the "sample"-state). A natural assumption here would be that Di is only supported over points with integer coordinates bounded by n in magnitude. This gives a natural binary representation of each sample using n log n bits, plus one bit for the label. M runs until terminating in a special halt state, with the promise that regardless of what n distributions D1, . . . , Dn over the input domain that are used for generating samples in the "sample"-state, it holds with probability at least 2/3 over the samples and the random bits on the tape, that the output tape contains a binary encoding of a hyperplane with error at most τ +1/n for every distribution Di. A bit more generally, we could also let it terminate with an encoding of a Turing machine on its output tape. That Turing machine, upon receiving the encoding of n and an n-dimensional point on its input tape, outputs a prediction on its tape. This allows more general hypotheses than just outputting something from H. The above special states and tapes are introduced to most accurately represent multi-distribution learning. Now observe that our reduction from discrepancy minimization still goes through. Given such a special Turing machine M for multi-distribution learning, observe that we can obtain a standard (randomized) Turing machine M for discrepancy minimization from it. Concretely, in discrepancy minimization, the input is the integer n and an n n binary matrix A. As mentioned in our reduction, we can easily compute n shattered points for linear models, e.g. just the standard basis e1, . . . , en. Now do as in our reduction and interpret each row of A as two distributions over e1, . . . , en. M can now simulate the "sample"-state, "sample"-tape and "target distribution" tape of M, as it can itself use its random tape to generate samples from the distributions. In this way, M can simulate M without the need for special tapes and states, and by the guarantees of M (as in our reduction), it can distinguish whether A has discrepancy 0 or Ω( n) by using the final output hypothesis of M and evaluating it on e1, . . . , en and computing the error on each of the (known) distributions Di obtained from the input matrix A. Note that the reduction would also hold if we rephrased multi-distribution learning such that the algorithm receives some binary encoding of D1, . . . , Dn as input. This would make the reduction even more straight-forward, as we need not worry about samples. However, we feel the above definition with a special state and tapes for sampling more accurately represent multi-distribution learning from a learning theoretic perspective. We thus prefer a slightly more complicated reduction as above to better model the problem. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The theoretical claims are stated in the introduction and abstract. They match the contribution of the paper. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper is theoretical, so given the assumptions of the claims, there are no limitations. However, we discuss that our paper does not exclude the existence of inefficient deterministic predictors (see introduction). Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The paper includes all proofs of all claims in the main text. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We see no violations of the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The result of the paper is theoretical/foundational research, so we have no experiments, data or code in the paper. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.