# sample_complexity_of_forecast_aggregation__6692db88.pdf Sample Complexity of Forecast Aggregation Tao Lin Harvard University Cambridge, MA 02138 tlin@g.harvard.edu Yiling Chen Harvard University Cambridge, MA 02138 yiling@seas.harvard.edu We consider a Bayesian forecast aggregation model where n experts, after observing private signals about an unknown binary event, report their posterior beliefs about the event to a principal, who then aggregates the reports into a single prediction for the event. The signals of the experts and the outcome of the event follow a joint distribution that is unknown to the principal, but the principal has access to i.i.d. samples from the distribution, where each sample is a tuple of the experts reports (not signals) and the realization of the event. Using these samples, the principal aims to find an ε-approximately optimal aggregator, where optimality is measured in terms of the expected squared distance between the aggregated prediction and the realization of the event. We show that the sample complexity of this problem is at least Ω(mn 2/ε) for arbitrary discrete distributions, where m is the size of each expert s signal space. This sample complexity grows exponentially in the number of experts n. But, if the experts signals are independent conditioned on the realization of the event, then the sample complexity is significantly reduced, to O(1/ε2), which does not depend on n. Our results can be generalized to non-binary events. The proof of our results uses a reduction from the distribution learning problem and reveals the fact that forecast aggregation is almost as difficult as distribution learning. 1 Introduction Suppose you want to know whether it will rain tomorrow. A Google search on weather returns 40% probability of raining. The weather forecasting app on your phone shows 85%. And one of your friends, who is an expert in meteorology, predicts 65%. How do you aggregate these different predictions into a single, accurate prediction? This problem is called forecast aggregation or forecast combination [7, 16, 45]. It has innumerable applications in fields ranging from economics, statistics, operations research, to machine learning, decision theory, and of course, climate science. A straightforward solution to forecast aggregation is to take the (unweighted) average of all experts forecasts. Simple as it is, unweighted average performs surprisingly well in practice, as observed by many forecast aggregation works from 1960s to 2020s (e.g., [7, 38, 35, 30, 16, 45, 34]). Naturally, when past data of expert forecasts and event outcomes (e.g., historical weather forecasts and outcomes) are available, one may hope to leverage on such data to learn more accurate aggregators. While adjusting the weights of experts in averaging according to their individual historical accuracy has led to improved accuracy for the aggregated prediction [49], interestingly, more sophisticated data-driven methods [23, 35, 49] were often outperformed by the unweighted average. The dominance of simple aggregators over more sophisticated data-driven methods is observed so often in empirical applications that it is termed the forecast combination puzzle [44, p.428]. There are many potential explanations for the forecast combination puzzle [43, 23, 15]. In some scenarios, the past events are different from the future events in fundamental ways (e.g., geopolitical 37th Conference on Neural Information Processing Systems (Neur IPS 2023). forecasting) and hence past data may not be informative in predicting the future. Another widely accepted conjecture is that the amount of past data is not large enough for a data-intensive method to perform well. Indeed, the sample sizes in many empirical forecast aggregation works are small under today s big-data standard (24 in [27], 69 in [42], 87 in [6], 100 in [30]). However, there are aggregation settings where future events are arguably similar to past events and we do have abundant data for instance, the forecasting of weather, stock prices [21], and some forecasting competitions on Kaggle1. Such settings are well-suited for data-driven methods. It is hence tempting to ask how many data are needed for data-driven aggregators to achieve high accuracy in these settings. In this paper, we initiate the study of sample complexity of forecast aggregation, building upon a standard Bayesian forecasting model [37, 48, 24]. In this model, there are n experts who observe private signals about an unknown binary event and make posterior predictions about the event. The experts signals and the event are jointly distributed according to some underlying unknown distribution (information structure) P, which determines the optimal way to aggregate the experts predictions. With sample access to the unknown distribution P, we ask the following question: How many samples do we need to approximately learn the optimal aggregator? Our model favors the use of data-driven aggregation methods because historical tasks are i.i.d. as future tasks. We show that however, even in this benign model, optimal aggregation in general needs exponentially many samples. One may thus expect that data-driven methods can hardly perform well in more realistic and non-i.i.d. scenarios. Nevertheless, for some special, yet interesting, families of information structures, the sample complexity of forecast aggregation is significantly lower. Main results (1) If P can be an arbitrary discrete distribution, then at least Ω(mn 2/ε) samples are needed, and O(mn/ε2) samples are sufficient, to learn an ε-optimal aggregator with high probability, where m is the number of signals an expert can possibly observe.2 (2) If the experts signals are conditionally independent, then the sample complexity is exponentially reduced and surprisingly does not depend on the number of experts or the number of signals: O(1/ε2) samples are sufficient, and Ω(1/ε) samples are necessary, to learn an ε-optimal aggregator with high probability. Main techniques The main technical part of our paper is to prove the Ω(mn 2/ε) lower bound on the sample complexity of forecast aggregation for the general case, via a reduction from the distribution learning problem. It is known that learning a discrete distribution with support size |S| in total variation distance ε requires Ω(|S|/ε2) samples. By reducing distribution learning to forecast aggregation, we obtain the lower bound on the sample complexity of the latter problem. This reduction is highly nontrivial. To do this reduction we define a new distribution learning problem that is different from the one in the literature, which is of independent interest. This reduction also reveals an interesting fact: learning to aggregate optimally on some distribution is almost as difficult as learning the distribution itself. This is a little surprising because one might initially think that aggregation should be easier than distribution learning we show that this is not the case. Structure of the paper We discuss related works in Section 1.1. Section 2 introduces our model. Section 3 includes some preliminaries, including the distribution learning problem, which we will use in the proof. Section 4 studies the sample complexity for general distributions. Section 5 focuses on the conditional independence case. Section 6 summarizes how our results can be generalized to non-binary (multi-outcome) events. Section 7 concludes and discusses future directions. 1.1 Related Works Data-driven aggregation Data-driven approaches to forecast aggregation date back to perhaps the first paper on forecast aggregation [7] and have been a standard practice in the literature (see, e.g., surveys [16, 45] and more recent works [51, 42, 4, 47, 22, 40]). Many of these works focus on specific weighted average aggregators like linear pooling (weighted arithmetic mean) [51] and logarithmic pooling (normalized weighted geometric mean) [42, 40], and the goal is to estimate the optimal weights from data. However, those weighted average aggregators are not necessarily the 1https://www.kaggle.com/c/m5-forecasting-accuracy 2The O( ) and Ω( ) notations omit logarithmic factors. optimal (Bayesian) aggregator unless the underlying information structure satisfies some strict conditions (e.g., experts forecasts are equal to the true probability of the event plus normally distributed noises [48]). Our work aims to understand the sample complexity of Bayesian aggregation, namely how many samples are needed to approximate the Bayesian aggregator. Our work is closely related to a recent work [4] on Bayesian aggregation via online learning. For the case of conditionally independent experts, [4] shows that the Bayesian aggregator can be approximated with ε = O( n T ) regret. By an online-to-batch conversion, this implies a T = O( n2 ε2 ) sample complexity upper bound for the batch learning setting. Our paper studies the batch learning setting. For the conditional independence case, we obtain an improved bound of O( 1 Robust forecast aggregation Recent works on robust forecast aggregation [2, 39, 18, 3] also study information aggregation problems where the principal does not know the underlying information structure. They take a worst-case approach, assuming that the information structure is chosen adversarially. This often leads to negative results: e.g., a bad approximation ratio [2, 39] or a degenerate maximin aggregator that solely relies on a random expert s opinion [18, 3]. In contrast, we assume sample access to the unknown information structure. Our sample complexity approach is orthogonal and complementary to the robust forecast aggregation approach. Sample complexity of mechanism design Our work may remind the reader of a long line of research on the sample complexity of revenue maximization in mechanism design (e.g., [17, 20, 36, 19, 5, 11, 29, 10, 25, 50, 28]). In particular, [28] gives a general framework to bound the sample complexity for mechanism design problems that satisfy a strong monotonicity property, but this property is not satisfied by our forecast aggregation problem. A key observation from this line of works is that the number of samples needed to learn an ε-optimal auction for n 1 independent bidders is increasing in n, because when there are more bidders, although we can obtain higher revenue, the optimal auction benchmark is also improved. We see a similar phenomenon that the sample complexity of forecast aggregation increases with the number of experts in the general case, but not in the case where experts are conditionally independent. 2.1 Forecast Aggregation There are n 2 experts and one principal. The principal wants to predict the probability that a binary event ω Ω= {0, 1} happens (ω = 1), based on information provided by the experts. For example, ω may represent whether it will rain tomorrow. We present binary events to simplify notations. All our results can be generalized to multi-outcome events with |Ω| > 2 (see Section 6). We also refer to ω as the state of the world . Each expert i = 1, . . . , n observes some private signal si Si that is correlated with ω, where Si denotes the space of all possible signals of expert i. We assume for now that Si is finite, with size |Si| = m. We relax this assumption in Section 5 where we consider conditionally independent signals. Let S = S1 Sn be the joint signal space of all experts; |S| = mn. Let P be a distribution over S Ω, namely, a joint distribution of signals s = (s1, . . . , sn) and event ω. Since the space S Ωis discrete, we can use P( ) to denote the probability: P(s, ω) = Pr P [s, ω]. Signals of different experts can be correlated conditioned on ω. We assume that each expert i knows the marginal joint distribution of their own signal si and ω, P(si, ω). Neither any expert nor the principal knows the entire distribution P. Each expert i reports to the principal a forecast (or prediction) ri for the event ω, which is equal to the conditional probability of ω = 1 given their signal si:3 ri = P(ω = 1 | si) = P (ω=1)P (si|ω=1) P (ω=1)P (si|ω=1)+P (ω=0)P (si|ω=0). (1) We note that ri depends on si and P, but not on ω or other experts signals s i. Let r = (r1, . . . , rn) [0, 1]n denote the reports (joint report) of all experts. We sometimes use r i to 3One may wonder whether the experts are willing to report ri = P(ω = 1 | si) truthfully. This can be guaranteed by a proper scoring rule. For example, we can reward each expert the Brier score C |ri ω|2 after seeing the realization of ω [9]. Each expert maximizes its expected reward by reporting its belief truthfully. denote the reports of all experts except i. The principal aggregates the experts reports r into a single forecast f(r) using some aggregation function, or aggregator, f : [0, 1]n [0, 1]. We measure the performance of an aggregator by the mean squared loss: LP (f) = EP |f(r) ω|2 . (2) The notation EP [ ] makes it explicit that the expectation is over the random draw of (s, ω) P followed by letting ri = P(ω = 1 | si). We omit P and write E[ ] when it is clear from the context. Let f be the optimal aggregator with respect to P, which minimizes LP (f): f = argmin f:[0,1]n [0,1] LP (f) = argmin f:[0,1]n [0,1] EP |f(r) ω|2 . (3) We have the following characterization of f and LP (f): f is equal to the Bayesian aggregator, which computes the posterior probability of ω = 1 given all the reports r = (r1, . . . , rn). And the difference between the loss of f and the loss of f is equal to their expected squared difference. Lemma 2.1. The optimal aggregator f and any aggregator f satisfy: f (r) = P(ω = 1 | r), for almost every r. LP (f) LP (f ) = EP |f(r) f (r)|2 . An aggregator f is ε-optimal (with respect to P) if LP (f) LP (f ) + ε. By Lemma 2.1, this is equivalent to EP |f(r) f (r)|2 ε. For an ε-optimal f, we also say it ε-approximates f . Definition 2.2. An aggregator f is ε-optimal (with respect to P) if EP |f(r) f (r)|2 ε. Discussion of the benchmark f Our benchmark, the Bayesian aggregator f , is common in the forecast aggregation literature (e.g., [27, 26, 45]). It is stronger than the typical best expert benchmark in no-regret learning (e.g., [13, 22]), but weaker than the omniscient aggregator that has access to the experts signals: fomni(s) = P(ω = 1 | s). If there is a one-to-one mapping between signals s and reports r, then fomni and f are the same. Otherwise, fomni could be much stronger than f and an ε-approximation to fomni using experts reports only is not always possible.4 In contrast, an ε-approximation to f is always achievable (in fact, achieved by f itself). The difference between f and fomni is known as the difference between aggregating forecasts and aggregating information sets in the literature [27, p.198-199], [26, p.168-169], [45, p.143]. 2.2 Sample Complexity of Forecast Aggregation The principal has access to T i.i.d. samples of forecasts and event realizations drawn from the underlying unknown distribution P: ST = (r(1), ω(1)), . . . , (r(T ), ω(T )) , (s(t), ω(t)) P, r(t) i = P(ω = 1 | s(t) i ). (4) Here, we implicitly regard P as a distribution over (r, ω) instead of (s, ω). The principal uses samples ST to learn an aggregator ˆf = ˆf ST , in order to approximate f . Our main question is: How many samples are necessary and sufficient for finding an ε-optimal aggregator ˆf (with probability at least 1 δ over the random draw of samples)? The answer to the above question depends on the family of distributions we are interested in. Let P denote a family of distributions over S Ω. It could be the set of all distributions over S Ω, or in Section 5 we will only consider distributions where the signals are independent conditioned on ω. We define the sample complexity of forecast aggregation, with respect to P, formally: Definition 2.3. The sample complexity of forecast aggregation (with respect to P) is the minimum function TP( , ) of ε, δ (0, 1), such that: if T TP(ε, δ), then for any distribution P P, with probability at least 1 δ over the random draw of T samples ST from P (and over the randomness of the learning procedure if it is randomized), we can obtain an aggregator ˆf = ˆf ST satisfying EP [| ˆf(r) f (r)|2] ε. 4This has been noted by [2, 4, 39]. They give an XOR example where ω = s1 sn, s1 and s2 are i.i.d. Uniform{0, 1} distributed. The experts always report ri = 0.5, fomni(s1, s2) = ω, f (r1, r2) = 0.5, so LP (fomni) = 0 but LP (f ) = 0.25 > 0. No aggregator that uses experts reports only can do better than f . The principal is assumed to know the family of distributions P but not the specific distribution P P. There should be at least two different distributions in P. Otherwise, the principal knows what the distribution is and there is no need for learning. 3 Preliminaries In this section, we briefly introduce some notions that will be used in our analysis of the sample complexity of forecast aggregation, including some definitions of distances between distributions, the distribution learning problem, and the distinguishing distributions problem. Distances between distributions We recall two distance metrics for discrete distributions: the total variation distance and the (squared) Hellinger distance. Definition 3.1. Let D1, D2 be two distributions on a discrete space X. The total variation distance between D1 and D2 is d TV(D1, D2) = 1 2 P x X D1(x) D2(x) . The squared Hellinger distance between D1 and D2 is d2 H(D1, D2) = 1 2 P x X p D2(x) 2 = 1 P x X p D1(x)D2(x). The total variation distance has the following well-known property that upper bounds the difference between the expected values of a function on two distributions: Fact 3.2. For any function h : X [0, 1], |Ex D1h(x) Ex D2h(x)| d TV(D1, D2). In Appendix A we give some properties of the Hellinger distance that will be used in the proofs. Distribution learning in total variation distance Our analysis of the sample complexity of the forecast aggregation problem will leverage on the sample complexity of another learning problem: learning discrete distributions in total variation distance. We review this problem below. Let D be a family of distributions over X. The sample complexity of learning distributions in D within total variation distance ε is the minimum function T TV D (ε, δ), such that: if T T TV D (ε, δ), then for any distribution D D, with probability at least 1 δ over the random draw of T samples from D, we can obtain (from the T samples) a distribution ˆD such that d TV( ˆD, D) ε. Proposition 3.3 (e.g., [12, 33]). Let Dall be the set of all distributions over X. Then, T TV Dall(ε, δ) = Θ |X|+log(1/δ) ε2 . In particular, the upper bound can be achieved by using the empirical estimate ˆDemp (which is the uniform distribution over the T samples). The lower bound holds regardless of what learning algorithm is used. Distinguishing distributions Another learning problem that we will leverage on is the problem of distinguishing (two) distributions: given samples from a distribution randomly chosen from {D1, D2}, we are to guess whether the samples are from D1 or D2. The sample complexity of distinguishing distributions is characterized by the squared Hellinger distance. It is known that at least T = Ω 1 d2 H(D1,D2) log 1 δ samples are needed to distinguish two distributions with probability at least 1 δ. See Appendix A for a formal statement of this result. 4 Sample Complexity for General Distributions In this section we characterize the sample complexity of forecast aggregation for general distributions P. We give an exponential (in the number of experts, n) upper bound and an exponential lower bound on the sample complexity, as follows: Theorem 4.1. Let Pall be the set of all distributions over S Ω, with |S| = mn. Suppose n 2. The sample complexity of forecast aggregation with respect to Pall is O mn+log(1/δ) ε2 TPall(ε, δ) Ω mn 2+log(1/δ) This theorem is for n 2. When there is only one expert (n = 1), there is no need to learn to aggregate because the optimal aggregator f simply outputs the forecast given by the only expert: f (r1) = P(ω = 1 | r1) = P(ω = 1 | s1) = r1. The sample complexity is 0 in this case. There is a gap in the dependency on ε in the upper bound and the lower bound in Theorem 4.1. We conjecture that the tight dependency on ε should be 1 ε (so the lower bound is tight). See Section 7 for a detailed discussion of this conjecture, where we show that the dependency on ε in the upper bound can be improved to 1 ε for a large family of distributions. 4.1 Proof of the Upper Bound In this subsection we prove the O mn+log(1/δ) ε2 upper bound in Theorem 4.1. This is a direct corollary of the distribution learning result introduced in Section 3. We regard P as a distribution over r and ω instead of over s and ω. Then P is a discrete distribution with support size at most 2mn because each possible report ri [0, 1] corresponds to some discrete signal si in Si. Let ˆPemp be the empirical distribution of reports and event realizations: ˆPemp = Uniform (r(1), ω(1)), . . . , (r(T ), ω(T )) . By Proposition 3.3, with probability at least 1 δ over the random draw of T = O 2mn+log(1/δ) ε2 samples, we have d TV( ˆPemp, P) ε. According to Fact 3.2, and by the definition of LP (f), we have: for any aggregator f : [0, 1]n [0, 1], L ˆ Pemp(f) LP (f) = E ˆ Pemp |f(r) ω|2 EP |f(r) ω|2 d TV( ˆPemp, P) ε. Therefore, if we pick the empirically optimal aggregator ˆfemp = argminf L ˆ Pemp(f), we get LP ( ˆfemp) L ˆ Pemp( ˆfemp) + ε L ˆ Pemp(f ) + ε LP (f ) + 2ε, which means that ˆfemp is a 2ε-optimal aggregator for P. 4.2 Proof of the Lower Bound In this subsection we prove the Ω mn 2+log(1/δ) ε lower bound in Theorem 4.1. The main idea is a reduction from the distribution learning problem (defined in Section 3) for a specific family D of distributions over the joint signal space S = S1 Sn. We construct a corresponding family of distributions P = {PD : D D} for the forecast aggregation problem, such that, if we can obtain an ε-optimal aggregator ˆf for PD, then we can convert ˆf into a distribution ˆD such that d TV( ˆD, D) O( ε). We then prove that learning D within total variation distance εTV = O( ε) requires Ω mn 2+log(1/δ) ε2 TV = Ω mn 2+log(1/δ) ε samples. This gives the sample complexity lower bound for the forecast aggregation problem for P (and hence Pall). We will need a family of distributions D that satisfies the following three properties: Definition 4.2. We say a family of distributions D satisfies 1. B-uniformly bounded, if: D(s) B |S| = B mn , s S, D D, where B 1 is a constant. 2. same marginal across distributions, if: for any D, D D, any i, any si Si, D(si) = D (si). 3. distinct marginals across signals, if: for any D D, any i, any si = s i Si, D(si) = D(s i). How do we construct the family P? For each distribution D D, we construct distribution PD as follows: the marginal distribution of ω is Uniform{0, 1}, i.e., PD(ω = 0) = PD(ω = 1) = 1 2; conditioning on ω = 0, the joint signal s is uniformly distributed: PD(s | ω = 0) = 1 |S| = 1 mn , s S; conditioning on ω = 1, the joint signal is distributed according to D: PD(s | ω = 1) = D(s), s S. The family P is {PD : D D}. We show that if we can obtain ε-optimal aggregators for distributions in P, then we can learn the distributions in D within total variation distance (1+B)2 ε, and thus the sample complexity of the former is lower bounded by the sample complexity of the latter: Lemma 4.3. Let D be a family of distributions that satisfies the three properties in Definition 4.2. Let P = {PD : D D} be defined above. Then, TP(ε, δ) T TV D (1 + B)2 ε, δ . Proof sketch of Lemma 4.3. The full proof is in Appendix E.1. We give a sketch here. According to the definition of PD, by Bayes rule, we have PD(ω = 1 | s) = 1 2 PD(s|ω=1) 1 2 PD(s|ω=0)+ 1 2 PD(s|ω=1) = D(s) 1 mn +D(s). (6) The distinct marginals across signals property in Definition 4.2 ensures that there is a one-to-one mapping between signal si and report ri = PD(ω = 1 | si), and hence a one-to-one mapping between joint signal s and joint report r. So, the Bayesian aggregator f satisfies f (r) = PD(ω = 1 | r) = PD(ω = 1 | s) = D(s) 1/mn+D(s). This gives D(s) = 1 mn f (r) 1 f (r). (7) Suppose we have obtained an ε-optimal aggregator ˆf for PD, EPD | ˆf(r) f (r)|2 ε, then we convert ˆf into ˆD by letting ˆD(s) = 1 mn ˆ f(r) 1 ˆ f(r). The total variation distance between ˆD and D is: d TV( ˆD, D) = 1 ˆD(s) D(s) (7)= 1 1 mn ˆ f(r) 1 ˆ f(r) f (r) 1 f (r) . (8) The B-uniformly bounded property in Definition 4.2 ensures D(s) = O( 1 mn ), which has two consequences: (1) PD(s) = O( 1 mn ); (2) f (r) = O(1), which implies ˆ f(r) 1 ˆ f(r) f (r) 1 f (r) = O | ˆf(r) f (r)| due to Lipschitz property of the function x 1 x for x = O(1). These two consequences imply d TV( ˆD, D) = O X 1 mn ˆf(r) f (r) = O X s S PD(s) ˆf(r) f (r) = O E | ˆf(r) f (r)| Jensen s inequality O q E | ˆf(r) f (r)|2 = O( ε). So, we obtain d TV( ˆD, D) O( ε) = (1 + B)2 ε. There is a subtlety, however: In the distribution learning problem we are given samples from D, which are signals {s(t)}T t=1. We need to convert them into the corresponding reports {r(t)}T t=1 as the samples for the forecast aggregation problem. To do this we need to know PD or D, which we do not know. So, we make use of the same marginal across distributions property in Definition 4.2 here: because all the distributions D D have the same marginal probability D(si) = D (si) (but possibly different joint probabilities D(s) = D (s)), we are able to compute the report r(t) i = PD(ω = 1 | s(t) i ) = 1 2 PD(s(t) i |ω=1) 1 2 PD(s(t) i |ω=0)+ 1 2 PD(s(t) i |ω=1) = D(s(t) i ) 1 m +D(s(t) i ) separately for each expert i without knowing D, since we know what the family D is. This allows us to reduce the distribution learning problem for D to the forecast aggregation problem for P. Then, we find a family of distributions D that satisfies the three properties in Definition 4.2 and requires many samples to learn. Proposition 4.4. There exists a family of distributions D that satisfies the three properties in Definition 4.2 (with B = e + 1 2) and requires T TV D (εTV, δ) = Ω mn 2+log(1/δ) ε2 TV samples to learn. The above sample complexity is smaller than the Ω |S|+log(1/δ) ε2 TV lower bound in Proposition 3.3 because we are restricting to a smaller set of distributions than the set of all distributions over S. The proof of Proposition 4.4 is analogous to a textbook proof of Proposition 3.3, which uses reductions from the distinguishing distributions problem. See details in Appendix E.2. Finishing the proof of Theorem 4.1: By Lemma 4.3 and Proposition 4.4, plugging in εTV = (1 + B)2 ε with B = e + 1 2, we obtain the lower bound on the sample complexity of forecast aggregation for P (and hence for Pall): TP(ε, δ) T TV D (1 + B)2 ε, δ = Ω mn 2+log(1/δ) ((1+B)2 ε)2 = Ω mn 2+log(1/δ) 5 Sample Complexity for Conditionally Independent Distributions Section 4 proved that learning ε-optimal aggregators for all discrete distributions needs exponentially many samples. As shown in the proof, this large sample complexity is because the experts signals can be arbitrarily correlated conditioned on the event ω. Accurate estimation of such correlation requires many samples. So, in this section we restrict attentions to the case where the experts signals are conditionally independent. It turns out that an ε-optimal aggregator can be learned using only O( 1 εδ) samples in this case, which does not depend on n. The assumption of discrete signal space can be relaxed here. We also investigate two special and interesting families of conditionally independent distributions that admit an even smaller sample complexity of O( 1 5.1 General Conditionally Independent Distributions Let P be a conditionally independent distributions over S Ω, namely, P(s | ω) = Qn i=1 P(si | ω) for all s S, for ω {0, 1}. Here, Si can be a continuous space, in which case, P( | ω) represents a density function. We introduce some additional notations. Let p = P(ω = 1) be the prior probability of ω = 1. For technical convenience we assume p (0, 1). Define ρ = p 1 p = P (ω=1) P (ω=0) (0, + ). (9) We will be working with ratios like ri 1 ri and p 1 p a lot in this section. We will use the following characterization of the optimal aggregator f for conditionally independent distributions: Lemma 5.1 ([8]). For conditionally independent distribution P, given signals s = (si)n i=1, with corresponding reports r = (ri)n i=1 where ri = P(ω = 1|si), the posterior probability of ω = 1 is: f (r) = P(ω = 1 | r) = P(ω = 1 | s) = 1 1+ρn 1 Qn i=1 1 ri (Define f (r) = 0 if ρn 1 Qn i=1 1 ri Lemma 5.1 implies that one way to learn f is to simply learn the value of ρ. If we can learn ρ with accuracy ε n , then we can learn ρn 1 with accuracy ε and obtain an ˆf that is ε-close to f for every possible input r [0, 1]n. However, by standard concentration inequalities, learning ρ with accuracy ε n requires O( n2 ρε ) samples, which is larger than the O( 1 ε2 ) bound we will prove. The key is that we do not actually need ˆf(r) to be close to f (r) for every r [0, 1]n +; we only need the expectation E | ˆf(r) f (r)|2 ε. This allows us to prove a smaller sample complexity bound, using a pseudo-dimension argument. The main result of this section is that the sample complexity of forecast aggregation with respect to all conditionally independent distributions is between Ω( 1 δ ) and O( 1 Theorem 5.2. Let Pind be the set of all conditionally independent distributions over S Ω. Suppose n 2. The sample complexity of forecast aggregation with respect to Pind is εδ TPind(ε, δ) Ω 1 We provide the main ideas of the proof of Theorem 5.2 here. The upper bound O( 1 εδ) is a corollary of our theorem for multi-outcome events (Theorem C.1), so we only give a sketch here. We note that, according to Lemma 5.1, the optimal aggregator has the form f (r) = 1 1+ρn 1 Qn i=1 1 ri We consider the class of aggregators F = f θ : f θ(r) = 1 1+θn 1 Qn i=1 1 ri parameterized by θ (0, + ). The class of loss functions G = gθ : gθ(r, ω) = |f θ(r) ω|2 associated with F has pseudo-dimension Pdim(G) = O(1). By the known result (e.g., [1]) that the pseudo-dimension gives a sample complexity upper bound on the uniform convergence of a class of functions, we conclude that the empirically optimal aggregator in F must be O(ε)-optimal on the true distribution (with probability at least 1 δ), given O 1 ε2 (Pdim(G) log 1 εδ) samples. We prove the Ω( 1 δ ) lower bound by a reduction from the distinguishing distributions problem (introduced in Section 3). We construct two conditionally independent distributions P 1, P 2 over the space Ω S that differ by d2 H(P 1, P 2) = O(ε) in squared Hellinger distance. Specifically, P 1 has prior P 1(ω = 1) = 0.5 O( 1 n ) and P 2 has prior P 2(ω = 1) = 0.5 O( 1 n ); the conditional distributions of each signal, P 1(si | ω) and P 2(si | ω), differ by O( ε n) in squared Hellinger distance; taking the product of n signals, P 1(s | ω) and P 2(s | ω) differ by O(ε). The distinguishing distributions problem asks: given T samples from either P 1 or P 2, tell which distribution the samples come from. We show that, if we can solve the forecast aggregation problem, namely, ε-approximate f (r) = 1 1+ρn 1 Qn i=1 1 ri ri , then we can estimate ρ with accuracy O( ε and hence distinguish P 1 and P 2. But distinguishing P 1 and P 2 requires Ω( 1 d2 H(P 1,P 2) log 1 δ ) samples. This gives the lower bound we claimed. See details in Appendix F.1. 5.2 Strongly and Weakly Informative Experts While the sample complexity of forecast aggregation for general conditionally independent distributions is O( 1 εδ), under further assumptions this bound can be improved. In particular, we find two special yet natural families of conditionally independent distributions that admit O( 1 δ ) sample complexity. In these two cases, the experts are either very informative or very noninformative . Roughly speaking, an expert is very informative if the conditional distributions of the expert s signal under event ω = 0 and ω = 1 are significantly different, so the expert s prediction ri is away from the prior p. An expert is very non-informative if the opposite is true. Intuitively, an expert being informative should help aggregation and hence reduce the sample complexity. Interestingly though, we show that even if the experts are non-informative the sample complexity of forecast aggregation can also be reduced. See details in Appendix B. 6 Extension: Multi-Outcome Events Our main results regarding the sample complexity of forecast aggregation for binary events (Theorems 4.1 and 5.2) can be generalized to multi-outcome events with |Ω| > 2. We prove that: for general distributions, the sample complexity is Ω mn 2 ε = Ω mn 2+log(1/δ) ε TP(ε, δ) O |Ω|mn+log(1/δ) ε2 = O |Ω|mn ε2 ; for conditionally independent distributions, the sample complexity is Ω 1 δ TPind(ε, δ) O |Ω| log |Ω| ε + 1 ε2 log 1 ε2 ). See Appendix C for details. 7 Conclusion and Discussion In this work, we showed an exponential gap between the sample complexity of forecast aggregation for general distributions, Ω( mn 2 ε ), and conditionally independent distributions, O( 1 ε2 ). This gap is due to the need of estimating the conditional correlation between experts in the general case, which is not needed in the conditional independence case. Notably, the bound O( 1 ε2 ) for conditionally independent distributions does not depend on the number of experts. We discuss the dependency of the sample complexity on ε and other directions for future works: The dependency on ε An open question left by our work is the dependency of the sample complexity on the parameter ε. We conjecture that the tight dependency should be 1 ε (so our lower bounds are tight). This is supported by the following evidence: Theorem 7.1. For the case of |Ω| = 2 and for general distributions, if the distribution P has a minimum joint probability min(s,ω) S ΩP(s, ω) > c mn for some c > 0, then the sample complexity of forecast aggregation is at most O mn cε (n log m + log 1 δ ) = O nmn In particular, this theorem can be applied to distributions that are close to uniform, where P(s, ω) 1 |S Ω| = 1 2mn , giving a bound of O( nmn ε ). Notably, the set of distributions we constructed in the proof of the Ω( mn 2 ε ) lower bound in Theorem 4.1 are also close to uniform. This means that 5This bound has a better dependency on ε but worse on n than the O( mn ε2 ) bound in Theorem 4.1. close-to-uniform distributions have a tight sample complexity bound of the form Θ( f(n,m) ε ), not Θ( f(n,m) ε2 ). Moreover, since close-to-uniform distributions are the most difficult distributions to learn in the distribution learning problem, it is likely that they are also the most difficult distributions for the forecast aggregation problem, and therefore the tight sample complexity of forecast aggregation should be determined by the sample complexity for those distributions, which is Θ( f(n,m) Other future directions The middle ground between fully correlated experts and conditionally independent experts: An example is the partial evidence model in [4]. Applying [4] s results, one can show that the sample complexity of forecast aggregation in the partial evidence model is at most O( n2 ε2 ).6 Giving a lower bound for the partial evidence model and exploring other intermediate models is open. Weaker benchmark: Since the Bayesian aggregator needs exponentially many samples to approximate, can we find a weaker yet meaningful benchmark with a small sample complexity? Samples vs experts: In reality, obtaining samples of experts historical forecasts can be difficult, while recruiting experts is easy. Can we achieve better aggregation by recruiting more experts instead of collecting more samples? How many experts do we need? Eliciting more information: Previous works on information elicitation and aggregation have noticed that better aggregation can be achieved by eliciting more information than agents own predictions, for example, also eliciting each agent s prediction about other agents predictions (e.g., [41, 47, 31, 14]). One can ask whether and how eliciting more information can help to reduce the sample complexity of information aggregation. Continuous distributions: In our model the random variable ω to be predicted is discrete. One can study a setting where ω is a continuous random variable and the experts report, e.g., the means of their posterior beliefs about ω. The results for continuous random variables might be very different from the results in this work. Other loss functions: We focused on the squared loss E |f(r) ω|2 due to its popularity in machine learning problems and its useful property that the difference between the squared losses of any aggregator and the optimal aggregator is equal to their expected squared difference (Lemma 2.1). Alternatively, one can consider other loss functions like the logarithmic loss E[ω log(f(r))+(1 ω) log(1 f(r))] and the absolute loss E[|f(r) ω|]. There might be some technical challenges in the analysis of sample complexity for those loss functions, though: e.g., the logarithmic loss can be unbounded [4, 40] and the absolute loss does not enjoy a property like Lemma 2.1. Acknowledgments and Disclosure of Funding We would like to thank Yannai Gonczarowski, Ariel Procaccia, Milind Tambe, David Parkes, Bo Waggoner, Rafael Frongillo, Grant Schoenebeck, Yuqing Kong, and anonymous reviewers for their helpful comments. This research is based upon work supported in part by the National Science Foundation under grants no. IIS-2007887 and no. IIS-2147187. [1] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1 edition, November 1999. [2] Itai Arieli, Yakov Babichenko, and Rann Smorodinsky. Robust forecast aggregation. Proceedings of the National Academy of Sciences, 115(52), December 2018. [3] Itai Arieli, Yakov Babichenko, Inbal Talgam-Cohen, and Konstantin Zabarnyi. Universally Robust Information Aggregation for Binary Decisions. In Proceedings of the 24th ACM Conference on Economics and Computation, pages 118 118, London United Kingdom, July 2023. ACM. [4] Yakov Babichenko and Dan Garber. Learning Optimal Forecast Aggregation in Partial Evidence Environments. Mathematics of Operations Research, 46(2):628 641, May 2021. 6In particular, [4] studies an online learning setting with logarithmic loss. Their regret bound can be converted to our sample complexity bound by an online-to-batch conversion and the Pinsker s inequality. [5] Maria-Florina F Balcan, Tuomas Sandholm, and Ellen Vitercik. Sample complexity of automated mechanism design. Advances in Neural Information Processing Systems, 29, 2016. [6] Jonathan Baron, Barbara A. Mellers, Philip E. Tetlock, Eric Stone, and Lyle H. Ungar. Two Reasons to Make Aggregated Probability Forecasts More Extreme. Decision Analysis, 11(2):133 145, June 2014. [7] John M. Bates and Clive W. J. Granger. The Combination of Forecasts. Journal of the Operational Research Society, 20(4):451 468, December 1969. [8] Robert F. Bordley. A Multiplicative Formula for Aggregating Probability Assessments. Management Science, 28(10):1137 1148, 1982. Publisher: INFORMS. [9] Glenn W. Brier. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1):1 3, January 1950. [10] Johannes Brustle, Yang Cai, and Constantinos Daskalakis. Multi-Item Mechanisms without Item Independence: Learnability via Robustness. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 715 761, Virtual Event Hungary, July 2020. ACM. [11] Yang Cai and Constantinos Daskalakis. Learning Multi-Item Auctions with (or without) Samples. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 516 527, Berkeley, CA, October 2017. IEEE. [12] Clément L. Canonne. A short note on learning discrete distributions, 2020. [13] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, Cambridge, 2006. [14] Yi-Chun Chen, Manuel Mueller-Frank, and Mallesh Pai. The Wisdom of the Crowd and Higher-Order Beliefs. In Proceedings of the 24th ACM Conference on Economics and Computation, pages 450 450, London United Kingdom, July 2023. ACM. [15] Gerda Claeskens, Jan R. Magnus, Andrey L. Vasnev, and Wendun Wang. The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting, 32(3):754 762, July 2016. [16] Robert T. Clemen. Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5(4):559 583, January 1989. [17] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing - STOC 14, pages 243 252, New York, New York, 2014. ACM Press. [18] Henrique De Oliveira, Yuhta Ishii, and Xiao Lin. Robust Merging of Information. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 341 342, Budapest Hungary, July 2021. ACM. [19] Nikhil R. Devanur, Zhiyi Huang, and Christos-Alexandros Psomas. The sample complexity of auctions with side information. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 426 439, Cambridge MA USA, June 2016. ACM. [20] Peerapong Dhangwatnotai, Tim Roughgarden, and Qiqi Yan. Revenue maximization with a single sample. Games and Economic Behavior, 91:318 333, May 2015. [21] R. Glen Donaldson and Mark Kamstra. Forecast combining with neural networks. Journal of Forecasting, 15(1):49 61, January 1996. [22] Rafael Frongillo, Robert Gomez, Anish Thilagar, and Bo Waggoner. Efficient Competitions and Online Learning with Strategic Forecasters. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 479 496, Budapest Hungary, July 2021. ACM. [23] Véronique Genre, Geoff Kenny, Aidan Meyler, and Allan Timmermann. Combining expert forecasts: Can anything beat the simple average? International Journal of Forecasting, 29(1):108 121, January 2013. [24] John Geweke and Charles Whiteman. Chapter 1 Bayesian Forecasting. In Handbook of Economic Forecasting, volume 1, pages 3 80. Elsevier, 2006. [25] Yannai A. Gonczarowski and S. Matthew Weinberg. The Sample Complexity of Up-to-ǫ Multidimensional Revenue Maximization. Journal of the ACM, 68(3):1 28, March 2021. [26] Clive W. J. Granger. Invited review combining forecasts twenty years later. Journal of Forecasting, 8(3):167 173, July 1989. [27] Clive W. J. Granger and Ramu Ramanathan. Improved methods of combining forecasts. Journal of Forecasting, 3(2):197 204, April 1984. [28] Chenghao Guo, Zhiyi Huang, Zhihao Gavin Tang, and Xinzhi Zhang. Generalizing complex hypotheses on product distributions: Auctions, prophet inequalities, and pandora s problem. In Conference on Learning Theory, pages 2248 2288. PMLR, 2021. [29] Chenghao Guo, Zhiyi Huang, and Xinzhi Zhang. Settling the sample complexity of single-parameter revenue maximization. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing - STOC 2019, pages 662 673, Phoenix, AZ, USA, 2019. ACM Press. [30] Sunil Gupta and Peter C. Wilton. Combination of Forecasts: An Extension. Management Science, 33(3):356 372, 1987. Publisher: INFORMS. [31] Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. Eliciting Thinking Hierarchy without a Prior. In Advances in Neural Information Processing Systems, volume 35, pages 13329 13341. Curran Associates, Inc., 2022. [32] Jasper Lee. Lecture 11: Distinguishing (discrete) distributions, 2020. [33] Jasper Lee. Lecture 12: Learning Discrete Distributions, 2020. [34] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54 74, January 2020. [35] Spyros Makridakis and Robert L. Winkler. Averages of Forecasts: Some Empirical Results. Management Science, 29(9):987 996, September 1983. [36] Jamie H Morgenstern and Tim Roughgarden. On the pseudo-dimension of nearly optimal auctions. Advances in Neural Information Processing Systems, 28, 2015. [37] Peter A. Morris. Decision Analysis Expert Use. Management Science, 20(9):1233 1241, 1974. Publisher: INFORMS. [38] Paul Newbold and Clive W. J. Granger. Experience with Forecasting Univariate Time Series and the Combination of Forecasts. Journal of the Royal Statistical Society. Series A (General), 137(2):131 146, 1974. [39] Eric Neyman and Tim Roughgarden. Are You Smarter Than a Random Expert? The Robust Aggregation of Substitutable Signals. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 990 1012, Boulder CO USA, July 2022. ACM. [40] Eric Neyman and Tim Roughgarden. No-regret learning with unbounded losses: The case of logarithmic pooling. ar Xiv preprint ar Xiv:2202.11219, 2022. [41] Dražen Prelec, H. Sebastian Seung, and John Mc Coy. A solution to the single-question crowd wisdom problem. Nature, 541(7638):532 535, January 2017. [42] Ville A. Satopää, Jonathan Baron, Dean P. Foster, Barbara A. Mellers, Philip E. Tetlock, and Lyle H. Ungar. Combining multiple probability predictions using a simple logit model. International Journal of Forecasting, 30(2):344 356, April 2014. [43] Jeremy Smith and Kenneth F. Wallis. A Simple Explanation of the Forecast Combination Puzzle. Oxford Bulletin of Economics and Statistics, 71(3):331 355, June 2009. [44] James H. Stock and Mark W. Watson. Combination forecasts of output growth in a seven-country data set. Journal of Forecasting, 23(6):405 430, September 2004. [45] Allan Timmermann. Chapter 4 Forecast Combinations. In Handbook of Economic Forecasting, volume 1, pages 135 196. Elsevier, 2006. [46] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer series in statistics. Springer, New York ; London, 2009. OCLC: ocn300399286. [47] Juntao Wang, Yang Liu, and Yiling Chen. Forecast Aggregation via Peer Prediction. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 9(1):131 142, October 2021. [48] Robert L. Winkler. Combining Probability Distributions from Dependent Information Sources. Management Science, 27(4):479 488, 1981. Publisher: INFORMS. [49] Robert L. Winkler and Spyros Makridakis. The Combination of Forecasts. Journal of the Royal Statistical Society. Series A (General), 146(2):150, 1983. [50] Chunxue Yang and Xiaohui Bei. Learning Optimal Auctions with Correlated Valuations from Samples. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11716 11726. PMLR, July 2021. [51] Yuhong Yang. Combining Forecasting Procedures: Some Theoretical Results. Econometric Theory, 20(1):176 222, 2004. Publisher: Cambridge University Press. A Additional Preliminaries A.1 Properties of Hellinger Distance We review some useful properties of the Hellinger distance. First, the Hellinger distance gives upper bounds on the total variation distance: Fact A.1 (e.g., Lemma 2.3 in [46]). d TV(D1, D2) 2d H(D1, D2). 1 d2 TV(D1, D2) 1 d2 H(D1, D2) 2. (This inequality implies the first one.) Second, we will use the following lemma to upper bound the squared Hellinger distance between two distributions that are close to each other: Lemma A.2. Let D1 and D2 be two distributions on X satisfying 1 ε D2(x) D1(x) 1 + ε, x X. Then, d2 H(D1, D2) 1 Proof. By definition, d2 H(D1, D2) = 1 D2(x) 2 = 1 x X D1(x) 1 q D2(x) D1(x) 2 . If 1 ε D2(x) D1(x) < 1, then we have 1 q D2(x) D1(x) 2 1 1 ε 2 1 (1 ε) 2 = ε2. If 1 + ε D2(x) D1(x) > 1, then we have 1 q D2(x) D1(x) 2 1 + ε 1 2 (1 + ε) 1 2 = ε2. These two cases together imply d2 H(D1, D2) 1 x X D1(x) ε2 = 1 Third, we will use a property of the Hellinger distance between distributions defined on a product space. Suppose D1 and D2 are two distributions over a product space X Y. They can be decomposed into the marginal distribution of x X and the conditional distribution of y Y given x, namely, D1(x, y) = D1,x(x) D1,y|x(y|x) and D2(x, y) = D2,x(x) D2,y|x(y|x). Then, the squared Hellinger distance between D1 and D2 satisfies the following: Lemma A.3. d2 H(D1, D2) d2 H(D1,x, D2,x) + maxx X d2 H(D1,y|x, D2,y|x). In particular, if x and y are independent, then d2 H(D1, D2) d2 H(D1,x, D2,x) + d2 H(D1,y, D2,y). Proof. By definition, d2 H(D1, D2) = 1 X D1(x, y)D2(x, y) D1,x(x)D2,x(x) X D1,y|x(y|x)D2,y|x(y|x) D1,x(x)D2,x(x) + X D1,x(x)D2,x(x) 1 X D1,y|x(y|x)D2,y|x(y|x) = d2 H(D1,x, D2,x) + X D1,x(x)D2,x(x) d2 H(D1,y|x, D2,y|x) d2 H(D1,x, D2,x) + X D1,x(x)D2,x(x) max x X d2 H(D1,y|x, D2,y|x) d2 H(D1,x, D2,x) + 1 max x X d2 H(D1,y|x, D2,y|x), where the last inequality is because P x X p D1,x(x)D2,x(x) = 1 d2 H(D1,x, D2,x) 1. Finally, let D T denote the distribution of T i.i.d. samples from D, namely, the product of T independent D s. We have the following lemma that relates d2 H(D T 1 , D T 2 ) with d2 H(D1, D2): Lemma A.4 (e.g., [32]). d2 H(D T 1 , D T 2 ) = 1 1 d2 H(D1, D2) T T d2 H(D1, D2). A.2 Distinguishing Distributions Let D1, D2 be two distributions over a discrete space X. A distribution Di is chosen uniformly at random from {D1, D2}. Then, we are given T samples from Di and want to guess whether the distribution is D1 or D2. It is known that at least T = Ω 1 d2 H(D1,D2) log 1 δ samples are needed to guess correctly with probability at least 1 δ, no matter how we guess. Formally: Lemma A.5 (e.g., [32]). Let j {1, 2} be the index of the distribution we guess based on the samples. The probability of making a mistake when distinguishing D1 and D2 using T samples, namely Pr[j = i] = 1 2 Pr[j = i | i = 1] + 1 2 Pr[j = i | i = 2], is at least: Pr[j = i] 1 2 d H(D1, D2). Pr[j = i] 1 4 1 d2 H(D1, D2) 2T 1 4e 4T d2 H(D1,D2), assuming d2 H(D1, D2) 1 The second item implies that, in order to achieve Pr[j = i] δ, we must have T 1 4d2 H(D1,D2) log 1 We provide a proof of this lemma for completeness: Proof. Let D T 1 and D T 2 denote the distributions of T i.i.d. samples from D1 and D2, respectively. The draw of T samples from D1 or D2 is equivalent to the draw of one sample from D T 1 or D T 2 . Given one sample from D T 1 or D T 2 , the probability of making a mistake when guessing the distribution is at least: Pr[j = i] = 1 Pr[j = 2 | i = 1] + Pr[j = 1 | i = 2] 1 Pr[j = 1 | i = 1] + Pr[j = 1 | i = 2] Pr[j = 1 | i = 1] Pr[j = 1 | i = 2] ED T 1 [ 1{j = 1}] ED T 2 [ 1{j = 1}] by Fact 3.2 1 2d TV(D T 1 , D T 2 ). (12) We then upper bound d TV(D T 1 , D T 2 ) in two ways, which will prove the two items of the lemma, respectively. First, according to first item of Fact A.1, we have d TV(D T 1 , D T 2 ) 2d H(D T 1 , D T 2 ). By Lemma A.4, d H(D T 1 , D T 2 ) Td H(D1, D2). The above two inequalities give d TV(D T 1 , D T 2 ) 2Td H(D1, D2). This proves the first item of the lemma. Second, according to the second item of Fact A.1 and Lemma A.4, we have 1 d2 TV(D T 1 , D T 2 ) 1 d2 H(D T 1 , D T 2 ) 2 = 1 d2 H(D1, D2) 2T Since 1 d2 TV(D T 1 , D T 2 ) = 1 + d TV(D T 1 , D T 2 ) 1 d TV(D T 1 , D T 2 ) 2 1 d TV(D T 1 , D T 2 ) , we have 1 d TV(D T 1 , D T 2 ) 1 2 1 d2 H(D1, D2) 2T . Plugging into (12), we get Pr[j = i] 1 4 1 d2 H(D1, D2) 2T . When d2 H(D1, D2) < 1 2, we use the inequality 1 x e 2x for 0 < x < 1 2 to conclude that Pr[j = i] 1 4 e 2d2 H(D1,D2) 2T = 1 4e 4T d2 H(D1,D2). B Special Cases: Strongly and Weakly Informative Experts In this section we investigate two special families of conditionally independent distributions that admit O( 1 δ ) sample complexity, which is smaller than the O( 1 εδ) bound for general conditionally independent distributions. In these two cases, the (signals of) experts are either very informative or very non-informative . Definition B.1. Let γ [0, ] be a parameter. For an expert i, we say its signal si Si is γ-strongly informative if either P (si|ω=1) P (si|ω=0) 1 + γ or P (si|ω=1) P (si|ω=0) 1 1+γ holds. γ-weakly informative if 1 1+γ P (si|ω=1) P (si|ω=0) 1 + γ. An expert i is γ-strongly (or γ-weakly) informative if all of its signals in Si are γ-strongly (or γweakly) informative.7 A signal si being γ-strongly (or γ-weakly) informative implies that its corresponding report ri will be γ-away from (or γ-close to ) the prior p = P(ω = 1), in the ri 1 ri and p 1 p ratio form. Specifically, if si is γ-strongly informative, then from Equation (1) we have ri 1 ri = P (si|ω=1) P (si|ω=0) p 1 p (1 + γ)ρ or 1 1+γ ρ. (13) As γ gets larger, a γ-strongly informative signal (expert) is more informative for predicting whether ω = 1 or 0. This would make aggregation easier. If si is γ-weakly informative, then: 1 1+γ ρ ri 1 ri (1 + γ)ρ. (14) As γ gets smaller, a γ-weakly informative signal (expert) is less informative for predicting ω, but in this case their report ri will be close to the prior p, which allows us to estimate the ρn 1 term in the optimal aggregator f (r) = 1 1+ρn 1 Qn i=1 1 ri ri better. Those are some intuitions why both strongly and weakly informative signals can reduce the sample complexity of forecast aggregation. Formally, for γ-strongly informative experts with not-too-small γ, we have the following result: Theorem B.2. If n 32 log 2 ε and all experts are γ-strongly informative with γ 1+γ 8 q then the sample complexity of forecast aggregation is O( 1 εn( γ 1+γ )2 log 1 We remark that the conditions of the theorem, n 32 log 2 ε and γ 1+γ 8 q ε, are easier to be satisfied when the number of experts n increases, if the informativeness parameter γ of each expert is a constant or does not decrease with n (which we believe is a reasonable assumption given that experts are independent of each other). Also, if γ is fixed or increasing, then as n increases the sample complexity decreases. The proof of Theorem B.2 is in Appendix G.1. Roughly speaking, we divide each expert s signal set into two sets, S1 i and S0 i : signals that are more likely to occur under ω = 1 (i.e., P (si|ω=1) P (si|ω=0) 1 + γ) and under ω = 0 (i.e., P (si|ω=1) P (si|ω=0) 1 1+γ ). If the realized ω is 1, then one may expect to see Ω γ 1+γ n more S1 i signals than the S0 i signals from the n experts, because the probabilities of these two types of signals differ by Ω( γ 1+γ ) for each expert. If ω is 0 then one may expect to see more S0 i signals than the S1 i signals. So, by checking which type of signals are more we can tell whether ω = 0 or 7We note that an expert can be neither γ-strongly informative nor γ-weakly informative for any γ. 1. To tell whether a signal belongs to S1 i or S0 i , we compare the corresponding report ri 1 ri with ρ (namely, ri 1 ri (1 + γ)ρ si S1 i and ri 1 ri 1 1+γ ρ si S0 i ), where ρ is estimated with accuracy γ 1+γ using O( 1 ρn( γ 1+γ )2 log 1 δ ) samples. Dealing with the case of ρ < ε separately, we obtain the bound O( 1 εn( γ 1+γ )2 log 1 For γ-weakly informative experts with small γ, we have: Theorem B.3. If all experts are γ-weakly informative with γ 1, then the sample complexity of forecast aggregation is min O( γn εδ) , which is O( 1 δ ) if γ = O( 1 εδ) term in the sample complexity follows from the result for general conditionally independent distributions (Theorem 5.2). The O( γn δ ) term is proved in Appendix G.2. We give the rough idea here using γ = 1 n as an example. The proof relies on the observation that E Qn i=1 ri 1 ri | ω = 0 = ρn. Since experts are weakly informative, each of their reports ri 1 ri is around ρ in the range [ 1 1+γ ρ, (1 + γ)ρ] [ 1 1+ 1 n ρ, (1 + 1 n)ρ]. Taking the product, we have Qn i=1 ri 1 ri [ ρn e , eρn], which is in a bounded range. This allows us to use Chernoff bound to argue that the ρn (or ρn 1) term in the optimal aggregator f (r) = 1 1+ρn 1 Qn i=1 1 ri ri can be esti- mated, with O( ε) accuracy, using only O( e ( ε)2 log 1 δ ) samples of Qn i=1 ri 1 ri . The aggregator using the estimate ˆρn 1, ˆf(r) = 1 1+ˆρn 1 Qn i=1 1 ri ri , turns out to be O(ε)-optimal. C Extension: Multi-Outcome Events In this section we generalize our model from binary events to multi-outcome events. The event space now becomes Ω= {1, 2, . . . , |Ω|} with |Ω| > 2. The joint distribution of event ω Ωand experts signals s = (si)n i=1 S is still denoted by P, which belongs to some class of distributions P. The size of each expert s signal space is still assumed to be |Si| = m < + for general distributions and can be infinite for conditionally independent distributions (where s1, . . . , sn are independent conditioned on ω). After observing signal si, expert i reports its posterior belief of the event given si, which is ri = (rij)j Ωwhere rij = P(ω = j | si). An aggregator now is a vector-valued function f = (fj)j Ωthat maps the joint report r = (ri)n i=1 = (rij)ij to a probability distribution f(r) over Ω, where fj(r) is the aggregated predicted probability for ω = j. We assume fj(r) 0 and P j Ωfj(r) = 1. The definition of the (expected) loss of an aggregator f becomes: LP (f) = EP h X fj(r) 1[ω = j] 2 i . (15) It is easy to see that the optimal aggregator f = argminf LP (f) in the multi-outcome case is still the Bayesian aggregator (this is a generalization of Lemma 2.1): f = (f j )j Ω, f j (r) = P(ω = j | r) (16) and the difference between the losses of f and f satisfies LP (f) LP (f ) = EP h X fj(r) f j (r) 2 i . (17) So, an ε-optimal aggregator f is an aggregator that satisfies EP P j Ω|fj(r) f j (r)|2 ε. Using T = TP(ε, δ) samples {(r(t), ω(t))}T t=1 from P, the principal wants to find an ε-optimal aggregator ˆf with probability at least 1 δ, for any distribution P in a class P. We give lower bounds and upper bounds on the sample complexity: Theorem C.1. The sample complexity of forecast aggregation for multi-outcome events is: For the class P of general distributions: Ω mn 2+log(1/δ) ε TP(ε, δ) O |Ω|mn+log(1/δ) For the class Pind of conditionally independent distributions: Ω 1 δ TPind(ε, δ) O |Ω| log |Ω| C.1 Proof of Theorem C.1 Before proving the theorem, we note that the loss of any aggregator f satisfying fj(r) 0 and P j Ωfj(r) = 1 is bounded by [0, 2]: fj(r) 1[ω = j] 2 X fj(r) 1[ω = j] 1 + X j Ω fj(r) = 2. (18) fj(r) f j (r) 2 2. (19) C.1.1 Lower Bounds The lower bounds directly follow from the lower bounds for the binary case (Theorem 4.1 and Theorem 5.2) because the binary case is a special case of the multi-outcome case. Specifically, we can regard any binary event distribution P as a multi-outcome distribution that puts probability only on outcomes {1, 2} Ω. If we can learn an ε-optimal aggregator ˆf for the multi-outcome case: E P j Ω| ˆfj(r) f j (r)|2 ε, then this aggregator satisfies ˆfj(r) f j (r) 2i ε E h ˆf1(r) f 1 (r) 2 + (1 ˆf1(r)) (1 f 1 (r)) 2i = E h 2 ˆf1(r) f 1 (r) 2i ε E h ˆf1(r) f 1 (r) 2i ε So, the aggregator ˆf1( ) is an ε 2-optimal aggregator for the binary case. C.1.2 Upper Bounds General distributions: the O |Ω|mn+log(1/δ) ε2 upper bound. The proof for general distributions in the multi-outcome case is the same as the proof for general distributions in the binary case (Section 4.1), except for three differences: (1) P (regarded as a distribution over reports r and event ω) now is a discrete distribution with support size at most |Ω|mn; (2) the loss |f(r) ω|2 [0, 1] in the binary case is replaced by the loss P j Ω|fj(r) 1[ω = j]|2 [0, 2] in the multi-outcome case; (3) the ε in the binary case is replaced by ε 2 because the loss is now upper bounded by 2. Thus, the bound O mn+log(1/δ) ε2 in the binary case becomes O |Ω|mn+log(1/δ) 2 )2 = O |Ω|mn+log(1/δ) ε2 in the multi-outcome case. Conditionally independent distributions: the O |Ω| log |Ω| δ upper bound. The sample complexity upper bound for conditionally independent distributions is proved by a pseudodimension argument. We first show in Lemma C.2 that the optimal aggregator f belongs to some parametric family of aggregators. Then, we upper bound the pseudo-dimension of the loss functions associated with this parametric family of aggregators, which will give the desired sample complexity upper bound. Lemma C.2. For multi-outcome conditionally independent distribution P, given signals s = (si)n i=1, with corresponding reports r = (ri)n i=1 = (rij)ij where rij = P(ω = j | si), the posterior probability of ω = j satisfies: f j (r) = P(ω = j | r) = P(ω = j | s) = 1 P (ω=j)n 1 Qn i=1 rij P k Ω 1 P (ω=k)n 1 Qn i=1 rik . Proof. By Bayes rule and the fact that s1, . . . , sn are independent conditioning on ω, P(ω = j | s) = P(ω = j) Qn i=1 P(si | ω = j) P k ΩP(ω = k) Qn i=1 P(si | ω = k) 1 P (ω=j)n 1 Qn i=1 P(ω = j)P(si | ω = j) P k Ω 1 P (ω=k)n 1 Qn i=1 P(ω = k)P(si | ω = k) 1 P (ω=j)n 1 Qn i=1 P (ω=j)P (si|ω=j) P (si) P k Ω 1 P (ω=k)n 1 Qn i=1 P (ω=k)P (si|ω=k) 1 P (ω=j)n 1 Qn i=1 rij P k Ω 1 P (ω=k)n 1 Qn i=1 rik . Since the above expression depends only on ri s but not on si s, we have: f j (r) = P(ω = j | r) = P(ω = j | s) = 1 P (ω=j)n 1 Qn i=1 rij P k Ω 1 P (ω=k)n 1 Qn i=1 rik . (For the special binary event case, f (r) = P(ω = 1 | r) = 1 P (ω=1)n 1 Qn i=1 ri 1 P (ω=1)n 1 Qn i=1 ri + 1 P (ω=0)n 1 Qn i=1(1 ri) = 1 1 + ( P (ω=1) P (ω=0))n 1 Qn i=1 1 ri This proves Lemma 5.1.) According to Lemma C.2, the optimal aggregator f = (fj)j Ωbelongs to the following family of aggregators, parameterized by θ = (θj)j Ω R|Ω| + : F = f θ = (f θ j )j Ω f θ j (r) = θj Qn i=1 rij P k Ωθk Qn i=1 rik (The optimal aggregator f has parameter θj = 1 P (ω=j)n 1 .) Let G be the family of loss functions associated with F, which is also parameterized by θ R|Ω| + , G = n gθ gθ(r, ω) = X f θ j (r) 1[ω = j] 2o . (21) We recall the definition of pseudo-dimension of a family of functions: Definition C.3 (e.g., [1]). Let H be a family of functions from input space X to real numbers R. Let x(1), . . . , x(d) X be d inputs. Let t(1), . . . , t(d) R be d thresholds . Let b = (b(1), . . . , b(d)) { 1, +1}d be a vector of labels. We say b can be generated by H (on inputs x(1), . . . , x(d) with thresholds t(1), . . . , t(d)) if there exists a function hb H such that hb(x(i)) > t(i) if b(i) = 1 and hb(x(i)) < t(i) if b(i) = 1 (namely, sign(hb(x(i)) t(i)) = b(i)) for every i {1, . . . , d}. The pseudo-dimension of H, Pdim(H), is the largest integer d for which there exist d inputs and d thresholds such that all the 2d label vectors in { 1, +1}d can be generated by H. Pseudo-dimension gives a sample complexity upper bound for the uniform convergence of the empirical means of a family of functions to their true means: Theorem C.4 (e.g., [1]). Let H be a family of functions from X to [0, U] with Pdim(H) = d. For any distribution D on X, with probability at least 1 δ over the random draw of samples x(1), . . . , x(T ) from D, we have: for every h H, Ex D[h(x)] 1 T PT t=1 h(x(t)) ε. We now upper bound the pseudo-dimension of G: Lemma C.5. Pdim(G) O |Ω| log |Ω| . Proof. Suppose Pdim(G) = d. By definition, there exist d inputs (r(1), ω(1)), . . . , (r(d), ω(d)) and d thresholds t(1), . . . , t(d) R such that all the 2d label vectors b { 1, +1}d can be generated by some function gθb G. We will count how many label vectors can actually be generated by G. Consider the ℓ-th input (r(ℓ), ω(ℓ)) and threshold t(ℓ). To simplify notations we omit superscript (ℓ), so we have (r, ω) and t. We write x = (xj)j Ωwhere xj = Qn i=1 rij. For any function gθ G, we have gθ(r, ω) = X θjxj P k Ωθkxk 1[ω = j] 2 θjxj P k Ωθkxk 2 2 θωxω P k Ωθkxk + 1 j Ω (θjxj)2 2θωxω X k Ω θkxk + X k Ω θkxk 2 . By definition, the set of parameters that give input (r, ω) label +1 is θ R|Ω| + | gθ(r, ω) > t . By the above equation, this set is equal to the set θ R|Ω| + j Ω (θjxj)2 2θωxω X k Ω θkxk + X k Ω θkxk 2 > t X k Ω θkxk 2 . We note that the above set is the solution set to the quadratic form inequality θ Aθ > 0 for some matrix A R|Ω| |Ω|. Similarly, the set of parameters that give input (r, ω) label 1 is the solution set to θ Aθ < 0. These two sets share a boundary: the solution set to θ Aθ = 0, which is a hyper-ellipsoid in R|Ω|. In other words, the input (r, ω) and threshold t define a hyper-ellipsoid which divides the parameter space R|Ω| + into two regions such that all the parameters in one region generate the same label for that input. Enumerating all inputs (r(1), ω(1)), . . . , (r(d), ω(d)). They define d hyper-ellipsoids in total, dividing the parameter space R|Ω| + into several regions. Within each region, all the parameters generate the same label for each input and hence generate the same label vector. So, the number of label vectors that can be generated by all the parameters in R|Ω| + is upper bounded by the number of regions. The number of regions divided by d hyper-ellipsoids in R|Ω| + is in the order of O(d|Ω|). Hence, to generate all the 2d label vectors we need O(d|Ω|) 2d. This gives d O(|Ω| log |Ω|). By Theorem C.4 and Lemma C.5, plugging in d = Pdim(G) O(|Ω| log |Ω|) and U = 2 (because gθ(r, ω) is bounded by [0, 2] according to (18)), we obtain the following: with probability at least 1 δ over the random draw of = O |Ω| log |Ω| samples (r(1), ω(1)), . . . , (r(T ), ω(T )) from P, we have for any gθ G, EP gθ(r, ω) 1 t=1 gθ(r(t), ω(t)) ε. This is equivalent to for any f θ F, EP h X f θ j (r) 1[ω = j] 2i 1 f θ j (r(t)) 1[ω(t) = 1] 2 ε LP (f θ) L ˆ Pemp(f θ) ε, where ˆPemp is the empirical distribution Uniform (r(1), ω(1)), . . . , (r(T ), ω(T )) . By the same logic as the proof of the upper bound in Theorem 4.1 (Section 4.1), the empirically optimal aggregator ˆf = argminf θ F L ˆ Pemp(f θ) is 2ε-optimal. D Missing Proofs from Section 2 D.1 Proof of Lemma 2.1 To prove the first item f (r) = P(ω = 1 | r), we simply note that f = argmin f Er h E |f(r) ω|2 | r i should minimize E |f(r) ω|2 | r for almost every r. This gives f (r) = E[ω | r] = P(ω = 1 | r). To prove the second item LP (f) LP (f ) = EP |f(r) f (r)|2 , we note that, by the definition of LP ( ) and the fact that f (r) = E[ω | r] proven above, LP (f) LP (f ) = E |f(r) ω|2 E |f (r) ω|2 = E f(r)2 2E f(r)ω E f (r)2 + 2E f (r)ω = E f(r)2 2Er f(r)E[ω|r] E f (r)2 + 2Er f (r)E[ω|r] = E f(r)2 2Er f(r)f (r) E f (r)2 + 2Er f (r)2 = E f(r)2 2f(r)f (r) + f (r)2 = E |f(r) f (r)|2 . E Missing Proofs from Section 4 E.1 Proof of Lemma 4.3 Recall that we have a family D of distributions over S = S1 Sn satisfying the three properties in Definition 4.2 (B-uniformly bounded, same marginal across distributions, and distinct marginals across signals). We have constructed from D the family of distributions P = {PD : D D} for the forecast aggregation problem as follows: PD(ω = 0) = PD(ω = 1) = 1 PD( | ω = 0) = Uniform(S), namely, PD(s | ω = 0) = 1 |S| = 1 mn , PD( | ω = 1) = D, namely, PD(s | ω = 1) = D(s). We want to show that ε-optimal aggregation with respect to P will imply (1+B)2 ε total variation distance learning with respect to D, and hence TP(ε, δ) T TV D (1 + B)2 ε, δ . First, we have the following observations about D and P: Fact E.1. For a family of distributions D that satisfies the three properties in Definition 4.2: 1. Given signal si, expert i s report is ri = PD(ω = 1 | si) = D(si) 1 m +D(si), which is the same for all distributions D D. 2. For every expert i, given different signals si = s i, its reports ri = r i. So, there is a one-toone mapping between si and ri for every i {1, . . . , n} and also a one-to-one mapping between the joint signal s = (s1, . . . , sn) and the joint report r = (r1, . . . , rn). 3. For any joint signal s, with corresponding joint report r, we have D(s) = f (r) mn(1 f (r)). Proof. We prove the three items one by one: 1. By definition, the marginal distribution of joint signal, PD(s), is PD(s) = PD(ω = 0)PD(s | ω = 0) + PD(ω = 1)PD(s | ω = 1) = 1 mn + D(s) . (23) Fixing si, summing over s i = (s1, . . . , si 1, si+1, . . . , sn) S i, we get s i S i D(s) = 1 m + D(si) . So, given signal si, expert i reports ri = PD(ω = 1 | si) = Pz(ω = 1)PD(si | ω = 1) PD(si) = D(si) 1 m + D(si), (24) and this is the same for all D D since D(si) = D (si) by the same marginal across distributions property. 2. Given si = s i, by the distinct marginals across signals property, we have D(si) = D(s i). Since ri = D(si) 1 m +D(si) and x 1 m +x is a strictly increasing function of x, it follows that ri = r i. 3. According to item 2, there is a one-to-one mapping between s = (s1, . . . , sn) and r = (r1, . . . , rn); in other words, observing signals s1, . . . , sn is equivalent to observing reports r1, . . . , rn. Therefore, by Bayes rule we have f (r) = PD(ω = 1 | r) = PD(ω = 1 | s) = PD(ω = 1)PD(s | ω = 1) by (22) and (23) = D(s) 1 mn + D(s). (25) Rearranging, we obtain D(s) = f (r) mn(1 f (r)). Claim E.2. If we have an aggregator ˆf that is ε-optimal with respect to PD, then we can find a distribution ˆD such that d TV( ˆD, D) (1 + B)2 ε. Proof. Because D is B-uniformly bounded, from (25) we can verify that f (r) satisfies f (r) B 1+B . So, we can assume ˆf(r) B 1+B as well (if ˆf(r) > B 1+B , we can let ˆf(r) be B 1+B ; this only reduces the approximation error E | ˆf(r) f (r)|2 ). Define ˆD by letting ˆD(s) = ˆ f(r) mn(1 ˆ f(r)), s S, where r is the reports corresponding to s (cf., Fact E.1). Then, d TV( ˆD, D) is d TV( ˆD, D) = 1 ˆD(s) D(s) = 1 ˆ f(r) mn(1 ˆ f(r)) f (r) mn(1 f (r)) ˆ f(r) 1 ˆ f(r) f (r) 1 f (r) . Because ˆf(r), f (r) B 1+B and the function x 1 x has derivative 1 (1 x)2 (1+B)2 when x B 1+B , we have d TV( ˆD, D) 1 2mn (1 + B)2 X ˆf(r) f (r) by (23) (1 + B)2 X s S PD(s) ˆf(r) f (r) = (1 + B)2EPD h ˆf(r) f (r) i . By Jensen s inequality (E[X])2 E[X2] and by the assumption that ˆf is ε-optimal, we have EPD | ˆf(r) f (r)| 2 EPD | ˆf(r) f (r)|2 ε. Thus, d TV( ˆD, D) (1 + B)2 ε, which proves the claim. Now, we present the reduction from learning D in total variation distance to forecast aggregation for P. We use notations x(1), . . . , x(T ) S to represent the samples from D. From x(1), . . . , x(T ) we construct the samples (r(1), ω(1)), . . . , (r(T ), ω(T )) for the forecast aggregation problem. After obtaining a solution ˆf to the latter problem, we convert it into a solution ˆD to the former. Details are as follows: Input: T i.i.d. samples x(1), . . . , x(T ) from an unknown distribution D D. 1. Draw T samples ω(1), . . . , ω(T ) Uniform{0, 1}. 2. For each t = 1, . . . , T, do the following: If ω(t) = 0, draw s(t) Uniform(S). If ω(t) = 1, let s(t) = x(t). For each i, compute r(t) i = D(s(t) i ) 1 m +D(s(t) i ). Let r(t) = (r(t) 1 , . . . , r(t) n ). 3. Feed ST = (r(1), ω(1)), . . . , (r(T ), ω(T )) to the forecast aggregation problem. Obtain solution ˆf. 4. Convert ˆf into ˆD according to Claim E.2. Output: ˆD. We remark that, in the second step of the reduction, the report r(t) i can be computed even if D is unknown, because the D(s(t) i ) is the same for all D D and hence known. Using the above reduction, we show that the sample complexity of ε-optimal forecast aggregation for P cannot be smaller than the sample complexity of learning D within total variation distance (1 + B)2 ε, which will prove Lemma 4.3: Proof of Lemma 4.3: First, we verify that the distribution of samples ST in the above reduction is exactly the distribution of T samples {(r(1), ω(1)), . . . , (r(T ), ω(T ))} from PD. This is because: (1) the distribution of ω(t) is Uniform{0, 1}, as defined in PD; (2) given ω(t) = 0, the distribution of s(t) is Uniform(S), as defined in PD; (3) given ω(t) = 1, the distribution of s(t) is the same as the distribution of x(t), which is D, because the random draws of ω(t) and x(t) are independent; (4) according to Fact E.1, the report r(t) i = D(s(t) i ) 1 m +D(s(t) i ) = PD(ω = 1 | s(t) i ), as desired. Then, by the definition of sample complexity of forecast aggregation, if we are given T = TP(ε, δ) samples ST for the forecast aggregation problem, then with probability at least 1 δ we should be able to find an ε-optimal aggregator ˆf with respect to PD. According to Claim E.2, we can convert ˆf into a ˆD such that d TV( ˆD, D) (1 + B)2 ε. By the definition of sample complexity T TV D ( , δ) of distribution learning, T must be at least T T TV D (1 + B)2 ε, δ , which proves the lemma. E.2 Proof of Proposition 4.4 To prove Proposition 4.4 we will construct a family of distributions D that satisfies the three properties in Definition 4.2 and requires T TV D (εTV, δ) = Ω mn 2+log(1/δ) ε2 TV samples to learn. For simplic- ity, we write ε = εTV. For technical convenience, we assume ε < 1 40, δ < 0.01. E.2.1 Part 1: Constructing D We index the distributions Dz D by z; the meaning of z will be defined later. Without loss of generality, we assume |Si| = m to be an even integer, and denote Si = {1, . . . , m} =: S. We will define Dz to be a distribution over the joint signal space S = S1 Sn = Sn = {1, . . . , m}n. In the following we will call a joint signal s Sn simply a signal. We write a signal s Sn as s = (b, x, y), where b Sn 2 and x, y S. We sort all the mn signals in Sn by the lexicographical order, from (1, . . . , 1, 1), (1, . . . , 1, 2), ..., to (m, . . . , m, m). We number the signals from 1 to mn, using num(s) = num(b, x, y) {1, . . . , mn}. to denote their numbers. We divide the whole signal space Sn into |Sn 2| = mn 2 buckets , each of size m2 and denoted by Bb: Bb = (b, x, y) : x, y S , b Sn 2. We first define a base distribution Dbase, then construct the distributions Dz s by modifying the probabilities of the base distribution within each bucket Bb. Let γ = 1 + 1 mn . The base distribution is defined as follows: Dbase(s) = γnum(s) s Sn γnum(s) = Because 1 γℓ γmn (1 + 1 mn )mn e, we have mn W emn. (26) We assign a sign zb {+1, 1} to each bucket Bb, and let z be a vector of length mn 2 that includes the signs of all buckets: z = (zb)b Sn 2, zb {+1, 1}. We have 2mn 2 different z s in total, and hence 2mn 2 different distributions Dz s in D in total. Let c = 20, so that cε 1/2. For each z, we define Dz as follows: within each bucket Bb, for each element (b, x, y) Bb let Dz(b, x, y) = Dbase(b, x, y) + zbcε 2 Dbase(b, x, y) zbcε 2 Dbase(b, x, y) zbcε 2 Dbase(b, x, y) + zbcε Claim E.3. The family of distributions D = {Dz}z defined above satisfies the three properties in Definition 4.2: B-uniformly bounded with B = e+1/2, same marginal across distributions, distinct marginals across signals. Proof. B-uniformly bounded: For any s, any Dz, by definition, Dz(s) Dbase(s) + cε (26) e mn + 1/2 So, the distribution is B-uniformly bounded with B = e + 1/2. Same marginal across distributions: Consider each Dz(si). We want to show that Dz(si) does not depend on z, and in fact, Dz(si) = Dbase(si). If i {1, . . . , n 2}, namely, si is a component of the vector b, then we have s i S i Dz(si, s i) = X b Sn 2: bi=si y=1 Dz(b, x, y). We notice that, fixing any b, the numbers of +zbcε W in the summation Pm x=1 Pm y=1 Dz(b, x, y) are the same. So, they cancel out, and we obtain b Sn 2: bi=si y=1 Dbase(b, x, y) = Dbase(si). If i = n 1, namely si = x, then we have: s i S i Dz(si, s i) = X y=1 Dz(b, x, y). Fixing any b, the numbers of +zbcε W in the summation Pm y=1 Dz(b, x, y) are the same. So, they cancel out, and we obtain y=1 Dbase(b, x, y) = Dbase(si). Finally, if i = n, namely si = y, then by a similar argument as above we have x=1 Dbase(b, x, y) = Dbase(si). Distinct marginals across signals: By the same marginal across distributions property above we have Dz(si) = Dbase(si). So, to prove distinct marginals across signals we only need to prove Dbase(si) = Dbase(s i) for si = s i. Without loss of generality assume si < s i. By the definition Dbase(s) = γnum(s) W and the fact that num(si, s i) < num(s i, s i) for any s i S i, we have Dbase(si) = X s i S i Dbase(si, s i) < X s i S i Dbase(s i, s i) = Dbase(s i), so Dbase(si) = Dbase(s i). E.2.2 Part 2: Sample Complexity Lower Bound of Learning D Overview We want to prove the proposition that the sample complexity of learning the family of distributions D = {Dz}z defined above is at least T TV D (ε, δ) = Ω mn 2+log(1/δ) ε2 . This proof is analogous to a textbook proof of Proposition 3.3 (the sample complexity for learning all distributions), which uses reductions from the distinguishing distributions problem. Roughly speaking, if one can learn the unknown distribution Dz well then one must be able to guess most of the components of the sign vector z = (zb)b Sn 2 correctly, meaning that one can distinguish whether the distribution Dzb on bucket Bb is Dzb=+1 or Dzb= 1. However, since Dzb=+1 and Dzb= 1 are O(ε)-close to each other, distinguishing them requires Ω( 1 ε2 ) samples. In average, there are only O( T mn 2 ) samples falling into a bucket (because there are mn 2 buckets in total and the distribution Dz is close to uniform). We thus need O( T mn 2 ) = Ω( 1 ε2 ), which gives T = Ω( mn 2 ε2 ). Ignoring logarithmic terms, this proves the proposition. Formal argument First, we note that if we can learn Dz very well, then we can guess the vector z correctly for a large fraction of its mn 2 components. Formally, suppose we obtain from T samples a distribution ˆD such that d TV( ˆD, Dz) ε. We find the distribution Dw in D, w = (wb)b Sn 2 {+1, 1}mn 2, that is closest to ˆD in total variation distance. By definition, we have d TV(Dw, ˆD) d TV(Dz, ˆD) ε. Hence, by triangle inequality, d TV(Dw, Dz) d TV(Dw, ˆD) + d TV( ˆD, Dz) 2ε. Let # = {b Sn 2 | wb = zb} be the number of different components of w and z. We claim that: Claim E.4. d TV(Dw, Dz) 2ε implies # 2W Proof. Whenever we have a different component wb = zb, this different component contributes the following to the total variation distance between Dw and Dz: Dw(b, x, y) Dz(b, x, y) = 1 So, the number of different components of w and z is at most d TV(Dw,Dz) We first show the Ω( log(1/δ) ε2 ) part in the sample complexity lower bound, and then show the Ω( mn 2 ε2 ) part. Together, they give the lower bound max{Ω( log(1/δ) ε2 ), Ω( mn 2 Ω( mn 2+log(1/δ) ε2 ). Consider the distribution D+1 D whose index is the all +1 vector and the distribution D 1 D whose index is the all 1 vector. According to (28), the total variation distance between D+1 and D 1 is d TV(D+1, D 1) = mn 2 m2cε W because +1 and 1 have mn 2 different components. Since W emn (26), we have d TV(D+1, D 1) cε e > 2ε. Consider the distinguishing distributions problem (defined in Section 3) where we want to distinguish D+1 and D 1. If we can learn from samples a distribution ˆD that is ε-close in total variation distance to the unknown distribution D+1 or D 1, then we can perfectly tell whether the unknown distribution is D+1 or D 1 because the two distributions are more than 2ε-away from each other in total variation distance. Lemma A.5 implies that, to distinguish D+1 and D 1 with probability 1 δ, the number of samples must be at least Ω( log(1/δ) d2 H(D+1,D 1)) = Ω( log(1/δ) ε2 ). This proves the Ω( log(1/δ) We then prove the Ω( mn 2 ε2 ) part. Suppose we first draw the vector z from {+1, 1}mn 2 uniformly at random, then draw T samples from Dz. We obtain the Dw as above. Let s consider the expected number of different components of w and z in this two-step random draw procedure: E z, T samples[#] = E h X 1{zb = wb} i = X 1{zb = wb} . (29) We consider each component E 1{zb = wb} in the above summation. Suppose that, within the T samples drawn from Dz, Tb of them fall into the bucket Bb. So, Tb follows the Binomial(T, D(Bb)) distribution with (b,x,y) Bb Dz(b, x, y) = 1 (b,x,y) Bb γnum(b,x,y). (Notice that the + zbcε W cancel out in the summation and hence D(Bb) does not depend on z.) Let Dzb denote the Bb-part of distribution Dz, namely, Dz conditioning on Bb: Dzb(s) = Dz(s) D(Bb), s Bb. We think of the random draw of the vector z and the T samples as follows: first, we draw Tb from Binomial(T, D(Bb)); second, we draw zb {+1, 1} uniformly at random; third, we draw Tb samples from the conditional distribution Dzb; forth, we draw the remaining vector z b and the remaining T Tb samples (which are samples outside of Bb). Only writing the first two steps explicitly, we have 1{zb = wb} = E Tb 1{zb = wb} | zb, Tb i 1{zb = wb} | zb = +1, Tb + 1 1{zb = wb} | zb = 1, Tb 2 Pr zb = wb | zb = +1, Tb + 1 2 Pr zb = wb | zb = 1, Tb . (30) Claim E.5. For any Tb, 1 2 Pr zb = wb | zb = +1, Tb + 1 2 Pr zb = wb | zb = 1, Tb 1 2 2cε Tb. Proof. We notice that 1 2 Pr zb = wb | zb = +1, Tb + 1 2 Pr zb = wb | zb = 1, Tb is the probability that we make a mistake when guessing the sign zb using wb, if (1) zb is chosen from { 1, +1} uniformly at random; (2) we are given Tb samples from Dzb; (3) we then draw the remaining vector z b and the remaining samples; (4) finally, we use the Dw computed from all samples to get wb. The steps (3) and (4) define a randomized function that maps the Tb samples of Dzb to wb {0, 1}, and therefore, according to the first item of Lemma A.5, we have 1 2 Pr zb = wb | zb = +1, Tb +1 2 Pr zb = wb | zb = 1, Tb 1 2 d H(Dzb=+1, Dzb= 1). Then we consider d H(Dzb=+1, Dzb= 1). We use Lemma A.2 to do so. For any s Bb, we have, on the one hand, Dzb=+1(s) Dzb= 1(s) = γnum(s) cε γnum(s) cε γnum(s) cε γnum(s) + cε 1 2cε, because a b a+b = 1 2b a+b 1 2b for a = γnum(s) 1. On the other hand, Dzb=+1(s) Dzb= 1(s) γnum(s) + cε γnum(s) cε 1 + 4cε, because a+b a b = 1 + 2b a b 1 + 4b for a = γnum(s) 1 and b = cε 1 2. Therefore, by Lemma A.2 we have d2 H(Dzb=+1, Dzb= 1) 1 2(4cε)2 = 8c2ε2. (32) Combining (31) and (32) proves our claim. By (30) and Claim E.5, we have E 1{zb = wb} E Tb 2 2cε Tb . Summing over b Sn 2, (29) becomes 1{zb = wb} X b Sn 2 E Tb Tb i = mn 2 b Sn 2 E[ p (by Jensen s inequality E[ Because E[Tb] = T D(Bb) = T W P (b,x,y) Bb γnum(b,x,y) T W m2γmn, we have 1{zb = wb} mn 2 T W m2γmn = mn 2 2 2cεmn 2 r Now, let s consider the probability with which we can obtain ˆD such that d TV( ˆD, Dz) ε. We will show that this probability is at most 0.99 < 1 δ if T is less than 10 5 mn 2 ε2 . Recall from Claim E.4 that d TV( ˆD, Dz) ε implies # = P b Sn 2 1{zb = wb} 2W cm2 . So, the probability is at most 1{zb = wb} 2W 1{zb = wb} mn 2 2W (by Markov s inequality) E P 2 + 2cεmn 2 q (γmn e and mn W emn by (26)) 2 + 2cεmn 2 q c < 0.99 < 1 δ, when c = 20, T < 10 5 mn 2 ε2 , and δ < 0.01. This means that, in order to obtain such ˆD with probability at least 1 δ, at least 10 5 mn 2 ε2 samples are needed. F Missing Proofs from Section 5 F.1 Proof of Theorem 5.2: the Ω 1 δ Lower Bound The proof uses a reduction from the distinguishing distributions problem (defined in Section 3). We construct two conditionally independent distributions P 1, P 2 over the space Ω S1 Sn with each |Si| = 2, Si = {a, b}. Given T samples from either P 1 or P 2, we want to tell which distribution the samples are coming from. We will show that, if we can solve the forecast aggregation problem, then we can distinguish the two distributions (with high probability), which requires T = Ω( 1 d2 H(P1,P2) log 1 δ ) samples according to Lemma A.5. Let c = 32. We assume ε < 2 18, so that c ε < 1 16. For P 1, we let P 1(ω = 1) = 0.5 1 16n + c ε For P 2, we let P 2(ω = 1) = 0.5 1 16n c ε We require that, in the forecast aggregation problem under both distributions P 1 and P 2, whenever expert i sees signal a, b, she reports ra = 0.5, rb = 0, respectively. This gives the following conditional probabilities P 1( | ω), P 2( | ω): P 1(a | ω = 0) P 1(a | ω = 1) , P 1(b | ω = 0) P 1(b | ω = 1) " 1 4n 4c ε (34) P 2(a | ω = 0) P 2(a | ω = 1) , P 2(b | ω = 0) P 2(b | ω = 1) " 1 4n + 4c ε Given T samples from the unknown distribution P {P 1, P 2}, each of which is a vector of ω(t) and all experts signals s(t) i {a, b}, we feed the corresponding reports r(t) i {ra, rb} and ω(t) to the forecast aggregation problem and obtain a solution ˆf, which is an ε-optimal aggregator. We want to use ˆf to estimate the prior p = P(ω = 1) {p1, p2} so that we can tell apart P 1 and P 2. Recall from Lemma 5.1 that f (r) = 1 1+ρn 1 Qn i=1 1 ri ri , where ρ = p 1 p. Writing ρ in terms of f (r), we have 1 f (r) 1 n Y In particular, when ri = ra = 0.5 for all i {1, . . . , n}, we have: 1 f (r0.5) 1, r0.5 = (0.5, . . . , 0.5). So, we estimate ρ by: 1 ˆf(r0.5) 1. Now, we want to argue that, if ˆf is ε-optimal, then |ˆρ ρ| is at most O( ε n ). Consider the function: h(x) = n 1 r whose derivative is h (x) = 1 n 1 By definition, we have ˆρ ρ = h ˆf(r0.5) h f (r0.5) . (36) Claim F.1. 1 2 f (r0.5) 2 Proof. For P {P 1, P 2}, its ρ = p 1 p satisfies 1 ρ ρn 1 ρn 0.5 1 16n ε cn 0.5 + 1 16n + c ε = 1 1 8n 2c ε n 1 + 1 8n + 2c ε 8 + 2c ε > 1 where in the last three transitions we used the inequalities 1 x 1+x > 1 2x and (1 x/n)n 1 x for x (0, 1) and the fact that c ε < 1 16. So, f (r0.5) = 1 1 + ρn 1 h 1 1 + 1, 1 1 + 1 which proves the claim. With Claim F.1, we can without loss of generality assume 1 2 ˆf(r0.5) 2 3 as well (otherwise, we can truncate ˆf(r0.5) to this range; this only reduces the approximation error E | ˆf(r) f (r)|2 ). Claim F.2. For 1 3, |h (x)| 8 n 1. |h (x)| = 1 n 1 x 1 x 1 1 n 1 1 1 1 n 1 1 ( 1 2)2 = 4 n 1 21 1 n 1 8 n 1. Claim F.3. If ˆf is ε-optimal, then | ˆf(r0.5) f (r0.5)| < 2 ε. Proof. If ˆf is ε-optimal, i.e., E | ˆf(r) f (r)|2 ε, then, by Jensen s inequality E[X2] E[X]2, we have ε E | ˆf(r) f (r)| = X r P(r)| ˆf(r) f (r)| P(r0.5)| ˆf(r0.5) f (r0.5)|. (38) For both P {P 1, P 2}, we have P(r0.5) = p P(r0.5 | ω = 1) + (1 p) P(r0.5 | ω = 0) = p P(a | ω = 1)n + (1 p) P(a | ω = 0)n p 1 + (1 p) 1 1 8n 2c ε n 1 + 1 8n + 2c ε n by (37) > 1 2, Plugging P(r0.5) > 1 2 into (38), we get | ˆf(r0.5) f (r0.5)| < 2 ε. From (36), Claim F.1, Claim F.2, and Claim F.3, we get ˆρ ρ = h ˆf(r0.5) h f (r0.5) 8 n 1 ˆf(r0.5) f (r0.5) < 8 n 1 2 ε = 16 n 1 ε. Since p = ρ 1+ρ as a function of ρ has a bounded derivative p ρ = 1 (1+ρ)2 1, Equation (39) implies |ˆp p| < 16 n 1 ε if we use ˆp = ˆρ 1+ˆρ as an estimate of p. This allows us to tell part P 1 and P 2 because the difference between p1 and p2 is greater than twice of our estimation error |ˆp p|: |p1 p2| = 2c ε n 64 ε 2(n 1) = 32 ε n 1 > 2|ˆp p|. Therefore, we can tell part P 1 and P 2 by checking whether p1 or p2 is closer to ˆp. Finally, we upper bound the squared Hellinger distance between P 1 and P 2. This will give the sample complexity lower bound we want. Claim F.4. d2 H(P 1, P 2) O(c2ε). Proof. For the marginal distributions of ω, P 1 ω and P 2 ω, according to Lemma A.2 and the fact that 1 P 2 ω(ω) P 1 ω(ω) = 1 1 n = 1 O( c ε n ), we have d2 H(P 1 ω, P 2 ω) O c ε n 2 = O c2ε Given ω = 0 or 1, we consider the conditional distributions of each si, P 1 si|ω and P 2 si|ω. For si = a, we have 1 P 2(a | ω) P 1(a | ω) 1 1 8n 2c ε n 1 + 1 8n + 2c ε n 1 + 1 8n 2c ε n 1 1 8n + 2c ε = 1 + 1 8n 2c ε n 1 + 1 8n + 2c ε n 1 1 8n 2c ε n 1 1 8n + 2c ε For si = b, we have 1 P 1(b | ω) n 1 + 1 8n 2c ε n 1 + 1 8n + 2c ε n 1 4n + 4c ε n 1 4n + 4c ε n = 1 16c ε 1 + 16c ε = 1 O(c ε). So, d2 H(P 1 si|ω, P 2 si|ω) can be upper bounded as follows: d2 H(P 1 si|ω, P 2 si|ω) = 1 P 1(a | ω) p P 2(a | ω) 2 + p P 1(b | ω) p P 2(b | ω) 2 P 1(a | ω) 1 P 2(a | ω) P 1(a | ω) 2 + P 2(b | ω) 1 P 1(b | ω) P 2(b | ω) Since P 1 = P 1 ω Qn i=1 P 1 si|ω and P 2 = P 2 ω Qn i=1 P 2 si|ω, we have, by Lemma A.3 and Lemma A.4, d2 H(P 1, P 2) d2 H(P 1 ω, P 2 ω) + max ω {0,1} i=1 P 1 si|ω, i=1 P 2 si|ω o d2 H(P 1 ω, P 2 ω) + max ω {0,1} n n d2 H(P 1 si|ω, P 2 si|ω) o (40) and (41) O c2ε + max ω {0,1} Therefore, according to Lemma A.5, to tell apart P 1 and P 2 with probability at least 1 δ we need at least T = Ω 1 d2 H(P 1, P 2) log 1 samples. This concludes the proof. G Missing Proofs from Section B G.1 Proof of Theorem B.2 G.1.1 Additional Notations and Lemmas We introduce some additional notations and lemmas for the proof. Let µ0 be the expected average report of all experts conditioning on ω = 0: i=1 E[ri | ω = 0] = 1 i=1 Esi|ω=0 P(ω = 1 | si) | ω = 0 , (42) which is equal to the expected prediction of ω given expert i s signal si where si is distributed conditioning on ω = 0, averaged over all experts. Symmetrically, let i=1 E[1 ri | ω = 1] = 1 i=1 Esi|ω=1 P(ω = 0 | si) | ω = 1 . (43) Recall that p = P(ω = 1). Fact G.1. (1 p)µ0 = pµ1. Proof. For each expert i, by the law of total expectation and the fact that ri = P(ω = 1 | si), we have the following equations: (1 p) E[ri | ω = 0] + p E[ri | ω = 1] = P(ω = 0) E[ri | ω = 0] + P(ω = 1) E[ri | ω = 1] = E[ri] = Esi P(ω = 1 | si) = P(ω = 1) = p. Subtracting p from both sides, we get (1 p) E[ri | ω = 0] p E[1 ri | ω = 1] = 0. Averaging over all experts i {1, . . . , n}, we conclude that i=1 E[ri | ω = 0] p 1 i=1 E[1 ri | ω = 1] = 0. Lemma G.2. If (1 p)µ0 = pµ1 < ε 2, then the averaging aggregator favg(r) = 1 n Pn i=1 ri is ε-optimal. Proof. If (1 p)µ0 = pµ1 < ε 2, then the expected loss of favg is at most LP (favg) = E |favg ω|2 = p E (favg(r) 1)2 | ω = 1 + (1 p)E (favg(r) 0)2 | ω = 0 p E 1 favg(r) | ω = 1 + (1 p)E favg(r) | ω = 0 i=1 ri | ω = 1 + (1 p)E 1 i=1 ri | ω = 0 = pµ1 + (1 p)µ0 which implies that favg is ε-optimal. The following lemma says that O( 1 δ ) samples are sufficient to tell whether the mean of a random variable is below ε or above ε Lemma G.3. Given T = 40 δ i.i.d. samples X(1), . . . , X(T ) of a random variable X [0, 1] with unknown mean E[X] = µ, with probability at least 1 δ we can tell whether µ < ε or µ ε 2. This can be done by checking whether the empirical mean ˆµ = 1 T PT t=1 X(t) is < 3 Proof. If µ ε, using the multiplicative version of Chernoff bound we have Pr h ˆµ < 3 4ε i Pr h ˆµ < 3 Namely, with probability at least 1 δ, it holds that If µ < ε, then using the additive version of Chernoff Hoeffding theorem, we have Pr h ˆµ < µ ε Pr h ˆµ > µ + ε where D(x||y) = x ln x y + (1 x) ln 1 x 1 y . Using the inequality D(x||y) (x y)2 2y for x y and D(x||y) (x y)2 2x for x y, we obtain: Pr h ˆµ < µ ε Pr h ˆµ > µ + ε 4 ) T e ( ε 4 ) T = e ε By a union bound, with probability at least 1 δ, we have Combining the case of µ ε and µ < ε, we conclude that: with probability at least 1 δ, 4ε, then we must have µ < ε. 4ε, then we have µ ε or ε > µ ˆµ ε 2. In either case, we have µ ε The last lemma we will use shows how to estimate the unknown value of ρ = p 1 p = P (ω=1) P (ω=0) with accuracy using T = O( 1 n 2 log 1 δ ) samples. Notice that, if one simply uses the empirical value 1{ω(t)=1} PT t=1 1{ω(t)=0} to estimate ρ, then by Chernoff bound this needs T = O( 1 δ ) samples, which is larger than what we claim by a factor of n. This sub-optimality is because one only uses the ω(t) s in the samples to estimate ρ, wasting the reports r(t) i s. By using r(t) i s to estimate ρ, we can reduce the number of samples by a factor of n. The basic idea is the following: According to Fact G.1, we have ρ = p 1 p = µ0 µ1 = E[Pn i=1 ri|ω=0] E[Pn i=1(1 ri)|ω=1]. The numerator E[Pn i=1 ri | ω = 0] and the denominator E[Pn i=1(1 ri) | ω = 1] can be estimated from samples of r(t) i s where ω(t) = 0 and 1 respectively. The total number of r(t) i s is Tn, because we have n experts per sample. This reduces the needed number of samples by a factor of n. Formally: Lemma G.4. For conditionally independent distribution P, we can estimate ρ = P (ω=1) P (ω=0) with accuracy |ˆρ ρ| ρ < 1 (equivalently, ˆρ ρ 1 ) with probability at least 1 δ using T = O 1 (1 p)µ0 n 2 log 1 δ + 1 min{p, 1 p} log 1 samples of (ω(t), r(t)) s, by letting ˆρ = t:ω(t)=0 Pn i=1 r(t) i t:ω(t)=1 Pn i=1(1 r(t) i ), where #0 and #1 are the num- bers of samples with ω(t) = 0 and 1 respectively. Proof. Recall that p = P(ω = 1), 1 p = P(ω = 0), and ρ = p 1 p. According to Fact G.1 (which says (1 p)µ0 = pµ1), we have ρ = p 1 p = µ0 µ1 = Pn i=1 E[ri | ω = 0] Pn i=1 E[1 ri | ω = 1]. (44) Consider the following way of estimating ρ from T samples (ω(t), r(t))T t=1: Let #0, #1 be the numbers of samples where ω(t) = 0, 1, respectively: 1{ω(t) = 0}, #1 = 1{ω(t) = 1}. t:ω(t)=0 Pn i=1 r(t) i 1 #1 P t:ω(t)=1 Pn i=1(1 r(t) i ) . (45) Now, we compare the ˆρ in (45) and the ρ in (44): we see that 1 #0 P t:ω(t)=0 Pn i=1 r(t) i is an (unbiased) estimate of the numerator Pn i=1 E[ri | ω = 0] = nµ0 and that 1 #1 P t:ω(t)=1 Pn i=1(1 r(t) i ) is an (unbiased) estimate of the denominator Pn i=1 E[1 ri | ω = 1] = nµ1. We use Chernoff bounds to argue that the accuracy of the two estimates is within with high probability if #0 and #1 are big enough. Suppose that, when drawing the T samples, we draw all the ω(t) s first (and hence #0, #1 are determined), and then draw all the r(t) i s. After all the ω(t) s are drawn, the r(t) i s become independent, because the signals s(t) i s are conditionally independent given ω(t). Therefore, we can use Chernoff bounds: i=1 r(t) i nµ0 > nµ0 i 2e #0nµ0 2 i=1 (1 r(t) i ) nµ1 > nµ1 i 2e #1nµ1 2 Requiring δ 2e #0nµ0 2 3 and δ 2e #1nµ1 2 3 , namely, #0 3 nµ0 2 log 2 δ , #1 3 nµ1 2 log 2 we have, with probability at least 1 2δ, both of the following hold: i=1 r(t) i (1 )nµ0, 1 #1 i=1 (1 r(t) i ) (1 )nµ1, (47) Then, we argue that (46) can be satisfied with high probability if T is large enough. This is again done by a Chernoff bound: since E[#j] = E[PT t=1 1{ω(t) = j}] = T P(ω = j), for j = 0, 1, we have Pr h #0 T(1 p) 1 2T(1 p) i 2e T (1 p)( 1 3 , Pr h #1 Tp 1 2Tp i 2e T p( 1 3 . (48) So, if we are given T 12 min{p, 1 p} log 2 samples, then we can ensure that with probability at least 1 2δ, it holds #0 1 2T(1 p) and #1 1 2Tp. Then, in order for (46) to be satisfied, we can let 1 2T(1 p) 3 nµ0 2 log 2 δ , 1 2Tp 3 nµ1 2 log 2 T max 6 (1 p)µ0 n 2 log 2 δ , 6 pµ1 n 2 log 2 (1 p)µ0=pµ1 = 6 (1 p)µ0 n 2 log 2 (50) Both (49) and (50) are satisfied when T 6 (1 p)µ0 n 2 log 2 δ + 12 min{p, 1 p} log 2 To conclude, if we are given T = 6 (1 p)µ0 n 2 log 2 δ + 12 min{p,1 p} log 2 δ samples, then with probability at least 1 4δ, (47) holds, which implies (1 )µ1 = (1 ) (1 )ρ (1 4 )ρ = |ˆρ ρ| G.1.2 The Proof We want to show the O( 1 εn( γ 1+γ )2 log 1 δ ) sample complexity upper bound for the case where experts have γ-strongly informative signals. We first use O( 1 δ ) samples tell whether (1 p)µ0 = pµ1 < ε 2 or (1 p)µ0 = pµ1 ε 4. We note that (1 p)µ0 = P(ω = 0) E h 1 i=1 ri | ω = 0 i = E h which is the expectation of the random variable X = 1{ω = 0} 1 n Pn i=1 ri. So, according to Lemma G.3, we can tell whether (1 p)µ0 < ε 2 or ε 4 with probability at least 1 δ using O( 1 δ ) samples of X. If (1 p)µ0 = pµ1 < ε 2, then according to Lemma G.2, the averaging aggregator favg(r) = 1 n Pn i=1 ri is ε-optimal. We hence obtained an ε-optimal aggregator in this case. So, in the following proof, we assume (1 p)µ0 = pµ1 ε For each expert i, let S1 i = {si Si : P (si|ω=1) P (si|ω=0) 1 + γ} be its set of γ-strongly informative signals that are more likely to be realized under ω = 1 than under ω = 0. Let S0 i = Si \S1 i = {si Si : P (si|ω=1) P (si|ω=0) 1 1+γ } be the set of signals that are more likely to be realized under ω = 0. Since ri 1 ri = P (si|ω=1) P (si|ω=0)ρ by Equation (1), whenever an expert receives a signal in S1 i , its report satisfies ri 1 ri (1 + γ)ρ, si S1 i ; (51) and whenever it receives a signal in S0 i , its report satisfies ri 1 ri 1 1 + γ ρ, si S0 i . (52) We will use the notation P(Su i | ω) = P(si Su i | ω) = P si Su i P(si | ω), for u {0, 1}. Given a set of n signals s1, . . . , sn, one per expert, we let X1 = Pn i=1 1{si S1 i } be the total number of signals that belong to the S1 i sets, and similarly let X0 = Pn i=1 1{si S0 i }. We have X0 + X1 = n, and by definition, E[X1 | ω = 1] = i=1 P(S1 i | ω = 1) (1 + γ)P(S1 i | ω = 0) = (1 + γ)E[X1 | ω = 0]. (53) E[X0 | ω = 0] = i=1 P(S0 i | ω = 0) (1 + γ)P(S0 i | ω = 1) = (1 + γ)E[X0 | ω = 1], (54) Claim G.5. At least one of E[X1 | ω = 1] and E[X0 | ω = 0] is n Proof. Suppose on the contrary both E[X1 | ω = 1] and E[X0 | ω = 0] are < n 2 . Then, from (54) we have E[X0 | ω = 1] 1 1 + γ E[X0 | ω = 0] < n This implies n = E[X0 + X1 | ω = 1] < n 2 = n, a contradiction. Let u {0, 1} be an index such that E[Xu | ω = u] n Claim G.5 guarantees that such a u exists. We construct a hypothetical aggregator fhypo that, having access to ρ and E[Xu | ω], predicts whether ω = 0 or 1 by counting the number Xu of signals that belong to the Su i sets and comparing it with its expectations under ω = 0 and 1, E[Xu | ω = 0] and E[Xu | ω = 1], respectively. Specifically, given reports r = (r1, . . . , rn) as input, with corresponding unobserved signals s = (s1, . . . , sn), fhypo does the following: (1) If u = 1, count how many reports ri s satisfy ri 1 ri (1 + γ)ρ; If u = 0, count how many reports ri s satisfy ri 1 ri 1 1+γ ρ. According to (51) and (52), this number is exactly equal to the number of signals that belong to the Su i sets, Xu. (2) Then, check whether Xu is closer (in terms of absolute difference) to E[Xu | ω = u] or E[Xu | ω = 1 u]. If Xu is closer to E[Xu | ω = u], output fhypo(r) = u; otherwise, output fhypo(r) = 1 u. We claim that fhypo is ε-optimal. Claim G.6. Given γ 1+γ 8 q ε and E[Xu | ω = u] n 4 , fhypo is ε-optimal. Proof. Given either ω = 0 or 1, consider the conditional random draw of signals s1, . . . , sn. Because Xu = Pn i=1 1{si Su i } and the random variables 1{si Su i }, i = 1, . . . , n, are [0, 1]- bounded and independent conditioning ω, by Hoeffding s inequality we have Pr h Xu E[Xu | ω] a ω i 2e 2a2 Then with probability at least 1 2e 2a2 n = 1 ε, it holds Xu E[Xu | ω] < a. (57) Consider the difference between E[Xu | ω = u] and E[Xu | ω = 1 u]. By (53) and (54), we have E[Xu | ω = 1 u] 1 1 + γ E[Xu | ω = u] = 1 γ 1 + γ E[Xu | ω = u]. By the assumption E[Xu | ω = u] n E[Xu | ω = 1 u] E[Xu | ω = u] γ 1 + γ n By the assumption γ 1+γ 8 q ε, we have γ 1+γ n ε = 4a. Therefore E[Xu | ω = u] E[Xu | ω = 1 u] 4a. (58) Because we already had Xu E[Xu | ω] < a (which happened with probability at least 1 ε), if Xu turns out to be closer to E[Xu | ω = u] it must be that ω = u; if Xu turns out to be closer to E[Xu | ω = 1 u] it must be that ω = 1 u. In either case, our output fhypo(r) is equal to ω, having a loss 0. If Xu E[Xu | ω] < a did not happen, our loss is at most 1. Therefore, the expected loss of our aggregator fhypo is at most LP (fhypo) = Eω h E |fhypo(r) ω|2 ω i Eω h (1 ε) 0 + ε 1 i = ε. Since the expected loss of the optimal aggregator f is non-negative, fhypo is ε-optimal. In the remaining proof, we show how to use samples to learn a real aggregator ˆf that implements the same functionality as the hypothetical aggregator fhypo and hence is ε-optimal. We have two learning tasks: First, we need to estimate ρ, so that we can implement the step (1) of fhypo which tells apart ri 1 ri (1 + γ)ρ and ri 1 ri 1 1+γ ρ. Second, we need to find an index u {0, 1} such that E[Xu | ω = u] n 4 and estimate E[Xu | ω = u], so that we can implement the step (2) of fhypo which tells whether Xu is closer to E[Xu | ω = u] or E[Xu | ω = 1 u]. We show that these two tasks can be achieved using O( 1 εn( γ 1+γ )2 log 1 δ + 1 ε log 1 δ ) samples, with probability at least 1 O(δ). Task 1: estimate ρ, using T1 = O( 1 εn( γ 1+γ )2 log 1 δ ) samples. We want to use samples to obtain an estimate ˆρ of ρ such that 1 1+γ ρ < ˆρ < (1 + γ)ρ. So, by checking whether ri 1 ri > ˆρ or ri 1 ri < ˆρ we can tell apart ri 1 ri (1 + γ)ρ and ri 1 ri 1 1+γ ρ. Using Lemma G.4 with = γ 1+γ , we obtain a ˆρ such that with probability at least 1 δ using T1 = O 1 (1 p)µ0n 2 log 1 δ + 1 min{p, 1 p} log 1 O 1 εn( γ 1+γ )2 log 1 samples (recall that we have min{p, 1 p} (1 p)µ0 = pµ1 ε 4). The ˆρ then satisfies ˆρ < 1 + γ 1 + γ ρ < (1 + γ)ρ and ˆρ > 1 γ 1 + γ ρ = 1 1 + γ ρ, as desired. Task 2: find u such that E[Xu | ω = u] n 4 and estimate E[Xu | ω = u], using T2 = O( 1 δ ) samples. First, we show how to use T2 = O( 1 δ ) samples to estimate both E[X0 | ω = 0] and E[X1 | ω = 1] with accuracy a = q ε. By the same argument as in the proof of Lemma G.4 (Equations 48 and 49), we know that with probability at least 1 2δ over the random draws of T2 12 min{p, 1 p} log 2 samples, the numbers of samples (ω(t), r(t) 1 , . . . , r(t) n ) s where ω(t) = 0 and ω(t) = 1, denoted by #0 and #1, must satisfy 2(1 p)T2, #1 1 We consider the samples where ω(t) = 0. There are #0n total number of r(t) i s. Suppose we have accomplished Task 1. Then, for each r(t) i , we can tell whether the corresponding signal s(t) i belongs to S0 i by checking whether r(t) i 1 r(t) i < ˆρ. So, we can exactly compute the total number of such signals in the t-th sample, X0(t) = Pn i=1 1{s(t) i S0 i }, whose expected value is E[X0 | ω = 0]. Because signals are independent given ω(t) = 0, by Hoeffding s inequality we have 1{s(t) i S0 i } | {z } X0(t) #0E[X0 | ω = 0] #0a 2e 2(#0a)2 #0n = 2e 2#0a2 Plugging in a = q 2(1 p)T2, we get t:ω(t)=0 X0(t) E[X0 | ω = 0] a 2e #0 log 2 2 (1 p)T2 log 2 Similarly, considering the samples where ω(t) = 1, we get t:ω(t)=1 X1(t) E[X1 | ω = 1] a 2e #1 log 2 2 p T2 log 2 Therefore, if we require T2 2 log(2/δ) min{p, 1 p} log(2/ε), (60) then with probability at least 1 2δ, both 1 t:ω(t)=0 X0(t) E[X0 | ω = 0] < a, 1 t:ω(t)=1 X1(t) E[X1 | ω = 1] < a hold. Namely, 1 #0 P t:ω(t)=0 X0(t) and 1 #1 P t:ω(t)=1 X1(t) are a-accurate estimates of E[X0 | ω = 0] and E[X1 | ω = 1]. Equations (59) and (60) together imply that T2 = O( 1 min{p,1 p} log 2 δ ) samples suffice. Then, we identify an index u {0, 1} such that E[Xu | ω = u] n 4 . By Claim G.5, there exists a v {0, 1} with E[Xv | ω = v] n 2 . Since 1 #v P t:ω(t)=v Xv(t) is an a-accurate estimate of E[Xv | ω = v], we must have 1 #v t:ω(t)=v Xv(t) E[Xv | ω = v] a n So, at least one of u {0, 1} must satisfy 1 #u P t:ω(t)=u Xu(t) n 2 a. By picking any such a u, we are guaranteed that E[Xu | ω = u] 1 #u P t:ω(t)=u Xu(t) a n 2 2a. Given the assumption n 32 log 2 ε in the statement of the theorem, we have Hence, E[Xu | ω = u] n Finally, as argued above, an a-accurate estimate of E[Xu | ω = u] is given by 1 #u P t:ω(t)=u Xu(t). Constructing ˆf. After accomplishing Tasks 1 and 2 using T1+T2 = O( 1 εn( γ 1+γ )2 log 1 samples, we construct a ˆf that implements the same functionality as fhypo. Let t:ω(t)=u Xu(t) 2a, where 1 #u P t:ω(t)=u Xu(t) is our estimate of E[Xu | ω = u] in Task 2 and a = q Claim G.7. E[Xu | ω = u] a > M > E[Xu | ω = 1 u] + a. Proof. Because 1 #u P t:ω(t)=u Xu(t) is an a-accurate estimate of E[Xu | ω = u], we have E[Xu | ω = u] > 1 #u t:ω(t)=u Xu(t) a = M + a. Recall from Equation (58) that E[Xu | ω = 1 u] E[Xu | ω = u] 4a. So, E[Xu | ω = 1 u] < 1 t:ω(t)=u Xu(t) + a 4a = M a. The above two inequalities prove the claim. Given reports r = (r1, . . . , rn) as input, we let ˆf do the following: (1) If u = 1, count how many reports ri s satisfy ri 1 ri > ˆρ; If u = 0, count how many reports ri s satisfy ri 1 ri < ˆρ. Let this number be X; (2) Then, check whether X > M or X M. If X > M, output ˆf(r) = u; otherwise, output ˆf(r) = 1 u. We argue that ˆf implements the same functionality as fhypo: (1) In Task 1 we got 1 1+γ ρ < ˆρ < (1 + γ)ρ. So, by checking whether ri 1 ri > ˆρ or < ˆρ we can exactly tell whether ri 1 ri (1 + γ)ρ or 1 1+γ ρ. Hence, we have X = Xu, the number of signals that belong to the Su i sets. (2) Recall from Equation (57) that with probability at least 1 ε, X is a-close to its expectation E[Xu | ω]. Then, according to Claim G.7, X > M implies that X is closer to E[Xu | ω = u]; X < M implies that X is closer to E[Xu | ω = 1 u]. So, ˆf implements both of the two steps in fhypo. Hence, according to Claim G.6, ˆf is ε-optimal. G.2 Proof of Theorem B.3 According to Lemma 5.1, the optimal aggregator is f (r) = 1 1 + ρn 1 Qn i=1 1 ri where ρ = p 1 p. We claim that an approximately optimal aggregator can be obtained by first estimating ρ from samples and then use the aggregator with the estimate ˆρ: ˆf(r) = 1 1 + ˆρn 1 Qn i=1 1 ri Claim G.8. If |ˆρ ρ| 2, then the aggregator ˆf defined above is ε-optimal. Proof. Consider the function g(ˆρ) = 1 1+ˆρn 1 Qn i=1 1 ri ri (where ˆρ is the variable and ri s are con- stants). We claim that |g (ˆρ)| n 1 To see this, we note that if ri = 0 for some i then g(ˆρ) = 0 and g (ˆρ) = 0. Otherwise, we let y = Qn i=1 1 ri ri < + and take the derivative with respect to ˆρ, g (ˆρ) = 1 (1 + ˆρn 1y)2 (n 1)ˆρn 2y = (n 1) (ˆρ n (1 + ˆρn 1y)2 = (n 1) 1 1 ˆρ n 2 1 y + ˆρ n 2 y 2 . By the AM-GM inequality a + b 2 |g (ˆρ)| (n 1) 1 2 q 1 ˆρ n 2 1 y ˆρ n 2 y 2 = (n 1) 1 as claimed. Using (63), we have, for ˆρ ρ | ˆf(r) f (r)| = |g(ˆρ) g(ρ)| n 1 4 min{ˆρ, ρ} |ˆρ ρ| n 1 So, to obtain ε-approximation E | ˆf(r) f (r)|2 ε, we can require | ˆf(r) f (r)| n 1 ρ ε. This can be satisfied if the error in estimating ρ is at most |ˆρ ρ| We then show how to use O( γn δ ) samples to estimate the value of ρ with O( ε n 1) accuracy, which will give us an ε-optimal aggregator according to Claim G.8. Recall from (14) that when signals are γ-weakly informative, the reports always satisfy 1 1 + γ ρ ri 1 ri (1 + γ)ρ. (65) The following observation is the key: Lemma G.9. For each expert i, we have E ri 1 ri | ω = 0 = ρ; ri | ω = 1 = 1 As corollaries, for k conditionally independent reports r1, . . . , rk, we have E Qk i=1 ri 1 ri | ω = 0 = ρk and E Qk i=1 1 ri ri | ω = 1 = 1 ρk . Proof. Because ri 1 ri = P (si|ω=1) P (si|ω=0)ρ (from (13)), we have E h ri 1 ri ω = 0 i = X si Si P(si | ω = 0)P(si | ω = 1) P(si | ω = 0)ρ = X si Si P(si | ω = 1)ρ = ρ. For conditionally independent r1, . . . , rk, we have i=1 E h ri 1 ri ω = 0 i = ρk. Similarly, we can prove E 1 ri ri | ω = 1 = 1 ρ and E Qk i=1 1 ri ri | ω = 1 = 1 ρk . Let = ε 3γn and suppose we are given T = 6e γn 2 log 2 δ ) samples. Suppose when drawing the samples we first draw the events ω(t) s, and then draw the reports r(t) i s conditioning on ω(t) being 0 or 1. After the first step, the numbers of samples with ω(t) = 0 and ω(t) = 1 are determined, which we denote by #0 and #1. Since #0 + #1 = T, one of them must be at least T/2. We argue that whether #0 T 2 we can estimate ρ1/γ with accuracy 3 . (For simplicity, we assume that 1/γ is an integer.) If #0 T/2, then we consider the #0n reports r(t) i s in the samples with ω(t) = 0. We divide these #0n reports evenly into #0nγ groups, each of size 1/γ, denoted by G1, . . . , G#0nγ. Consider the product of r(t) i 1 r(t) i s in a group Gj: because r(t) i s are in- dependent given ω = 0, by Lemma G.9 we have r(t) i 1 r(t) i ω = 0 i = ρ1/γ. Using (65) and the inequality (1 + γ)1/γ e, we have 1 eρ1/γ 1 (1 + γ)1/γ ρ1/γ Y r(t) i 1 r(t) i (1 + γ)1/γρ1/γ eρ1/γ. Let Xj be the random variable 1 eρ1/γ Q r(t) i Gj r(t) i 1 r(t) i . From the above equation and in- equality we have E[Xj] = 1 e and Xj [ 1 e2 , 1] [0, 1]. So, by Chernoff bound, Pr h 1 #0nγ j=1 Xj (1 )1 ω = 0 i 1 2e #0nγ 2 3e 1 2e T nγ 2 given our choice of T. Multiplying 1 #0nγ P#0nγ j=1 Xj by eρ1/γ, we obtain the following estimate of ρ1/γ: ˆρ1/γ 0 := 1 #0nγ r(t) i 1 r(t) i (1 )ρ1/γ. Dividing by ρ1/γ, we get ( ˆρ0 If #1 T/2, then by considering the #1n reports in the samples with ω(t) = 1, dividing them into #1nγ groups of size 1/γ, H1, . . . , H#1nγ, and similarly defining random variable Yj = ρ1/γ r(t) i Hj 1 r(t) i r(t) i , we obtain the following estimate of ( 1 ˆρ1 )1/γ := 1 #1nγ 1 r(t) i r(t) i (1 )(1 Multiplying by ρ1/γ, we get ( ρ ˆρ1 )1/γ 1 . Taking the reciprocal and noticing that 1 1 1 3 when < 1 3, we obtain ( ˆρ1 ρ )1/γ 1 3 . From the discussion above we obtained an estimate ˆρ {ˆρ0, ˆρ1} of ρ such that ( ˆρ ρ)1/γ 1 3 . Raising to the power of γ, and using the inequality (1 x)γ 1 xγ and (1+x)γ exγ 1+2xγ for xγ 1, we get ˆρ ρ (1 3 )γ [1 3 γ, e3 γ] [1 3 γ, 1 + 6 γ]. In particular, this implies |ˆρ ρ| ρ 6 γ = 2 ε n . Then, according to Claim G.8, the aggregator ˆf defined by ˆf(r) = 1 1+ˆρn 1 Qn i=1 1 ri ri is ε-optimal. We hence obtained an ε-optimal aggregator. H Missing Proofs from Section 7 H.1 Proof of Theorem 7.1 Regard P as a joint distribution over reports r = (r1, . . . , rn) and the state ω, where ri is sampled by first sampling si Si and then letting ri = P(ω = 1 | si). Since |Si| = m, there are at most m different values of ri that can be sampled, so there are at most 2mn different tuples of (r, ω) in the support of P. For each such tuple (r, ω), consider the empirical probability of this tuple: ˆP(r, ω) = 1 1 (r(t), ω(t)) = (r, ω) . By the Chernoff bound, we have Pr ˆP(r, ω) P(r, ω) > P(r, ω) 2e T P (r,ω) 2 Using a union bound for all the 2mn tuples and the fact that P(r, ω) P(s, ω) > c mn (where s S are some signals that generate r), we have ˆP(r, ω) P(r, ω) P(r, ω) (66) holds for all tuples (r, ω) except with probability at most 2mn 2e T P (r,ω) 2 3 4mn e c T 2 c 2 log 4mn Assuming (66) holds, we consider the empirical Bayesian aggregator: ˆf(r) = ˆP(r, ω) Since (66) implies ˆP(r, ω) (1 )P(r, ω) and ˆP(r) (1 )P(r), we have 1 + P(r, ω) P(r) = 1 2 1 + f (r) (1 2 )f (r) P(r) = 1 + 2 1 f (r) (1 + 4 )f (r), 2. Putting these two inequalities together: | ˆf(r) f (r)| 4 f (r). We note that this holds for all possible r in the support of P. So, the expected approximation error of ˆf is at most: E | ˆf(r) f (r)|2 = X r P(r)| ˆf(r) f (r)|2 r P(r)16 2f (r)2 r P(r) P(r, ω) P(r) 16 2 X P(r) = 16 2. Letting 16 2 = ε, namely 2 = ε 16, we have E | ˆf(r) f (r)|2 ε, so ˆf is an ε-optimal aggregator. Plugging 2 = ε 16 into (67) we obtain the sample complexity cε n log m + log 1