# consistency_of_weighted_majority_votes__37346034.pdf Consistency of weighted majority votes Daniel Berend Computer Science Department and Mathematics Department Ben Gurion University Beer Sheva, Israel berend@cs.bgu.ac.il Aryeh Kontorovich Computer Science Department Ben Gurion University Beer Sheva, Israel karyeh@cs.bgu.ac.il We revisit from a statistical learning perspective the classical decision-theoretic problem of weighted expert voting. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. The bounds we derive are nearly optimal, and several challenging open problems are posed. 1 Introduction Imagine independently consulting a small set of medical experts for the purpose of reaching a binary decision (e.g., whether to perform some operation). Each doctor has some reputation , which can be modeled as his probability of giving the right advice. The problem of weighting the input of several experts arises in many situations and is of considerable theoretical and practical importance. The rigorous study of majority vote has its roots in the work of Condorcet [1]. By the 70s, the field of decision theory was actively exploring various voting rules (see [2] and the references therein). A typical setting is as follows. An agent is tasked with predicting some random variable Y { 1} based on input Xi { 1} from each of n experts. Each expert Xi has a competence level pi (0, 1), which is the probability of making a correct prediction: P(Xi = Y ) = pi. Two simplifying assumptions are commonly made: (i) Independence: The random variables {Xi : i [n]} are mutually independent conditioned on the truth Y . (ii) Unbiased truth: P(Y = +1) = P(Y = 1) = 1/2. We will discuss these assumptions below in greater detail; for now, let us just take them as given. (Since the bias of Y can be easily estimated from data, only the independence assumption is truly restrictive.) A decision rule is a mapping f : { 1}n { 1} from the n expert inputs to the agent s final decision. Our quantity of interest throughout the paper will be the agent s probability of error, P(f(X) = Y ). (1) A decision rule f is optimal if it minimizes the quantity in (1) over all possible decision rules. It was shown in [2] that, when Assumptions (i) (ii) hold and the true competences pi are known, the optimal decision rule is obtained by an appropriately weighted majority vote: f OPT(x) = sign where the weights wi are given by wi = log pi 1 pi , i [n]. (3) Thus, wi is the log-odds of expert i being correct and the voting rule in (2), also known as naive Bayes [3], may be seen as a simple consequence of the Neyman-Pearson lemma [4]. Main results. The formula in (2) raises immediate questions, which apparently have not previously been addressed. The first one has to do with the consistency of the Nitzan-Paroush optimal rule: under what conditions does the probability of error decay to zero and at what rate? In Section 3, we show that the probability of error is controlled by the committee potential Φ, defined by 2) log pi 1 pi . (4) More precisely, we prove in Theorem 1 that log P(f OPT(X) = Y ) Φ, where denotes equivalence up to universal multiplicative constants. Another issue not addressed by the Nitzan-Paroush result is how to handle the case where the competences pi are not known exactly but rather estimated empirically by ˆpi. We present two solutions to this problem: a frequentist and a Bayesian one. As we show in Section 4, the frequentist approach does not admit an optimal empirical decision rule. Instead, we analyze empirical decision rules in various settings: high-confidence (i.e., |ˆpi pi| 1) vs. low-confidence, adaptive vs. nonadaptive. The low-confidence regime requires no additional assumptions, but gives weaker guarantees (Theorem 5). In the high-confidence regime, the adaptive approach produces error estimates in terms of the empirical ˆpis (Theorem 7), while the nonadaptive approach yields a bound in terms of the unknown pis, which still leads to useful asymptotics (Theorem 6). The Bayesian solution sidesteps the various cases above, as it admits a simple, provably optimal empirical decision rule (Section 5). Unfortunately, we are unable to compute (or even nontrivially estimate) the probability of error induced by this rule; this is posed as a challenging open problem. 2 Related work Machine learning theory typically clusters weighted majority [5, 6] within the framework of online algorithms; see [7] for a modern treatment. Since the online setting is considerably more adversarial than ours, we obtain very different weighted majority rules and consistency guarantees. The weights wi in (2) bear a striking similarity to the Adaboost update rule [8, 9]. However, the latter assumes weak learners with access to labeled examples, while in our setting the experts are static . Still, we do not rule out a possible deeper connection between the Nitzan-Paroush decision rule and boosting. In what began as the influential Dawid-Skene model [10] and is now known as crowdsourcing, one attempts to extract accurate predictions by pooling a large number of experts, typically without the benefit of being able to test any given expert s competence level. Still, under mild assumptions it is possible to efficiently recover the expert competences to a high accuracy and to aggregate them effectively [11]. Error bounds for the oracle MAP rule were obtained in this model by [12] and minimax rates were given in [13]. In a recent line of work [14, 15, 16] have developed a PAC-Bayesian theory for the majority vote of simple classifiers. This approach facilitates data-dependent bounds and is even flexible enough to capture some simple dependencies among the classifiers though, again, the latter are learners as opposed to our experts. Even more recently, experts with adversarial noise have been considered [17], and efficient algorithms for computing optimal expert weights (without error analysis) were given [18]. More directly related to the present work are the papers of [19], which characterizes the consistency of the simple majority rule, and [20, 21, 22] which analyze various models of dependence among the experts. 3 Known competences In this section we assume that the expert competences pi are known and analyze the consistency of the Nitzan-Paroush optimal decision rule (2). Our main result here is that the probability of error P(f OPT(X) = Y ) is small if and only if the committee potential Φ is large. Theorem 1. Suppose that the experts X = (X1, . . . , Xn) satisfy Assumptions (i)-(ii) and f OPT : { 1}n { 1} is the Nitzan-Paroush optimal decision rule. Then (i) P(f OPT(X) = Y ) exp 1 (ii) P(f OPT(X) = Y ) 3 8[1 + exp(2Φ + 4 As we show in the full paper [27], the upper and lower bounds are both asymptotically tight. The remainder of this section is devoted to proving Theorem 1. 3.1 Proof of Theorem 1(i) Define the {0, 1}-indicator variables ξi = 1{Xi=Y }, (5) corresponding to the event that the ith expert is correct. A mistake f OPT(X) = Y occurs precisely when1 the sum of the correct experts weights fails to exceed half the total mass: P(f OPT(X) = Y ) = P Since Eξi = pi, we may rewrite the probability in (6) as A standard tool for estimating such sum deviation probabilities is Hoeffding s inequality. Applied to (7), it yields the bound P(f OPT(X) = Y ) exp which is far too crude for our purposes. Indeed, consider a finite committee of highly competent experts with pi s arbitrarily close to 1 and X1 the most competent of all. Raising X1 s competence sufficiently far above his peers will cause both the numerator and the denominator in the exponent to be dominated by w2 1, making the right-hand-side of (8) bounded away from zero. The inability of Hoeffding s inequality to guarantee consistency even in such a felicitous setting is an instance of its generally poor applicability to highly heterogeneous sums, a phenomenon explored in some depth in [23]. Bernstein s and Bennett s inequalities suffer from a similar weakness (see ibid.). Fortunately, an inequality of Kearns and Saul [24] is sufficiently sharp to yield the desired estimate: For all p [0, 1] and all t R, (1 p)e tp + pet(1 p) exp 1 2p 4 log((1 p)/p)t2 . (9) Remark. The Kearns-Saul inequality (9) may be seen as a distribution-dependent refinement of Hoeffding s (which bounds the left-hand-side of (9) by et2/8), and is not nearly as straightforward to prove. An elementary rigorous proof is given in [25]. Following up, [26] gave a soft proof based on transportation and information-theoretic techniques. 1 Without loss of generality, ties are considered to be errors. Put θi = ξi pi, substitute into (6), and apply Markov s inequality: P(f OPT(X) = Y ) = P Ee twiθi = pie (1 pi)wit + (1 pi)epiwit exp 1 + 2pi 4 log(pi/(1 pi))w2 i t2 = exp 1 2)wit2 , (11) where the inequality follows from (9). By independence, i Ee twiθi exp and hence P(f OPT(X) = Y ) exp 1 2Φt2 Φt . Choosing t = 1 yields the bound in Theorem 1(i). 3.2 Proof of Theorem 1(ii) Define the { 1}-indicator variables ηi = 2 1{Xi=Y } 1, (12) corresponding to the event that the ith expert is correct and put qi = 1 pi. The shorthand w η = Pn i=1 wiηi will be convenient. We will need some simple lemmata, whose proofs are deferred to the journal version [27]. Lemma 2. P(f OPT(X) = Y ) = 1 η { 1}n max {P(η), P( η)} P(f OPT(X) = Y ) = 1 η { 1}n min {P(η), P( η)} , where P(η) = Q i:ηi=1 pi Q i:ηi= 1 qi. Lemma 3. Suppose that s, s (0, )m satisfy Pm i=1(si + s i) a and R 1 si/s i R, i [m], for some R < . Then Pm i=1 min {si, s i} a/(1 + R). Lemma 4. Define the function F : (0, 1) R by F(x) = x(1 x) log(x/(1 x)) Then sup0 0, qi = 1 pi and B(x, y) = Γ(x)Γ(y)/Γ(x + y). Our full probabilistic model is as follows. Each of the n expert competences pi is drawn independently from a Beta distribution with known parameters αi, βi. Then the ith expert, i [n], is queried independently mi times, with ki correct predictions and mi ki incorrect ones. As before, K = (k1, . . . , kn) is the (random) committee profile. Absent direct knowledge of the pis, the agent relies on an empirical decision rule ˆf : (x, k) 7 { 1} to produce a final decision based on the expert inputs x together with the committee profile k. A decision rule ˆf Ba is Bayes-optimal if it minimizes P( ˆf(X, K) = Y ), which is formally identical to (18) but semantically there is a difference: the former is over the pi in addition to (X, Y, K). Unlike the frequentist approach, where no optimal empirical decision rule was possible, the Bayesian approach readily admits one: ˆf Ba(x, k) = sign (Pn i=1 ˆw Ba i xi), where ˆw Ba i = log αi + ki βi + mi ki . (29) Notice that for 0 < pi < 1, we have ˆw Ba i mi wi, almost surely, both in the frequentist and the Bayesian interpretations. Unfortunately, although P( ˆf Ba(X, K) = Y ) = P( ˆw Ba η 0) is a deterministic function of {αi, βi, mi}, we are unable to compute it at this point, or even give a non-trivial bound. The main source of difficulty is the coupling between ˆw Ba and η. Open problem. Give a non-trivial estimate for P( ˆf Ba(X, K) = Y ). 6 Discussion The classic and seemingly well-understood problem of the consistency of weighted majority votes continues to reveal untapped depth and suggest challenging unresolved questions. We hope that the results and open problems presented here will stimulate future research. [1] J.A.N. de Caritat marquis de Condorcet. Essai sur l application de l analyse a la probabilit e des d ecisions rendues a la pluralit e des voix. AMS Chelsea Publishing Series. Chelsea Publishing Company, 1785. [2] S. Nitzan, J. Paroush. Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, 23(2):289 297, 1982. [3] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009. [4] J. Neyman, E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. Royal Soc. A: Math., Physi. Eng. Sci., 231(694-706):289 337, 1933. [5] N. Littlestone, M. K. Warmuth. The weighted majority algorithm. In FOCS, 1989. [6] N. Littlestone, M. K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212 261, 1994. [7] N. Cesa-Bianchi, G. Lugosi. Prediction, learning, and games. 2006. [8] Y. Freund, R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119 139, 1997. [9] R. E. Schapire, Y. Freund. Boosting. Foundations and algorithms. 2012. [10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20 28, 1979. [11] F. Parisi, F. Strino, B. Nadler, Y. Kluger. Ranking and combining multiple predictors without labeled data. Proc. Nat. Acad. Sci., 2014+. [12] H. Li, B. Yu, D. Zhou. Error rate bounds in crowdsourcing models. Co RR, abs/1307.2674, 2013. [13] C. Gao, D. Zhou. Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced Labels (ar Xiv:1310.5764), 2014. [14] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, N. Usunier. PAC-Bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In NIPS, 2006. [15] F. Laviolette, M. Marchand. PAC-Bayes risk bounds for stochastic averages and majority votes of samplecompressed classifiers. JMLR, 8:1461 1487, 2007. [16] J.-F. Roy, F. Laviolette, M. Marchand. From PAC-Bayes bounds to quadratic programs for majority votes. In ICML, 2011. [17] Y. Mansour, A. Rubinstein, M. Tennenholtz. Robust aggregation of experts signals, preprint 2013. [18] E. Eban, E. Mezuman, A. Globerson. Discrete chebyshev classifiers. In ICML (2), 2014. [19] D. Berend, J. Paroush. When is Condorcet s jury theorem valid? Soc. Choice Welfare, 15(4):481 488, 1998. [20] P. J. Boland, F. Proschan, Y. L. Tong. Modelling dependence in simple and indirect majority systems. J. Appl. Probab., 26(1):81 88, 1989. [21] D. Berend, L. Sapir. Monotonicity in Condorcet s jury theorem with dependent voters. Social Choice and Welfare, 28(3):507 528, 2007. [22] D. P. Helmbold and P. M. Long. On the necessity of irrelevant variables. JMLR, 13:2145 2170, 2012. [23] D. A. Mc Allester, L. E. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. JMLR, 4:895 911, 2003. [24] M. J. Kearns, L. K. Saul. Large deviation methods for approximate probabilistic inference. In UAI, 1998. [25] D. Berend, A. Kontorovich. On the concentration of the missing mass. Electron. Commun. Probab., 18(3), 1 7, 2013. [26] M. Raginsky, I. Sason. Concentration of measure inequalities in information theory, communications and coding. Foundations and Trends in Communications and Information Theory, 10(1-2):1 247, 2013. [27] D. Berend, A. Kontorovich. A finite-sample analysis of the naive Bayes classifier. Preprint, 2014. [28] E. Baharad, J. Goldberger, M. Koppel, S. Nitzan. Distilling the wisdom of crowds: weighted aggregation of decisions on multiple issues. Autonomous Agents and Multi-Agent Systems, 22(1):31 42, 2011. [29] E. Baharad, J. Goldberger, M. Koppel, S. Nitzan. Beyond condorcet: optimal aggregation rules using voting records. Theory and Decision, 72(1):113 130, 2012. [30] J.-Y. Audibert, R. Munos, C. Szepesv ari. Tuning bandit algorithms in stochastic environments. In ALT, 2007. [31] V. Mnih, C. Szepesv ari, J.-Y. Audibert. Empirical Bernstein stopping. In ICML, 2008. [32] A. Maurer, M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009. [33] A. Kontorovich. Obtaining measure concentration from Markov contraction. Markov Proc. Rel. Fields, 4:613 638, 2012. [34] D. Berend, A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013.