# noregret_and_incentivecompatible_online_learning__43c839f4.pdf

No-Regret and Incentive-Compatible Online Learning

Rupert Freeman 1 David M. Pennock 2 Chara Podimata 3 Jennifer Wortman Vaughan 4

We study online learning settings in which experts act strategically to maximize their inﬂuence on the learning algorithm s predictions by potentially misreporting their beliefs about a sequence of binary events. Our goal is twofold. First, we want the learning algorithm to be no-regret with respect to the best ﬁxed expert in hindsight. Second, we want incentive compatibility, a guarantee that each expert s best strategy is to report his true beliefs about the realization of each event. To achieve this goal, we build on the literature on wagering mechanisms, a type of multi-agent scoring rule. We provide algorithms that achieve no regret and incentive compatibility for myopic experts for both the full and partial information settings. In experiments on datasets from Five Thirty Eight, our algorithms have regret comparable to classic no-regret algorithms, which are not incentive-compatible. Finally, we identify an incentive-compatible algorithm for forward-looking strategic agents that exhibits diminishing regret in practice.

1. Introduction

We study an online learning setting in which a learner makes predictions about a sequence of T binary events (Vovk, 1990; Littlestone & Warmuth, 1994; Cesa-Bianchi et al., 1997; Freund & Schapire, 1997; Vovk, 1998; Auer et al.,

2002). The learner has access to a pool of K experts, each with beliefs about the likelihood of each event occurring. The standard goal of the learner is to output a sequence of predictions almost as accurate as those of the best ﬁxed expert in hindsight. Such a learner is said to have no regret.

1Darden School of Business, University of Virginia. 2DIMACS, Rutgers University. 3Harvard University. 4Microsoft Research NYC. Correspondence to: Rupert Freeman <freemanr@darden.virginia.edu>, David M. Pennock <dpennock@dimacs.rutgers.edu>, Chara Podimata <podimata@g.harvard.edu>, Jennifer Wortman Vaughan <jenn@microsoft.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

But what if the experts that the learner consults are strategic agents, capable of reporting predictions that do not represent their true beliefs? As pointed out by Roughgarden & Schrijvers (2017), when the learner is not only making predictions but also (implicitly or explicitly) evaluating the experts, experts might have incentive to misreport. The Good Judgment Project,1 a competitor in IARPA s Aggregative Contingent Estimation geopolitical forecasting contest, scored individual forecasters and rewarded the top 2% dubbed Superforecasters (Tetlock & Gardner, 2015) with perks such as paid conference travel; some are now employed by a spinoff company. Similarly, the website Five Thirty Eight2

not only predicts election results by aggregating different pollsters, but also publicly scores the pollsters, in a way that correlates with the amount of inﬂuence that the pollsters have over the Five Thirty Eight aggregate. It is natural to expect that forecasters might respond to the competitive incentive structure in these settings by seeking to maximize the inﬂuence that they exert on the learner s prediction.

When an online learning algorithm is designed in such a way that experts are motivated to report their true beliefs, we say it is incentive-compatible. Incentive compatibility is desirable for several reasons. First, when experts do not report truthfully, the learner s prediction may be harmed. Second, learning algorithms that fail incentive compatibility place an additional layer of cognitive burden on the experts, who must now reason about the details of the algorithm and other experts reports and beliefs in order to decide how to act optimally. To our knowledge, the standard multiplicativeweights-type algorithms fail incentive compatibility, in the sense that experts can sometimes achieve a greater inﬂuence on the algorithm s prediction by misreporting their beliefs; we illustrate this explicitly through manipulation examples. Our goal in this work is to design incentive-compatible online learning algorithms without compromising on the quality of the algorithm s predictions. That is, we seek algorithms that are both incentive-compatible and no-regret, for both the full and partial (bandit) information settings.

Towards this goal, we show a novel connection between online learning and wagering mechanisms (Lambert et al., 2008; 2015), a type of multi-agent scoring rule that allows

1https://goodjudgment.com 2https://fivethirtyeight.com/

No-Regret and Incentive-Compatible Online Learning

a principal to elicit the beliefs of a group of agents without taking on ﬁnancial risk. Using this connection, we construct online learning algorithms that are incentive-compatible and incur sublinear regret. For the full information setting, we introduce Weighted-Score Update (WSU), which yields regret O(

T ln K), matching the optimal regret achievable for general loss functions, even without incentive guarantees. For the partial information setting, we introduce Weighted Score Update with Uniform Exploration (WSU-UX), which achieves regret O(T 2/3(K ln K)1/3).

We focus primarily on experts that strategize only about their inﬂuence at the next timestep. However, we obtain a partial extension for forward-looking experts. Building on a mechanism that was proposed for forecasting competitions (Witkowski et al., 2018), we identify an algorithm, ELF-X, for the full information setting that is incentivecompatible and achieves diminishing regret in simulations.

Our theoretical results are supported by experiments on data gathered from an online prediction contest on Five Thirty Eight. Our algorithms achieve regret almost identical to the classic (and not incentive-compatible) Multiplicative Weights Update (MWU) (Freund & Schapire, 1997) and EXP3 (Auer et al., 2002) algorithms in the full and partial information settings respectively, though WSU falls short of the optimal regret achieved by Hedge for quadratic loss.

Related Work. Other work has drawn connections between online learning and incentive-compatible forecasting, particularly in the context of prediction markets (Abernethy et al., 2013; Abernethy & Frongillo, 2011; Frongillo et al., 2012; Hu & Storkey, 2014). Our work is most closely related to that of Roughgarden & Schrijvers (2017), but differs from theirs in several important ways. Most crucially, Roughgarden and Schrijvers consider algorithms that maintain unnormalized weights over the experts, and they assume that an expert s incentives are only affected by these weights. In our work, incentives are tied to the expert s normalized weight that is, his probability of being selected by the learning algorithm. We argue that normalized weights better reﬂect experts incentives in reality, since reputation tends to be relative more than absolute; put another way, doubling the unnormalized weight of every expert should not increase an expert s utility, since his inﬂuence over the learner s prediction remains the same. Under Roughgarden and Schrijvers model, the design problem is fairly simple when the loss function is a proper loss (Reid & Williamson, 2009) that is, one that can be elicited by a proper scoring rule (Savage, 1971; Gneiting & Raftery, 2007), such as the quadratic loss function and can be solved with a multiplicative weights algorithm. Because of this, they focus primarily on the absolute loss function, which is not a proper loss. In contrast, in our model, the design problem is nontrivial even for these easier proper loss functions.

Conceptually, our work builds on work by Witkowski et al. (2018), who use competitive scoring rules a subclass of wagering mechanisms to design incentive-compatible forecasting competitions. We discuss their work further in Section 5. Our work also has connections with the work of Orabona & P al (2016), who introduce a class of coin-betting algorithms for online learning. Although Orabona & P al (2016) do not address incentives and do not make a connection with the wagering mechanisms literature, our WSU algorithm can be interpreted as a coin-betting algorithm.3

2. Model and Preliminaries

We consider a setting in which a learner interacts with a set of K experts, each making probabilistic predictions about a sequence of T binary outcomes.4 At each round t [T], each expert i [K] has a private belief bi,t [0, 1], unknown to the learner, about the outcome for that round. Both the experts beliefs and the sequence of outcomes may be chosen arbitrarily, and potentially adversarially.

In the full information setting, each expert reports his prediction pi,t [0, 1] to the learner. The learner then chooses her own prediction pt [0, 1] and observes the outcome realization rt {0, 1}. Finally, the learner and the experts incur losses ℓt = ℓ( pt, rt) and ℓi,t = ℓ(pi,t, rt), i [K], where ℓ: [0, 1] {0, 1} [0, 1] is a bounded loss function.5

As is common in the literature, we restrict our attention to algorithms in which the learner maintains a timestepspeciﬁc probability distribution πt = (π1,t, . . . , πK,t) over the experts, and chooses her prediction pt according to this distribution. Unless speciﬁed, this means that the learner predicts pt = pi,t with probability πi,t; some of our results additionally apply when pt = P i [K] πi,tpi,t.

Under partial information, the protocol remains the same except that the learner is explicitly restricted to choosing a single expert It on each round t (according to distribution πt) and does not observe the predictions of other experts.

The goal of the learner is twofold. First, she wishes to incur a total loss that is not too much worse than the loss of the best ﬁxed expert in hindsight. This is captured using the classic notion of regret, given by

t [T ] ℓt min i [K]

t [T ] ℓi,t

3In the language of coin betting, in WSU experts wager an η fraction of their wealth on the positive realization of the event, and the actual outcome of each coin ﬂip is the expert s loss minus the weighted average loss of all other experts. 4We focus on binary outcomes to simplify the presentation of our results, but our techniques could be applied more broadly. 5The loss function taking values in [0, 1] is without loss of generality since any bounded loss function could be scaled.

No-Regret and Incentive-Compatible Online Learning

where the expectation is taken with respect to randomness in the learner s choice of pt.

No-regret algorithms have been proposed in both the full and partial information settings. Many, such as Hedge (Freund & Schapire, 1997) and MWU (Arora et al., 2012), achieve regret of O(

T ln K) for general loss functions by maintaining unnormalized weights wi,t for each expert i that are updated multiplicatively at each timestep. Hedge uses the update rule wi,t+1 = wi,t exp( ηℓi,t), while MWU uses wi,t+1 = wi,t(1 ηℓi,t) for appropriately chosen values of η. These weights are then normalized to arrive at the distribution πt. For the case of exp-concave loss functions, such as the quadratic loss, Kivinen & Warmuth (1999) showed that by aggregating experts predictions and tuning η appropriately, Hedge can achieve regret O(ln K).

For the partial information setting, the EXP3 algorithm of Auer et al. (2002) achieves a regret of O(

TK ln K). EXP3 maintains a set of expert weights similar to those of Hedge. However, since the learner can only observe the prediction of the chosen expert, she uses an unbiased estimator ˆℓi,t of each expert i s loss in her updates in place of ℓi,t. The update rule then becomes wi,t+1 = wi,t exp( ηˆℓi,t).

The second goal of the learner is to incentivize experts to truthfully report their private beliefs. In our model, at each timestep t, each expert i chooses his report pi,t strategically to maximize the probability πi,t+1 that he is chosen at timestep t + 1. An algorithm is incentive-compatible if experts maximize this probability by reporting pi,t = bi,t, irrespective of the reports of the other experts.

Deﬁnition 2.1 (Incentive Compatibility). An online learning algorithm is incentive-compatible if for every timestep t [T], every expert i with belief bi,t, every report pi,t, every vector of reports of the other experts p i,t, and every history of reports (pt )t <t and outcomes (rt )t <t,

E rt Bern(bi,t)[πi,t+1| (bi,t, p i,t) , rt, (rt )t <t, (pt )t <t]

E rt Bern(bi,t) [πi,t+1| (pi,t, p i,t) , rt, (rt )t <t, (pt )t <t] .

where by r Bern(b) we denote a random variable r taking value 1 with probability b and 0 otherwise.

Incentive compatibility guarantees that any regret bounds apply not only with respect to the reports of the experts, but also with respect to their beliefs. This notion of regret is often called strategic regret, and in general may be higher or lower than standard regret. For an incentive-compatible algorithm, the two notions coincide.

To achieve incentive compatibility, we restrict attention to proper loss functions (Reid & Williamson, 2009), referred to in the forecasting literature as proper scoring rules (Mc Carthy, 1956; Savage, 1971; Gneiting & Raftery, 2007).

Deﬁnition 2.2. A loss function ℓis said to be proper if

Er Bern(b)[ℓ(p, r)] Er Bern(b)[ℓ(b, r)], p = b.

Restricting attention to proper loss functions, we are guaranteed that an expert who cares only about his expected loss would truthfully report his beliefs. However, this does not apply for experts who care about their probability of being chosen by the learner, as in our setting. Indeed, known online learning algorithms fail to be incentive-compatible even for proper loss functions. We illustrate this in the following example for MWU with the (proper) quadratic loss function ℓ(p, r) = (p r)2. Here the normalization of weights by the factor P

j [K] wj,t, which depends on both pi,t and rt, can create incentives for agent i to deviate. We note that a similar counterexample can be proved for Gradient Descent too, and we include it in Appendix A.

Example 2.1. Let ℓ(p, r) = (p r)2. Under standard initialization for MWU, wi,1 = 1 for all i [K]. Suppose that b1,1 = 0.5 and pi,1 = 0 for all i {2, . . . , K}. Then E[π1,2], the expected probability that expert 1 is chosen at time 2 under MWU with respect to his own beliefs, is

0.5 1 η(1 p1,1)2

K η(1 p1,1)2 η(K 1)

1 ηp2 1,1 K ηp2 1,1

For K 3 and T 9 ln(3), the denominator in the ﬁrst term is less than the denominator in the second term, independent of p1,1. The derivative of E[π1,2] with respect to p1,1 is therefore strictly positive at 0.5, implying that expert 1 maximizes his utility by reporting some p1,1 > 0.5.

Thus, unlike in the setting of Roughgarden & Schrijvers (2017), using a proper loss function with a standard algorithm is not enough, and new algorithmic ideas are needed. To derive our algorithms, we draw a connection between online learning and wagering mechanisms, one-shot elicitation mechanisms that allow experts to bet on the quality of their predictions relative to others. In the one-shot wagering setting introduced by Lambert et al. (2008), each agent i [K] holds a belief bi [0, 1] about the likelihood of an event. Agent i reports a probability pi and a wager wi 0. A wagering mechanism, Γ, maps the reports p = (p1, . . . , p K), wagers w = (w1, . . . , w K), and the realization r of the binary event to payments Γi(p, w, r) for each agent i. The purpose of the wager is to allow each agent to set a maximum allowable loss, which is captured by imposing the constraint that Γi(p, w, r) 0, i [K]. We restrict our attention to budget-balanced wagering mechanisms for which P

i [K] Γi(p, w, r) = P

A wagering mechanism Γ is said to be incentivecompatible if for every agent i [K] with belief bi [0, 1], every report pi [0, 1], every vector of reports of the other agents p i, and every vec-

No-Regret and Incentive-Compatible Online Learning

tor of wagers w, Er Bern(bi) [Γi ((bi, p i) , w, r)] Er Bern(bi) [Γi ((pi, p i) , w, r)].

One class of budget-balanced, incentive-compatible wagering mechanisms is the Weighted Score Wagering Mechanisms (WSWMs) of Lambert et al. (2008; 2015). Fixing any proper loss function ℓbounded in [0, 1], agent i receives

ΓWSWM i (p, w, r) = wi

1 ℓ(pi, r)+ X

j [K] wjℓ(pj, r)

WSWMs are incentive-compatible because the payment an agent receives is a linear function of his loss, measured by a proper loss function. An agent makes a proﬁt (i.e., receives payment greater than his wager), whenever his loss is smaller than the wager-weighted average agent loss, so accurate agents are more likely to increase their wealth.

3. The Full Information Setting

In this section, we present and analyze an online prediction algorithm, Weighted-Score Update (WSU), for the full information setting. We show that WSU is incentive-compatible and achieves regret O(

Our key observation is that we can deﬁne a black-box reduction that transforms any budget-balanced wagering mechanism Γ to an online learning algorithm by setting πt+1 = Γ(pt, πt, rt). Here we can interpret an expert s weight according to distribution πt as their currency. Each expert wagers πt at time t and receives a payoff πt+1, which depends on the reports of the experts p and the realization rt. It is easy to see that any online prediction algorithm that is derived from an incentive-compatible wagering mechanism will in turn be incentive-compatible, because any misreport that increases weight πt+1 would also be a successful misreport in the wagering setting.

One might hope that applying this reduction to the WSWM would directly yield a no-regret online learning algorithm. But this is not the case, due to the fact that an expert who makes an inaccurate prediction can lose too much of his wealth (probability) if all other experts have low loss, and it can take a long time to recover from this. To handle this, we allow experts to wager only an η fraction of their current probability at each timestep for some η (0, 0.5]. This guarantees that no expert can obtain a probability πi,t close to zero without having made a long series of inaccurate predictions. Formally, the update rule of our algorithm, the Weighted-Score Update (WSU), is deﬁned by:

πi,t+1 = ηΓWSWM i (pt, πt, rt) + (1 η)πi,t, (1)

with weights πi,1 initialized to πi,1 = 1/K for all i.

We must show that πt is a valid probability distribution

over experts at each t. This follows from the WSWM being budget-balanced; the proof is in the appendix (Lemma B.1).

By rewriting the WSU update rule in terms of relative loss Li,t = ℓi,t P

j [K] πj,tℓj,t, we can see that the form of the update is quite familiar. In particular, from Equation 1,

πi,t+1 = ηπi,t

j [K] πj,tℓj,t

+ (1 η)πi,t

= πi,t(1 ηLi,t). (2)

This resembles the update rule for the (unnormalized) weights maintained by MWU, but with the relative loss Li,t in place of ℓi,t. The D-Prod algorithm of Even-Dar et al. (2008) involves a similar update, but using loss relative to a single ﬁxed distribution over experts instead of πt.

We are now ready to prove our guarantees. The proof of Theorem 3.1 proceeds in a similar manner to the standard proof that MWU satisﬁes no regret. However, our proof is slightly simpler because we do not need to make a distinction between (unnormalized) weights and (normalized) probabilities. We can therefore avoid introducing the standard potential function used in proofs of no regret.

Theorem 3.1. WSU is incentive-compatible and for step size η = p

ln(K)/T yields regret R 2

Proof. For incentive compatibility, note that from Equation (1), πi,t+1 is a convex combination of a WSWM payment and πi,t, which cannot be inﬂuenced by i s report at time t. Since truthful reporting (at least weakly) maximizes each of these components, it also maximizes the sum.

For the regret, denoting by i the best expert in hindsight,

1 πi ,T +1 = πi ,T (1 ηLi ,T )

t [T ] (1 ηLi ,t) = 1

t [T ] (1 ηLi ,t) .

Taking the logarithm for both sides of this inequality, we get

t [T ] ln (1 ηLi ,t)

ηLi ,t η2L2 i ,t ,

where the last inequality comes from the fact that for x 1/2, ln(1 x) x x2 (see Lemma B.2). Rearranging and dividing both sides by η yields

t [T ] Li ,t ln K

t [T ] L2 i ,t.

No-Regret and Incentive-Compatible Online Learning

Since we have P

t [T ] Li ,t = P

t [T ] ℓi ,t P

j [K] πj,tℓj,t = R, this becomes

t [T ] L2 i ,t ln K

Finally, tuning η = p

ln(K)/T gives us the result.

If T is not known in advance, a standard doubling trick (Auer et al., 2002) can be applied with only a constant factor increase in regret; see Appendix B.3 for details.

The regret and incentive-compatibility guarantees of WSU presented in Theorem 3.1 hold for all [0, 1]-bounded proper loss functions ℓ. If ℓis additionally convex, then these guarantees carry over to a (possibly more practical) variant of WSU, termed WSU-Aggr, that uses the same update rule but sets pt = P

i [K] πi,tpi,t rather than choosing a single expert. Incentive compatibility is immediate. The regret bound follows from the fact that, by Jensen s inequality,

i [K] πi,tpi,t, rt

i [K] πi,tℓ(pi,t, rt).

4. The Partial Information Setting

The encouraging results of the previous section apply only when the learner has access to the reports of all experts. But what if the learner has only partial information regarding these reports and still wants to incentivize all experts to report their predictions truthfully? In this section, we provide and analyze a novel algorithm, Weighted-Score Update with Uniform Exploration (WSU-UX), that is simultaneously no-regret and incentive-compatible in the bandit setting in which the learner chooses a single expert It at each round and observes only that expert s prediction. We show this algorithm has regret O(T 2/3(K ln K)1/3). This guarantee is weaker than that of EXP3, but we see in Section 6 that WSU-UX can perform similarly to EXP3 in practice with the additional advantage of incentive compatibility.

One might think that the standard trick of replacing the loss ℓi,t with an unbiased estimator ˆℓi,t in the WSU update rule would sufﬁce in order to guarantee both incentive compatibility and a regret rate of O(

T ln K). Speciﬁcally, following Auer et al. (2002), we might consider setting ˆℓi,t = 0 for all experts i = It whose predictions we do not observe, and ˆℓIt,t = ℓIt,t/πIt,t for the chosen expert. However, since these estimated losses are unbounded, this could lead to weights πi,t moving outside of [0, 1], and we would no longer have a valid algorithm.

To solve this, we mix a distribution generated via WSU-style updates with a small amount of the uniform distribution.

This does not affect incentives, since the experts have no way of altering the uniform distribution, and has the convenient property that the estimated loss function is now bounded. By carefully tuning parameters, we are able to guarantee a valid probability distribution over experts. The resulting updates are given in Algorithm 1.

Algorithm 1 WSU-UX with parameters η and γ such that 0 < η, γ < 1/2 and ηK/γ 1/2.

1: Set πi,1 = 1

K , i [K] 2: for t [T] do 3: Choose expert It eπi,t = (1 γ)πi,t + γ

K 4: Compute: ˆℓIt,t = ℓIt,t

eπIt,t and ˆℓi,t = 0, i = It

5: Update πi,t+1 =πi,t 1 η ˆℓi,t P

j [K] πj,tˆℓj,t

We ﬁrst prove that this is a valid algorithm, that is, that the distributions eπt from which an expert is selected are valid, under appropriate settings of η and γ. Lemma 4.1. If ηK/γ 1/2, the WSU-UX weights πt and eπt are valid probability distributions for all t [T + 1].

Proof. We prove this inductively for πt and eπt simultaneously. The base case is trivial since at time t = 1, i [K], πi,1 = eπi,1 = 1/K. Now assume that for some t both πt and eπt are valid probability distributions. We distinguish two cases. First, suppose i = It. Then, since ˆℓi,t = 0, the WSU-UX update rule becomes

πi,t+1 = πi,t

1 η 0 πIt,t ℓIt,t eπIt,t

Second, suppose i = It. Then

πi,t+1 = πi,t

eπi,t πi,t ℓi,t eπi,t

eπi,t (1 πi,t)

where the penultimate inequality comes from the fact that eπi,t γ/K, since by the inductive assumption πi,t 0. The last follows from the assumption that ηK/γ 1/2. Moreover, for the sum of probabilities we get:

i [K] πi,t+1 = X

j [K] πj,tˆℓj,t

i [K] πi,t η

i[K] πi,tˆℓi,t X

i [K] πi,t X

j [K] πj,tˆℓj,t

i [K] πi,tˆℓi,t X

j [K] πj,tˆℓj,t

No-Regret and Incentive-Compatible Online Learning

Thus πt+1 is valid. Since eπt+1 is a convex combination of two probability distributions, it is also a probability distribution, completing the inductive argument.

We are now ready to state the main theorem. The requirement that T K ln K ensures that the precondition of Lemma 4.1 is satisﬁed for the settings of η and γ used.

Theorem 4.1. For T K ln K and parameters η = ln K 4K1/2T 2/3 and γ = K ln K

4T 1/3, WSU-UX is incentive compatible and yields regret R 2(4T)2/3(K ln K)1/3.

The proof of the theorem will follow from a series of claims and lemmas. We ﬁrst examine the moments of ˆℓi,t and verify that it is an unbiased estimator of ℓi,t; the proof is direct and in Appendix C.

Lemma 4.2 (Moments). Taking expectation with respect to the choice of expert at round t and keeping all else ﬁxed, i [K], t [T], EIt eπt h ˆℓi,t i = ℓi,t. Furthermore,

h ˆℓ2 i,t i = ℓ2 i,t eπi,t 1 eπi,t . (3)

We next provide a second-order regret bound. It differs from the standard second-order regret bounds presented for bandit algorithms (see e.g., Bubeck et al. (2012, Chapter 3)) because it relates the estimated regret of the learner to the second moment of the estimated loss of the best-ﬁxed expert in hindsight. The proof can be found in Appendix C.

Lemma 4.3 (Second-Order Bound). For WSU-UX, the probability vectors π1, . . . , πT and the estimated losses ˆℓi,t for i [K], t [T] induce the following second-order bound:

i [K] πi,tˆℓi,t X

ˆℓ2 i ,t + η X

i [K] πi,tˆℓ2 i,t

where i = arg mini [K] P

t [T ] ℓi,t.

Proof. Since πT +1 is a valid probability distribution (Lemma 4.1), we have

1 πi ,T +1 = πi ,T

j [K] πj,T ˆℓj,T

j [K] πj,tˆℓj,t

Taking the logarithm for both sides, and using the fact that

πi,1 = 1/K, i [K], we get

j [K] πj,tˆℓj,t

(4) We next show that for all t [T] and any i [K]

j [K] πj,tˆℓj,t 1/2. We distinguish two

cases. First, if i = It, then the inequality holds since ˆℓi,t = 0 and as a result the expression becomes η πIt,tˆℓIt,t 0. Second, if i = It, then the expression becomes

eπIt,t ηπIt,t ℓIt,t eπIt,t = η ℓIt,t

eπIt,t (1 πIt,t)

η 1 eπIt,t (πi,t 0, ℓi,t 1)

γ (eπi,t γ/K, since πi,t 0)

2 (by deﬁnition)

We can now lower bound Equation (4) using the fact that for z 1/2 it holds that: ln(1 z) z z2 (Lemma B.2).

j [K] πj,tˆℓj,t

j [K] πj,tˆℓj,t

j [K] πj,tˆℓj,t

j [K] πj,tˆℓj,t

j [K] πj,tˆℓj,t

ˆℓ2 i ,t η2 X

j [K] πj,tˆℓj,t

j [K] πj,tˆℓj,t

ˆℓ2 i ,t η2 X

j [K] πj,tˆℓ2 j,t

where the second inequality uses the fact that for a, b nonnegative, (a b)2 a2 + b2 and the last inequality uses Jensen s inequality for function f(x) = x2. Rearranging the latter and dividing both sides by η gives the result.

No-Regret and Incentive-Compatible Online Learning

With that we can complete the proof of Theorem 4.1.

Proof of Theorem 4.1. It follows from incentive compatibility of WSU that an expert maximizes the expected value of πi,t+1 by minimizing the expected value of ˆℓi,t. From the deﬁnition of ˆℓi,t, it is easy to see that minimizing the expected value of ˆℓi,t is equivalent to minimizing the expected value of ℓi,t. By properness of ℓ, this is achieved by truthfully reporting pi,t = bi,t.

We now show the regret bound. Taking expectations with respect to the choice of expert at round t for both sides of the equation in Lemma 4.3, we get

i [K] πi,t E It eπt

h ˆℓi,t i X

t [T ] E It eπt

t [T ] E It eπt

h ˆℓ2 i ,t i + ln K

i [K] πi,t E It eπt

h ˆℓ2 i,t i .

Using Lemma 4.2, this gives us X

i [K] πi,tℓi,t X

t [T ] ℓi ,t

1 eπi ,t + ln K

i [K] πi,t 1 eπi,t

where the second inequality uses the fact that πi,t 2eπi,t, i [K], t [T] since γ/K 0 and γ 1/2. Next, we re-write πi,t = eπi,t γ/K

1 γ , yielding

K 1 γ ℓi,t X

t [T ] ℓi ,t

Since 1 γ < 1 and ℓ(pi,t, rt) 1, this can be relaxed to:

i [K] eπi,tℓ(pi,t, rt) X

t [T ] ℓ(pi ,t, rt)

Making γT = ηKT/γ by setting γ = ηK, and η = ln K 4K1/2T 2/3 we get the regret result.6

6The last derivation requires that η 1/K, which is true for large enough horizons T K ln K.

As in the full information setting, a doubling trick can be applied if T is unknown (Appendix C.2).

We note that, unlike the full information setting in which WSU achieves the optimal regret bound for general loss functions, our regret bound in Theorem 4.1 is not as good as what can be achieved without incentive compatibility. Examining our analysis, one can see that if the loss of the best-ﬁxed expert in hindsight is zero at each round, then the regret guarantee achieved by WSU-UX would be the same as EXP3, i.e., O(

T ln K). Closing this gap via a tighter analysis of WSU-UX or via a new incentive-compatible algorithm is a compelling question for future work.

5. Forward-Looking Experts

So far we have assumed that the experts are myopic, aiming at time t to optimize their inﬂuence on the algorithm only at time t + 1 with no regard for future rounds. It is natural to ask whether it is possible to design learning algorithms that satisfy no regret while incentivizing truthful reports from forward-looking experts who care about their inﬂuence πi,t at all t > t. Neither WSU nor WSU-UX achieve this goal; see the appendix for examples that illustrate why.7

In order to derive an online learning algorithm that is incentive-compatible for forward-looking experts, we build on work by Witkowski et al. (2018), who studied a forecasting competition setting in which agents make predictions about a series of independent events, competing for a single prize. Unlike in our setting, their goal was to derive an incentive-compatible mechanism for choosing the winning agent; they are agnostic to how the elicited forecasts are aggregated. They deﬁned a mechanism, Event-Lotteries Forecaster Selection Mechanism (ELF), in which, for every predicted event τ, every agent i is assigned a probability of being the event winner based on the quality of their prediction. The winner of the competition is the agent who wins the most events.

We build on this idea to deﬁne an online learning algorithm, ELF-X, for the full information setting. Like WSU, ELF-X incorporates WSWM payments, but in a different way. The distribution πt at time t is deﬁned as the distribution over experts output by the following randomized process:

1. At each round τ [t], pick agent i as the winner xτ with probability 1

K 1 ℓi,τ + 1

j [K] ℓj,τ . 2. Select arg maxi [K] P

τ [t] 1(xτ =i), the expert who won the most events, breaking ties uniformly.

It can be shown by a similar argument to that of Witkowski et al. (2018) that ELF-X is incentive-compatible. The proof,

7It is worth noting that in these examples, an expert can gain only a negligible amount from misreporting; it is an open question whether WSU satisﬁes some notion of ϵ-incentive compatibility.

No-Regret and Incentive-Compatible Online Learning

along with a formal deﬁnition of incentive compatibility for forward-looking experts, is in the appendix.

Theorem 5.1. ELF-X is incentive-compatible for forwardlooking experts.

While proving that ELF-X is no-regret remains an open problem, in the following section, we present experimental results suggesting that its regret is sublinear in T in practice.

6. Experiments

In this section, we empirically evaluate the performance of our proposed incentive-compatible algorithms, WSU and WSU-UX, compared with standard no-regret algorithms. We also evaluate the performance of ELF-X, which is incentivecompatible for non-myopic experts. Our code and the datasets we use are publicly available online.8

We ran each algorithm on publicly available datasets from a forecasting competition run by Five Thirty Eight9 in which users (henceforth called forecasters ) make predictions about the outcomes of National Football League (NFL) games. Before each game, Five Thirty Eight releases information on the past performance of the two opposing teams, and forecasters provide probabilistic predictions about which team will win the game. Five Thirty Eight maintains a public leaderboard with the most accurate forecasters, updated after each game. The datasets for the 2018 2019 and 2019 2020 seasons each include all forecasters predictions, labeled with the forecaster s unique id, information about the corresponding game, and the game s outcome. Each NFL season has a total of 267 games, so in our setting, T = 267. For 2018 2019 (respectively, 2019 2020), while 15,702 (15,140) participated, only 302 (375) made predictions for every game. In order to reduce variance, for each value of K, we sampled 10 groups of K forecasters from the 302 (respectively, 375), and for each such group, ran each algorithm 50 times.

We evaluate performance using quadratic loss. We compare the cumulative loss of each algorithm against the cumulative loss of the best ﬁxed forecaster in hindsight. For the full information setting, we compare WSU and ELF-X against Hedge, which achieves optimal regret guarantees since the quadratic loss is exp-concave, and MWU, which is more similar in form to WSU, in order to evaluate whether anything is lost in terms of regret when incentive compatibility is achieved. For the partial information setting, we compare WSU-UX against EXP3. For each full information algorithm, we run both the variant in which a single expert

8Code: https://github.com/charapod/ noregr-and-ic. Datasets: https://github.com/ fivethirtyeight/nfl-elo-game

9https://projects.fivethirtyeight.com/ 2019-nfl-forecasting-game/

is selected at each timestep and the variant in which the learner outputs a weighted combination of expert reports (labeled *-Aggr). For ELF-X-Aggr, since πt cannot be computed in closed form, we approximate it via sampling.

We present the results of our experiments on the 2018 2019 dataset in Figure 1; the results on the 2019 2020 dataset are in Appendix E.1, and exhibit similar trends. We note that lines correspond to average regret (across all samples of experts and all repetitions), while the error bands correspond to the 20th and 80th percentiles; this leads to much smaller error bands for larger values of K since the speciﬁc sampling of experts has less inﬂuence on regret for large K.

Validating our theoretical results, WSU performs almost identically (in terms of the dependence on both K and T) to MWU when fed the same set of reports this, of course, does not take into account that MWU is not incentive compatible and may lead to misreports in practice, potentially degrading predictions. Interestingly, we also see that WSU-Aggr performs almost identically to MWU-Aggr. This suggests that the performance of WSU-Aggr is considerably better than the bound in Section 3 implies. It is an interesting open question to see whether better regret guarantees can be proved for WSU-Aggr, perhaps with respect to the best ﬁxed distribution of experts. As expected, both WSU and MWU are outperformed by Hedge, which achieves optimal regret bounds for squared loss but no incentive guarantees.

ELF-X appears to exhibit diminishing regret on this dataset, particularly for K = 5. However, ELF-X and ELF-X-Aggr perform worse than WSU and WSU-Aggr respectively when fed the same input, particularly when the number of experts is large. Although ELF-X obtains a stronger incentive guarantee, the violations of incentive compatibility for forward-looking experts exhibited by WSU are very small in our examples. In practice, we expect that WSU is a superior choice to ELF-X when balancing regret and incentive properties, even for forward-looking experts.

For the bandit setting, quite encouragingly, we see that the performance of WSU-UX is only slightly worse than that of EXP3, and appears signiﬁcantly better than the O(T 2/3) regret bound in Section 4 would suggest. This could be a byproduct of our analysis not being tight, and it remains an open question whether this bound can be improved.

The experiments presented in this section focus on settings with relatively small horizons T since an NFL season has only 267 matches. In Appendix E.2, we present our results (which also validate our theoretical analysis) for Monte Carlo simulations for larger horizons.

No-Regret and Incentive-Compatible Online Learning

Figure 1. Comparisons on the 2018 2019 Five Thirty Eight NFL dataset. Top: Full-information setting with pt the prediction of a single expert chosen according to πt. Middle: Full-information setting with pt = P

i [K] πi,tpi,t. Bottom: Partial information setting.

7. Conclusion and Open Questions

We studied the problem of online learning with strategic experts. We introduced algorithms that are simultaneously no-regret and incentive-compatible, and assessed their performance experimentally on data from Five Thirty Eight. Several open questions arise. In the full-information setting, there is the question of whether an incentive-compatible algorithm exists with better regret bounds for the special case of exp-concave bounded proper loss functions. For the bandit setting, there is the question of whether there

exist incentive-compatible algorithms that bridge the gap between the regret of WSU-UX and that of EXP3, and whether a better regret guarantee could be proved for WSU-UX via a tighter analysis. There is also the question of whether ELF-X is indeed no-regret, as our experimental results might suggest. More broadly, the most important research question that we believe needs to be addressed in online learning from strategic agents is the quantiﬁcation of the tradeoff between incentive-incompatibility and standard learning guarantees and how to balance these in practice.

No-Regret and Incentive-Compatible Online Learning

Acknowledgments

The authors are grateful to Yiling Chen, Ariel Procaccia, Rob Schapire, Vasilis Syrgkanis, and Jens Witkowski for helpful discussions at different stages of this work, and to the anonymous reviewers for their comments and suggestions. Part of the research was conducted while Rupert Freeman, David M. Pennock, and Chara Podimata were at Microsoft Research NYC as a postdoc, researcher, and intern respectively. Chara Podimata is also supported in part under grant No. CCF-1718549 of the National Science Foundation and the Harvard Data Science Initiative.

Abernethy, J. and Frongillo, R. M. A collaborative mechanism for crowdsourcing prediction problems. In Advances in Neural Information Processing Systems, 2011.

Abernethy, J., Chen, Y., and Vaughan, J. W. Efﬁcient market making via convex optimization, and a connection to online learning. ACM Transactions on Economics and Computation, 1(2):12:1 12:38, 2013.

Arora, S., Hazan, E., and Kale, S. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5 (1):1 122, 2012.

Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. How to use expert advice. Journal of the ACM (JACM), 44(3):427 485, 1997.

Even-Dar, E., Kearns, M., Mansour, Y., and Wortman, J. Regret to the best vs. regret to the average. Machine Learning Journal, 72:21 37, 2008.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Frongillo, R., Della Penna, N., and Reid, M. D. Interpreting prediction markets: A stochastic approach. In Advances in Neural Information Processing Systems, 2012.

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

Hu, J. and Storkey, A. Multi-period trading prediction markets with connections to machine learning. In International Conference on Machine Learning, pp. 1773 1781, 2014.

Kivinen, J. and Warmuth, M. K. Averaging expert predictions. In European Conference on Computational Learning Theory, pp. 153 167. Springer, 1999.

Lambert, N. S., Langford, J., Wortman, J., Chen, Y., Reeves, D., Shoham, Y., et al. Self-ﬁnanced wagering mechanisms for forecasting. In Proceedings of the 9th ACM Conference on Electronic Commerce, pp. 170 179, 2008.

Lambert, N. S., Langford, J., Vaughan, J. W., Chen, Y., Reeves, D. M., Shoham, Y., and Pennock, D. M. An axiomatic characterization of wagering mechanisms. Journal of Economic Theory, 156:389 416, 2015.

Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994.

Mc Carthy, J. Measures of the value of information. Proceedings of the National Academy of Sciences of the United States of America, 42(9):654, 1956.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pp. 577 585, 2016.

Reid, M. D. and Williamson, R. C. Surrogate regret bounds for proper losses. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 897 904. ACM, 2009.

Roughgarden, T. and Schrijvers, O. Online prediction with selﬁsh experts. In Advances in Neural Information Processing Systems, pp. 1300 1310, 2017.

Savage, L. J. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971.

Tetlock, P. E. and Gardner, D. Superforecasting: The Art and Science of Prediction. Crown, 2015.

Vovk, V. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153 173, 1998.

Vovk, V. G. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.

Witkowski, J., Freeman, R., Vaughan, J. W., Pennock, D. M., and Krause, A. Incentive-compatible forecasting competitions. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.