# peer_prediction_for_learning_agents__1f556835.pdf Peer Prediction for Learning Agents Shi Feng Institute for Interdisciplinary Information Sciences, Tsinghua University Beijing, China fengs19@mails.tsinghua.edu.cn Fang-Yi Yu Department of Computer Science, George Mason University Fairfax, VA, USA fangyiyu@gmu.edu Yiling Chen John A. Paulson School of Engineering and Applied Sciences, Harvard University Cambridge, MA, USA yiling@seas.harvard.edu Peer prediction refers to a collection of mechanisms for eliciting information from human agents when direct verification of the obtained information is unavailable. They are designed to have a game-theoretic equilibrium where everyone reveals their private information truthfully. This result holds under the assumption that agents are Bayesian and they each adopt a fixed strategy across all tasks. Human agents however are observed in many domains to exhibit learning behavior in sequential settings. In this paper, we explore the dynamics of sequential peer prediction mechanisms when participants are learning agents. We first show that the notion of no regret alone for the agents learning algorithms cannot guarantee convergence to the truthful strategy. We then focus on a family of learning algorithms where strategy updates only depend on agents cumulative rewards and prove that agents strategies in the popular Correlated Agreement (CA) mechanism converge to truthful reporting when they use algorithms from this family. This family of algorithms is not necessarily no-regret, but includes several familiar no-regret learning algorithms (e.g multiplicative weight update and Follow the Perturbed Leader) as special cases. Simulation of several algorithms in this family as well as the ϵ-greedy algorithm, which is outside of this family, shows convergence to the truthful strategy in the CA mechanism. 1 Introduction A fundamental challenge in many domains is to elicit high-quality information from people when directly verifying the acquired information is not feasible, either because the ground truth is not available or because it is too costly to obtain. Notable settings include asking people to label data for machine learning, having students perform peer grading in education, and soliciting customer feedback for products and services. The peer prediction literature has made impressive progress on this challenge in the past two decades, with many mechanisms that have desirable incentive properties developed for this problem [20, 19, 27, 5, 22, 8, 14, 25, 17, 16, 21, 18, 23, 15]. The term peer prediction refers to a collection of reward 36th Conference on Neural Information Processing Systems (Neur IPS 2022). mechanisms that solicit information from human agents and reward each agent solely based on how the agent s reported information compares with that of the other agents, without having access to the ground truth. Under some assumptions, many peer prediction mechanisms [20, 19, 27, 22, 8, 16, 21] guarantee that every agent truthfully reporting their information is a game-theoretic equilibrium, and the more recent multi-task peer prediction mechanisms [5, 14, 25, 17, 18, 23, 15] further ensure that agents receive the highest expected payoff at the truthful equilibrium, compared with other strategy profiles. While achieving truthful reporting as a highest-payoff equilibrium is a victory to declare for this challenging without-verification setting, there are however caveats associated with adopting the notion of equilibrium as a solution concept. The equilibrium results rely on the assumption that participants are fully rational Bayesian agents. Equilibrium is a static notion and does not address how agents, who act independently, jump to play their equilibrium strategies. Moreover, the equilibrium results of multi-task peer prediction mechanisms heavily depend on a consistent strategy assumption, that is, each agent is assumed to adopt a fixed strategy across all tasks in which she participates. All together these assumptions exclude the possibility that agents may explore and learn from previous experience, a behavior that s not only commonly observed in practice but also has been modeled in studying other strategic settings [3, 6, 4]. This paper is the first theoretical study on the dynamics of sequential peer prediction mechanisms when participants are learning agents. The main question that we explore is whether and when in sequential peer prediction, learning agents will converge to all playing the truthful reporting strategy. We first consider agents adopting no-regret learning algorithms and prove that the notion of no regret alone cannot guarantee convergence to truthful reporting. We then define a natural family of reward-based learning algorithms where strategy updates only depend on agents cumulative rewards. While algorithms in this family are not necessarily no-regret (e.g. the Follow the Leader algorithm), this family includes some familiar no-regret learning algorithms, including the Multiplicative Weight Update and the Follow the Perturbed Leader algorithms. Our main result shows that, for the binarysignal setting, agents strategies in the popular Correlated Agreement (CA) mechanism [5] converge to truthful reporting when agents use any algorithm from this family. To prove the result, we show the process has a self-fulfilling property: once Alice and Bob have large accumulated rewards for truth-telling, they are more likely to play truth-telling and resulting in larger accumulated rewards. Theoretically, we carefully partition the process into three stages, bad, intermediate, and good events illustrated in fig. 1, and use tools in martingale theory to argue the progress of the process. Finally, we simulate the strategy dynamics in the CA mechanism for several algorithms in this family as well as for the ϵ-greedy algorithm, which doesn t belong to this family. We observe convergence to truthful reporting for all algorithms considered in our simulation, suggesting an interesting future direction to characterize all learning algorithms that converge to truthful reporting. Related Works This paper relates to two lines of work, information elicitation and mechanisms for learning agents. Information Elicitation Mechanisms The literature on information elicitation without verification focuses on capturing the strategic aspect of human agents. In multi-task settings, Dasgupta and Ghosh [5] proposed a seminal informed truthful mechanism, the Correlated Agreement (CA) mechanism, for binary positively correlated signals. A series of works then relaxed the binary and positively correlation assumptions [25, 17, 23, 15]. Additionally, Zheng et al. [28] study the limitation of information elicitation in the multi-task setting. However, all of the above works assume agents using consistent strategies that are identical across all tasks. The consistent strategy assumption excludes the possibility of agent learning. Our work removes the consistent strategy assumption and explicitly considers learning agents. We theoretically prove truthful convergence of the CA mechanism when agents using algorithms from a family of reward-based online learning algorithms. Our work of considering learning agents can be viewed as a way of testing the robustness of information elicitation mechanisms with respect to deviation from the rational Bayesian agent model. From this perspective, Shnayder et al. [26] is closely related to ours. They consider sequential information elicitation and empirically study if agents using replicator dynamics can converge to truth-telling in the CA mechanism and several other mechanisms. Our work theoretically proves that, besides replicator dynamics, learning agents can converge to truth-telling in the CA mechanism when they use a general family of learning algorithms. Additionally, Schoenebeck et al. [24] designed an information elicitation mechanism that was robust against a small fraction of adversarial agents. Mechanisms for Learning Agents Several works in economics and computer science try to design mechanisms for learning agents, rather than for rational, Bayesian ones. Braverman et al. [3] studied pricing mechanisms for learning agents with no external regrets. Their work was generalized by Deng et al. [6] to consider repeated Stackelberg games in full-information settings. Camara et al. [4] further proposed counterfactual internal regrets (CIR) together with no-CIR assumption, which was proved to be a sufficient behavior assumption for no-regret principal mechanism design in repeated stage games. However, all of these works focus on a single agent or full-information games, while peer prediction is an incomplete-information game with multiple agents. Finally, our goal is slightly different from that of most sequential mechanism design. Instead of maximizing the mechanism designer s utility, our goal is to incentivize truthful reporting from agents. 2 Peer Prediction Settings For simplicity we consider two agents1, Alice and Bob, who work on a sequence of tasks indexed by t 1. For round t, both agents work on task t, Alice receives a signal Xt = xt in {0, 1}, and Bob a signal Yt = yt in {0, 1} where Xt and Yt denote random variables, and xt and yt are their realizations. Then Alice and Bob report ˆXt = ˆxt and ˆYt = ˆyt in {0, 1}. We define X t = {Xs : 1 s t} {0, 1}t and ˆX t = { ˆXs : 1 s t} {0, 1}t to denote Alice s signal profile and report profiles until t-th round respectively, and define Y t and ˆY t for Bob similarly. We use X, ˆX, Y, and ˆY for the complete signal and report profiles. Additionally, we consider the signals are generated from some distribution P that satisfies the following assumptions: Assumption 2.1 (A priori similar tasks [5]). Each pair of signal is identically and independently (i.i.d.) generated: there exists a distribution PX,Y over {0, 1}2 such that (Xt, Yt) PX,Y for any t N+. Moreover, we assume the distribution has full support, PX,Y (x, y) > 0 for all x, y {0, 1}. Assumption 2.2 (Positively correlated signals). The distribution PX,Y is positively correlated, min{PX,Y (1, 1), PX,Y (0, 0)} > max{PX,Y (1, 0), PX,Y (0, 1)}. Now we introduce multi-task peer prediction mechanisms and sequential peer prediction mechanisms, and their relation. We will focus on the sequential setting. Multi-task peer prediction mechanisms work on a fixed number of tasks. Formally, a multi-task peer prediction mechanism on k tasks is a pair of payment functions M : {0, 1}k [0, 1]2. For instance, the (multi-task) correlated agreement mechanism (CA mechanism)2 [5, 25, 26] is M CA(ˆx, ˆy) = (I[ˆx2 = ˆy2] I[ˆx2 = ˆy1], I[ˆy2 = ˆx2] I[ˆy2 = ˆx1]) for all ˆx, ˆy {0, 1}2. Intuitively, the CA mechanism rewards agreement on the same task and punishes agreement on uncorrelated tasks. A sequential information elicitation mechanism is a sequence of payment functions M = {Mt : t 1} where Mt : {0, 1}2 t [ 1, 1] for all t. After Alice and Bob reporting ˆx t and ˆy t in round t, the mechanism computes (rt, st) := Mt(ˆx t, ˆy t) and pay rt to Alice and st to Bob. Here we assume Mt can only depends on a constant k round of reports so that Mt(ˆx t, ˆy t) = Mt(ˆxt k+1, ˆxt k+1, . . . , ˆxt, ˆyt k+1, ˆyt k+1, . . . , ˆyt) for all t, ˆx t, and ˆy t, and we call such M rank k mechanism. For instance, the (multi-task) CA mechanism can be adopted as a sequential rank 2 information elicitation mechanism: At round t, the payment is M CA t (ˆx t, ˆy t) = (I[ˆxt = ˆyt] I[ˆxt = ˆyt 1], I[ˆyt = ˆxt] I[ˆyt = ˆxt 1]) (1) where ˆx0 and ˆy0 are set as 0. Similarly, we say a sequential information elicitation mechanism M = (Mt)t 1 is a sequential version of a multi-task information elicitation mechanism M if Mt(ˆx t, ˆy t) = M(ˆxt k+1, ˆxt k+1, . . . , ˆxt, ˆyt k+1, ˆyt k+1, . . . , ˆyt) for all ˆx t, ˆy t and t k. The payment at round t k is M on the latest k reports. Conversely, a sequential information elicitation mechanism can be seen as a sequence of multi-task information elicitation mechanisms. 1For more than two agents, we can partition the agents into groups of two agents to run our mechanisms when the number of agents is even. Then all our results still hold. Finally, when the number of agents is odd, we can pair the unpaired agent with a reference agent whose payment is not affected by the unpaired one. 2While the CA mechanism can be defined on non binary setting and does not require positive correlation. [25], with assumption 2.2, the CA mechanism reduces to eq. (1) and is first proposed in [5]. Finally, note that when the number of task is greater than two, we can compute the payment based on the last two tasks or two random tasks since agents using consistent strategy and assumption 2.1. Now we formally define agents strategies. Due to symmetry, we introduce notation for Alice and omit Bob s. Given an information elicitation mechanism M, at round t, Alice observes her signal xt and decides on her report ˆxt. Thus, Alice has four options (pure strategies): 1) opt1: report the private signal truthfully, 2) opt2: flip the private signal, 3) opt3: report 1 regardless of the signal, and 4) opt4: report 0 regardless of the signal. We call opt3 and opt4 uninformative strategies. We use opt X t to denote Alice s pure strategy, and rt for her payoff at round t. At each round t, Alice knows her previous signals x t {0, 1}t, her pure strategies opt X 1 , . . . , opt X t 1, and Bob s reports ˆy t 1, so we use Ft = {x t, opt X t 1, ˆy t 1} to denote Alice knowledge at round t. Thus, Alice s mixed strategy at round t is a stochastic mapping σX t from Ft to {opt1, opt2, opt3, opt4}. We ll abuse our nation and also use σX t = opt X t to represent the realized pure strategy. Finally, a learning algorithm of Alice is a mapping from an information elicitation mechanism M to her strategies. Strong Truthfulness for Rational and Bayesian agents Previous works on information elicitation try to ensure truth-telling opt1 is the best strategy for rational and strategic agents. In particular, a mechanism is strongly truthful if Alice and Bob report truthfully is a Bayesian Nash Equilibrium (BNE) and they get strictly higher payment at this BNE than at any other non-permutation BNE. We present the formal definitions in the appendix. Informally, in a permutation BNE, every agent s strategy on each round is a permutation/bijection from his/her signals to reports. However, the equilibrium results of previous mechanisms not only require assumption 2.1 but further assume agents using consistent strategies. Specifically, Alice uses a consistent strategy if there is a fixed distribution on {opt1, opt2, opt3, opt4} so that opt X t is generated from a fixed distribution that is independent of her private signals on other tasks and the round number. For instance, when Alice and Bob are Bayesian and use consistent strategies under assumption 2.2 and 2.1, Dasgupta and Ghosh [5] show CA mechanism in eq. (1) is strongly truthful. Intuitively, positive correlation assumption 2.2 guarantees that truthful reporting can maximize the chance of agreeing with the peer on the same task while avoiding agreeing on reports on other tasks. Furthermore, in appendix B we show CA mechanism merely has three types Bayesian Nash equilibria, at which both agents 1) play truth-telling opt1, 2) flip the signal opt2, or 3) generate uninformative reports (mixture between opt3, opt4) when agents use consistent strategies and assumptions 2.1 and 2.2 hold. Truthful Convergence for Learning Agents However, as we consider agents using a family of online learning algorithms to decide their strategies, standard solution concepts like Bayesian Nash equilibrium no longer apply. Additionally, online learning algorithms often have exploration, so we cannot hope agents will always use the truth-telling strategy. For learning agents, our goal is to test whether existing mechanisms can ensure that agents will converge to truthful reporting when they deploy certain learning behavior that goes beyond obliviously consistent strategies. We now formalize the convergence of algorithms to truthful reporting. Because we want to elicit information without verification, it is information-theoretically impossible for us to separate permutation equilibrium, where all agents play opt2, from truthful equilibrium, where all agents play opt1, without any additional information [17]. However, if we have an additional bit of information on whether the prior of 0 is larger than 1, we may tell apart these two equilibria. We hence define convergence to truthful reporting as the limits of both opt X t and opt Y t being truth-telling (opt1) or flipping (opt2). Note that definition 2.3 requires almost surely convergence which is very strong convergence concept. Definition 2.3. An information elicitation mechanism M achieves truthful convergence for agents using algorithms A1 and A2 respectively if and only if both sequences of pure strategies converge to truth-telling or both flipping the reports. Pr lim t + opt X t = lim t + opt Y t = opt1 lim t + opt X t = lim t + opt Y t = opt2 3 Online Learning Algorithms In this section, we explore candidates to model agents learning behavior to replace Bayesian agents consistent strategies in the literature. We first show the conventional no-regret assumption is a necessary but not a sufficient condition for truthful convergence in section 3.1. Then in section 3.2, we introduce a family of reward-based online learning algorithm to model agents learning behavior, and show that the family of reward-based online learning algorithms contains several common no-regret algorithms as special cases. 3.1 No-regret Online Learning Algorithms We now investigate the relationship between no regret and truthful convergence. First, we show general no-regret algorithms may not ensure truthful convergence (theorem 3.1). However, we show the converse is almost true (theorem 3.2): If truthful convergence happens, the agents do not have regret when the sequential mechanism is a sequential version of a strongly truthful mechanism. Given a sequential information elicitation mechanism M = (M X t , M Y t )t 1, signals x, y, and reports ˆx, ˆy, we define ri,t = M X t (opti(x1), . . . , opti(xt), ˆy t) be the payoff when Alice uses strategy opti and Bob s choices are unchanged. Then Alice s regret is Reg X(T) = maxi P t T rt. Finally, we say that Alice s and Bob s online learning algorithms are no regret (on M) if E[Reg X(T)] = E[Reg Y (T)] = o(T) over the randomness of signals and the algorithms, and we say Alice and Bob are no regret for short. One may hope that no regret as a behavior assumption for agents is sufficient for achieving desirable outcome in a mechanism. However, the following theorem shows that we cannot have an information elicitation mechanism that achieves truthful convergence for all no-regret agents. Theorem 3.1. For any sequential information elicitation mechanism M of rank k N, there exist no-regret algorithms for Alice and Bob so that M cannot achieve truthful convergence. The main idea of the proof is that the no-regret assumption cannot prevent Alice and Bob from colluding. In our counterexample, Alice and Bob decide on a no-regret sequence of reports regardless of their signals once the mechanism is announced. Technically, we use a probabilistic method to show the existence of a deterministic and no-regret sequence of strategies (opt X t , opt Y t )t 1 that consists of reporting 1 or 0 regardless of private signals, i.e. opt3 or opt4. The formal proof is in appendix C.1. The notion of truthful convergence in definition 2.3 provides an ideal truthful guarantee to the mechanism designer. Here we show that truthful convergence also ensures no regret for agents when the sequential mechanism is a sequential version of a strongly truthful one-shot multi-task mechanism. That is, when the one-shot mechanism admits truthful reporting as a highest-payoff BNE. For instance, if a pair of algorithms exhibits truthful convergence on the sequential CA mechanism (eq. (1)), they are also no regret (on the game). Intuitively, if Bob converges to the truth-telling limt opt Y t = opt1, the average expected gain of Alice deviating to opti is equal to the expected gain of deviating to opti when Bob always tells the truth. The gain is non positive because the CA mechanism is strictly truthful by lemma B.4.Therefore, the expected regrets E[Reg X(T)] and E[Reg Y (T)] are small. Theorem 3.2 formalizes and extends the above idea to any strongly truthful multi-task information elicitation mechanism. Theorem 3.2. Let M be a strongly truthful multi-task information elicitation mechanism, and M be a sequential version of M. If M achieve truthful convergence for Alice and Bob using algorithms A1 and A2 respectively, then Alice and Bob are no regret. 3.2 Reward-based Online Learning Algorithms As shown in theorem 3.1, the no-regret assumption allow does not guarantee truthful convergence. In this section, we introduce a general family of online learning algorithms, reward-based online learning algorithm, under the general full feedback bandit setting, and we will apply these algorithms in sequential information elicitation mechanisms later. Informally, a reward-based online learning algorithm decides each round s strategy using a fixed update function that depends only on the accumulative reward. For simplicity, we only consider algorithms on four strategies {opt1, opt2, opt3, opt4} which can be extended easily. Recall that the payoff of choosing option opti, i [4] at round t is ri,j when others choices are unchanged. We denote the accumulated payoffs of these four options as Ri,t = Pt j=1 ri,j for i [4]. Symmetrically, we use Si,t and si,t to represent accumulated payoffs and the payoff of turning to choose option opti in the tth round for Bob. For example, for our specific peer prediction game using CA mechanism in eq. (1), r1,t = I[xt = ˆyt] I[xt = ˆyt 1] and therefore, R1,t = Pt j=1(I[xt = ˆyt] I[xt = ˆyt 1]) for Alice. We consider a family of reward-based online learning algorithms A that use an update function f : R4 3, and choose opti with probability fi(R1,t 1, R2,t 1, R3,t 1, R4,t 1) for i [4] in the tth round, where fi is the ith coordinate of f. A such mechanism based on accumulated payoffs is denoted by Af. We have three assumptions for the update function f, which are all very natural. The first two require that f is exchangeable and preserves ordering. Assumption 3.3 (Exchangeability of f). For any R1, R2, R3, R4 R and an arbitrary permutation of them Ri1, Ri2, Ri3, Ri4, fij(R1, R2, R3, R4) = fj(Ri1, Ri2, Ri3, Ri4) for all j [4]. Assumption 3.4 (Order preservation of f). For any R1, R2, R3, R4 R and suppose that Ri1, Ri2, Ri3, Ri4 is a non-increasing order of them, for f we have fi1(R1, R2, R3, R4) fi2(R1, R2, R3, R4) fi3(R1, R2, R3, R4) fi4(R1, R2, R3, R4). Finally we consider the strategy chosen by the update function f when the accumulated payoff of an strategy is much higher than that of other strategies (assumption 3.5). Appendix D.2 shows that the assumption is necessary for reward-based online learning algorithms to achieve no regret for any online decision problem. Assumption 3.5 (Full exploitation of f). lim R1 max{R2,R3,R4} + f1(R1, R2, R3, R4) = 1. Now we show that the family of reward-based online learning algorithms A satisfying assumptions 3.3 to 3.5 contains several classic no-regret online learning algorithms [10, 13]. Theorem 3.6. A contains Follow the Perturbed Leader (FPL) algorithm, and Multiplicative Weights algorithm as special cases. Corresponding f s for them are listed as below: Multiplicative Weight algorithm f hedge 1 i (R1, R2, R3, R4) = eβRi P j [4] eβRj for i [4] FPL algorithm Given a noise distribution N on scalars, f FPL i (R1, R2, R3, R4) = Pr n Ri + pi = maxj [4]{Rj + pj} pj iid N, j [4] o for i [4]. In the binary peer prediction problem using CA mechanism, we can further show that replicator dynamics and linear updating multiplicative weight algorithm [2] are both in A in appendix E. Remark 3.7. Our reward based online learning is very similar to the mean-based algorithm in [3] and both use the accumulated rewards to characterize the algorithm s choice. Additionally, like the meanbased algorithms, our reward based online algorithm may contain algorithms with regret in genral game, e.g., follow the leader. The family of reward based online algorithms uses an identical update function across all rounds. Thus some no-regret algorithms, e.g., ϵ-greedy with time decreasing ϵ, doesn t belong to family. Finally, if a mean-based algorithm is also a reward based algorithm that uses the same update function in each round, then its update function satisfies assumption 3.5. 4 Truthful Convergence of CA Mechanism on Reward-based Algorithms Now we present our main result. We will show that the sequential CA mechanism can achieve truthful convergence if both agents use reward-based online learning algorithms from A. This convergence result suggests that the classical CA is robust even when agents deviate from Bayesian rational behavior and use a general family of online learning algorithms. Theorem 4.1. Under assumptions 2.1 and 2.2, the binary-signal, sequential CA mechanism as defined in eq. (1) achieves truthful convergence when agents use reward-based algorithms Af and Ag, where the update functions f and g satisfy assumptions 3.3 to 3.5. Note that given agents online learning algorithm and the payment function, the sequence of accumulated reward vector (R1,t, R2,t, R3,t, R4,t, S1,t, S2,t, S3,t, S4,t)t 1 forms a stochastic process and we define Ht as the game history of the rewards and private signals in the first t round. Additionally, if the accumulated reward of truth-telling R1,t, S1,t is much larger than the others , we can show Alice and Bob converge to the truth-telling by assumption 3.5. Thus, it is sufficient for us to track the evolution of accumulated reward vector. Though the process of accumulated reward vector is not a Markov chain because the payment function eq. (1) depends on reports in two rounds, we can still use ideas from semi-martingale to track the process. Before proving our main result, we first present two properties of CA mechanism for our binary signal peer prediction problem. Lemma 4.2 shows that the accumulated payoffs of uninformative strategies opt3, opt4 R3,t, R4,t, S3,t, and S4,t are always bounded. Lemma 4.2. Given the game defined in theorem 4.1, for any round t, the accumulated payoffs R3,t, R4,t, S3,t, and S4,t for two agents are bounded by [ 1, 1]. The second one, Lemma 4.3, tells us that the summation of accumulated payoffs of opt1 and opt2 is always fixed and is equal to the summation of accumulated payoffs of uninformative ones opt3 and opt4 for both agents. The proofs of these two lemmas are in appendix F.1. Lemma 4.3. Given the game defined in theorem 4.1, for any round t, Alice has R1,t + R2,t = R3,t + R4,t = 0, and Bob has S1,t + S2,t = S3,t + S4,t = 0. We now sketch the proof that consists of four steps. Informally, using lemmas 4.2 and 4.3, the first step says that the uninformative strategies opt3 and opt4 can not completely dominate other strategies. Specifically, both agents can not choose an option between opt3 and opt4 with a probability larger than 0.75. With the first step, lemma 4.2 and lemma 4.3, we only need to focus on agents reward of the truthtelling opt1 shown in fig. 1. We partition the space into three types of events. Good events happen when R1, S1 are both very large or very small, bad events happen when one of R1, S1 is very large and one of them is very small, and intermediate states are the states between them. The second step removes the possibility that Alice and Bob continue using different reports opt1 and opt2. Therefore, we can always escape "bad events" and enter "intermediate states". The third step further shows that if the game is in an intermediate state, there exists a constant probability that the game will get into a good events that leads to truthful convergence in a constant number of rounds. Hence, after the game enters "good events", which leaves us the final step: showing their strategies converges to either both truth telling opt1 or flipping opt2 truthful convergence. Now we discuss each steps in more details but defer all the proofs to appendix F. Figure 1: A schematic diagram of behaviors of R1,t and S1,t. Step 1: Choosing opt1, opt2 with Nonzero Probability Combining Lemmas 4.2 and 4.3, we have the following Lemma 4.4, which completes step 1. Lemma 4.4. Given the game defined in theorem 4.1, for any round t 0 and i {1, 2}, if Pt j=1 ri,j 0, the probability for Alice to choose opti is larger than 1 4; if Pt j=1 si,t 0, the probability for Bob to choose opti is larger than 1 Step 2: Escaping Bad Events Before introduce the formal statement of step 2, we define two "bad events". Given c0 for all t 1 E1,2 t := {R1,t > c0, and S2,t > c0} and E2,1 t := {R2,t > c0 and S1,t > c0}. We will specify c0 later. Intuitively, when c0 is sufficiently large, E1,2 t implies that Alice and Bob will choose opt1, opt2 respectively with a probability close to 1 in following rounds. For simplicity, we treat each event E as an indicator function, i.e., E happens if and only if E = 1. In order to prove that these two bad events cannot go on forever, we want to show that when E1,2 t happens, R1,t R2,t will tend to decrease at a rapid rate. Therefore, Alice will eventually deviate from opt1 to choose opt2 with a relatively high probability. By assumption 3.5 and lemma 4.2, given δ > 0 there exists a constant c1 such that when R1,t > c1, Alice chooses opt1 with probability larger than 1 δ; when S2,t > c1, Bob chooses opt2 with probability larger than 1 δ. Let γ1 = PX,Y (1, 1) + PX,Y (0, 0) PX,Y (1, 0) PX,Y (0, 1) γ2 = (PX,Y (1, 1) PX,Y (0, 0))2 (PX,Y (1, 0) PX,Y (0, 1))2. Lemma 4.5. Given the game defined in theorem 4.1, there exists a δ > 0 and corresponding c1 such that for any round t, E[r1,t+1 r2,t+1|S1,t 1 > c1 + 1] γ1 γ2 Given such δ and c1 in lemma 4.5, we set c0 = c1 + 1000 γ1 γ2 + 1. Because in each round R1,t, R2,t and S1,t, S2,t vary by at most 1, if E1,2 t happens Alice chooses opt1 and Bob chooses opt2 with probability larger than 1 δ independently for the next 1000 γ1 γ2 + 1 rounds. Similar argument holds for E2,1 t . We use the above observation to show lemma 4.6. Lemma 4.6. Given the game defined in theorem 4.1, for all t and history Ht E1,2 t , we have E P 1000 γ1 γ2 +1 j=1 (r1,t+j r2,t+j) Ht This lemma formalizes the blue arrows in Fig. 1. With this lemma, we get the main result of step two. Lemma 4.7. Given the game defined in theorem 4.1, Pr n lim supt E1,2 t E2,1 t = 1 o = 1. If we treat R1,t R2,t as money of Alice, Lemma 4.7 is similar to the gambler s ruin problem [7]. More specifically, R1,t R2,t has a negative expected growth each 1000 γ1 γ2 + 1 rounds by Lemma 4.6, so R1,t R2,t will always become small enough to escape E1,2. Step 3: From Intermediate States to Good Events We define a series of "good events" at first. For all u N+ and t 1, E1,1 t (u) := {R1,t u, and S1,t u} and E2,2 t (u) := {R2,t u and S2,t u}. To the end, we want to show that t N+E1,1 t (u) and t N+E2,2 t (u) happen with probability 1 for any u N+. Formally, we claim Lemma 4.8. Lemma 4.8. Given the game defined in theorem 4.1, for all u there exists λu so that for any T with history HT E1,2 T E2,1 T , we have Pr n T +4(u+c0)+100 i=T E1,1 t (u) T +4(u+c0)+100 i=T E2,2 t (u) = 1 HT o λu. This lemma generally says that when the agents are in an intermediate state that E1,2 T E2,1 T = 1, they have a constant probability λu to get into a good event in the next 4(u + c0) + 100 rounds. In order to prove this lemma, we define a nice event PT such that PT happens with probability larger than λu and PT implies the good events happen in no more than 4(u + c0) + 100 rounds. Formally, event PT is defined as the following: First, for j = T, . . . , T1 until some round T1 T such that S1,T1 0, xj+1 = 1 ˆyj, yj+1 = 1 ˆxj, Alice uses strategy opt1, Bob uses strategy opt2. Then, Alice and Bob uses strategy opt1 for 4u + 50 rounds and signals are generated as xj+1 = yj+1 = xj for j T1. We can use Lemma 4.4 to prove that λu = (mini,j {0,1}{PX,Y (i, j)} 0.252)100+4(u+c0) is a feasible lower bound. Step 4: Good Events Lead to Truthful Convergence Similar to step 2, we have lemma 4.9 to show that E1,1 t (u) can lead to increasing of R1,t R2,t at first. The idea is completely similar to Lemma 4.6 and it formalize the red arrows in Fig. 1. Lemma 4.9. Given the game defined in theorem 4.1, for all u > 2c0 + 1 and t if Ht E1,1 t u we have E P 1000 γ1 γ2 +1 j=1 (r1,t+j r2,t+j) Ht Using Lemma 4.9, we are able to prove that for any ε, there exists a u such that when E1,1 t (u) or E2,2 t (u) happens, Alice and Bob will tend to choose (opt1, opt1) or (opt2, opt2) with increasingly higher probability. Formally, we propose Lemma 4.10. Lemma 4.10. Given the game defined in theorem 4.1, for all ϵ > 0 there exists u N+ such that given a history HT E1,1 T (u) E2,2 T (u), we have Pr i N, E1,1 T + 1000 γ1 γ2 +1 i u T + 1000 γ1 γ2 +1 i u 2 + i = 1 HT We design a sub-martingale {Di}i N that is proportional to R1,T +i 1000 γ1 γ2 +1 R2,T +i 1000 γ1 γ2 +1 , and use Azuma-Hoeffding inequality to prove lemma 4.10. 5 Simulations We simulate the CA mechanism with various learning algorithms: the Hedge algorithms, follow the perturbed leader, follow the leader, and ϵ-greedy, and repeat the process 400 times with 800 rounds on each algorithm each time. We define the converge proportion in round t as the fraction of the simulations where both agents report truthfully opt1 (or both use opt2) in all the subsequent rounds. In our simulations, we use the following private signal distribution that satisfies assumption 2.2: PX,Y (0, 0) = PX,Y (1, 1) = 0.4, PX,Y (1, 0) = PX,Y (0, 1) = 0.2. Moreover, Alice and Bob are using the same learning algorithms in our simulations that are listed below: First, Follow the Leader algorithm (FTL) chooses opti with probability proportional to I[Ri = max{R1, R2, R3, R4}]. Follow the perturbed leader (FPL*, where * can be 1, 4 or 8) adds a uniform random noise between 0 and and choose strategy opti with probability Pr n Ri + pi = maxj [4]{Rj + pj} pj iid U[0, ] o . We consider FPL1, FPL4, and FPL8. Hedge algorithm 1 choose i with probability proportional to 3Ri/2 that is an implementation by choosing ϵ = 0.5 for the multiplicative weights algorithm of Arora et al. [2]. Hedge algorithm 2 chooses i with probability proportional to e Ri that is an implementation by choosing β = 1 of exponentially weighted averaged forecaster introduced by Freund and Schapire [10]. Finally, ϵ-greedy algorithm uses time varying ϵ = 1 (t+1)2 at round t. Note that ϵ-greedy is not in A but still achieves truthful convergence. In Figure 2, all our algorithms converge to truth-telling. First, all reward-based online learning algorithms (the Hedge algorithms, follow the perturbed leader, and follow the leader) exhibit truthful convergence that aligns with our theoretical result, theorem 4.1. Moreover, although FTL is generally not no-regret, CA mechanism still works well with it. Additionally, we observe that when an algorithm explores less (e.g. FPL4 vs FPL8), it converges faster, but very little exploration does not further improve the convergence rate (e.g. FTL vs FPL2). Finally, we also find that the ϵ-greedy with time decreasing ϵ, which is not a reward-based online learning algorithm, also shows truthful convergence. This suggests that the CA mechanism may have truthful convergence beyond reward-based online learning algorithms. 6 Conclusions In this paper, we study sequential peer prediction with learning agents and prove that the notion of no-regret alone is not sufficient for truthful convergence. We then define a family of reward-based learning algorithms and show that the CA mechanism is able to achieve truthful convergence when agents use algorithms in this family. Finally, we give a discussion on the converge rates of different learning agents based on simulations. This is the first theoretical study on peer prediction with learning agents. There are many open problems and future directions to extend this work. We believe similar proof techniques can be used to extend our results to settings where agents private signals are generated by a Markov chain with Figure 2: Convergence Rates of Learning Algorithms in CA. some assumptions on the transition matrix. Moreover, this work is only restricted to binary signals and it is still an open problem whether there exists a mechanism for non-binary settings that can promise truthful convergence. For the learning agents, one could consider a more general family of learning algorithms such as when f is time-varying. Acknowledgments and Disclosure of Funding The authors would like to thank the anonymous reviewers for their valuable comments and constructive feedback. This work is partially supported by the National Science Foundation under Grant No. IIS 2007887 and by the National Science Foundation and Amazon under Grant No. FAI 2147187. [1] N. Alon and J. H. Spencer. The probabilistic method. John Wiley & Sons, 2016. [2] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121 164, 2012. [3] M. Braverman, J. Mao, J. Schneider, and S. M. Weinberg. Selling to a no-regret buyer. Co RR, abs/1711.09176, 2017. URL http://arxiv.org/abs/1711.09176. [4] M. K. Camara, J. D. Hartline, and A. Johnsen. Mechanisms for a no-regret agent: Beyond the common prior. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 259 270. IEEE, 2020. [5] A. Dasgupta and A. Ghosh. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web, pages 319 330, 2013. [6] Y. Deng, J. Schneider, and B. Sivan. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019. [7] R. A. Epstein. The theory of gambling and statistical logic. Academic Press, 2012. [8] B. Faltings, R. Jurca, P. Pu, and B. D. Tran. Incentives to counter bias in human computation. In Second AAAI conference on human computation and crowdsourcing, 2014. [9] W. Feller. An introduction to probability theory and its applications, vol 2. John Wiley & Sons, 2008. [10] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. [11] J. Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3(2):97 139, 1957. [12] A. Kajii and S. Morris. The robustness of equilibria to incomplete information. Econometrica: Journal of the Econometric Society, pages 1283 1309, 1997. [13] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. [14] V. Kamble, N. Shah, D. Marn, A. Parekh, and K. Ramachandran. Truth serums for massively crowdsourced evaluation tasks. ar Xiv preprint ar Xiv:1507.07045, page 96, 2015. [15] Y. Kong. Dominantly truthful multi-task peer prediction with a constant number of tasks. Co RR, abs/1911.00272, 2019. URL http://arxiv.org/abs/1911.00272. [16] Y. Kong and G. Schoenebeck. Equilibrium selection in information elicitation without verification via information monotonicity. ar Xiv preprint ar Xiv:1603.07751, 2016. [17] Y. Kong and G. Schoenebeck. A framework for designing information elicitation mechanisms that reward truth-telling. Co RR, abs/1605.01021, 2016. URL http://arxiv.org/abs/1605. 01021. [18] Y. Liu and Y. Chen. Surrogate scoring rules and a dominant truth serum for information elicitation. ar Xiv preprint ar Xiv:1802.09158, 2018. [19] N. Miller, P. Resnick, and R. Zeckhauser. Eliciting informative feedback: The peer-prediction method. Management Science, 51(9):1359 1373, 2005. [20] D. Prelec. A bayesian truth serum for subjective data. science, 306(5695):462 466, 2004. [21] D. Prelec, H. S. Seung, and J. Mc Coy. A solution to the single-question crowd wisdom problem. Nature, 541(7638):532 535, 2017. [22] G. Radanovic and B. Faltings. Incentives for truthful information elicitation of continuous signals. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 770 776, 2014. [23] G. Schoenebeck and F. Yu. Learning and strongly truthful multi-task peer prediction: A variational approach. Co RR, abs/2009.14730, 2020. URL https://arxiv.org/abs/2009. 14730. [24] G. Schoenebeck, F.-Y. Yu, and Y. Zhang. Information elicitation from rowdy crowds. In Proceedings of the Web Conference 2021, pages 3974 3986, 2021. [25] V. Shnayder, A. Agarwal, R. M. Frongillo, and D. C. Parkes. Informed truthfulness in multi-task peer prediction. Co RR, abs/1603.03151, 2016. URL http://arxiv.org/abs/1603.03151. [26] V. Shnayder, R. Frongillo, and D. C. Parkes. Measuring performance of peer prediction mechanisms using replicator dynamics. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016. [27] J. Witkowski and D. C. Parkes. A robust bayesian truth serum for small populations. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [28] S. Zheng, F. Yu, and Y. Chen. The limits of multi-task peer prediction. Co RR, abs/2106.03176, 2021. URL https://arxiv.org/abs/2106.03176.