# bandit_social_learning_under_myopic_behavior__f3f47384.pdf Bandit Social Learning under Myopic Behavior Kiarash Banihashem University of Maryland, College Park kiarash@umd.edu Mohammad Taghi Hajiaghayi University of Maryland, College Park hajiagha@umd.edu Suho Shin University of Maryland, College Park suhoshin@umd.edu Aleksandrs Slivkins Microsoft Research NYC slivkins@microsoft.com We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms expected rewards. We derive stark exploration failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.1 1 Introduction Reviews and ratings are pervasive in many online platforms. A customer consults reviews/ratings, then chooses a product and then (often) leaves feedback, which is aggregated by the platform and served to future customers. Collectively, customers face a tradeoff between exploration and exploitation, i.e., between acquiring new information while making potentially suboptimal decisions and making optimal decisions using information currently available. However, individual customers tend to act myopically and favor exploitation, without regards to exploration for the sake of the others. On a high level, we ask whether/how the myopic behavior interferes with efficient exploration. We are particularly interested in learning failures when only a few agents choose an optimal action. We distill this issue down to its purest form. We posit that the customers make one decision each and do not observe any personalized payoff-relevant information prior to their decision, whether public or private. In particular, the customers believe they are similar to one another. They have only two alternative products/experiences to choose from, a.k.a., arms, and no way to infer anything about one arm from the other. The platform provides each customer with full history on the previous agents.2 Concretely, we posit Bandit Social Learning (BSL): a variant of social learning in which the customers (henceforth, agents) arrive sequentially and follow a simple multi-armed bandit protocol. Each agent observes full history, chooses an arm, and receives a reward: a Bernoulli random draw whose mean is arm-specific and unknown. Initial knowledge (a dataset with some samples of each arm) may be available to all agents. When all agents are governed by a centralized algorithm, this setting is known as stochastic bandits, a standard and well-understood variant of multi-armed bandits. 1Early versions of our results on the greedy algorithm (Corollary 3.6 and Theorem 6.1) have been available in a book chapter by A. Slivkins [54, Ch. 11]. The authors acknowledge Mark Sellke for proving Theorem 6.1 and suggesting a proof plan for a version of Corollary 3.6. The authors are grateful to Mark Sellke and Chara Podimata for brief collaborations (with A. Slivkins) in the initial stages of this project. 2In practice, online platforms provide summaries such as the average score and the number of samples. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Mean rewards Beliefs Behavior Result fixed frequentist" η-confident Thm. 3.1 (main), confidence intervals Thm. 3.9 (small N0). unbiased/Greedy Cor. 3.6 ηt-pessimistic Thm. 3.10 Bayesian (independent) Bayesian-unbiased Thm. 5.1(a) η-Bayesian-confident Thm. 5.1(b) Bayesian (correlated) Bayesian (and correct) Bayesian-unbiased Thm. 6.1 Table 1: Our negative results: learning failures. We allow a wide range of myopic behaviors that are consistent with observations. Given an arm, consider the confidence interval for its mean reward, parameterized by η 0: the sample average plus/minus the confidence term , p η/#samples. 3 An η-confident agent evaluates each arm to an index: some number within this arm s confidence interval (and otherwise arbitrary), and chooses an arm with a largest index. (Computational implementation of this process is irrelevant to our model.) Crucially, the η and the agents behavior are given and cannot be influenced by the platform. This model subsumes the unbiased behavior, when the index equals the sample average, as well as various behavioral biases (see related work for citations). Most notably: optimism and pessimism , when the index is, resp., larger or smaller than the sample average. (These can also be interpreted as, resp., risk seeking and risk aversion.) The model also allows for probabilistic decisions (via randomized indices), correlated behaviors (when samples from one arm affect the behavioral bias on another), and recency bias (when one favors more recent observations). Further, an agent may treat each arm differently, and different agents may exhibit different biases. We target the regime when parameter η is constant w.r.t. the number of agents T. I.e., the agents population is characterized by a constant η. We are interested in the asymptotic behavior when T increases. An extreme version of our model, with η log(T), is only considered for intuition and sanity checks. Interestingly, this version subsumes two well-known bandit algorithms: UCB1 [6] and Thompson Sampling [57, 52], which achieve optimal regret bounds. These algorithms can be seen as behaviors: resp., extreme optimism and probability matching [49, 59], a well-known randomized behavior. More moderate versions of these behaviors are consistent with η-confidence as defined above, and are subject to the learning failures described below. Our results. We are interested in learning failures when all but a few agents choose the bad arm, and how the failure probability scales with the η parameter. Our main result is that with η-confident agents, the failure probability is at least e O(η) (see Section 3). Consequently, regret is at least Ω(T e O(η)) for any given problem instance, in contrast with the O(log T) regret rate obtained by optimal bandit algorithms. Further, the e O(η) scaling is the best possible: indeed, regret for optimistic agents is at most O T e Ω(η) + η for a given problem instance (Theorem 4.1). Note that the negative result deteriorates as η increases, and becomes vacuous when η log T; the upper bound then essentially matches the optimal O(log T) regret of the UCB algorithm [6]). We refine these results in several directions. First, if all agents are unbiased , the failure probability scales as the difference in expected reward between the two arms (Corollary 3.6). Second, if all agents are pessimistic, then any level of pessimism, whether small or large or different across agents, leads to the similar failure probability as in the unbiased case (Theorem 3.10). Third, a small fraction of optimists goes a long way! That is, if all agents are η-confident and even a q-fraction of them are η-optimistic, then we obtain regret O T e Ω(η) + η/q regardless of the other agents.4 Our results extend to Bayesian agents who have independent priors across the arms and act according to their posteriors. Such agents are consistent with our main model of η-confident agents, and therefore are subject to the same negative results (Section 5). Further, we focus on Bayesian-unbiased agents and allow arbitrary correlated Bayesian beliefs (when the agents can make inferences about one arm from the observations on the other). We derive a general result on learning failures, assuming that the mean rewards are actually drawn according to the beliefs (Section 6). 3Then the confidence interval contains the (true) mean reward with probability at least 1 2e 2η. 4A similar result holds even the agents hold different levels of optimism, see Theorem 4.5. Mean rewards Beliefs Behavior Result fixed frequentist" η-optimistic Thm. 4.1 confidence ηt-optimistic, ηt [η, ηmax] Thm. 4.4 intervals small fraction of optimists Thm. 4.5 Table 2: Our positive results: upper bounds on regret. Our results are summarized in Tables 1 and 2. Implications for multi-armed bandits. The negative results for unbiased agents can be seen as general results on the failure of the greedy algorithm: a bandit algorithm that always exploits. This is a theoretical foundation for why bandit algorithms should explore and indeed why one should design them. We are not aware of any general results of this nature, whether published or known previously as "folklore", which is quite surprising given the enormous literature on multi-armed bandits. Therefore, we believe our results fill an important gap in the literature. How surprising are these results? It has been folklore knowledge for several decades that the greedy algorithm is inefficient in some simple special cases, and folklore belief that this should hold much more generally. However, recent results reveal a more complex picture: the greedy algorithm fails under some strong assumptions, but works well under some other strong assumptions (see Related Work). Thus, it has arguably became less clear which assumptions would be needed for negative results and what would be the shape" and probability of learning failures. Further, our results on η-confident agents explain why UCB1 algorithm requires extreme optimism, why any algorithm based on narrow (constant-η) confidence intervals is doomed to fail, and also why pessimism under uncertainty" is not a productive approach for exploration. Novelty and significance. BSL was not well-understood previously even with unbiased agents, as discussed above, let alone for more permissive behavioral models. It was very unclear a priori how to analyze learning failures and how strong would be the guarantees, in terms of the generality of agents behaviors, the failure events/probabilities, and the technical assumptions. On a technical level, our proofs have very little to do with standard lower-bound analyses in bandits stemming from [43, 7]. These analysis apply any algorithm and prove sublinear lower bounds on regret, such as Ω(log T) for a given problem instance and Ω( T) in the worst case. Their main technical tool is KL-divergence analysis showing that no algorithm can distinguish between a given tuple of similar problem instances. In contrast, we prove linear lower bounds on regret, our results apply to a particular family of behaviors/algorithms, and we never consider a tuple of similar problem instances. Instead, we use anti-concentration and martingale tools to argue that the best arm is never played (or played only a few times), with some probability. The result on correlated beliefs in Section 6 has a rather short but conceptual proof which we believe is well-suited for a textbook. While our positive results in Section 4 are restricted to optimistic agents, we do not assert that such agents are necessarily typical. The primary point here is that our results on learning failures are essentially tight. That said, optimism is a well-documented behavioral bias (e.g., see [50] and references therein). So, a small fraction of optimists (leveraged in Theorem 4.5) is not unrealistic. Our proofs are more involved compared to the standard analysis of the UCB1 algorithm. This is because we cannot make the η parameter as large as needed to ensure that the complements of certain clean events can be ignored. Instead, we need to define and analyze these clean events in a more careful way. These difficulties are compounded in Theorem 4.5, our most general result. As far as the statements are concerned, the basic result in Theorem 4.1 is perhaps what one would expect to hold, whereas the extensions in Theorem 4.4 and Theorem 4.5 are more surprising. Framing. We target the scenario in social learning when both actions and rewards are observable in the future, and the agents do not receive any other payoff-relevant signals. As in much of algorithmic game theory, we discuss the influence of self-interested behavior on the overall welfare of the system. We consider how such behaviour can cause learning failures , which is a typical framing in the literature on social learning. From the perspective of multi-armed bandits, we investigate the failures of the greedy algorithm, and more generally any algorithm that operates on narrow confidence intervals. We do not attempt to design new algorithms, as a version of UCB1 is proved optimal. Map of the paper. Section 2 defines our model. Section 3 derives the learning failures. Section 4 provides positive results for optimistic agents. Section 5 and Section 6 handle agents with Bayesian beliefs. Most proofs are moved to the supplement. Related work. A vast literature on social learning studies agents that learn over time in a shared environment. A prominent topic is learning failures such as ours. Models vary across several dimensions: e.g., the information acquired/transmitted, the communication network, agents life-span and decision rules, etc. All models from prior work are very different from ours. In the supplement, we separate our model from several most relevant ones: sequential social learning [27], strategic experimentation [33], networked myopic learners [8, 46], and misspecified beliefs [31, 15, 25, 44]. Positive results for the greedy bandit algorithm [37, 11, 51] focus on contextual bandits, an extension of stochastic bandits where a payoff-relevant signal (context) is available before each round. Equivalently, each agent in BSL receives such signal along with the history (incl. all previous signals). Very strong assumptions are needed: linearity of rewards and diversity of contexts. A similar result holds for BSL with private signals, under different (and also very strong) assumptions on structure and diversity [1]. In all this work, agents diversity substitutes for exploration, and structural assumptions allow aggregation across agents. Moreover, the greedy algorithm obtains o(T) regret in various scenarios with a very large number of near-optimal arms [13, 35], e.g., in Bayesian bandits with K T arms and independent uniform priors. We focus on a more basic model, with only two arms and no contexts, where all these channels are ruled out. Learning failures for the greedy algorithm are derived for bandit problems with 1-dimensional action spaces under (strong) structural assumptions: e.g., dynamic pricing with linear demands [30, 22] and dynamic control in a (generalized) linear model [42, 40]. In all these results, the failure probability is only proved positive, but not otherwise characterized. Incentivized exploration takes a mechanism design perspective on BSL, whereby the platform strives to incentivize individual agents to explore for the sake of the common good. In most of this work, starting from [41, 19], the platform controls the information flow, e.g., can withhold history and instead issue recommendations, and uses this information asymmetry to create incentives; see [55], [54, Ch. 11] for surveys.5 In particular, [48, 34, 53] target stochastic bandits as the underlying learning problem, same as we do. In [34], the platform constructs a (very) particular communication network for the agents, and then the agents engage in BSL on this network. Non-Bayesian models of behavior are prominent in social learning, starting from De Groot [21]: agents use variants of statistical inference and/or naive rules-of-thumb to infer the state of the world. In particular, our model of η-confident agents is essentially a special case of case-based decision theory [26]. Well-documented behavioral biases allowed by our model include: optimism and pessimism (e.g., [50] and [18, 12], resp., and references therein), risk aversion/risk seeking [36, 10], recency bias (e.g., [24] and references therein), randomized decisions (with theory tracing back to Luce [47]), and probability matching more specifically [49, 59]. Our perspective of multi-armed bandits is very standard in machine learning theory: we consider asymptotic regret rates without time-discounting. The vast literature on regret-minimizing bandits is summarized in books [17, 54, 45]. Stochastic bandits is a standard, basic version with i.i.d. rewards and no auxiliary structure. Most relevant are the UCB1 algorithm [6], Thompson Sampling [57, 52] (particularly the frequentist analyses thereof [2, 4, 38]), and the lower-bound results [43, 7]. 2 Our model and preliminaries Our model, called Bandit Social Learning, is defined as follows. There are T rounds, where T N is the time horizon, and two arms (i.e., alternative actions). We use [T] and [2] to denote the set of rounds and arms, respectively.6 In each round t [T], a new agent arrives, observes history histt (defined below), chooses an arm at [2], receives reward rt [0, 1] for this arm, and leaves forever. When a given arm a [2] is chosen, its reward is drawn independently from Bernoulli distribution 5Alternatively, the agents observe full history, but the platform uses payments to create incentives [23, 29, 20]. 6Throughout, we denote [n] = { 1, 2 , . . . , n }, for any n N. with mean µa [0, 1]. 7 The mean reward is fixed over time, but not known to the agents. Some initial data is available to all agents, namely N0 1 samples of each arm a [2]. We denote them r0 a,i [0, 1], i [N0]. The history in round t consists of both the initial data and the data generated by the previous agents. Formally, it is a tuple of arm-reward pairs, histt := (a, r0 a,i) : a [2], i [N0]; (as, rs) : s [t 1] . We summarize the protocol for Bandit Social Learning as Protocol 1. Protocol 1: Bandit Social Learning Problem instance: two arms a [2] with (fixed, but unknown) mean rewards µ1, µ2 [0, 1] ; Initialization: hist { N0 samples of each arm }; for each round t = 1, 2, . . . , T do agent t arrives, observes hist and chooses an arm at [2] ; reward rt [ 0, 1 ] is drawn from Bernoulli distribution with mean µat; new datapoint (at, rt) is added to hist Remark 2.1. The initial data-points represent reports created outside our model, e.g., by ghost shoppers or influencers, and available before the products enter the market. One could interpret them as a simple frequentist representation for the initial beliefs of the agents, with N0 as the beliefs strength . We posit N0 1 to ensure that the arms average rewards are always well-defined. If the agents were controlled by an algorithm, this protocol would correspond to stochastic bandits with two arms, the most basic version of multi-armed bandits. A standard performance measure in multi-armed bandits (and online machine learning more generally) is regret, defined as Regret(T) := µ T E h P t [T ] µat i , (2.1) where µ = max(µ1, µ2) is the maximal expected reward of an arm. Each agent t chooses its arm at myopically. Each agent is endowed with some (possibly randomized) mapping from histories to arms, and chooses an arm accordingly. This mapping, called behavioral type, encapsulates how the agent resolves uncertainty on the rewards. More concretely, each agent maps the observed history histt to an index Inda,t R for each arm a [2], and chooses an arm with a largest index. The ties are broken independently and uniformly at random. We allow for a range of myopic behaviors, whereby each index can take an arbitrary value in the (parameterized) confidence interval for the corresponding arm. Formally, fix arm a [2] and round t [T]. Let na,t denote the number of times this arm has been chosen in the history histt (including the initial data), and let ˆµa,t denote the corresponding average reward. Given these samples, standard (frequentist, truncated) upper and lower confidence bounds for the arm s mean reward µa (UCB and LCB, for short) are defined as follows: UCBη a,t := min 1, ˆµa,t + q and LCBη a,t := max 0, ˆµa,t q where η 0 is a parameter. The interval LCBη a,t, UCBη a,t will be referred to as η-confidence interval. Standard concentration inequalities imply that µa is contained in this interval with probability at least 1 2 e 2η (where the probability is over the random rewards, for any fixed value of µa). We allow the index to take an arbitrary value in this interval: Inda,t LCBη a,t, UCBη a,t , for each arm a [2]. (2.3) We refer to such agents as η-confident; η > 0 will be a crucial parameter throughout. On special cases. Our model accommodates a number of behavioural biases. Most notably: unbiased agents, who set Inda,t = ˆµa,t, η-optimistic agents, who set Inda,t = UCBη a,t, and ηpessimistic agents who set Inda,t = LCBη a,t. Unbiased agents formally correspond to the greedy algorithm, whereas extreme optimism, i.e., η-optimism with η log(T), corresponds to UCB1 algorithm [6]. 7Our results on upper bounds (Section 4) and Bayesian learning failures (Section 6) allow each arm to have an arbitrary reward distribution on [0, 1]. We omit further mention of this to simplify presentation. Our model also allows a version of Thompson Sampling in which the posterior samples are truncated to the η-confidence interval. 8 More generally, we allow Bayesian agents that preprocess observations to a Bayesian posterior, and use the latter to define their indices. See the supplement for more details. Preliminaries. When µ1, µ2 are fixed (not drawn from a prior), we posit µ1 > µ2, i.e., arm 1 is the good arm, and arm 2 is the bad arm. Our guarantees depend on quantity := µ1 µ2, called the gap (between the two arms). It is a very standard quantity for regret bounds in multi-armed bandits. We use the big-O notation to hide constant factors. Specifically, O(X) and Ω(X) mean, resp., at most c0 X and at least c0 X for some absolute constant c0 > 0 that is not specified in the paper. When and if c0 depends on some other absolute constant c that we specify explicitly, we point this out in words and/or by writing, resp., Oc(X) and Ωc(X). As usual, Θ(X) is a shorthand for both O(X) and Ω(X) , and writing Θc(X) emphasizes the dependence on c. Algorithms UCB1 and Thompson Sampling achieve regret Regret(T) O( min( 1/ , T ) log T ). (2.4) This regret rate is essentially optimal among all bandit algorithms: it is optimal up to constant factors for fixed > 0, and up to O(log T) factors for fixed T (see related work for citations). A key property of a reasonable bandit algorithm is that Regret(T)/T 0; this property is also called no-regret. Conversely, algorithms with Regret(T) Ω(T) are considered very inefficient. 3 Learning failures We are interested in learning failures when all but a few agents choose the bad arm. More precisely, we define the n-sampling failure as an event that all but at most n agents choose the bad arm. We make two technical assumptions: mean rewards satisfy c < µ2 < µ1 < 1 c for some absolute constant c (0, 1/2), (3.1) the number of initial samples satisfies N0 64 η/c2 + 1/c. (3.2) The meaning of (3.1) is that it rules out degenerate behaviors when mean rewards are close to the known upper/lower bounds. The big-O notation hides the dependence on the absolute constant c, when and if explicitly stated so. Assumption (3.2) ensures that the η-confidence interval is a proper subset of [0, 1] for all agents; we sidestep this assumption later in Theorem 3.9. Our main result allows arbitrary η-confident agents and asserts that 0-sampling failures happen with probability at least pfail e O(η). This is a stark failure when η is a constant relative to T. Theorem 3.1 (η-confident agents). Suppose all agents are η-confident, for some fixed η 0. Make assumptions (3.1) and (3.2). Then the 0-sampling failure occurs with probability at least 9 pfail = Ωc( + p η/N0 ) e Oc( η + N0 2 ), where = µ1 µ2. (3.3) Consequently, Regret(T) pfail T. We emphasize generality: the agents can exhibit any behaviors consistent with η-confidence, possibly different for different agents and different arms. From multi-armed bandit perspective, the theorem implies that bandit algorithms consistent with η-confidence cannot have regret sublinear in T. The guarantee in Theorem 3.1 deteriorates as the parameter η increases, and becomes vacuous when η log(T). This makes sense, as this regime of η is used in UCB1 algorithm. Discussion 3.2. Assumption (3.2) is innocuous from the social learning perspective: essentially, the agents hold initial beliefs grounded in data and these beliefs are not completely uninformed. From the bandit perspective, this assumption is more substantive, as an algorithm can always choose to discard data. In any case, we remove this assumption in Theorem 3.9 below. Remark 3.3. A weaker version of (3.2), namely N0 η, is necessary to guarantee an n-sampling failure for any η-confident agents. Indeed, suppose all agents are η-optimistic for arm 1 (the good arm), and η-pessimistic for arm 2 (the bad arm). If N0 < η, then the index for arm 2 is 0 after the initial samples, whereas the index of arm 1 is always positive. Then all agents choose arm 1. 8For η log T, this coincides with Thompson Sampling with very high probability. 9Throughout the paper, we use the notation Oc to hide the dependence on the absloute constant c. Next, we spell out two corollaries which help elucidate the main result. Corollary 3.4. If the gap is sufficiently small, < O 1/ N0 , then Theorem 3.1 holds with pfail = Ωc( + p η/N0 ) e Oc(η). (3.4) Remark 3.5. The assumption in Corollary 3.4 is quite mild in light of the fact that when > Ω p log(T)/N0 , the initial samples suffice to determine the best arm with high probability. Corollary 3.6. If all agents are unbiased, then Theorem 3.1 holds with η = 0 and pfail = Ωc ( ) e Oc( N0 2 ) (3.5) = Ωc ( ) if < O( 1/ p Remark 3.7. A trivial failure result for unbiased agents relies on the event E that all initial samples of arm 1 (i.e., the good arm) are realized as 0. This would indeed imply a 0-sampling failure (as long as at least one initial sample of arm 1 is realized to 1), but the event E happens with probability exponential in N0, the number of initial samples. In contrast, in our result pfail only depends on N0 through the assumption that < O 1/ N0 . Discussion 3.8. Corollary 3.6 can be seen as a general result on the failure of the greedy algorithm. This is the first such result with a non-trivial dependence on N0, to the best of our knowledge. We can remove assumption (3.2) and allow a small N0 if the behavioral type for each agent t also satisfies natural (and very mild) properties of symmetry and monotonicity: (P1) (symmetry) if all rewards in histt are 0, the two arms are treated symmetrically;10 (P2) (monotonicity) Fix any arm a [2], any t-round history H in which all rewards are 0 for both arms, and any other t-round history H that contains the same number of samples of arm a such that all these samples have reward 1. Then Pr [ at = a | histt = H ] Pr [ at = a | histt = H ]. (3.6) Note that both properties would still be natural and mild even without the all rewards are zero clause. The resulting guarantee on the failure probability is somewhat cleaner. Theorem 3.9 (small N0). Fix η 0, assume Eq. (3.1), and let N0 [1, N ], where N := 64η/c2 + 1/c . Suppose each agent t is η-confident and satisfies properties (P1) and (P2). Then an n-sampling failure, n = N N0, occurs with probability at least pfail = Ωc( c2N ) = Ωc( e Oc(η) ). (3.7) Consequently, Regret(T) pfail (T n). If all agents are pessimistic, we find that any levels of pessimism, whether small or large or different across agents, lead to a 0-sampling failure with probability Ωc( ), matching Corollary 3.6 for the unbiased behavior. This happens in the (very reasonable) regime when Ωc(η) < N0 < O(1/ 2). (3.8) Theorem 3.10 (pessimistic agents). Suppose each agent t [T] is ηt-pessimistic, for some ηt 0. Suppose assumptions (3.1) and (3.2) hold for η = maxt [T ] ηt. Then the 0-sampling failure occurs with probability lower-bounded by Eq. (3.5). Consequently, Regret(T) Ωc( 2) e Oc( N0 2 ). We allow extremely pessimistic agents (ηt log T), and the pessimism levels ηt can vary across agents t. While the relevant parameter is η = maxt [T ] ηt, the failure probability in (3.5) does not contain the e η term. In particular, pfail = Ω( ) when N0 < O(1/ 2). However, the dependence on η creeps in through assumption (3.2), i.e., that N0 > Ωc(η). Proof Overview. We first show that the average reward of arm 1 (the good arm), is upper bounded by some threshold q1. This is only guaranteed with some probability and only when this arm is sampled exactly N times, for a particular N N0. Next, we lower bound the average reward of arm 2 (the bad arm): we show that with some probability it is always above some threshold q2 (q1, µ2). Focus 10That is, the behavioral type stays the same if the arms labels are switched. on the round t when the good arm is sampled for the N-th time (if this ever happens). If both of these events hold, from round t onwards the bad arm will have a larger average reward by a constant margin q2 q1. Consequently, as we prove, the bad arm has a larger index, and therefore gets chosen. The details of this argument differ from one theorem to another. For Theorem 3.1, it suffices to set the thresholds q2, q1 such that q2 q1 = Θ( p η/N0). For Theorem 3.10, we use a more involved argument: since the LCB of an arm increases when it is played, playing this arm only strengthens the preference of pessimistic agents for this arm. We are therefore less constrained in the choice of q1, q2 and we can prove that a learning failure occurs whenever q1 < q2.11 In both proofs, we also require q1 and q2 to be close to µ1 and µ2, resp., so as to lower-bound the probability of the two desirable events. For Theorem 3.9, the case of small N0, our analysis becomes more subtle. We can (in some sense) simplify the two events defined above, but we need to introduce a third event: if arm 1 is chosen by at least n agents (for a suitably defined n), then arm 2 is chosen by n agents before arm 1 is. The crux is a deterministic" argument which derives a failure when all three events hold jointly. To formalize, we represent realized rewards of each arm a as written out in advance on a tape , where each entry is an independent Bernoulli draw with mean µa.12 The i-th entry is returned as reward when and if arm a is chosen for the i-th time. (We start counting from the initial samples, which comprise entries i [N0].) We analyze each arm separately (and then invoke independence). We use some tools from probability: a sharp anti-concentration inequality for arm 1 and a martingale argument for arm 2. Let (Xi)i N be a sequence of i.i.d. Bernoulli random variables with mean p [c, 1 c], for some absolute constant c (0, 1/2). The anti-concentration is as follows: ( n 1/c, q (c/8, p) ) Pr 1 n Pn i=1 Xi q Ω( e O( n(p q)2 ) ), (3.9) The martingale argument leads to this: q [0, p) Pr n 1 : 1 n Pn i=1 Xi q Ωc(p q). (3.10) We each tool to the tape for the respective arm, and lower bound the probability of the desirable event. While the novelty is mainly in how we use these tools, the tools themselves are not very standard. Eq. (C.1) follows from the anti-concentration inequality in [61] and a reverse Pinsker inequality in [28]. More standard anti-concentration results via Stirling s approximation lead to an additional factor of 1/ n on the right-hand side of (C.1). For Eq. (C.2), we introduce an exponential martingale and relate the event in Eq. (C.2) to a deviation of this martingale. We then use Ville s inequality (a version of Doob s martingale inequality) to bound the probability that this deviation occurs. 4 Upper bounds for optimistic agents We upper-bound regret for optimistic agents: we match the exponential-in-η scaling from Corollary 3.4 and then extend this result to different behavioral types. On a technical level, we prove three regret bounds of the same shape (4.1), but with a different Φ term. (The unified presentation emphasizes this similarity.) Throughout, = µ1 µ2 denotes the gap between the two arms. Theorem 4.1. Suppose all agents are η-optimistic, for some fixed η > 0. Then, letting Φ = η, Regret(T) O T e Ω(η) (1 + log 1/ ) + Φ/ . (4.1) Discussion 4.2. The main take-away is that the exponential-in-η scaling from Corollary 3.4 is tight for η-optimistic agents, and therefore the best possible lower bound for η-confident agents. This result holds for any given N0, the number of initial samples.13 Our guarantee remains optimal in the extreme optimism regime when η log(T), matching the optimal regret rate, O ( log(T)/ ). What if different agents can hold different behavioral types? First, let us allow agents to have varying amounts of optimism, possibly different across arms and possibly randomized. Definition 4.3. Fix ηmax η > 0. An agent t [T] is called [ η, ηmax ]-optimistic if its index Inda,t lies in the interval UCBη a,t, UCBηmax a,t , for each arm a [2]. 11We also require q1 > p η/N0 to ensure that the confidence lower bounds are not truncated to zero. 12This is an equivalent (and well-known) representation of rewards in stochastic bandits. 13For ease of exposition, we do not track the improvements in regret when N0 becomes larger. We show that the guarantee in Theorem 4.1 is robust to varying the optimism level upwards . Theorem 4.4 (robustness). Fix ηmax η > 0. Suppose all agents are [ η, ηmax ]-optimistic. Then regret bound (4.1) holds with Φ = ηmax. Note that the upper bound ηmax has only a mild influence on the regret bound in Theorem 4.4. Our most general result only requires a small fraction of agents to be optimistic, whereas all agents are only required to be ηmax-confident (allowing all behaviors consistent with that). Theorem 4.5 (recurring optimism). Fix ηmax η > 0. Suppose all agents are ηmax-confident. Further, suppose each agent s behavioral type is chosen independently at random so that the agent is [ η, ηmax ]-optimistic with probability at least q > 0. Then regret bound (4.1) holds with Φ = ηmax/q. Thus, with even a small fraction of optimists, q > 1 o(T ), the behavioral type of less optimistic agents does not have much impact on regret. In particular, it does not hurt much if they become very pessimistic. A small fraction of optimists goes a long way! Further, a small-but-constant fraction of extreme optimists, i.e., η, ηmax log(T) in Theorem 4.5, yields optimal regret rate, log(T)/ . 5 Learning failures for Bayesian agents In this section, we posit that agents are endowed with Bayesian beliefs. The basic version is that all agents believe that the mean reward of each arm is initially drawn from a uniform distribution on [0, 1]. (We emphasize that the mean rewards are fixed and not actually drawn according to these beliefs.) Each agent t computes a posterior Pa,t for µa given the history histt, for each arm a [a], and maps this posterior to the index Inda,t for this arm.14 The basic behavior is that Inda,t is the posterior mean reward, E [ Pa,t ]. We call such agents Bayes-unbiased. Further, we consider a Bayesian version of η-confident agents, defined by Inda,t [ Qa,t(ζ), Qa,t(1 ζ) ] for each arm a [2], (5.1) where Qa,t( ) denotes the quantile function of the posterior Pa,t and ζ (0, 1/2) is a fixed parameter (analogous to η elsewhere). The interval in Eq. (5.1) is a Bayesian version of η-confidence intervals. Agents t that satisfy Eq. (5.1) are called ζ-Bayes-confident. We allow more general beliefs given by independent Beta distributions. For each arm a [2], all agents believe that the mean reward µa is initially drawn as an independent sample from Beta distribution with parameters αa, βa N. Our results are driven by parameter M = maxa [2] αa+βa. We refer to such beliefs as Beta-beliefs with strength M. The intuition is that the prior on each arm a can be interpreted as being based on αa + βa 2 samples from this arm.15 Our technical contribution here is that Bayes-unbiased (resp., ζ-Bayes-confident) agents are ηconfident for a suitably large η, and therefore subject to the learning failure in Theorem 3.1. Theorem 5.1. Consider a Bayesian agent that holds Beta-beliefs with strength M 1. (a) If the agent is Bayes-unbiased, then it is η-confident for some η = O(M/ N0). (b) If the agent is ζ-Bayes-confident, then it is η-confident for η = O M/ N0 + ln(1/ζ) . Discussion 5.2. Beta-beliefs may be completely unrelated to the actual mean rewards. If ζ and M are constants relative to T, the resulting η is constant, too. Our guarantee is stronger if the beliefs are weak (i.e., M is small) or are dominated by the initial samples, in the sense that N0 > Ω(M 2). Discussion 5.3. ζ-Bayes-confident agents subsume Bayesian version of optimism and pessimism, where the index Inda,t is defined as, resp., Qa,t(1 ζ) and Qa,t(ζ), as well as all the Bayesian versions of all other behaviorial biases discussed previously as special cases of η-confidence. 6 Bayesian model with arbitrary priors We consider Bayesian-unbiased agents in a fully Bayesian model such that the mean rewards are actually drawn from a prior. We are interested in Bayesian probability and Bayesian regret, i.e., resp., 14Note that the Bayesian update for agent t does not depend on the beliefs of the previous agents. 15More precisely, any Beta distribution with integer parameters (α, β) can be seen as a Bayesian posterior obtained by updating a uniform prior on [0, 1] with α + β 2 data points. probability and regret in expectation over the prior. We focus on learning failures when the agents never choose an arm with the largest prior mean reward (as opposed to an arm with the largest realized mean reward, which is not necessarily the same arm). We do not explicitly allow initial samples (i.e., we posit N0 = 0 here), because they are implicitly included in the prior. Compared to Section 5, the benefit is that we allow arbitrary priors, possibly correlated across the two arms. Further, our guarantee does not depend on the prior, other than through the prior gap E[µ1 µ2], and does not contain any hidden constants. On the other hand, the guarantees here are only in expectation over the prior, whereas the ones in Section 5 hold for fixed µ1, µ2. Also, our result here is restricted to Bayesian-unbiased agents. Theorem 6.1. Suppose the pair (µ1, µ2) is initially drawn from some Bayesian prior P such that E[µ1] > E[µ2]. Assume that all agents are Bayesian-unbiased, with beliefs given by P. Then with Bayesian probability at least E[µ1 µ2], the agents never choose arm 2. Proof. W.l.o.g., assume that agents break ties in favor of arm 2. In each round t, the key quantity is Zt = E[µ1 µ2 | histt]. Indeed, arm 2 is chosen if and only if Zt 0. Let τ be the first round when arm 2 is chosen, or T + 1 if this never happens. We use martingale techniques to prove that E[Zτ] = E[µ1 µ2]. (6.1) We use the optional stopping theorem (OST). We observe that τ is a stopping time relative to H = ( histt : t [T + 1] ), and ( Zt : t [T + 1] ) is a martingale relative to H. 16 OST asserts that E[Zτ] = E[Z1] for any martingale Zt and any bounded stopping time τ. Eq. (6.1) follows because E[Z1] = E[µ1 µ2]. On the other hand, by Bayes theorem it holds that E[Zτ] = Pr [ τ T] E[Zτ | τ T ] + Pr [ τ > T] E[Zτ | τ > T ] (6.2) Recall that τ T implies that arm 2 is chosen in round τ, which in turn implies that Zτ 0. It follows that E[Zτ | τ T] 0. Plugging this into Eq. (6.2), we find that E[µ1 µ2] = E[Zτ] Pr [ τ > T ] = Pr [ arm 2 never chosen ]. As a corollary, we derive a 0-sampling failure, leading to Ω(T) Bayesian regret. Specifically, the agents start out playing arm 1 (because it has a higher prior mean reward), and never try arm 2 when it is in fact the best arm. This happens whenever the prior is independent across arms and has a positive density on the entire [0, 1] interval (see the supplement for the exact statement). Note that it is a (much) more general family of priors compared to independent Beta-priors allowed in Section 5. 7 Conclusions and open questions We examine the dynamics of social learning in a multi-armed bandit scenario, where agents sequentially choose arms and receive rewards, and observe the full history of previous agents. For a range of agents myopic behavior, we investigate how they impact exploration, and provide tight upper and lower bounds on the learning failure probabilities and regret rates. In particular, we obtain the first general results on the failure of the greedy algorithm in bandits. With our results as a departure point , one could study BSL in more complex bandit models with many arms and/or some known structure of rewards that the agents myopic behaviour would account for.17 The greedy algorithm fails for some structures (e.g., our current model) and works well for some others (e.g., for linear contextual bandits with smoothed contexts [37, 11, 51], or when all arms have the same rewards). The whole world is in between these two extremes. It is not at all clear which structures would cause learning failures and which would enable learning, and which structures would be amenable to analysis, one way or another. 8 Acknowledgements This work is partially supported by DARPA Qu ICC NSF and AF:Small #2218678, #2114269. 16The latter follows from a general fact that sequence E[X | histt], t [T + 1] is a martingale w.r.t. H for any random variable X with E [ |X| ] < . It is known as Doob martingale for X. 17E.g., linear, convex, Lipschitz and combinatorial structures are well-studied, see books [17, 54, 45]. [1] D. Acemoglu, A. Makhdoumi, A. Malekian, and A. Ozdaglar. Learning From Reviews: The Selection Effect and the Speed of Learning. Econometrica, 2022. Working paper available since 2017. [2] S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the multi-armed bandit problem. In 25nd Conf. on Learning Theory (COLT), 2012. [3] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39 1. JMLR Workshop and Conference Proceedings, 2012. [4] S. Agrawal and N. Goyal. Near-optimal regret bounds for thompson sampling. J. of the ACM, 64(5):30:1 30:24, 2017. Preliminary version in AISTATS 2013. [5] J. Audibert and S. Bubeck. Regret Bounds and Minimax Policies under Partial Monitoring. J. of Machine Learning Research (JMLR), 11:2785 2836, 2010. Preliminary version in COLT 2009. [6] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48 77, 2002. Preliminary version in 36th IEEE FOCS, 1995. [8] V. Bala and S. Goyal. Learning from neighbours. Review of Economic Studies, 65(3):595 621, 1998. [9] A. V. Banerjee. A simple model of herd behavior. Quarterly Journal of Economics, 107: 797 817, 1992. [10] N. Barberis and R. Thaler. A survey of behavioral finance. In Handbook of the Economics of Finance, volume 1, pages 1053 1128. Elsevier, 2003. [11] H. Bastani, M. Bayati, and K. Khosravi. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329 1349, 2021. Working paper available on arxiv.org since 2017. [12] M. Bateson. Optimistic and pessimistic biases: a primer for behavioural ecologists. Current Opinion in Behavioral Sciences, 12:115 121, 2016. [13] M. Bayati, N. Hamidi, R. Johari, and K. Khosravi. Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. In 33rd Advances in Neural Information Processing Systems (Neur IPS), 2020. [14] S. Bikhchandani, D. Hirshleifer, and I. Welch. A Theory of Fads, Fashion, Custom, and Cultural Change as Informational Cascades. Journal of Political Economy, 100(5):992 1026, 1992. [15] A. Bohren and D. N. Hauser. Learning with Heterogeneous Misspecified Models: Characterization and Robustness. Econometrica, 89(6):3025 3077, Nov 2021. [16] P. Bolton and C. Harris. Strategic Experimentation. Econometrica, 67(2):349 374, 1999. [17] S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Published with Now Publishers (Boston, MA, USA). Also available at https://arxiv.org/abs/1204.5721. [18] E. C. Chang, editor. Optimism and pessimism: Implications for theory, research, and practice. American Psychological Association, 2000. See https://www.apa.org/pubs/books/431754A for the table of contents. [19] Y.-K. Che and J. Hörner. Recommender systems as mechanisms for social learning. Quarterly Journal of Economics, 133(2):871 925, 2018. Working paper since 2013, titled Optimal design for social learning . [20] B. Chen, P. I. Frazier, and D. Kempe. Incentivizing exploration by heterogeneous users. In Conf. on Learning Theory (COLT), pages 798 818, 2018. [21] M. H. De Groot. Reaching a Consensus. Journal of the American Statistical Association, 69 (345):118 121, 1974. [22] A. V. den Boer and B. Zwart. Simultaneously learning and optimizing using controlled variance pricing. Management Science, 60(3):770 783, 2014. [23] P. Frazier, D. Kempe, J. M. Kleinberg, and R. Kleinberg. Incentivizing exploration. In ACM Conf. on Economics and Computation (ACM-EC), pages 5 22, 2014. [24] D. Fudenberg and D. K. Levine. Learning with recency bias. Proceedings of the National Academy of Sciences, 111:10826 10829, 2014. [25] D. Fudenberg, G. Lanzani, and P. Strack. Limit Points of Endogenous Misspecified Learning. Econometrica, 89(3):1065 1098, May 2021. [26] I. Gilboa and D. Schmeidler. Case-Based Decision Theory. The Quarterly Journal of Economics, 110(3):605 639, 08 1995. [27] B. Golub and E. D. Sadler. Learning in social networks. In Y. Bramoullé, A. Galeotti, and B. Rogers, editors, The Oxford Handbook of the Economics of Networks. Oxford University Press, 2016. [28] F. Götze, H. Sambale, and A. Sinulis. Higher order concentration for functions of weakly dependent random variables. Electronic Journal of Probability, 24:1 19, 2019. [29] L. Han, D. Kempe, and R. Qiang. Incentivizing exploration with heterogeneous value of money. In 11th Intl. Conf. on Web and Internet Economics (WINE), pages 370 383, 2015. [30] J. M. Harrison, N. B. Keskin, and A. Zeevi. Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 58(3):570 586, 2012. [31] P. Heidhues, B. Koszegi, and P. Strack. Unrealistic expectations and misguided learning. Econometrica, 86(4):1159 1214, 2018. [32] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963. [33] J. Hörner and A. Skrzypacz. Learning, experimentation, and information design. In B. Honoré, A. Pakes, M. Piazzesi, and L. Samuelson, editors, Advances in Economics and Econometrics: 11th World Congress, volume 1, page 63 98. Cambridge University Press, 2017. [34] N. Immorlica, J. Mao, A. Slivkins, and S. Wu. Incentivizing exploration with selective data disclosure. In ACM Conf. on Economics and Computation (ACM-EC), pages 647 648, 2020. Working paper available at https://arxiv.org/abs/1811.06026. [35] M. Jedor, J. Louëdec, and V. Perchet. Be greedy in multi-armed bandits, 2021. Working paper, available on https://arxiv.org/abs/2101.01086. [36] D. Kahneman and A. Tversky. Judgment under uncertainty: Heuristics and biases. Cambridge University Press, 1982. [37] S. Kannan, J. Morgenstern, A. Roth, B. Waggoner, and Z. S. Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances in Neural Information Processing Systems (NIPS), pages 2231 2241, 2018. [38] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In 23rd Intl. Conf. on Algorithmic Learning Theory (ALT), pages 199 213, 2012. [39] G. Keller, S. Rady, and M. Cripps. Strategic Experimentation with Exponential Bandits. Econometrica, 73(1):39 68, 2005. [40] N. B. Keskin and A. Zeevi. On incomplete learning and certainty-equivalence control. Oper. Res., 66(4):1136 1167, 2018. [41] I. Kremer, Y. Mansour, and M. Perry. Implementing the wisdom of the crowd". J. of Political Economy, 122(5):988 1012, 2014. Preliminary version in ACM EC 2013. [42] T. Lai and H. Robbins. Iterated least squares in multiperiod control. Advances in Applied Mathematics, 3(1):50 73, 1982. [43] T. L. Lai and H. Robbins. Asymptotically efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6:4 22, 1985. [44] G. Lanzani. Dynamic Concern for Misspecification, 2023. Working paper. [45] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020. [46] D. Lazer and A. Friedman. The network structure of exploration and exploitation. Administrative Science Quarterly, 52:667 694, 2007. [47] R. D. Luce. Individual Choice Behavior: A Theoretical Analysis. Wiley, New York, NY, USA, 1959. [48] Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesian incentive-compatible bandit exploration. Operations Research, 68(4):1132 1161, 2020. Preliminary version in ACM EC 2015. [49] J. L. Myers. Probability learning and sequence learning. In W. Estes, editor, Handbook of Learning and Cognitive Processes: Approaches to Human Learning and Motivation, pages 171 205. Erlbaum, Hillsdale, NJ, 1976. [50] M. Puri and D. T. Robinson. Optimism and economic choice. Journal of Financial Economics, 86(1):71 99, 2007. [51] M. Raghavan, A. Slivkins, J. W. Vaughan, and Z. S. Wu. Greedy algorithm almost dominates in smoothed contextual bandits. SIAM J. on Computing (SICOMP), 52(2):487 524, 2023. Preliminary version at COLT 2018. Working paper available at arxiv.org/abs/2005.10624. [52] D. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018. Published with Now Publishers (Boston, MA, USA). Also available at https://arxiv.org/abs/1707.02038. [53] M. Sellke and A. Slivkins. The price of incentivizing exploration: A characterization via thompson sampling and sample complexity. Operations Research, 71(5), 2022. Preliminary version in ACM EC 2021. [54] A. Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, Nov. 2019. Published with Now Publishers (Boston, MA, USA). Also available at https://arxiv.org/abs/1904.07272. Latest online revision: Jan 2022. [55] A. Slivkins. Exploration and persuasion. In F. Echenique, N. Immorlica, and V. Vazirani, editors, Online and Matching-Based Market Design. Cambridge University Press, 2023. Also available at http://slivkins.com/work/Expl Pers.pdf. [56] L. Smith and P. Sørensen. Pathological outcomes of observational learning. Econometrica, 68: 371 398, 2000. [57] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. [58] J. Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939. [59] N. Vulkan. An economist s perspective on probability matching. Journal of Economic Surveys, 14:101 118, 2000. [60] I. Welch. Sequential sales, learning, and cascades. The Journal of finance, 47:695 732, 1992. [61] A. R. Zhang and Y. Zhou. On the non-asymptotic and sharp lower tail bounds of random variables. Stat, 9(1):e314, 2020. SUPPLEMENTARY MATERIALS FOR BAYESIAN BANDIT SOCIAL LEARNING UNDER MYOPIC BEHAVIOR The supplement contains the proofs, as well as detailed discussion of related work on social learning. A Related Work on Social Learning 15 B Preliminaries: Reward-Tape 15 C Proofs from Section 3: Learning Failures 16 D Proofs from Section 4: Upper Bounds for Optimistic Agents 22 E Proofs from Section 5: Learning Failures for Bayesian Agents 25 F The corollary from Section 6 26 A Related Work on Social Learning A vast literature on social learning studies agents that learn over time in a shared environment. A prominent topic is the presence or absence of learning failures such as ours. Models vary across several dimensions, such as: which information is acquired or transmitted, what is the communication network, whether agents are long-lived or only act once, how they choose their actions, etc. Below we discuss several lines of work that are most relevant. In sequential social learning , starting from [9, 60, 14, 56], agents observe private signals, but only the chosen actions are observable in the future; see Golub and Sadler [27] for a survey. The social planner (who chooses agents actions given access to the knowledge of all previous agents) only needs to exploit, i.e., choose the best action given the previous agents signals, whereas in our model it also needs to explore. Learning failures are (also) of primary interest, but they occur for an entirely different reason: restricted information flow, i.e., the fact that the private signals are not observable in the future. Strategic experimentation , starting from Bolton and Harris [16] and Keller et al. [39], studies long-lived learning agents that observe both actions and rewards of one another; see Hörner and Skrzypacz [33] for a survey. Here, the social planner also solves a version of multi-armed bandits, albeit a very different one (with time-discounting, safe arm that is completely known, and risky arm that follows a stochastic process). The main difference is that the agents engage in a complex repeated game where they explore but prefer to free-ride on exploration by others. Bala and Goyal [8] and Lazer and Friedman [46] consider a network of myopic learners, all faced with the same bandit problem and observing each other s actions and rewards. The interaction protocol is very different from ours: agents are long-lived, act all at once, and only observe their neighbors on the network. Other specifics are different, too. Bala and Goyal [8] makes strong assumptions on learners beliefs, which would essentially cause the greedy algorithm to work well in BSL. In Lazer and Friedman [46], each learner only retains the best observed action, rather than the full history. Both papers study social learning under different network topologies. Prominent recent work, e.g., [31, 15, 25, 44], targets agents with misspecified beliefs, i.e., beliefs whose support does not include the correct model. The framing is similar to BSL with Bayesianunbiased agents: agents arrive one by one and face the same decision problem, whereby each agent makes a rational decision after observing the outcomes of the previous agents.18 Rational decisions under misspecified beliefs make a big difference compared to BSL, and structural assumptions about rewards/observations and the state space tend to be very different from ours. The technical questions being asked tend to be different, too. E.g., convergence of beliefs is of primary interest, whereas the chosen arms and agents beliefs/estimates trivially converge in our setting. 19 B Preliminaries: Reward-Tape It is convenient for our analyses to interpret the realized rewards of each arm as if they are written out in advance on a tape . We posit a matrix Tapea,i [0, 1] : a [2], i [T] , called reward-tape, such that each entry Tapea,i is an independent Bernoulli draw with mean µa. This entry is returned as reward when and if arm a is chosen for the i-th time. (We start counting from the initial samples, which comprise entries i [N0].) This is an equivalent (and well-known) representation of rewards in stochastic bandits. We will use the notation for the UCBs/LCBs defined by the reward-tape. Fix arm a [2] and n [T]. Let bµtape a,n = 1 i [n] Tapea,i be the average over the first n entries for arm a. Now, given η 0, define the appropriate confidence bounds: UCBtape, η a,n := min n 1, bµtape a,n + p η/n o and LCBtape, η a,n := max n 0, bµtape a,n p η/n o . (B.1) 18The original framing in this work posits a single learner that makes (possibly) myopic decisions over time and observes their outcomes. An alternative interpretation is that each decision is made by a new myopic agent who observes the history. 19Essentially, if an arm is chosen infinitely often then the agents beliefs/estimates converge on its true mean reward; else, the agents eventually stop receiving any new information about this arm. C Proofs from Section 3: Learning Failures Our proofs rely on two tools from Probability (proved in Section C.4 and C.5): a sharp anticoncentration inequality for Binomial distribution and a lemma that encapsulates a martingale argument. Lemma C.1 (anti-concentration). Let (Xi)i N be a sequence of independent Bernoulli random variables with mean p [c, 1 c], for some c (0, 1/2) interpreted as an absolute constant. Then ( n 1/c, q (c/8, p) ) Pr 1 n Pn i=1 Xi q Ω( e O( n(p q)2 ) ), (C.1) where Ω( ) and O( ) hide the dependence on c. Lemma C.2 (martingale argument). In the setting of Lemma C.1, q [0, p) Pr n 1 : 1 n Pn i=1 Xi q Ωc(p q). (C.2) The overall argument will be as follows. We will use Lemma C.1 to upper-bound the average reward of arm 1, i.e., the good arm, by some threshold q1. This upper bound will only be guaranteed to hold when this arm is sampled exactly N times, for a particular N N0. Lemma C.2 will allow us to uniformly lower-bound the average reward of arm 2, i.e., the bad arm, by some threshold q2 (q1, µ2). Focus on the round t when the good arm is sampled for the N-th time (if this ever happens). If the events in both lemmas hold, from round t onwards the bad arm will have a larger average reward by a constant margin q2 q1. We will prove that this implies that the bad arm has a larger index, and therefore gets chosen by the agents. The details of this argument differ from one theorem to another. Lemma C.1 is a somewhat non-standard statement which follows from the anti-concentration inequality in [61] and a reverse Pinsker inequality in [28]. More standard anti-concentration results via Stirling s approximation lead to an additional factor of 1/ n on the right-hand side of (C.1). For Lemma C.2, we introduce an exponential martingale and relate the event in (C.2) to a deviation of this martingale. We then use Ville s inequality (a version of Doob s martingale inequality) to bound the probability that this deviation occurs. C.1 Proof of Theorem 3.1: η-confident agents Fix thresholds q1 < q2 to be specified later. Define two failure events : Fail1: the average reward of arm 1 after the N0 initial samples is below q1; Fail2: the average reward of arm 2 is never below q2. In a formula, using the reward-tape notation from Appendix B, these events are Fail1 := n bµtape 1, N0 q1 o and Fail2 := n [T] : bµtape 2,n q2 . (C.3) We show that event Fail := Fail1 Fail2 implies the 0-sampling failure, as long as the margin q2 q1 is sufficiently large. Claim C.3. Assume q2 q1 > 2 p η/N0 and event Fail. Then arm 1 is never chosen by the agents. Proof. Assume, for the sake of contradiction, that some agent chooses arm 1. Let t be the first round when this happens. Note that Ind1,t Ind2,t. We will show that this is not possible by upper-bounding Ind1,t and lower-bounding Ind2,t. By definition of round t, arm 1 has been previously sampled exactly N0 times. Therefore, Ind1,t bµtape 1, N0 + p η/N0 (by definition of index) η/N0 (by Fail1) η/N0 (by assumption). Let n be the number of times arm 2 has been sampled before round t. This includes the initial samples, so n N0. It follows that Ind2,t bµtape 2,n p η/n (by definition of index) η/N0 (by Fail2 and n N0). Consequently, Ind2,t > Ind1,t, contradiction. In what follows, let c be the absolute constant from assumption (3.1). Let us lower bound Pr [ Fail ] by applying Lemmas C.1 and C.2 to the reward-tape. Claim C.4. Assume c/4 < q1 < q2 < µ2. Then Pr [ Fail ] qfail := Ωc(µ2 q2) e Oc( N0(µ1 q1)2 ). (C.4) Proof. To handle Fail1, apply Lemma C.1 to the reward-tape for arm 1, i.e., to the random sequence (Tape1,i)i [T ], with n = N0 and q = q1. Recalling that N0 1/c by assumption (3.2), Pr [ Fail1 ] Ωc e Oc( N0(µ1 q1)2 ) . To handle Fail2, apply Lemma C.2 to the reward-tape for arm 2, i.e., to the random sequence (Tape2,i)i [T ], with threshold q = q2. Then Pr [ Fail2 ] Ωc(µ2 q2). Events Fail1 and Fail2 are independent, because they are determined by, resp., realized rewards of arm 1 and realized rewards of arm 2. The claim follows. Finally, let us specify suitable thresholds that satisfy the preconditions in Claims C.3 and C.4: q1 := µ2 4 p η/N0 c /4 and q2 := µ2 p Plugging in µ2 c and N0 64 η/c2, it is easy to check that q1 c/4, as needed for Claim C.4. Thus, the preconditions in Claims C.3 and C.4 are satisfied. It follows that the 0-failure happens with probability at least qfail, as defined in Claim C.4. We obtain the final expression in Eq. (3.3) because µa qa = Θc( + p η/N0) for both arms a [2]. C.2 Proof of Theorem 3.10: pessimistic agents We reuse the machinery from Section C.1: we define event Fail := Fail1 Fail2 as per Eq. (C.3), for some thresholds q1 < q2 to be specified later, and use Claim C.4 to bound Pr [ Fail ]. However, we need a different argument to prove that Fail implies the 0-sampling failure, and a different way to set the thresholds. Claim C.5. Assume q1 > p η/N0 and event Fail. Then arm 1 is never chosen by the agents. Proof. Assume, for the sake of contradiction, that some agent chooses arm 1. Let t be the first round when this happens. Note that Ind1,t Ind2,t. We will show that this is not possible by upper-bounding Ind1,t and lower-bounding Ind2,t. By definition of round t, arm 1 has been previously sampled exactly N0 times. Therefore, Ind1,t = max{0, bµtape 1, N0 p η/N0} (by definition of index) max{0, q1 p η/N0} (by Fail1) η/N0 (by assumption). Let n be the number of times arm 2 has been sampled before round t. This includes the initial samples, so n N0. It follows that Ind2,t bµtape 2,n p η/n (by definition of index) η/N0 (by Fail2 and n N0). Consequently, Ind2,t > Ind1,t, contradiction. Now, set the thresholds q1, q2 as follows: q1 := µ2 c /4 and q2 := µ2 c /8. Plugging in µ2 c and N0 64 η/c2, it is easy to check that the preconditions in Claims C.4 and C.5 are satisfied. So, the 0-failure happens with probability at least qfail from Claim C.4. The final expression in Eq. (3.3) follows because µa qa = Θc( ) for both arms a [2]. C.3 Proof of Theorem 3.9: small N0 We focus on the case when N0 N := 64η/c2 + 1/c . We can now afford to handle the initial samples in a very crude way: our failure events posit that all initial samples of the good arm return reward 0, and all initial samples of the bad arm return reward 1. Fail1 := i [1, N ] : Tape1,i = 0 , Fail2 := i [1, N ] : Tape2,i = 1 and i [T] : bµtape 2,i q2 . Here, q2 > 0 is the threshold to be defined later. On the other hand, our analysis given these events becomes more subtle. In particular, we introduce another failure event" Fail3, with a more subtle definition: if arm 1 is chosen by at least n := N N0 agents, then arm 2 is chosen by n agents before arm 1 is. We first show that Fail := Fail1 Fail2 Fail3 implies the n-sampling failure. Claim C.6. Assume that q2 c/4 and Fail holds. Then at most n = N N0 agents choose arm 1. Proof. For the sake of contradiction, suppose arm 1 is chosen by more than n agents. Let agent t be the (n + 1)-th agent that chooses arm 1. In particular, Ind1,t Ind2,t. By definition of t, arm 1 has been previously sampled exactly N times before (counting the N0 initial samples). Therefore, Ind1,t bµtape 1,N + p η/N (by η-confidence) η/N (by event Fail1) c/8 (by definition of N ). Let m be the number of times arm 2 has been sampled before round t. Then Ind2,t bµtape 2,m p η/m (by η-confidence) η/m (by event Fail2) η/N (since m N by event Fail3) q2 c/8 (by definition of N ) > c/8 (since q2 c/2). Therefore, Ind2,t > Ind1,t, contradiction. Next, we lower bound the probability of Fail1 Fail2 using Lemma C.2. Claim C.7. If q2 < µ2 then Pr [ Fail1 Fail2 ] Ωc(µ2 q2) c2 N . Proof. Instead of analyzing Fail2 directly, consider events E := i [1, N ] : Tape2,i = 1 and E := n m [N + 1, T] : 1 m N Pm i=N +1 Tape2,i q2 o . Note that E E implies Fail2. Now, Pr [ Fail1 ] µ1N c N and Pr [ E ] (1 µ2)N c N . Further, Pr [ E ] Ωc(µ2 q2) by Lemma C.2. The claim follows since these three events are mutually independent. To bound Pr [ Fail ], we argue indirectly, assuming Fail1 Fail2 and proving that the conditional probability of Fail3 is at least 1/2. While this statement feels natural given that Fail1 Fail2 favors arm 2, the proof requires a somewhat subtle inductive argument. This is where we use the symmetry and monotonicity properties from the theorem statement. Claim C.8. Pr [ Fail3 | Fail1 Fail2 ] 1 Now, we can lower-bound Pr [ Fail ] by Ωc(µ2 q2) c2 N . Finally, we set the threshold to q2 = c/2 and the theorem follows. Proof of Claim C.8. Note that event Failt is determined by the first N entries of the reward-tape for both arms, in the sense that it does not depend on the rest of the reward-tape. For each arm a and i [T], let agent τa,i be the i-th agent that chooses arm a, if such agent exists, and τi = T + 1 otherwise. Then Fail3 = { τ2,n τ1,n } = { τ1,n 2n } (C.5) Let E be the event that the first N entries of the reward-tape are 0 for both arms. By symmetry between the two arms (property (P1) in the theorem statement) we have Pr [ τ2,n < τ1,n | E ] = Pr [ τ2,n > τ1,n | E ] = 1/2, and therefore Pr [ Fail3 | E ] = Pr [ τ2,n τ1,n | E ] 1/2. (C.6) Next, for two distributions F, G, write F fosd G if F first-order stochastically dominates G. A conditional distribution of random variable X given event E is denoted (X|E). For each i [T], we consider two conditional distributions for τ1,i: one given Fail1 Fail2 and another given E, and prove that the former dominates: ( τ1,i | Fail1 Fail2 ) fosd ( τ1,i | E ) i [T]. (C.7) Applying (C.7) with i = n, it follows that Pr [ Fail3 | Fail1 Fail2 ] = Pr [ τ1,n 2n | Fail1 Fail2 ] Pr [ τ1,n 2n | E ] = 1/2. (The last equality follows from (C.6) and Eq. (C.6).) Thus, it remains to prove (C.7). Let us consider a fixed realization of each agents behavioral type, i.e., a fixed, deterministic mapping from histories to arms. W.l.o.g. interpret the behavioral type of each agent t as first deterministically mapping history histt to a number pt [0, 1], then drawing a threshold θt [0, 1] independently and uniformly at random, and then choosing arm 1 if and only if pt θt. Note that pt = Pr [ at = 1 | histt ]. So, we pre-select the thresholds θt for each agent t. Note the agents retain the monotonicity property (P2) from the theorem statement. (For this property, the probabilities on both sides of Eq. (3.6) are now either 0 or 1.) Let us prove (C.7) for this fixed realization of the types, using induction on i. Both sides of (C.7) are now deterministic; let Ai, Bi denote, resp., the left-hand side and the right-hand side. So, we need to prove that Ai Bi for all i [n]. For the base case, take i = 0 and define A0 = B0 = 0. For the inductive step, assume Ai Bi for some i 0. We d like to prove that Ai+1 Bi+1. Suppose, for the sake of contradiction, that this is not the case, i.e., Ai+1 < Bi+1. Since Ai < Ai+1 by definition of the sequence (τa,i : [T]), we must have Bi Ai < Ai+1 < Bi+1. Focus on round t = Ai+1. Note that the history histt contains exactly i agents that chose arm 1, both under event Fail1 Fail2 and under event E. Yet, arm 2 is chosen under E, while arm 1 is chosen under Fail1 Fail2. This violates the monotonicity property (P2) from the theorem statement. Thus, we ve proved (C.7) for any fixed realization of the types. Consequently, (C.7) holds in general. C.4 Proof of Lemma C.1 Proof. We use the following sharp lower bound on the tail probability of binomial distribution. Theorem C.9 (Theorem 9 in [61]). Let n N be a positive integer and let (Xi)i [n] be a sequance of i.i.d Bernoulli random variables with prameter p. For any β > 1 there exists constants cβ and Cβ that only rely on β, such that for all x satisfying x [0, np β ] and x + n(1 p) 1, we have i=1 Xi np x cβe Cβn D(p x where D(x||y) denotes the KL divergence between two Bernoulli random variables with parameters x and y. We use the above result with x = n(p q) and β = 1 c 1 9 8 c. Note that β > 1 since c < 1 2. We first verify that x, β satisfy the conditions of the lemma. The x + n(1 p) 1 condition holds by the assumption n 1/c: x + n(1 p) n(1 p) nc 1. As for the x np β condition, by definition of x, x = np n(p q) = p p q . Since p 1 c and p p q is decreasing in p for p q, we can further bound this with p p q 1 c 1 c q 1 c 1 c c where the second inequality follows from q c/8 and q < p 1 c, together with the fact that 1 c 1 c q is decreasing in q for q < 1 c. We obtain x np β by rearranging. Invoking Theorem C.9 with the given values, we obtain Pr Pn i=1 Xi n q cβe Cβn D(q||p) = Ω(e O(n D(q||p))). (C.8) Next, we use the following type of reverse Pinsker s inqeuality to upper bound D(q||p). Theorem C.10 ([28]). For any two probability measures P and Q on a finite support X, if Q is absolutely continuous with respect to P, then the their KL divergence D(Q||P) is upper bounded by 2 αP δ(Q, P)2 where αP = minx X P(x) and δ(Q, P) denotes the total variation distance between P and Q. Setting P = Bernoulli(p) and Q = Bernoulli(q), we have αP = min(p, 1 p), and δ(Q, P) = p q Therefore, since min(p, 1 p) c by assumption, we conclude D(q||p) O((p q)2). Plugging this back in Equation (C.8) finshes the proof. C.5 Proof of Lemma C.2 Our proof will rely on the following doob-style inequality for (super)martingales. Lemma C.11 (Ville s Inequality [58]). Let (Zn)n 0 be a positive supermartingale with respect to filtration (Fn)n 0, i.e. Zn E [ Zn+1|Fn ] for any n 0. Then the following holds for any x > 0, Pr max n 0 Zn x E [ Z0 ]/x. In order to use this result, we will define the martingale Zn := u Pn i=1(Xi+1 q) for a suitable choice of u as specified by the following lemma. Lemma C.12. Let c be an absolute constant. For any p (c, 1 c) and q (0, p), there exists a value of u (0, 1) such that (p u1 q + (1 p) u q) = 1. (C.9) In addition, u satisfies p(1 u1 q) Ω(p q). (C.10) Proof. To see why such a u exists, define f(x) = (p x1 q + (1 p) x q). It is clear that f(1) = 1 and limx 0 f(x) = as limx 0(1 p)x q = . Furthermore, f (x) = p (1 q) x q + (1 p) ( q) x q 1, which implies f (1) = p(1 q) (1 p)q = p q > 0. Therefore, f(x) is decreasing at x = 1. Since limx 0 f(x) > f(1), this implies that f(u) = f(1) for some u (0, 1), proving Equation (C.9). We now prove Equation (C.10), define x0 as x0 = (1 p)q p(1 q). Note that x0 < 1 since p > q. We claim that u x0. To see why, we first note that f (x) can be rewritten as x q 1 (xp(1 q) (1 p)q) . It is clear that f (x0) = 0. Since xp(1 q) (1 p)q is increasing in x, this further implies that f (x) > 0 for x > x0. Now, if u > x0, then since f (x) > 0 for x > x0, we would conclude that f(u) < f(1), which is not possible since f(u) = f(1) = 1. Therefore, u x0 as claimed. We now claim that x1 q 0 1 p + q. This would finish the proof since, together with u x0, this would imply p(1 u1 q) p(1 x1 q 0 ) p(p q) = Ω(p q), where for the last equation we have used the assumption p (c, 1 c). To prove the claim, define ε := p q. We need to show that x1 q 0 1 ε, or equivalently ln(x0) ln(1 ε) 1 q . By defintion of x0, this is equivalent to ln (1 p)(p ε) 1 1 p + ε ln(1 ε). (C.11) Fix p and consider both hand sides as a function of ε. Putting ε = 0, both hands side coincide as they both equal 0. To prove Euqation (C.11), it suffices to show that as we increase ε, the left hand side decreases faster than the right hand side. Equivalently, we need to show that the derivative of the LHS with respect to ε is larger than the derivative of the RHS with respect to ε for ε [0, p]. Taking the derivative with respect to ε on LHS, we obtain d dε (ln(1 p) + ln(p ε) ln(1 p + ε) ln(p)) = 1 p ε 1 1 p + ε. Similarly taking the derivative on RHS we obtain = 1 (1 ε)(1 p + ε) ln(1 ε) (1 p + ε)2 . We therefore need to show that 1 1 p + ε + 1 p ε 1 (1 p + ε)(1 ε) + ln(1 ε) (1 p + ε)2 . (C.12) We note however that 1 1 p + ε + 1 p ε = ε p 1 + p ε (1 p + ε)(1 ε) = 1 (1 p + ε)(1 ε). Therefore Equation (C.12) is equivalent to ln(1 ε) (1 p + ε)2 0, which is true since ε [0, p]. This proves the claim x1 q 0 1 ε, finishing the proof. We now prove Lemma C.2 using Lemma C.11 and C.12. proof of Lemma C.2. Define the random variable Yi as Yi = Xi+1 q. Note that Yi takes value 1 q with probability p and takes q with probability 1 p. Set u to be the value specified in Lemma C.12. For n 0, define Zn := u Pn i=1 Yi. We first observe that Zn is a martingale with respect to Y1, . . . , Yn as E [ Zn+1|Y1, . . . Yn ] = E h u Pn+1 i=1 Yi|Y1, . . . Yn i = u Pn i=1 Yi (p u1 q + (1 p) u q) = u Pn i=1 Yi = Zn. Since 0 < u < 1, this further implies i=1 Yi < q 1 = 1 Pr max j [n]{u Pj i=1 Yi} uq 1 where the first inequality follows from Lemma C.11 and the final equality follows from E [ Z1 ] = E [ Z0 ] = E u0 = 1. Since Yi is a function of Xs+1, we independently have X1 = 1 with probability p. Therefore, with probability p(1 u1 q). Xi = 1 and n 1 : i=2 (Xi q) q 1, which further implies Pn i=1(Xi q) 0. Therefore, Pr n 1 : Pn i=1 Xi n q p(1 u1 q) Ω(p q), where the inequality follows from Equation (C.10). D Proofs from Section 4: Upper Bounds for Optimistic Agents D.1 Proof of Theorem 4.1 and Theorem 4.4 We define certain clean events to capture desirable realizations of random rewards, and decompose our regret bounds based on whether or not these events hold. The clean events ensure that the index of each arm is not too far from its true mean reward; more specifically, that the index is large enough for the good arm, and small enough for the bad arm. We have two clean events , one for each arm, defined in terms of the reward-table as follows: Cleanη 1 := i [T] : UCBtape, η 1,i µ1 /2 , (D.1) Cleanη 2 := i 64 η/ 2 : UCBtape, η 2,i µ2 + /4 . (D.2) Our analysis is more involved compared to the standard analysis of the UCB1 algorithm [6], essentially because we cannot make η be as large as needed to ensure that clean events hold with very high probability. For example, we cannot upper-bound the deviation probability separately for each round and naively take a union bound over all rounds.20 Instead, we apply a more careful peeling technique , used e.g., in Audibert and Bubeck [5], so as to avoid any dependence on T in the lemma below. 20Indeed, this would only guarantee that clean events hold with probability at least 1 O(T e Ω(η)), which in turn would lead to a regret bound like O(T 2 e Ω(η)). Lemma D.1. The clean events hold with probability Pr [ Cleanη 1 ] 1 O (1 + log(1/ )) e Ω(η) , (D.3) Pr [ Cleanη 2 ] 1 O e Ω(η) . (D.4) We show that under the appropriate clean events, η-optimistic agents cannot play the bad arm too often. In fact, this claim extends to [η, ηmax]-optimistic agents. Claim D.2. Assume that events Cleanη 1 and Cleanηmax 2 hold. Then [η, ηmax]-optimistic agents cannot choose the bad arm more than 64 ηmax/ 2 times. Proof. For the sake of contradiction, suppose [η, ηmax]-optimistic agents choose the bad arms at least n = 64 ηmax/ 2 times, and let t be the round when this happens. However, by event Cleanη 1, the index of arm 1 is at least µ1 /2. By event Cleanηmax 2 , the index of arm 2 is at least UCBtape, η i,n µ2 + /4, which is less than the index of arm 1, contradiction. For the joint clean event, Clean := Cleanη 1 Cleanηmax 2 , Lemma D.1 implies Pr [ Clean ] 1 O log ( 1/ ) e Ω(η) . (D.5) When the clean events fail, we upper-bound regret by T, which is the largest possible. Thus, Lemma D.2 and Eq. (D.5) imply Theorem 4.4, which in turn implies Theorem 4.1 as a special case. D.2 Proof of Theorem 4.5 We reuse the machinery from Section D.1, but we need some extra work. Recall that all agents are assumed to be ηmax-confident, whereas only a fraction are optimistic. Essentially, we rely on the optimistic agents to sample the good arm sufficiently many times (via Claim D.2). Once this happens, all other agents fall in line and cannot choose the bad arm too many times. In what follows, let m = 1 + 64 ηmax/ 2. Claim D.3. Assume Clean. Suppose the good arm is sampled at least m times by some round t0. Then after round t0, agents cannot choose the bad arm more than m times. Proof. For the sake of contradiction, suppose agent t t0 has at least m samples of the bad arm (i.e., n2,t m), and chooses the bad arm once more. Then the index of the good arm satisfies Ind1,t LCBηmax 1,t (ηmax-confident agents) LCBtape, ηmax 1,m (by definition of t0) UCBtape, ηmax 1,m 2 p ηmax/m (by definition of UCBs/LCBs) UCBtape, η 1,m 2 p ηmax/m (since ηmax η) > µ1 /2 (by Cleanη 1 and the definition of m). The index of the bad arm satisfies Ind2,t UCBη 1,t (η-confident agents) µ2 + /4 (by Cleanη 1 and the definition of m), which is strictly smaller than Ind1,t, contradiction. For Claim D.3 to kick in , we need sufficiently many optimistic agents to arrive by time t0. Formally, let Et be the event that at least 2m agents are [ η, ηmax ]-optimistic in the first t rounds. Corollary D.4. Assume Clean. Further, assume event Et0 for some round t0. Then (by Claim D.2) the good arm is sampled at least m times before round t0. Consequently (by Claim D.3), agents cannot choose the bad arm more than m + t0 times. Finally, it is easy to see by Chernoff Bounds that Pr [ Et0 ] 1 e Ω(η) for some t0 = O(m/q), where q is the probability from the theorem statement. So, Pr [ Clean Et0 ] is lower-bounded as in Eq. (D.5). Again, when Clean Et0 fails, we upper-bound regret by T. So, Corollary D.4 and the lower bound on Pr [ Clean Et0 ] implies the theorem. D.3 Proof of Lemma D.1 We assume without loss of generality that η > 2. If η 2, the Lemma s statement can be made vacuous using large enough constants in O. In addition, for mathematical convenience, we will assume that the tape for each arm is infinite, even though the entries after T will never actually be seen by any of the agents. For each arm a, we first separately consider each interval of the form [n, 2n] and bound the probability that UCBtape, η a,i deviates too much from µa for i [n, 2n]. While this can be done crudely by applying a union bound over all i, we use the following maximal inequality. Lemma D.5 (Eq. (2.17) in [32]). Given a sequence of i.i.d. random variables (Xi)i [n] in [0, 1] such that E [ Xi ] = µ, the inequality states that for any x > 0, j=1 ( Xj µ ) Focusing on some interval of the form [n, 2n] for n N, and applying this inequality to the reward tape of arm a, we conclude that Pr i [n, 2n] : bµtape a,i µa x O(e Ω(nx2)). (D.6) Define f := 64η/ 2 . We note that f = Θ(η/ 2) given the assumption η > 2. In order to bound Pr [ Cleanη 2 ], we will apply this inequality to each interval [n, 2n] for n f, and take a union bound. Formally, 1 Pr [ Cleanη 2 ] Pr i f : bµtape 2,i > µ2 + /8 (Since p η/i /8 for i f) r=0 Pr i [f2r, f2r+1] : bµtape 2,i > µ2 + /8 (Union bound) r=0 e Ω(η2r) ! (By Eq. (D.6)) r=0 e Ω(η(r+1)) ! (Since 2r r + 1 for r N) = O( 1 eΩ(η) 1) (Sum of geometric series) O(e Ω(η)) (By η > 2) In order to bound Pr [ Cleanη 1 ], we separately handle the intervals n < f and n f. For n f, repeating the same argument as above for arm 1 implies Pr i f : bµtape 1,i < µ1 /8 O(e Ω(η)). For n < f, we use a modified argument that utilizes the extra p η/i term in UCBtape, η 1,i . Instead of bounding the probability bµtape 1,i having deviation /8, we bound the probability that it deviates by η/i. This results in a marked improvement because p η/i increases as we decrease i. Formally, Pr h i [1, f] : bµtape 1,i < µ1 p r=0 Pr h i [2r, 2r+1] : bµtape 1,i < µ1 p η/i i (Union bound) r=0 Pr h i [2r, 2r+1] : bµtape 1,i < µ1 p η/2r+1 i (By assumption on i) (By Eq. (D.6)) = O( log(f) e Ω(η)). Finally, we note that since η > 2, log(f) O(1 + log(f)) = O(1 + log(η) + log(1/ )). This implies Eq. (D.3) because O(log(η)e Ω(η)) can be rewritten as O(e Ω(η)) by changing the constant behind Ω. E Proofs from Section 5: Learning Failures for Bayesian Agents In this section, we prove Theorem 5.1. We first briefly review some properties of the beta distribution. Throughout the section, we consider a beta distribution with parameters α, β. Lemma E.1 (Fact 1 in [3]). Let F B n,p denote the CDF of the binomial distribution with paramters n, p and F beta α,β denote the CDF of the beta distribution. Then, F beta α,β (y) = 1 F B α+β 1,y(α 1) for α, β that are positive integers. Using Hoeffding s inequality for concentration of the binomial distribution, we immediately obtain the following corollary. Corollary E.2. Define ρα,β := α 1 α+β 1. If X is sampled from the beta distribution with parameters (α, β), Pr [ | X ρα,β | y ] 2e (α+β 1)y2. In addition, letting Q(.) denote the quantile function of the distribution, [Q(ζ), Q(1 ζ)] ln(2/ζ) α + β 1, ρα,β + ln(2/ζ) α + β 1 Let αa,n, βa,n denote the posterior distribution after observing n entries of the tape for arm a. Note that since we are assuming independent priors, the posterior for each arm is independent of the seen rewards of the other arm. Define Ma,n := αa,n + βa,n. We note that by definition, αa,0, βa,0 coincide with the prior αa, βa. We analogously define Ma := αa + βa. Define ρa,n := αa,n 1 Ma,n 1 and ξa,n := αa,n Ma,n . We note that ξa,n is the mean of the posterior distribution after observing n entries of arm a. Lemma E.3. For all n 0, bµtape a,n ξa,n O Ma,0 n+Ma,0 Proof. After observing n entries, the posterior parameters satisfy αa,n := αa,0 + X i n Tapea,i, βa,n := βa,0 + X i n (1 Tapea,i). It follows that ξa,n = αa,0 + P i n Tapea,i αa,0 + βa,0 + n . Defining X := P i n Tapea,i, we can bound the difference between ξa,n and bµtape a,n as αa,0 + X Ma,0 + n X = nαa,0 + n X n X XMa,0 n(n + Ma,0) = nαa,0 XMa,0 n(n + Ma,0) αa,0 n + Ma,0 + Ma,0 n + Ma,0 (Since X n) O Ma,0 n + Ma,0 Lemma E.4. For all n 0, | ξa,n ρa,n | O 1 n+Ma,0 αa,n 1 Ma,n 1 αa,n = Ma,n + αa,n Ma,n(Ma,n 1) Ma,n Ma,n(Ma,n 1) = O 1 n + Ma,0 (Since Ma,n = Ma,0 + n and Ma,0 1) We can now prove Theorem 5.1. Proof of Theorem 5.1. We start with part (a). Set η to be large enough such that bµtape a,n ξa,n r η Since Ma n+Ma Ma n , by Lemma E.3, this can be achieved with η O(Ma/ N0), which proves part (a). For part (b), set η to be large enough such that bµtape a,n ρa,n 1 n. Given, Lemmas E.3 and E.4, this can be achieved with η O(Ma/ N0). Since M 1 n, we can further gaurantee ln(2/ζ) M 1 η 4n by setting η O(ln(1/ζ)), which finishes the proof together with Corollary E.2. F The corollary from Section 6 As a corollary of Theorem 6.1, we derive a 0-sampling failure, leading to Ω(T) Bayesian regret. Specifically, the agents never try arm 2 when it is in fact the best arm. This happens whenever the prior is independent across arms and has a positive density on the entire [0, 1] interval. Corollary F.1. In the setting of Theorem 6.1, suppose the prior P is independent across arms and has a positive density for each arm (i.e., has probability density function that is strictly positive on [0, 1]). Then E[Regret(T)] c P T, where the constant c P > 0 depends only on the prior P. This follows from a more explicit, but more cumbersome corollary. Corollary F.2. Denote µ0 1 = E[µ1] and µ0 2 = E[µ2]. Consider independent priors such that Pr [ µ1 = 1 ] < (µ0 1 µ0 2)/2. Pick any α > 0 such that Pr [ µ1 1 2 α ] (µ0 1 µ0 2)/2. Then Bayesian regret is at least T α/2 (µ0 1 µ0 2) Pr [ µ2 > 1 α ] . Proof. Let E1 be the event that µ1 < 1 2α and arm 2 is never chosen. By Theorem 6.1 and the definition of α, we have Pr [ E1 ] (µ0 1 µ0 2)/2. Let E2 be the event that µ2 > 1 α. Under event E1 E2, each round contributes µ2 µ1 α to regret, so E[R(T) | E1 E2] α T. Since event E1 is determined by the prior on arm 1 and the rewards of arm 2, it is independent from E2. It follows that E[R(T)] E[R(T) | E1 E2] Pr [ E1 E2 ] αT (µ0 1 µ0 2)/2 Pr [ E2 ].