# bandit_learning_with_positive_externalities__c5e331b1.pdf

Bandit Learning with Positive Externalities

Virag Shah Management Science and Engineering

Stanford University California, USA 94305 virag@stanford.edu

Jose Blanchet Management Science and Engineering

Stanford University California, USA 94305 jblanche@stanford.edu

Ramesh Johari Management Science and Engineering

Stanford University California, USA 94305 rjohari@stanford.edu

In many platforms, user arrivals exhibit a self-reinforcing behavior: future user arrivals are likely to have preferences similar to users who were satisﬁed in the past. In other words, arrivals exhibit positive externalities. We study multiarmed bandit (MAB) problems with positive externalities. We show that the self-reinforcing preferences may lead standard benchmark algorithms such as UCB to exhibit linear regret. We develop a new algorithm, Balanced Exploration (BE), which explores arms carefully to avoid suboptimal convergence of arrivals before sufﬁcient evidence is gathered. We also introduce an adaptive variant of BE which successively eliminates suboptimal arms. We analyze their asymptotic regret, and establish optimality by showing that no algorithm can perform better.

1 Introduction

A number of different platforms use multiarmed bandit (MAB) algorithms today to optimize their service: e.g., search engines and information retrieval platforms; e-commerce platforms; and news sites. Many such platforms exhibit a natural self-reinforcement in the arrival process of users: future arrivals may be biased towards users who expect to have positive experiences based on the past outcomes of the platform. For example, if a news site generates articles that are liberal (resp., conservative), then it is most likely to attract additional users who are liberal (resp., conservative) [2]. In this paper, we study the optimal design of MAB algorithms when user arrivals exhibit such positive self-reinforcement.

We consider a setting in which a platform faces many types of users that can arrive. Each user type is distinguished by preferring a subset of the item types above all others. The platform is not aware of either the type of the user, or the item-user payoffs. Following the discussion above, arrivals exhibit positive externalities (also called positive network effects) among the users [13]: in particular, if one type of item generates positive rewards, users who prefer that type of item become more likely to arrive in the future.

Our paper quantiﬁes the consequences of positive externalities for bandit learning in a benchmark model where the platform is unable to observe the user s type on arrival. In the model we consider, introduced in Section 3, there is a set of m arms. A given arriving user prefers a subset of these arms over the others; in particular, all arms other than the preferred arms generate zero reward. A preferred arm a generates a Bernoulli reward with mean µa. To capture positive externalities, the probability

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

= 0 0 < < 1 = 1 > 1

Lower Bound (ln T) (T 1 ln T) (ln2 T) (ln T)

UCB O(ln T) (T) (T) (T)

Random-explore-then-commit O(ln T)

µb µb+ a µa

Balanced Exploration (BE) O(ln T) O(T 1 ln T) O(ln2 T) O(ln T)

BE with Arm Elimination (BE-AE) O(ln T) O(T 1 ln T) O(ln2 T) O(ln T)

Table 1: Total regret under different settings. Here a = arg max µa, and b = arg maxa6=a µa. For Random-explore-then-commit algorithm, we assume that the initial bias a for each arm a is a positive integer (cf. Section 3). The notation f(T) = O(g(T)) implies there exists k > 0 such that f(T) = O(g(T) lnkg(T)).

that a user preferring arm a arrives at time t is proportional to (Sa(t 1) + a) , where Sa(t 1) is the total reward observed from arm a in the past and a captures the initial conditions. The positive constant captures the strength of the externality: when is large the positive externality is strong.

The platform aims to maximize cumulative reward up to time horizon T. We evaluate our performance by measuring regret against an ofﬂine oracle that always chooses the arm a = arg maxa µa. Because of the positive externality, this choice causes the user population to shift entirely to users preferring arm a over time; in particular, the oracle achieves asymptotically optimal performance to leading order in T. We study the asymptotic scaling of cumulative regret against the oracle at T as T ! 1.

At the heart of this learning problem is a central tradeoff. On one hand, because of the positive externality, the platform operator is able to move the user population towards the proﬁt maximizing population. On the other hand, due to self-reinforcing preferences the impact of mistakes is ampliﬁed: if rewards are generated on suboptimal arms, the positive externality causes more users that prefer those arms to arrive in the future. We are able to explicitly quantify the impact of this tradeoff in our model.

Our main results are as follows.

Lower bound. In Section 4, we provide an explicit lower bound on the best achievable regret for each . Strikingly, the optimal regret is structurally quite different than classical lower bounds for MAB problems; see Table 1. Its development sheds light into the key differences between MABs with positive externalities and those without.

Suboptimality of classical approaches. In Section 5, we show that the UCB algorithm is not only suboptimal, but in fact has positive probability of never obtaining a reward on the best arm a and thus obtains linear regret. This is because UCB does not explore sufﬁciently to ﬁnd the best arm. However, we show that just exploring more aggressively is also insufﬁcient; a random-explore-thencommit policy which explores in an unstructured fashion remains suboptimal. This demonstrates the need of developing a new approach to exploration.

Optimal algorithm. In Section 6, we develop a new algorithmic approach towards optimizing the exploration-exploitation tradeoff. Interestingly, this algorithm is cautious in the face of uncertainty to avoid making long-lasting mistakes. Our algorithm, Balanced Exploration (BE), keeps the user population balanced during the exploration phase; by doing so, it exploits an arm only when there is sufﬁcient certainty regarding its optimality. Its adaptive variant, Balanced Exploration with Arm Elimination (BE-AE), intelligently eliminates suboptimal arms while balancing exploration among the remainder. BE has the beneﬁt of not depending on system parameters, while BE-AE uses such information (e.g., ). We establish their optimality by developing an upper-bound on their regret for each ; this nearly matches the lower bound (for BE), and exactly matches the lower bound (for BE-AE).

Further, in Section 7 we provide simulation results to obtain quantitative insights into the relative performance of different algorithms. We conclude the paper by summarizing the main qualitative insights obtained from our work.

2 Related work

As noted above, our work incorporates positive externalities in user arrivals. Positive externalities are also referred to as positive network effects or positive network externalities. (Note that the phrase network is often used here, even when the effects do not involve explicit network connections

between the users.) See [13], as well as [21, 20] for background. Positive externalities are extensively discussed in most standard textbooks on microeconomic theory; see, e.g., Chapter 11 of [17].

It is well accepted that online search and recommendation engines produce feedback loops that can lead to self-reinforcement of popular items [3, 6, 19, 9]. Our model captures this phenomenon by employing a self-reinforcing arrival process, inspired by classical urn processes [4, 12].

We note that the kind of self-reinforcing behavior observed in our model may be reminiscent of herding behavior in Bayesian social learning [7, 23, 1]. In these models, arriving Bayesian rational

users take actions based on their own private information, and the outcomes experienced by past users. The central question in that literature is the following: do individuals base their actions on their own private information, or do they follow the crowd? By contrast, in our model it is the platform which takes actions, without directly observing preferences of the users.

If the user preferences are known then a platform might choose to personalize its services to satisfy each user individually. This is the theme of much recent work on contextual bandits; see, e.g., [16, 22, 18] and [8] for a survey of early work. In such a model, it is important that either (1) enough observable covariates are available to group different users together as decisions are made; or (2) users are long-lived so that the platform has time to learn about them.

In contrast to contextual bandits, in our model the users types are not known, and they are short-lived (one interaction per user). Of course, the reality of many platforms is somewhere in between: some user information may be available, though imperfect. We view our setting as a natural benchmark model for analysis of the impact of self-reinforcing arrivals. Through this lens, our work suggests that there are signiﬁcant consequences to learning when the user population itself can change over time, an insight that we expect to be robust across a wide range of settings.

3 Preliminaries

In this section we describe the key features of the model we study. We ﬁrst describe the model, including a precise description of the arrival process that captures positive externalities. Next, we describe our objective: minimization of regret relative to the expected reward of a natural oracle policy.

Arms and rewards. Let A = {1, ..., m} be the set of available arms. During each time t 2 {1, 2, ...} a new user arrives and an arm is pulled by the platform; we denote the arm pulled at time t by It. We view pulling an arm as presenting the corresponding option to the newly arrived user. Each arriving user prefers a subset of the arms, denoted by Jt. We describe below how Jt is determined.

If arm a is pulled at time t and if the user at time t prefers arm a 2 A (i.e., a 2 Jt) then the reward obtained at time t is an independent Bernoulli random variable with mean µa. We assume µa > 0 for all arms. If the user at time t does not prefer the arm pulled then the reward obtained at time t is zero. We let Xt denote the reward obtained at time t.

For t 1, let Ta(t) represent the number of times arm a is pulled up to and including time t, and let Sa(t) represent the total reward accrued by pulling arm a up to and including time t 1. Thus Ta(t) = |{1 s t : Is = a}|, and Sa(t) = |{1 s t : Is = a, Xs = 1}|. We deﬁne Ta(0) = Sa(0) = 0.

Unique best arm. We assume there exists a unique a 2 A such that:

a = arg max µa. This assumption is standard and made for technical convenience; all our results continue to hold without it.

Arrivals with positive externalities. We now deﬁne the arrival process {Jt}t 1 that determines users preferences over arms; this arrival process is the novel feature of our model. We assume there are ﬁxed constants a > 0 for a 2 A (independent of T), denoting the initial popularity of arm a.

For t 0, deﬁne:

Na(t) = Sa(t) + a, a 2 A. Observe that by deﬁnition Na(0) = a.

In our arrival process, arms with higher values of Na(t) are more likely to be preferred. Formally, we assume that the tth user prefers arm a (i.e., a 2 Jt) with probability λa(t) independently of other arms, where:

λa(t) = f(Na(t 1)) Pm

a0=1 f(Na0(t 1)),

where f( ) is a positive, increasing function f. We refer to f as the externality function. In our analysis we primarily focus on the parametric family f(x) = x , where 2 (0, 1).

Intuitively, the idea is that agents who prefer arm a are more likely to arrive if arm a has been successful in the past. This is a positive externality: users who prefer arm a are more likely to generate rewards when arm a is pulled, and this will in turn increase the likelihood an arrival preferring arm a comes in the future. The parameter controls the strength of this externality: the positive externality is stronger when is larger.

If f is linear ( = 1), then we can interpret our model in terms of an urn process. In this view, a resembles the initial number of balls of color a in the urn at time t = 1 and Na(t) resembles the total number of balls of color a added into the urn after t draws. Thus, the probability the tth draw is of color a is proportional to Na(t). In contrast to the standard urn model, in our model we have additional control: namely, we can pull an arm, and thus govern the probability with which a new ball of the same color is added into the urn.

3.2 The oracle and regret

Maximizing expected reward. Throughout our presentation, we use T to denote the time horizon over which performance is being optimized. (The remainder of our paper characterizes upper and lower bounds on performance as the time horizon T grows large.) We let ΓT denote the total reward accrued up to time T:

The goal of the platform is to choose a sequence {It} to maximize E[ΓT ]. As usual, It must be a function only of the past history (i.e., prior to time t).

The oracle policy. As is usual in multiarmed bandit problems, we measure our performance against a benchmark policy that we refer to as the Oracle. Deﬁnition 1 (Oracle). The Oracle algorithm knows the optimal arm a , and pulls it at all times t = 1, 2, . . ..

T denote the reward of the Oracle. Note that Oracle may not be optimal for ﬁnite ﬁxed T; in particular, unlike in the standard stochastic MAB problem, the expected cumulative reward E[Γ

T ] is not µa T, as several arrivals may not prefer the optimal arm.

The next proposition provides tight bounds on E[Γ

T ]. For the proof, see the Appendix. Proposition 1. Suppose > 0. Let = P

a . The expected cumulative reward E[Γ

T ] for the Oracle satisﬁes:

T ] µa T µa

1 (k + a 1) + .

1 (k + a ) 1.

In particular, we have:

µa T (T 1 ), 0 < < 1 µa T (ln T), = 1 µa T (1), > 1

The discontinuity at = 1 in the asymptotic bound above arrises since PT

1 k diverges for each 1 but converges for > 1. Further, the divergence is logarithmic for = 1 but polynomial for each < 1.

Note that in all cases, the reward asymptotically is of order µa T. This is the best achievable performance to leading order in T, showing that the oracle is asymptotically optimal.

Our goal: Regret minimization. Given any policy, deﬁne the regret against the Oracle as RT :

Our goal in the remainder of the paper is to minimize the expected regret E[RT ]. In particular, we focus on characterizing regret performance asymptotically to leading order in T (both lower bounds and achievable performance), for different values of the externality exponent .

4 Lower bounds

In this section, we develop lower bounds on the achievable regret of any feasible policy. As we will ﬁnd, these lower bounds are quite distinct from the usual O(ln T) lower bound (see [15, 8]) on regret for the standard stochastic MAB problem. This fundamentally different structure arises because of the positive externalities in the arrival process.

To understand our construction of the lower bound, consider the case where the externality function is linear ( = 1); the other cases follow similar logic. Our basic idea is that in order to determine the best arm, any optimal algorithm will need to explore all arms at least ln T times. However, this means that after t0 = (ln T) time, the total reward on any suboptimal arms will be of order P

b6=a Nb(t0) = (ln T). Because of the effect of the positive externality, any algorithm will then need to recover from having accumulated rewards on these suboptimal arms. We show that even if the optimal arm a is pulled from time t0 onwards, a regret (ln2 T) is incurred simply because arrivals who do not prefer arm a continue to arrive in sufﬁcient numbers.

The next theorem provides regret lower bounds for all values of . The proof can be found in the Appendix.

Theorem 1. 1. For < 1, there exists no policy with E[RT ] = o(T 1 ln T) on all sets of Bernoulli reward distributions.

2. For = 1, there exists no policy with E[RT ] = o(ln2 T) on all sets of Bernoulli reward

distributions.

3. For > 1, there exists no policy with E[RT ] = o(ln T) on all sets of Bernoulli reward

distributions.

The remainder of the paper is devoted to studying regret performance of classic algorithms (such as UCB), and developing an algorithm that achieves the lower bounds above.

5 Suboptimality of classical approaches

We devote this section to developing structural insight into the model, by characterizing the performance of two classical approaches for the standard stochastic MAB problem: the UCB algorithm [5, 8] and a random-explore-then-commit algorithm.

We ﬁrst show that the standard upper conﬁdence bound (UCB) algorithm, which does not account for the positive externality, performs poorly. (Recall that in the standard MAB setting, UCB achieves the asymptotically optimal O(ln T) regret bound [15, 8].)

Formally, the UCB algorithm is deﬁned as follows. Deﬁnition 2 (UCB(γ)). Fix γ > 0. For each a 2 A, let ˆµa(0) = 0 and for each t > 0 let

ˆµa(t) := Sa(t 1)

Ta(t 1), under convention that ˆµa(t) = 0 if Ta(t 1) = 0. For each a 2 A let ua(0) = 0 and for each t > 0 let

ua(t) := ˆµa(t) +

γ ln t Ta(t 1).

It 2 arg max

with ties broken uniformly at random.

Under our model, consider an event where a 62 Jt but It = a : i.e., a is pulled but the arriving user did not prefer arm a . Under UCB, such events are self-reinforcing, in that they not only lower the upper conﬁdence bound for arm a , resulting in fewer future pulls of arm a , but they also reduce the preference of future users towards arm a .

It is perhaps not surprising, then, that UCB performs poorly. However, the impact of this selfreinforcement under UCB is so severe that we obtain a striking result: there is a strictly positive probability that the optimal arm a will never see a positive reward, as shown by the following theorem. An immediate consequence of this result is that the regret of UCB is linear in the horizon length. The proof can be found in the Appendix.

Theorem 2. Suppose γ > 0. Suppose that f(x) is

for some > 0. For UCB(γ) algorithm, there exists an 0 > 0 such that

lim T !1 Sa (T) = 0

In particular, the regret of UCB(γ) is O(T).

5.2 Random-explore-then-commit

UCB fails because it does not explore sufﬁciently. In this section, we show that more aggressive unstructured exploration is not sufﬁcient to achieve optimal regret. In particular, we consider a policy that chooses arms independently and uniformly at random for some period of time, and then commits to the empirical best arm for the rest of the time. Deﬁnition 3 (REC( )). Fix 2 Z+. For each 1 t , choose It uniformly at random from set A. Let ˆa 2 arg maxa Sa( ), with tie broken at random. For < t < T, It = a .

The following theorem provides performance bounds for the REC( ) policy for our model. The proof of this result takes advantage of multitype continuous-time Markov branching processes [4, 12]; it is given in the Appendix. Proposition 2. Suppose that a for each a 2 A is a positive integer. Let b = arg maxa6=a µa. The following statements hold for the REC( ) policy for any :

1. If 0 < < 1 then we have E[RT ] = (T 1 ln

2. If = 1 then we have E[RT ] =

µb µb+ a µa

3. If > 1 then we have E[RT ] = (T).

Thus, for 1, the REC( ) policy may improve on the performance of UCB by delivering sublinear regret. Nevertheless this regret scaling remains suboptimal for each . In the next section, we demonstrate that carefully structured exploration can deliver an optimal regret scaling (matching the lower bounds in Theorem 1).

6 Optimal algorithms

In this section, we present an algorithm that achieves the lower bounds presented in Theorem 1. The main idea of our algorithm is to structure exploration by balancing exploration across arms; this ensures that the algorithm is not left to correct a potentially insurmountable imbalance in population once the optimal arm has been identiﬁed.

We ﬁrst present a baseline algorithm called Balanced Exploration (BE) that nearly achieves the lower bound, but illustrates the key beneﬁt of balancing; this algorithm has the advantage that it needs no knowledge of system parameters. We then use a natural modiﬁcation of this algorithm called Balanced Exploration with Arm Elimination (BE-AE) that achieves the lower bound in Theorem 1, though it uses some knowledge of system parameters in doing so.

6.1 Balanced exploration

The BE policy is cautious during the exploration phase in the following sense: it pulls the arm with least accrued reward, to give it further opportunity to ramp up its score just in case its poor performance was bad luck. At the end of the exploration phase, it exploits the empirical best arm for the rest of the horizon.

To deﬁne BE, we require an auxiliary sequence wk, k = 1, 2, . . ., used to set the exploration time. The only requirement on this sequence is that wk ! 1 as k ! 1; e.g., wk could be ln ln k for each postive integer k. The BE algorithm is deﬁned as follows.

Deﬁnition 4. Balanced-Exploration (BE) Algorithm: Given T, let n = w T ln T.

1. Exploration phase: Explore until the (random) time n = min(t : Sb(t) n 8 b 2 A) T,

i.e., explore until each arm has incurred at least n rewards, while if any arm accrues less than n rewards by time T, then n = T. Formally, for 1 t n, pull arm x(t) 2 arg infa2A Sa(t 1), with ties broken at random.

2. Exploitation phase: Let ˆa 2 arg infa2A Ta( n), with tie broken at random. For n + 1

t T, pull the arm ˆa .

Note that this algorithm only uses prior knowledge of the time horizon T, but no other system parameters; in particular, we do not need information on the strength of the positive externality, captured by . Our main result is the following. The proof can be found in the Appendix.

Theorem 3. Suppose wk, k = 0, 1, 2, . . ., is any sequence such that wk ! 1 as k ! 1. Then the regret of the BE algorithm is as follows:

1. If 0 < < 1 then E[RT ] = O(w

T T 1 ln T).

2. If = 1 then E[RT ] = O(w T ln2 T).

3. If > 1 then E[RT ] = O(w

In particular, observe that if wk = ln ln k, then we conclude E[RT ] = O(T 1 ln T) (if 0 < < 1); E[RT ] = O(ln2 T) (if = 1); and E[RT ] = O(ln T) (if > 1). Recall that the notation f(T) = O(g(T)) implies there exists k > 0 such that f(T) = O(g(T) lnkg(T)).

6.2 Balanced exploration with arm elimination

The BE algorithm very nearly achieves the lower bounds in Theorem 1. The additional inﬂation (captured by the additional factor w T ) arises in order to ensure the algorithm achieves low regret despite not having information on system parameters.

We now present an algorithm which eliminates the inﬂation in regret by intelligently eliminating arms that have poor performance during the exploration phase by using upper and lower conﬁdence bounds. The algorithm assumes the knowledge of T, m, , and a for each a to the platform (though we discuss the assumption on the knowledge of a further below). With these informational assumptions, λa(t) for each t can be computed by the platform. Below, ˆµa(t) is an unbiased estimate of µa given observations till time t, while ua(t) and la(t) are its upper and lower conﬁdence bounds.

Deﬁnition 5. Balanced Exploration with Arm Elimination (BE-AE) Algorithm: Given T, m, and , as well as a for each a 2 A, for each time t and each arm a deﬁne:

ˆµa(t) = (Ta(t)) 1

Further, let c = mina,b2A a m(1+ b). Deﬁne ua(t) = ˆµa(t) + 5

ln T c Ta(t), and la(t) = ˆµa(t)

ln T c Ta(t).

Let A(t) be the set of active arms at time t. At time t = 1 all arms are active, i.e., A(1) = A. At each time t pull arm

It 2 arg inf a2A(t) Sa(t 1),

with ties broken lexicographically. Eliminate arm a from the active set if there exists an active arm b 2 A(t) such that ua(t) < lb(t).

The following theorem shows that the BE-AE algorithm achieves optimal regret, i.e., it meets the lower bounds in Theorem 1. The proof can be found in the Appendix.

Theorem 4. For ﬁxed m and , the regret under the BE-AE algorithm satisﬁes the following:

1. If 0 < < 1 then E[RT ] = O(T 1 ln T).

2. If = 1 then E[RT ] = O(ln2 T).

3. If > 1 then E[RT ] = O(ln T).

As noted above, our algorithm requires some knowledge of system parameters. We brieﬂy describe an approach that we conjecture delivers the same performance as BE-AE, but without knowledge of a for a 2 A. Given a small > 0, ﬁrst run the exploration phase of the BE algorithm for n = ln T time without removing any arm. For t subsequent to the end of this exploration phase, i.e., once ln T samples are obtained for each arm, we have Na(t) = ln T + a. Thus, the effect of a on λa(t) becomes negligible, and one can approximate λa(t) by letting Nb(t) = Sb(t) for each arm b. We then continue with the BE-AE algorithm as deﬁned above (after completion of the exploration phase). We conjecture the regret performance of this algorithm will match BE-AE as deﬁned above. Proving this result, and more generally removing dependence on T, m, and , remain interesting open directions.

7 Simulations

Below, we summarize our simulation setup and then describe our main ﬁndings.

Simulation setup. We simulate our model with m = 2 arms, with externality strength = 1, arm reward parameters µ1 = 0.5 and µ2 = 0.3, and initial biases 1 = 2 = 1. For Fig. 1a, we simulate each algorithm one hundred times for each set of parameters. We plot pseudo-regret realization from each simulation, i.e., E[Γ

T ] ΓT , where E[Γ

T ] is the expected reward for the Oracle, computed via Monte Carlo simulation, and ΓT is the total reward achieved by the algorithm. Thus, lower pseudo-regret realization implies better performance. For Fig. 1b, each point is obtained by simulating the corresponding algorithm one thousand times. The time horizon T is as mentioned in the ﬁgures.

Parameters for each algorithm. We simulate UCB(γ) with γ = 3. For Random-explore-then-commit, we set the exploration time as

T (empirically, this performs signiﬁcantly better than ln T). For BE, we set w T = β ln ln T with β = 2 (see Deﬁnition 4). For BE-AE, cf. Deﬁnition 5, we recall that the upper and lower conﬁdence bounds are set as ua(t) = ˆµa(t)+p

ln T Ta(t), and la(t) = ˆµa(t) p

ln T Ta(t) for p = 5c 1/2 where c = mina,b2A a m(1+ b). This choice of p was set in the paper for technical reasons, but unfortunately this choice is suboptimal for ﬁnite T. The choice of p = 1/2 achieves signiﬁcantly better performance for this experimental setup. The performance is sensitive to small changes in p, as the plots illustrate when choosing p = 5/2. In contrast, in our experiments, we found that the performance of BE is relatively robust to the choice of β.

(a) Realized psuedo-regret for T = 3 104.

(b) Expected regret as a function of time horizon T

Figure 1: Performance comparison of algorithms in different parameter regimes. All simulations have m = 2 arms, externality strength = 1, arm reward parameters µ1 = 0.5 and µ2 = 0.3, and initial arm bias 1 = 2 = 1.

Main ﬁndings. The following are our main ﬁndings from the above simulations.

First, even for = 1, REC appears to perform as poorly as UCB. Recall that in Section 5 we show theoretically that the regret is linear for UCB for each , and for REC for > 1. For = 1, we are only able to show that REC exhibits polynomial regret.

Second, for ﬁnite T, the performance of the (asymptotically optimal) BE-AE algorithm is quite sensitive to the choice of algorithm parameters, and thus may perform poorly in certain regimes. By contrast, the (nearly asymptotically optimal) BE algorithm appears to exhibit more robust performance.

8 Discussion and conclusions

It is common that platforms make online decisions under uncertainty, and that these decisions impact future user arrivals. However, most MAB models in the past have decoupled the evolution of arrivals from the learning process. Our model, though stylized by design, provides several non-standard yet interesting insights which we believe are relevant to many platforms. In particular:

1. In the presence of self-reinforcing preferences, there is a cost to being optimistic in the face

of uncertainty, as mistakes are ampliﬁed. 2. It is possible to mitigate the impact of transients arising from positive externalities by

structuring the exploration procedure carefully. 3. Once enough evidence is obtained regarding optimality of a strategy, one may even use the

externalities to one s advantage by purposefully shifting the arrivals to a proﬁt-maximizing population.

Of course real-world scenarios are complex and involve other types of externalities which may reverse some of these gains. For example, the presence of negative externalities may preclude the ability to have all arrivals prefer the chosen option. Alternatively, arrivals may have limited memory , so that future arrivals might eventually forget the effect of the externality. Overall, we believe that this is an interesting yet under-explored space of research, and that positive externalities of the kind we study may play a pivotal role in the effectiveness of learning algorithms.

[1] Daron Acemoglu, Munther A. Dahleh, Ilan Lobel, and Asuman Ozdaglar. Bayesian learning in

social networks. The Review of Economic Studies, 78(4):1201 1236, 2011.

[2] Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election: divided

they blog. In Proceedings of the 3rd international workshop on Link discovery, pages 36 43. ACM, 2005.

[3] Chris Anderson. The long tail. Wired magazine, 12(10):170 177, 2004.

[4] Krishna B. Athreya and Samuel Karlin. Embedding of urn schemes into continuous time markov

branching processes and related limit theorems. Ann. Math. Statist., 39(6):1801 1817, 12 1968.

[5] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed

bandit problem. Machine Learning, 47(2):235 256, May 2002.

[6] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. Science,

286(5439):509 512, 1999.

[7] Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion, custom, and

cultural change as informational cascades. Journal of Political Economy, 100(5):992 1026, 1992.

[8] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic

multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1), 2012.

[9] Soumen Chakrabarti, Alan Frieze, and Juan Vera. The inﬂuence of search engines on prefer-

ential attachment. In Proceedings of the sixteenth annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2005.

[10] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Springer,

[11] David A. Freedman. On tail probabilities for martingales. Ann. Probab., 3(1):100 118, 02

[12] Svante Janson. Functional limit theorems for multitype branching processes and generalized

pólya urns. Stochastic Processes and their Applications, 110(2):177 245, 2004.

[13] Michael L Katz and Carl Shapiro. Systems competition and network effects. Journal of

Economic Perspectives, 8(2):93 115, 1994.

[14] Petra Küster. Generalized Markov branching processes with state-dependent offspring distribu-

tions. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 64(4):475 503, Dec 1983.

[15] T.L Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Adv. Appl.

Math., 6(1):4 22, March 1985.

[16] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side

information. In Advances in Neural Information Processing Systems, 2008.

[17] Andreu Mas-Colell, Michael Dennis Whinston, Jerry R Green, et al. Microeconomic Theory,

volume 1. Oxford University Press, 1995.

[18] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The

Annals of Statistics, pages 693 721, 2013.

[19] Jacob Ratkiewicz, Santo Fortunato, Alessandro Flammini, Filippo Menczer, and Alessandro

Vespignani. Characterizing and modeling the dynamics of online popularity. Physical review letters, 105(15):158701, 2010.

[20] Carl Shapiro and Hal R Varian. Information Rules: A Strategic Guide to the Network Economy.

Harvard Business Press, 1998.

[21] Oz Shy. A short survey of network economics. Review of Industrial Organization, 38(2):119

[22] Aleksandrs Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th

annual Conference On Learning Theory, 2011.

[23] Lones Smith and Peter Sørensen. Pathological outcomes of observational learning. Economet-

rica, 68(2):371 398, 2000.

[24] David Williams. Probability with Martingales. Cambridge University Press, 1991.

[25] G. Udny Yule. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis,

F. R. S. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 213(402-410):21 87, 1925.