# firing_bandits_optimizing_crowdfunding__c22e2ddd.pdf

Firing Bandits: Optimizing Crowdfunding

Lalit Jain 1 Kevin Jamieson 1

In this paper, we model the problem of optimizing crowdfunding platforms, such as the non-proﬁt Kiva or for-proﬁt Kick Starter, as a variant of the multi-armed bandit problem. In our setting, Bernoulli arms emit no rewards until their cumulative number of successes over any number of trials exceeds a ﬁxed threshold and then provides no additional reward for any additional trials - a process reminiscent to that of a neuron ﬁring once it reaches the action potential and then saturates. In the spirit of an inﬁnite armed bandit problem, the player can add new arms whose expected probability of success is drawn iid from an unknown distribution this endless supply of projects models the harsh reality that the number of projects seeking funding greatly exceeds the total capital available by lenders. Crowdfunding platforms naturally fall under this setting where the arms are potential projects, and their probability of success is the probability that a potential funder decides to fund it after reviewing it. The goal is to play arms (prioritize the display of projects on a webpage) to maximize the number of arms that reach the ﬁring threshold (meet their goal amount) using as few total trials (number of impressions) as possible over all the played arms. We provide an algorithm for this setting and prove sublinear regret bounds.

1. Introduction

In the crowdfunding paradigm, a crowd of many users pledge to contribute a small amount of funding toward a project, but the project is only funded if the total amount of funding pledged exceeds a known reserve price. This is reminiscient of the ﬁring of a neuron once it reaches its action potential - where the number of pledges from the crowd

1Computer Science and Engineering, University of Washington, Seattle, USA. Correspondence to: Lalit Jain <lalitj@cs.washington.edu>, Kevin Jamieson <jamieson@cs.washington.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

controls the ﬁring. Recent years have seen a huge proliferation of crowdfunding sites, with over 700 platforms in 2012 and an estimated 2000 in 2016. These platforms account for a signiﬁcant amount of investment; the World Bank estimates that $90 billion will be raised through crowdfunding ventures alone in 2020 (Barnett, 2015). This includes non-proﬁt micro-lending sites such as Kiva, charity organizations Donors Choose and Go Fund Me, and venture capital efforts such as Kickstarter and Indie Go Go.

Crowdfunding platforms seeking to maximize the number of projects funded need to decide when and how to show projects to users visiting their sites. The administrators of a platform can inﬂuence the outcomes of which projects are more likely to be funded by showing them more or less often in display results. Naive strategies that do not prioritize the most likely projects to reach the reserve price within a reasonable amount of time are doomed to only partially fund many projects (that fail to ﬁre), versus a preferred outcome of a moderate number fully funded to the reserve price (and ﬁre).

An intuitive strategy to maximize the number of projects that hit their reserve price is to ﬁrst estimate the popularity of a project by considering the proportion of users who pledged funding relative to the total number of users that reviewed the project. Then projects on the webpage can be prioritized by popularity. This is a challenge because the popularity cannot be accurately estimated without having the prioritization to ensure enough reviews are gathered. Simultaneously, we want to avoid prioritizing unpopular projects. Through personal correspondences with some of these popular platforms, we have learned that these companies rely on poorly motivated heuristics. Indeed, we became aware of this problem from one such platform seeking assistance. Despite the enormous interest, capital investment, and potential for small improvements to make large impact on the number of projects funded, we are unaware of any rigorous mathematical study or approaches to optimize these mechanisms. Our paper seeks to bridge this gap and is organized as follows. In the next section we specify our problem statement precisely. We then present our algorithm in Section 2 and prove a sub-linear regret bound. Finally, we present both synthetic experiments and real-world scenarios derived from data on the Kiva platform.

Firing Bandits: Optimizing Crowdfunding

1.1. The Problem Statement

Two key features deﬁne recommendation for the crowdfunding problem. Firstly, since the goal is to fund as many projects as possible, rather than just collect as many pledges as possible, we do not see a reward for an arm until the project has hit its reserve price. In addition, once funded, the project effectively exits the recommendation system. Secondly, in most crowdfunding models, the demand of lenders is vastly exceeded by the supply of funding seekers. Thus, at any given time there is an exceedingly large pool of loans to choose from when making a recommendation to a user, and this pool is constantly growing.

We also make a simplifying assumption that each pledge is the same amount, and that each loan needs the same number of pledges. Thus, instead of worrying how much funding a project raises, we only focus on the number of pledges. This is actually a realistic assumption, since in many platforms, the majority of pledges are a single small amount such as $25 (see (Barnett, 2015)).

With the above considerations in mind, we model each project as a biased coin with mean µ: showing the project to a lender is a ﬂip of the coin, and µ is the expected proportion of times the project receives a pledge when shown. We assume at any time we can obtain a new project/coin whose mean is drawn i.i.d. from a reservoir distribution F deﬁned on [0, 1]. For a project to reach its reserve price, i.e. the coin to ﬁre, we need the number of successes to reach a threshold of τ N in a sequence of ﬂips from µ. Throughout the rest of the paper, we adopt the vocabulary of coin and ﬂips . Unlike models normally used for search, we assume that a single project is being chosen at each time to be evaluated. We believe that it should be possible to generalize to more general settings in which multiple loans are displayed simultaneously in search results (e.g. cascading bandits (Kveton et al., 2015; Szepesv ari et al., 2015) or best-of-K-bandits (Simchowitz et al., 2016)).

The protocol for any algorithm for ﬁring bandits is shown below. The algorithm maintains a set of coins and at every time step we are choose between two actions: either ﬂip an existing coin, or draw a new coin from F and ﬂip it. Our goal is to ﬁnd a policy that maximizes the number of coins that reach τ successes and ﬁre.

1.2. Fixed Budget Policies

Without loss of generality, we can assume an optimal policy following the protocol has full knowledge of F (indeed, there will be different optimal policies for different reservoir distributions). Since the means of the coins are drawn iid from F, the ﬂips of different coins are independent and, in particular, they provide no information about each other. This immediately implies that there exists an optimal policy

Firing Bandits Protocol

Input: Reserve price τ S: Set of coins being maintained by algorithm. for t = 1, 2, . . . Algorithm does one of the following:

Draw a new coin from F, ﬂip it and add it to S.

Choose a coin from S and ﬂip it. If the coin has had a total of τ successes, it ﬁres to yield a reward of 1 and is then removed from S.

that treats each coin individually and identically. The implication of this is crucial: there exists an optimal policy that looks at coins one at a time, choosing only to keep ﬂipping or discarding a coin based on its history of successeses.

However, since we do not have access to F, our approach is to optimize over a policy class that we believe contains a nearoptimal policy. Such a policy search naturally appears in multi-armed bandit problems. For example in the standard multi-armed bandit each arm gives rise to the policy that only pulls that arm in each round and the optimal policy corresponds to playing the arm with the highest mean. A more interesting example is contextual bandits where policies map context feature vectors to actions. Algorithms for multi-armed bandits optimize over this set of policies to ﬁnd the optimal arm.

For a given coin with mean µ F (where we refer to µ simultaneously as the coin and the mean of the coin), let {Xi}T i=1 {0, 1}T be an empirical sequence of ﬂips and let X(t) = Pt i=1 Xi. The sequence of ﬂips give rise to a random walk (t, X(t)) that we refer to as the coin s trajectory. A coin ﬁres at time t only if X(t) = τ. The line with slope 1 corresponds to the trajectory of a coin with probability 1 of success, and similarly the horizontal axis corresponds to a trajectory of a coin with 0 probability of heads each random walk X(t) lies between these two lines. Furthermore, the average slope of a random walk after t steps, X(t)/t is an empirical estimate of the value of µ. By the theory of random walks we expect X(t) µt p

2µ(1 µ)t with constant probability. The minimum t such that X(t) = τ is a negative binomial random variable with mean τ/µ.

A policy is given by a non-decreasing function π(t), where µ is rejected at time t if X(t) π(t).

Intuitively, if X(t) > π(t) there is some hope that µt > π(t) and that the positive trend should continue until it reaches τ successes and ﬁres. Because X(t) is non-decreasing, it only makes sense to consider policies π(t) that are also nondecreasing. Indeed, if π is ever decreasing, it would behave just as if it remained constant. The difﬁculty of ﬁnding the

Firing Bandits: Optimizing Crowdfunding

optimal policy is now apparent it requires a search over the space of all non-decreasing functions. Even if we knew F, it is not clear how we could carry out such a search.

There exist several natural families of policies arising in the sequential testing literature (see (Siegmund, 1985)). Notable classes include the SPRT which corresponds to π(t) being an afﬁne function (Figure 1). In this work we choose to study ﬁxed budget policies, which correspond to vertical thresholds.

It remains to deﬁne a rich enough policy class for our algorithm to search over that only considers a single coin at a time. A Fixed Budget Policy πB evaluates a coin by repeatedly ﬂipping it and rejecting it if τ successes have not been reached within B ﬂips(note, to prevent additional notation, we may use B to simultaneously refer to the number of ﬂips and the policy πB). Note, we expect coins that do not ﬁre have means less than τ/B. In practice, πB corresponds to showing a project on a crowdfunding site at most B times, and then retiring it if it fails to get funded. This type of policy is appealing for a number of reasons: 1) it is intuitively fair to loan seeking applicants and simple to implement, 2) executing πB also simultaneously evaluates every πB for B B (we will expand on this later), and 3) leads to interpretable regret bounds.

Figure 1. Example trajectory. The red dashed line corresponds to the trajectory of a coin with success probability 1. The purple line is an example of a linear SPRT boundary, and the blue dashed line is a ﬁxed budget policy πB.

The problem of ﬁnding the best B will occupy the rest of this paper. If we had knowledge of F, we could use calculus techniques to ﬁnd the best value of B. Faced with only access to the empirical ﬂips of coins that expire, we will have to consider all possible values of B.

1.3. Related Work

The ﬁring bandit problem involves several salient features which have not been previously addressed in the literature. Firstly is the aspect of only receiving a reward once enough successes have been observed. Past work on delayed bandits (Vernade et al., 2017; Joulani et al., 2013) has considered the case when rewards may not be received until a later

round, however this is very different from only receiving a single reward at some later time. There are also several (very) different bandit models referred to as Threshold bandits (Abernethy et al., 2016; Locatelli et al., 2016). While these papers share some of the same vocabulary, they are variants on the standard multi-armed bandit and there is no notion of a threshold being passed once enough successes are accumulated.

Secondly, once an arm has ﬁred (reached τ heads), it can no longer be played. But we can optionally add as many arms as we want with means drawn i.i.d. from F. Many different bandit models have considered situations where the arms change over time. This includes sleeping bandits (Kleinberg et al., 2010) where the set of arms is variable over time. In mortal bandits (Chakrabarti et al., 2009) each arm has a different budget of the number of pulls it receives before going away and being replaced by a new arm whose underlying mean is drawn from a distribution F. While related, the major difference is that in mortal bandits rewards are accumulated as they are observed instead of only after the arm expires, which is the core difﬁculty of our problem.

In some sense, our problem is most closely modeled by an inﬁnite armed bandit problem. In this setting we have inﬁnitely many arms to choose from with the goal of maximizing cumulative reward. The mean rewards of the arms are distributed according to an underlying F. In the setting where F is semi-parametric or known, extensive work has been done in both the pure exploration case where the goal is to ﬁnd an arm close to the optimal (see (Carpentier & Valko, 2015; Jamieson et al., 2016)), and in the cumulative expected regret setting (Berry et al., 1997; Wang et al., 2009; Bonald & Proutiere, 2013). Recent work has provided algorithms for the stochastic and non-stochastic pure exploration setting case when the distribution is unknown (Li et al., 2016).

Most algorithms for inﬁnite armed bandits follow a similar mold. A large number of arms are sampled from the distribution (sometimes requiring partial knowledge of the underlying distribution to ensure a sufﬁcient number, but not too many for their budget) and then an algorithm like UCB (Auer et al., 2002) is executed. Algorithms exploit the fact that the arms are drawn i.i.d. from a distribution by knowing that if they ve seen a good arm, there is likely another just as good still out there. Hence, an arm is typically sampled as long as it cannot be ruled out as a potential superior arm to those found so far. In our setting, however, our arms expire and it is not at all clear how such a technique could be adapted. Firstly, since the goal is to have as many coins ﬁre as possible, it s unclear that dropping arms is a good idea - there is the possibility no arms would ﬁre even after a very large number of samples using such a strategy. Secondly, as we will show in the experiments section, naive UCB style

Firing Bandits: Optimizing Crowdfunding

algorithms do not work wll in an anytime sense, since they fail to reject coins that take extremely long times to convert. There is the possibility of using a UCB style algorithm that keeps increasing the number of arms it considers as time progresses but we believe such a policy is very difﬁcult to analyze in our setting. See (Wang et al., 2009) for an example of such a strategy.

2. Our Approach

Given a coin drawn i.i.d. from F, let the random variable CB be the indicator that it ﬁres under policy πB. Moreover, let NB be the random variable representing the number of times it is ﬂipped by policy πB; clearly, τ NB B. Of course, we will be applying πB repeatedly to many coins and using the results to estimate E[CB] and E[NB]. If we draw m coins from F and evaluate each one using policy πB, let b C(B, m) the empirical proportion of the m coins that ﬁred, and b N(B, m) the empirical mean of the number of times each of the m coins was ﬂipped.

Intuitively, the efﬁcacy of πB should be given by the expected number of coins ﬁring in a budget of V total ﬂips over all coins. Let m(V ) be the random variable representing how many coins end up being evaluated by πB using V total ﬂips. The last coin we look at may be stopped prematurely, and we let 0 κ B denote the number of ﬂips it was given. Then the number of coins that ﬁre is given by m(V ) b C(B, m) and V = m(V ) b N(B, m(V )) + κ. The rate at which coins ﬁre is given by,

" m(V ) b C(B, m(V ))

" b C(B, m(V )) b N(B, m(V )) + κ/m(V )

V = P(CB = 1)

The last line follows by the law of large numbers since m(V ) (V B)/B. This motivates deﬁning for any B,

ρ(B) := P(CB = 1)

E[NB] , ρ := max B ρ(B).

As the total budget of ﬂips V gets very large, the expected number of coins ﬁring is V ρ(B). Analogous to the empirical estimators given above, we can also deﬁne bρ(B, m) = b C(B, m)/ b N(B, m) For any m, bρ(B, m) provides an asymptotically consistent estimator of ρ(B). If our algorithm allocates a budget of V total ﬂips to B as above, versus using an optimal ﬁxed budget policy, asymptotically our expected loss in the number of coins ﬁring is V ρ V ρ(B). We can generalize this to any algorithm that allocates a budget of V among policies. Deﬁnition Consider an algorithm allocated a budget of V total ﬂips and considers a random number of m(V ) coins. In each round i, the algorithm draws a coin from F and ﬂips it according

to policy Bi using vi ﬂips (so V = Pm(V ) i=1 vi). Then, the regret of the algorithm is V ρ E[Pm(V ) i=1 viρ(Bi)].

The regret expresses the number of coins fewer the algorithm ﬁres relative to some policy B that satisﬁes ρ(B) = ρ . It s important to mention that our notion of regret is not based on the optimal policy over all policies but rather the weaker notion of the optimal ﬁxed budget policy.

In general, the problem of identifying B with ρ(B) = ρ is not well deﬁned - there can be several such optimal policies B. We will let B := min{B : ρ(B) = B }. Finding the optimal value of ρ can be difﬁcult due to a lack of good optimzation properties - in general ρ(B) need not be concave, much less unimodal.

Low Probability of Conversion. For most crowdfunding platforms such as Kiva or Kickstarter, the demand of funding seekers is far greater than the total capital available from all lenders. Hence very few projects have a chance of being funded. For example, only 3.6% of technology projects reach their funding goal on Indie Go Go and 80% of all projects do not reach a quarter of their reserve price (Jeffries, 2013). This suggests that only the best loans get funded, and that there are very few good loans. Any optimal policy will only target good loans drawn from F, i.e. loans with large values of µ, and even if the policy is funding the maximum rate of loans, the probability of any single loan being funded will be very low.

Hence throughout the rest of the paper, we will assume that the probability of a coin drawn from F obtaining τ successes (i.e., ﬁring) by an optimal ﬁxed budget policy is bounded above. Equivalently, the probability that a coin does not ﬁre, even under the optimal policy, is bounded below.

Speciﬁcally, we assume there exists α > 0 such that that P(CB = 1) < 1 α, or equivalently, P(CB = 1) α for any B {B : ρ(B) = ρ }. In practice, again we expect the probability of any particular loan ﬁring to be extremely small even for the best policy so we can safely assume α > 1/2. Finally, since P(CB = 1) is increasing with B, we note that P(CB = 1) < 1 α for any B B .

2.1. Algorithm Overview

The problem of ﬁnding the πB with the maximum ρ(B) will occupy the rest of this paper. If we had knowledge of F, we could compute ρ(B) analytically. Faced with only access to the empirical ﬂips of coins that expire leads to a classical exploration/exploitation tradeoff - our algorithm for ﬁnding the optimal policy will have to trade off exploring a large range of B with exploiting values that have empirically high values of ˆρ(B).

Our algorithm FIRINGUCB is presented below. Note that

Firing Bandits: Optimizing Crowdfunding

we assume FIRINGUCB and PULL both have access to a global history over all ﬂips. We now describe it at a high level.

Throughout we assume we have a pre-deﬁned sequence bk, for concreteness we can take bk = 2k. We begin with a guess on a value of K such that b K B , (one can always take K = 1). The algorithm partitions the set of policies (0, b K] (it does not make sense to evaluate policies B < τ since we need at least τ success to ﬁre, however we ignore this for brevity) into a series of brackets (0, b K] = K k=1(bk 1, bk]. For each B BK we maintain empirical estimates bρ(B, m) along with conﬁdence bounds UCB(B, m), LCB(B, m) such that LCB(B, m) < ρ(B) < UCB(B, m) at all times (see Lemma 2.2) where m tracks the total number of coins evaluated by policy πB. Ignoring INITIALIZE for a moment, at each round of FIRINGUCB , the subroutine PULL is run on the bracket containing the policy with the largest upper conﬁdence bound.

Pulling the k-th bracket corresponds to drawing a coin from F and then executing πB for the largest B in the bracket that has not yet been eliminated. These ﬂips are then used to evaluate the other policies πB for B B by looking at the result of the ﬁrst B ﬂips. The estimates of ρ, and the corresponding conﬁdence bounds are then updated for each B in the bracket and in addition, suboptimal policies are culled - any policy whose UCB is less than the maximum LCB in that round is removed. Since |UCB(B, mk) LCB(B, mk)| 0 as mk and each active policy in a bracket has been evaluated on the same number of coins, the set of active policies shrinks and eventually eliminates all suboptimal policies. A new bracket is added when the highest upper conﬁdence bound drops below a pre-determined value and the UCB game continues with the addition of this new bracket.

As we will see in Theorem 2.3, using multiple brackets opposed to just a single bracket we obtain higher statistical power in our estimates, translating into improved sample complexities. Also, note that our initialization step is necessary to ensure that our conﬁdence bounds hold - see the discussion after Lemma 2.2.

Our main theorem is the following, recall typically α > 1/2:

Theorem 2.1. If FIRINGUCB is run with a budget of V samples, and brackets deﬁned by bk satisfy bk/b2 k 1 1 for all k, then with probability at least 1 δ the regret incurred is at most

K c α4ρ2 log( 1 αρ δ) +

c α4 KV log (V ) log log(V/α)

where K = max{k : bk 1 2 αρ }. If bk+1 = b2 k = 22k

then K log2(log2( 2 αρ )).

We now discuss each component of the algorithm in greater detail and explain our choice of brackets.

FIRINGUCB Input: τ, α, K Create brackets {(bk 1, bk]}K k=1 Global Variables: mk : number of coins allocated to bracket k Ak: the set of active policies in (bk 1, bk]

Subroutine INITIALIZE(k): Run PULL (k), U 1(α/4, δ) times (see 1 for deﬁnition of U). Remove any B with b C(B, U 1(α/4, δ)) > 1 3α/4.

UCB Algorithm: for k K: INITIALIZE(k) while True bk argmaxk S max B Ak UCB(B, Tk) Call PULL(k) if max B b K UCB(B, mk) < 2 αb K K K + 1 allocate new bracket (b K 1, b K] Call INITIALIZE(k)

PULL Input: bracket k Draw a new coin with mean µ F.

B := max{Ak} Execute πB: Let {Xi}m i=1 be a sequence of m B ﬂips, where the ﬂipping is stopped early if there are τ success and the coin ﬁres.

mk mk + 1 Update: for each active policy B in Ak Evaluate policy B using {Xi}B i=1. Update b C(B , mk), b N(B , mk), bρ(B , mk)

Eliminate: Let LCB = max B Ak LCB(B, mk, δ)

for B Ak if UCB(B, mk, δ) < LCB Ak Ak {B}

2.2. Analysis: A Single Bracket

Since every call of PULL gives an additional budget to a ﬁxed bracket, we can analyze the regret a speciﬁc bracket contributes to the total regret independently of the other brackets. First we explain the conﬁdence bounds used in PULL .

Much of our analysis relies on an anytime conﬁdence bound. Speciﬁcally, we assume the existence of a constant c U such

that if U(m, δ) := q

c U log(log2(2m)/δ)

m , and Xi are i.i.d. bounded random variables then

i=1 Xi E[Xi]

Firing Bandits: Optimizing Crowdfunding

for any δ (0, 1). One can show that c U = 4 sufﬁces, but there exist substantially tighter expressions, albeit not as succinctly stated (c.f. Theorem 8 of (Kaufmann et al., 2016)). For the same c U it is known that U 1(ϵ, δ) = min{m : U(m, δ) ϵ} c Uϵ 2 log(2 log( ec Uϵ 2

δ )/δ) (c.f. (Jamieson, 2015)). Both b C(B, m) and b N(B, m) are empirical means of bounded i.i.d. random variables (in [0, 1] and [0, B], respectively), thus we can apply Equation 1.

Recall that bρ(B, m) = b C(B,m)

b N(B,m). Deﬁne

LCB(B, m, δ) := b C(B, m) U(m, δ/4B2) b N(B, m) + BU(m, δ/4B2) ,

UCB(B, m, δ) := b C(B, m) + U(m, δ/4B2) b N(B, m) BU(m, δ/4B2) .

and φ(B, m, δ) := 16 α2B U(m, δ 4B2 ). Using the bound on U 1(ϵ, δ) given above, we see that

φ 1(B, ϵ, δ) := min{m : φ(B, m, δ) ϵ}

α4B2 log 8B2 log 16ec Uϵ 2

Deﬁne the events

n | b C(B, m) c B| U(m, δ 4B2 ) o

n 1 B | b N(B, m) n B| U(m, δ 4B2 ) o

Lemma 2.2. For the events deﬁned above, P(Ec c Ec n) < δ. On the event, Ec En the following happen. After INITIALIZE is called on bracket k, each B (bk 1, b K] satisﬁes P(CB = 1) > α/2, and U(m, δ) < α/4. In addition, for any m > 0 and any B N satisfying these conditions we have that with probability greater than 1 δ

ρ(B) UCB(B, m, δ) ˆρ(B, m) + φ(B, m, δ)

ρ(B) LCB(B, m, δ) bρ(B, m) φ(B, m, δ).

In general UCB(B, m, δ) could be negative so the call to INITIALIZE prevents this and ensures the conﬁdence bounds given hold. In practice α > 1/2 so the number of calls to Pull needed by INITIALIZE is O(1) for each bracket, so the initialization step will not contribute to regret. Throughout the following, we assume the event Ec En.

Note that the conﬁdence interval given by φ implies that it takes fewer pulls to reach the same level of statistical conﬁdence for large values of B compared to smaller values. We are now ready to state our main regret theorem for a single bracket.

Theorem 2.3. Suppose PULL is called m times on bracket k and uses V total ﬂips. Assume on the i-th pull, the maximum

active policy is Bi and takes vi Bi ﬂips. Then,

i=1 viρ(Bi) V k

c V bk α4b2 k 1 log (V ) log bk 1 log(V/α)

where c is an absolute constant ρk, := maxbk 1<B bk ρ(B) and k := ρ ρk, .

Sketch of Proof. For any policy B in bracket k,

ρ ρ(B) = ρ ρ k + ρ k ρ(B) = k + (ρ k ρ(B)).

Thus the regret using policy B decomposes into two terms: the regret of the best policy in the bracket relative to playing ρ , plus the regret of policy B relative to the best policy in bracket. This gives the ﬁrst term. For the second term, we bound the number of pulls to the bracket it takes to eliminate policies with regret more than ϵ by φ 1(bk 1, ϵ, δ) and then applying a peeling argument. See the supplementary for details.

2.3. Analysis: UCB for Multiple brackets

The appearance of the term bk/b2 k 1 in the regret expression of Theorem 2.3 shows the advantage of using brackes. If we use a single bracket e.g. (τ, Bmax], we incur an additional factor of Bmax in our regret. This also motivates the choice of sequence, namely we should choose bk to satisfy bk/b2 k 1 < 1. One such sequence is bk = 22k, however we ﬁnd that bk = 2k works well in practice. Finally, throughout this section we assume that B < b K, i.e. the best policy is contained in the set of current policies and that we are not growing brackets - we will handle the more general case in the next section.

Our argument in this section is similar to that used in the standard UCB analysis. We ﬁrst quantify the total number of calls of PULL by a sub-optimal bracket. Using this, we bound the regret. Lemma 2.4. If FIRINGUCB is run with K brackets such that max B b K ρ(B) = ρ then any bracket k with k > 0 is allocated at most φ 1(bk 1, k/2, δ) coins.

Proof. For any B b K that satisﬁes ρ(B ) = ρ we have for all m N, bρ(B , m) + φ(B , m, δ) ρ(B ) = ρ . On the other hand, for any B (bk 1, bk] such that B (bk 1, bk] we have

bρ(B, t) + φ(B, t, δ) ρ(B) + 2φ(B, t, δ)

ρk, + 2φ(bk 1, t, δ)

Thus if B BK, for all t φ 1(bk 1, k/2, δ) we have bρ(B, t) + φ(B, t, δ) ρ which implies that the suboptimal bracket will not have the largest conﬁdence bound.

Firing Bandits: Optimizing Crowdfunding

Theorem 2.5. Suppose FIRINGUCB is run on K brackets with a total of V ﬂips across all brackets, and such that max B b K ρ(B) = ρ . The total regret incurred is bounded

c α4 KV log (V ) log log(V/α)

Proof. Suppose bracket k has used Vk total ﬂips (e.g., the sum total ﬂips over all coins it had been allocated in calls to PULL ), then by Theorem 2.3 the regret incurred by bracket k is bounded by

c α4 Vk log (Vk) log log(Vk/α)

using the fact that bk b2 k 1 1. The total regret is given by summing over all k K. Now let s analyze each term separately - ﬁrst focusing on the summed second term. By Jensen s inequality,

c α4 Vk log(Vk) log log(Vk/α)

c α4 K log(V ) log log(V/α)

using the fact that Vk V, bk 1 b K and P

k K Vk = V . To bound the summed ﬁrst term we use a peeling argument (analogous to that in Theorem 2.3). By Lemma 2.4 for any k > 0 we have Vk bkφ 1(bk 1, k/2, δ)

bk 64c U 2 k α4b2 k 1 log(8b2 k 1 log( 64ec U 2 k α4b2 k 1δ )/δ). For any η > 0

k K k Vk = X

k: k<η k Vk + X

k: k η k Vk

k: k η kbk 64c U 2 k α4b2 k 1 log( 8 log( 64ec U 2 k α4b2 k 1δ )

k: k η 1 k 64c U

α4 log( 8 log( 64ec U 2 k α4δ ) δ/b2 k 1 )

k K log(8b2 k 1 log( 64ec Uη 2

α4 log(8b2 K 1 log( 64ec Uη 2

Optimizing over η, the righthand side can be bounded by q

c α4 KV log( log(V/α)

2.4. Expanding Brackets

Returning to the deﬁnition of ρ(B), recall that by the initialization step for B and any B under consideration, n B BP(CB = 1) Bα/2 so that

n B c B Bα/2 2 αB

In particular, this implies that ρ 2 αB . This motivates Algorithm 2.1, we add on brackets opportunistically whenever there is a possibility that B is not yet being considered. The proof of Theorem 2.1 is given in the supplementary materials.

3. Experiments

We compare FIRINGUCB with a number of alternatives. The ﬁrst is a natural modiﬁcation of FIRINGUCB we call FIRINGBANDITSTHRESHOLD: instead of waiting for a budget B before rejecting a coin that does not convert, we monitor the empirical mean bµk after k ﬂips and stop sampling at ﬂip min{k B : k d(bµk, τ/B) log(1/δ)} (marking it as not converting in B ﬂips) where d( , ) is the KL divergence of two Bernoullis and δ = .05. This variant attempts to predict whether a coin will not convert within B ﬂips before the full B ﬂips are taken and is motivated by the guarantees of a Chernoff bound, but has no rigorous theoretical justiﬁcation. With the exception of this change, we maintained estimates of bρ(B, m) for each B identically to FIRINGUCB and eliminated brackets as needed.

We also compare to a variant of the standard UCB algorithm. For a ﬁxed value of k, NAIVEUCB-k maintains a pool of k coins. Identically to UCB (we used the anytime formulation given in (Abbasi-Yadkori et al., 2011)), at each round the coin with the highest UCB is ﬂipped and its empirical mean and number of ﬂips are updated. Once a coin reaches τ heads, it is removed from the pool, a reward of 1 is received, and a new coin is drawn to replace it.

Finally, as discussed in Section 1.2, the optimal policy with respect to a given coin reservoir distribution for evaluating coins will consist of an increasing boundary. Though a search over this space is difﬁcult in general, we considered the set of all linear boundaries, parametrized as at b, a, b > 0. Recall the traditional sequential probability ratio test (SPRT) corresponds to one of these boundaries (Siegmund, 1985). For each of the reservoir distributions considered below, we did an exhaustive search to discover the optimal values of a, b (as always in terms of the number of coins converted in a ﬁxed budget). We refer to the algorithm that repeatedly applies the policy at b as LINEARBOUNDARYa, b .

We considered two examples of reservoir distributions. Firstly, a synthetic example using F1 = Beta(1/20, 1) as the reservoir distribution, and τ1 = 10.

Secondly, we considered a distribution F2 derived from actual page click data arising from the search page of the microloan crowdfunding site Kiva (www.kiva.org). During a period of a week earlier this year, roughly 10000 microloans were listed on Kiva, and on average each loan was shown about 500 times. The average requested loan

Firing Bandits: Optimizing Crowdfunding

Figure 2. F2: Empirical distribution of means for Kiva loans shown during in a ﬁxed week.

amount on Kiva is $500, and the amount pledged per user funding a loan was equal to $25 more than half the time (the default amount). The underlying distribution of empirical µ s (i.e. # times funded/#times shown to potential funders) is in Figure 3. The mean µ was .03. We used this empirical distribution as our reservoir distribution and took τ = 25.

Figure 3 displays the results for F1 and Figure 3 for the Kiva distribution of shown in Figure F2. In both cases, we supplied each algorithm with a budget of 5 106 total ﬂips over all coins considered, and we considered 50 repetitions of each algorithm with 95% conﬁdence intervals plotted using a normal approximation. The optimal linear boundaries in both cases are indicated in the legend, note b = 0 in both cases.

Figure 3. Coins ﬁred when the reservoir distribution is F1.

The advantage of FIRINGUCB and FIRINGBANDITSTHRESHOLDcompared to NAIVEUCB-k is immediately apparent in this plot - though NAIVEUCB-k can achieve higher reward in a small time horizon for large values of k, it fails to provide an anytime regret guarantee. Indeed,

Algorithm F1 F2 FIRINGUCB 135190 2712 6439 98 FIRINGBANDITSTHRESHOLD 387038 6242 11269 109 LINEARBOUNDARY-a, b 2546582 1044 1242166 1479 NAIVEUCB-1000 - 2105 266 NAIVEUCB-10000 11205 18 17270 332 NAIVEUCB-100000 110627 42 102335 29 NAIVEUCB-500000 508438 37 -

Table 1. Number of Coins considered.

NAIVEUCB-k will quickly convert any good coin it samples, and then be burdened down by many subpar arms that it is forced to convert before receiving a good arm. FIRINGUCB successfully manages to avoid this problem by searching over policy classes that quickly disregard subpar coins.

Figure 4. Coins ﬁred when the reservoir distribution is F2.

In both cases the LINEARBOUNDARY-a, b policy greatly outperforms the FIRINGUCB algorithms (by almost a factor of 4). Recall that in both cases LINEARBOUNDARY-a, b is a line of the form at, thus any coin that does not get a head on the ﬁrst ﬂip will be immediately rejected. This aggressiveness is most apparent when considering the number of coins that each policy considers, for instance, LINEARBOUNDARY-a, b considers almost 200 times more coins than FIRINGUCB (see Table 1). In a crowdfunding environment, such a policy that rejects loans after displaying it just a single time would certainly be considered unfair and very impractical.

4. Acknowledgments

Finally, we would like to thank Kiva, especially Melissa Fabros and Kevin O Brien for their invaluable help and providing us with data.

Firing Bandits: Optimizing Crowdfunding

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Abernethy, J. D., Amin, K., and Zhu, R. Threshold bandits, with and without censored feedback. In Advances In Neural Information Processing Systems, pp. 4889 4897, 2016.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

Barnett, C. Trends show crowdfunding to surpass vc in 2016, 2015. URL https://www.forbes. com/sites/chancebarnett/2015/06/09/ trends-show-crowdfunding-to-surpass\ -vc-in-2016/2/#3d3d1aef666f. [Online; accessed 9-February-2018].

Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A. Bandit problems with inﬁnitely many arms. The Annals of Statistics, pp. 2103 2116, 1997.

Bonald, T. and Proutiere, A. Two-target algorithms for inﬁnite-armed bandits with bernoulli rewards. In Advances in Neural Information Processing Systems, pp. 2184 2192, 2013.

Carpentier, A. and Valko, M. Simple regret for inﬁnitely many armed bandits. In ICML, pp. 1133 1141, 2015.

Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. In Advances in neural information processing systems, pp. 273 280, 2009.

Jamieson, K. The Analysis of Adaptive Data Collection Methods for Machine Learning. Ph D thesis, University of Wisconsin-Madison, 2015.

Jamieson, K. G., Haas, D., and Recht, B. The power of adaptivity in identifying statistical alternatives. In Advances in Neural Information Processing Systems, pp. 775 783, 2016.

Jeffries, A. Indie no-go: only one in ten projects gets fully funded on kickstarter s biggest rival, 2013. URL https: //www.theverge.com/2013/8/7/4594824/ less-than-10-percent-of-projects-on\ -indiegogo-get-fully-funded. [Online; accessed 9-February-2018].

Joulani, P., Gyorgy, A., and Szepesv ari, C. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461, 2013.

Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identiﬁcation in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1 42, 2016.

Kleinberg, R., Niculescu-Mizil, A., and Sharma, Y. Regret bounds for sleeping experts and bandits. Machine learning, 80(2-3):245 272, 2010.

Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776, 2015.

Li, L., Jamieson, K., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. arxiv 2016. ar Xiv preprint ar Xiv:1603.06560, 2016.

Locatelli, A., Gutzeit, M., and Carpentier, A. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning, pp. 1690 1698, 2016.

Siegmund, D. Sequential Analysis: Tests and Conﬁdence Intervals. Springer Science & Business Media, 1985.

Simchowitz, M., Jamieson, K., and Recht, B. Best-of-kbandits. In Conference on Learning Theory, pp. 1440 1489, 2016.

Szepesv ari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. 2015.

Vernade, C., Capp e, O., and Perchet, V. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artiﬁcial Intelligence, 2017.

Wang, Y., Audibert, J.-Y., and Munos, R. Algorithms for inﬁnitely many-armed bandits. In Advances in Neural Information Processing Systems, pp. 1729 1736, 2009.