# the_multifidelity_multiarmed_bandit__35978843.pdf

The Multi-ﬁdelity Multi-armed Bandit

Kirthevasan Kandasamy , Gautam Dasarathy , Jeff Schneider , Barnabás Póczos

Carnegie Mellon University, Rice University {kandasamy, schneide, bapoczos}@cs.cmu.edu, gautamd@rice.edu

We study a variant of the classical stochastic K-armed bandit where observing the outcome of each arm is expensive, but cheap approximations to this outcome are available. For example, in online advertising the performance of an ad can be approximated by displaying it for shorter time periods or to narrower audiences. We formalise this task as a multi-ﬁdelity bandit, where, at each time step, the forecaster may choose to play an arm at any one of M ﬁdelities. The highest ﬁdelity (desired outcome) expends cost λ(M). The mth ﬁdelity (an approximation) expends λ(m) < λ(M) and returns a biased estimate of the highest ﬁdelity. We develop MF-UCB, a novel upper conﬁdence bound procedure for this setting and prove that it naturally adapts to the sequence of available approximations and costs thus attaining better regret than naive strategies which ignore the approximations. For instance, in the above online advertising example, MF-UCB would use the lower ﬁdelities to quickly eliminate suboptimal ads and reserve the larger expensive experiments on a small set of promising candidates. We complement this result with a lower bound and show that MF-UCB is nearly optimal under certain conditions.

1 Introduction

Since the seminal work of Robbins [11], the multi-armed bandit has become an attractive framework for studying exploration-exploitation trade-offs inherent to tasks arising in online advertising, ﬁnance and other ﬁelds. In the most basic form of the K-armed bandit [9, 12], we have a set K = {1, . . . , K} of K arms (e.g. K ads in online advertising). At each time step t = 1, 2, . . . , an arm is played and a corresponding reward is realised. The goal is to design a strategy of plays that minimises the regret after n plays. The regret is the comparison, in expectation, of the realised reward against an oracle that always plays the best arm. The well known Upper Conﬁdence Bound (UCB) algorithm [3], achieves regret O(K log(n)) after n plays (ignoring mean rewards) and is minimax optimal [9].

In this paper, we propose a new take on this important problem. In many practical scenarios of interest, one can associate a cost to playing each arm. Furthermore, in many of these scenarios, one might have access to cheaper approximations to the outcome of the arms. For instance, in online advertising the goal is to maximise the cumulative number of clicks over a given time period. Conventionally, an arm pull maybe thought of as the display of an ad for a speciﬁc time, say one hour. However, we may approximate its hourly performance by displaying the ad for shorter periods. This estimate is biased (and possibly noisy), as displaying an ad for longer intervals changes user behaviour. It can nonetheless be useful in gauging the long run click through rate. We can also obtain biased estimates of an ad by displaying it only to certain geographic regions or age groups. Similarly one might consider algorithm selection for machine learning problems [4], where the goal is to be competitive with the best among a set of learning algorithms for a task. Here, one might obtain cheaper approximate estimates of the performance of algorithm by cheaper versions using less data or computation. In this paper, we will refer to such approximations as ﬁdelities. Consider a 2-ﬁdelity problem where the cost at the low ﬁdelity is λ(1) and the cost at the high ﬁdelity is λ(2). We will present a cost weighted notion of regret for this setting for a strategy that expends a capital

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

of Λ units. A classical K-armed bandit strategy such as UCB, which only uses the highest ﬁdelity, can obtain at best O(λ(2)K log(Λ/λ(2))) regret [9]. In contrast, this paper will present multi-ﬁdelity strategies that achieve O (λ(1)K + λ(2)|Kg|) log(Λ/λ(2)) regret. Here Kg is a (typically) small subset of arms with high expected reward that can be identiﬁed using plays at the (cheaper) low ﬁdelity. When |Kg| < K and λ(1) < λ(2), such a strategy will outperform the more standard UCB algorithms. Intuitively, this is achieved by using the lower ﬁdelities to eliminate several of bad arms and reserving expensive higher ﬁdelity plays for a small subset of the most promising arms. We formalise the above intuitions in the sequel. Our main contributions are,

1. A novel formalism for studying bandit tasks when one has access to multiple ﬁdelities for each arm, with each successive ﬁdelity providing a better approximation to the most expensive one. 2. A new algorithm that we call Multi-Fidelity Upper Conﬁdence Bound (MF-UCB) that adapts the classical Upper Conﬁdence Bound (UCB) strategies to our multi-ﬁdelity setting. Empirically, we demonstrate that our algorithm outperforms naive UCB on simulations. 3. A theoretical characterisation of the performance of MF-UCB that shows that the algorithm (a) uses the lower ﬁdelities to explore all arms and eliminates arms with low expected reward, and (b) reserves the higher ﬁdelity plays for arms with rewards close to the optimal value. We derive a lower bound on the regret and demonstrate that MF-UCB is near-optimal on this problem.

Related Work The K-armed bandit has been studied extensively in the past [1, 9, 11]. There has been a ﬂurry of work on upper conﬁdence bound (UCB) methods [2, 3], which adopt the optimism in the face of uncertainty principle for bandits. For readers unfamiliar with UCB methods, we recommend Chapter 2 of Bubeck and Cesa-Bianchi [5]. Our work in this paper builds on UCB ideas, but the multi-ﬁdelity framework poses signiﬁcantly new algorithmic and theoretical challenges.

There has been some interest in multi-ﬁdelity methods for optimisation in many applied domains of research [7, 10]. However, these works do not formalise or analyse notions of regret in the multi-ﬁdelity setting. Multi-ﬁdelity methods are used in the robotics community for reinforcement learning tasks by modeling each ﬁdelity as a Markov decision process [6]. Zhang and Chaudhuri [16] study active learning with a cheap weak labeler and an expensive strong labeler. The objective of these papers however is not to handle the exploration-exploitation trade-off inherent to the bandit setting. A line of work on budgeted multi-armed bandits [13, 15] study a variant of the K-armed bandit where each arm has a random reward and cost and the goal is to play the arm with the highest reward/cost ratio as much as possible. This is different from our setting where each arm has multiple ﬁdelities which serve as an approximation. Recently, in Kandasamy et al. [8] we extended ideas in this work to analyse multi-ﬁdelity bandits with Gaussian process payoffs.

2 The Stochastic K-armed Multi-ﬁdelity Bandit

In the classical K-armed bandit, each arm k K = {1, . . . , K} is associated with a real valued distribution θk with mean µk. Let K = argmaxk K µk be the set of optimal arms, k K be an optimal arm and µ = µk denote the optimal mean value. A bandit strategy would play an arm It K at each time step t and observe a sample from θIt. Its goal is to maximise the sum of expected rewards after n time steps Pn t=1 µIt, or equivalently minimise the cumulative pseudo-regret Pn t=1 µ µIt for all values of n. In other words, the objective is to be competitive, in expectation, against an oracle that plays an optimal arm all the time.

In this work we differ from the usual bandit setting in the following aspect. For each arm k, we have access to M 1 successively approximate distributions θ(1) k , θ(2) k , . . . , θ(M 1) k to the desired distribution θ(M) k = θk. We will refer to these approximations as ﬁdelities. Clearly, these approximations are meaningful only if they give us some information about θ(M) k . In what follows, we will assume that the mth ﬁdelity mean of an arm is within ζ(m), a known quantity, of its highest ﬁdelity mean, where ζ(m), decreasing with m, characterise the successive approximations. That is, |µ(M) k µ(m) k | ζ(m)

for all k K and m = 1, . . . , M, where ζ(1) > ζ(2) > > ζ(M) = 0 and the ζ(m) s are known. It is possible for the lower ﬁdelities to be misleading under this assumption: there could exist an arm k with µ(M) k < µ = µ(M) k but with µ(m) k > µ and/or µ(m) k > µ(m) k for any m < M. In other words, we wish to explicitly account for the biases introduced by the lower ﬁdelities, and not treat them

as just a higher variance observation of an expensive experiment. This problem of course becomes interesting only when lower ﬁdelities are more attractive than higher ﬁdelities in terms of some notion of cost. Towards this end, we will assign a cost λ(m) (such as advertising time, money etc.) to playing an arm at ﬁdelity m where λ(1) < λ(2) < λ(M).

Notation: T (m) k,t denotes the number of plays at arm k, at ﬁdelity m until t time steps. T (>m) k,t is the number of plays at ﬁdelities greater than m. Q(m) t = P

k K T (m) k,t is the number of ﬁdelity

m plays at all arms until time t. X (m) k,s denotes the mean of s samples drawn from θ(m) k . Denote

(m) k = µ µ(m) k ζ(m). When s refers to the number of plays of an arm, we will take 1/s = if s = 0. A denotes the complement of a set A K. While discussing the intuitions in our proofs and theorems we will use , , to denote equality and inequalities ignoring constants.

Regret in the multi-ﬁdelity setting: A strategy for a multi-ﬁdelity bandit problem, at time t, produces an arm-ﬁdelity pair (It, mt), where It K and mt {1, . . . , M}, and observes a sample Xt drawn (independently of everything else) from the distribution θ(mt) It . The choice of (It, mt) could depend on previous arm-observation-ﬁdelity tuples {(Ii, Xi, mi)}t 1 i=1. The multi-ﬁdelity setting calls for a new notion of regret. For any strategy A that expends Λ units of the resource, we will deﬁne the pseudo-regret R(Λ, A) as follows. Let qt denote the instantaneous pseudo-reward at time t and rt = µ qt denote the instantaneous pseudo-regret. We will discuss choices for qt shortly. Any notion of regret in the multi-ﬁdelity setting needs to account for this instantaneous regret along with the cost of the ﬁdelity at which we played at time t, i.e. λ(mt). Moreover, we should receive no reward (maximum regret) for any unused capital. These observations lead to the following deﬁnition,

R(Λ, A) = Λµ

t=1 λ(mt)qt =

t=1 λ(mt) !

| {z } r(Λ,A)

t=1 λ(mt)rt

| {z } R(Λ,A)

Above, N is the (random) number of plays within capital Λ by A, i.e. the largest n such that Pn t=1 λ(mt) Λ. To motivate our choice of qt we consider an online advertising example where λ(m) is the advertising time at ﬁdelity m and µ(m) k is the expected number of clicks per unit time. While we observe from θ(mt) It at time t, we wish to reward the strategy according to its highest ﬁdelity

distribution θ(M) It . Therefore regardless of which ﬁdelity we play we set qt = µ(M) It . Here, we are competing against an oracle which plays an optimal arm at any ﬁdelity all the time. Note that we might have chosen qt to be µ(mt) It . However, this does not reﬂect the motivating applications for the multi-ﬁdelity setting that we consider. For instance, a clickbait ad might receive a high number of clicks in the short run, but its long term performance might be poor. Furthermore, for such a choice, we may as well ignore the rich structure inherent to the multi-ﬁdelity setting and simply play the arm argmaxm,k µ(m) k at each time. There are of course other choices for qt that result in very different notions of regret; we discuss this brieﬂy at the end of Section 7.

The distributions θ(m) k need to be well behaved for the problem to be tractable. We will assume that they satisfy concentration inequalities of the following form. For all ϵ > 0,

X (m) k,s µ(m) k > ϵ < νe sψ(ϵ), P

X (m) k,s µ(m) k < ϵ < νe sψ(ϵ). (2)

Here ν > 0 and ψ is an increasing function with ψ(0) = 0 and is at least increasing linearly ψ(x) Ω(x). For example, if the distributions are sub-Gaussian, then ψ(x) Θ(x2).

The performance of a multi-ﬁdelity strategy which switches from low to high ﬁdelities can be worsened by artiﬁcially inserting ﬁdelities. Consider a scenario where λ(m+1) is only slightly larger than λ(m) and ζ(m+1) is only slightly smaller than ζ(m). This situation is unfavourable since there isn t much that can be inferred from the (m + 1)th ﬁdelity that cannot already be inferred from the mth by expending the same cost. We impose the following regularity condition to avoid such situations. Assumption 1. The ζ(m) s decay fast enough such that Pm i=1 1 ψ(ζ(i)) 1 ψ(ζ(m+1)) for all m < M.

Assumption 1 is not necessary to analyse our algorithm, however, the performance of MF-UCB when compared to UCB is most appealing when the above holds. In cases where M is small enough and

can be treated as a constant, the assumption is not necessary. For sub-Gaussian distributions, the condition is satisﬁed for an exponentially decaying (ζ(1), ζ(2), . . . ) such as (1/

2, 1/2, 1/2

Our goal is to design a strategy A0 that has low expected pseudo-regret E[R(Λ, A0)] for all values of (sufﬁciently large) Λ, i.e. the equivalent of an anytime strategy, as opposed to a ﬁxed time horizon strategy, in the usual bandit setting. The expectation is over the observed rewards which also dictates the number of plays N. From now on, for simplicity we will write R(Λ) when A is clear from context and refer to it just as regret.

3 The Multi-Fidelity Upper Conﬁdence Bound (MF-UCB) Algorithm

As the name suggests, the MF-UCB algorithm maintains an upper conﬁdence bound corresponding to µ(m) k for each m {1, . . . , M} and k K based on its previous plays. Following UCB strategies [2, 3], we deﬁne the following set of upper conﬁdence bounds,

B(m) k,t (s) = X (m) k,s + ψ 1 ρ log t

+ ζ(m), for all m {1, . . . , M} , k K

Bk,t = min m=1,...,M B(m) k,t (T (m) k,t 1). (3)

Here ρ is a parameter in our algorithm and ψ is from (2). Each B(m) k,t (T (m) k,t 1) provides a high

probability upper bound on µ(M) k with their minimum Bk,t giving the tightest bound (See Appendix A). Similar to UCB, at time t we play the arm It with the highest upper bound It = argmaxk K Bk,t.

Since our setup has multiple ﬁdelities associated with each arm, the algorithm needs to determine at each time t which ﬁdelity (mt) to play the chosen arm (It). For this consider an arbitrary ﬁdelity m < M. The ζ(m) conditions on µ(m) k imply a constraint on the value of µ(M) k . If, at ﬁdelity m, the uncertainty interval ψ 1(ρ log(t)/T (m) It,t 1) is large, then we have not constrained µ(M) It sufﬁciently

well yet. There is more information to be gleaned about µ(M) It from playing the arm It at ﬁdelity m. On the other hand, playing at ﬁdelity m indeﬁnitely will not help us much since the ζ(m) elongation of the conﬁdence band caps off how much we can learn about µ(M) It from ﬁdelity m; i.e. even if we knew

µ(m) It , we will have only constrained µ(M) It to within a ζ(m) interval. Our algorithm captures this

natural intuition. Having selected It, we begin checking at the ﬁrst ﬁdelity. If ψ 1(ρ log(t)/T (1) It,t 1) is smaller than a threshold γ(1) we proceed to check the second ﬁdelity, continuing in a similar fashion. If at any point ψ 1(ρ log(t)/T (m) It,t 1) γ(m), we play It at ﬁdelity mt = m. If we go all the way to ﬁdelity M, we play at mt = M. The resulting procedure is summarised below in Algorithm 1.

Algorithm 1 MF-UCB

for t = 1, 2, . . .

1. Choose It argmaxk K Bk,t. (See equation (3).)

2. mt = minm { m | ψ 1(ρ log t/T (m) It,t 1) γ(m) m = M} (See equation (4).)

3. Play X θ(mt) It .

Choice of γ(m): In our algorithm, we choose

γ(m) = ψ 1 λ(m)

λ(m+1) ψ ζ(m) (4)

To motivate this choice, note that if (m) k = µ µ(m) k ζ(m) > 0 then we can conclude that arm k is not optimal. Step 2 of the algorithm attempts to eliminate arms for which (m) k γ(m) from plays above the mth ﬁdelity. If γ(m) is too large, then we would not eliminate a sufﬁcient number of arms whereas if it was too small we could end up playing a suboptimal arm k (for which µ(m) k > µ ) too many times at ﬁdelity m. As will be revealed by our analysis, the given choice represents an optimal tradeoff under the given assumptions.

K(1) K(1) K(4) K(4)

(2)+2γ(2) J (2)

(3)+2γ(3) J (3)

(1)+2γ(1) J (1)

Figure 1: Illustration of the partition K(m) s for a M = 4 ﬁdelity problem. The sets J (m) ζ(m)+2γ(m) are indicated next to their bound-

aries. K(1), K(2), K(3), K(4) are shown in yellow, green, red and purple respectively. The optimal arms K are shown as a black circle.

We will be primarily concerned with the term R(Λ, A) = R(Λ) from (1). r(Λ, A) is a residual term; it is an artefact of the fact that after the N +1th play, the spent capital would have exceeded Λ. For any algorithm that operates oblivious to a ﬁxed capital, it can be bounded by λ(M)µ which is negligible compared to R(Λ). According to the above, we have the following expressions for R(Λ):

m=1 λ(m)T (m) k,N

Central to our analysis will be the following partitioning of K. First denote the set of arms whose ﬁdelity m mean is within η of µ to be J (m) η = {k K; µ µ(m) k η}. Deﬁne K(1)

J (1) ζ(1)+2γ(1) = {k K; (1) k > 2γ(1)} to be the arms whose ﬁrst ﬁdelity mean µ(1) k is at least ζ(1) + 2γ(1) below the optimum µ . Then we recursively deﬁne,

K(m) J (m) ζ(m)+2γ(m) m 1 \

ℓ=1 J (ℓ) ζ(ℓ)+2γ(ℓ)

, m M 1, K(M) K M 1 \

ℓ=1 J (ℓ) ζ(ℓ)+2γ(ℓ)

Observe that for all k K(m), (m) k > 2γ(m) and (ℓ) k 2γ(ℓ) for all ℓ< m. For what follows, for any k K, Jk K will denote the partition k belongs to, i.e. Jk K = m s.t. k K(m). We will see that K(m) are the arms that will be played at the mth ﬁdelity but can be excluded from ﬁdelities higher than m using information at ﬁdelity m. See Fig. 1 for an illustration of these partitions.

4.1 Regret Bound for MF-UCB

Recall that N = PM m=1 Q(m) N is the total (random) number of plays by a multi-ﬁdelity strategy within capital Λ. Let nΛ = Λ/λ(M) be the (non-random) number of plays by any strategy that operates only on the highest ﬁdelity. Since λ(m) < λ(M) for all m < M, N could be large for an arbitrary multi-ﬁdelity method. However, our analysis reveals that for MF-UCB, N nΛ with high probability. The following theorem bounds R for MF-UCB. The proof is given in Appendix A. For clarity, we ignore the constants but they are ﬂeshed out in the proofs. Theorem 2 (Regret Bound for MF-UCB). Let ρ > 4. There exists Λ0 depending on λ(m) s such that for all Λ > Λ0, MF-UCB satisﬁes,

k/ K (M) k λ(Jk K)

ψ( (Jk K) k )

k K(m) (M) k λ(m)

Let us compare the above bound to UCB whose regret is E[R(Λ)]

k/ K (M) k λ(M)

ψ( (M) k ). We will

ﬁrst argue that MF-UCB does not do signiﬁcantly worse than UCB in the worst case. Modulo the (M) k log(nΛ) terms, regret for MF-UCB due to arm k is Rk,MF-UCB λ(Jk K)/ψ( (Jk K) k ). Consider any k K(m), m < M for which (m) k > 2γ(m). Since

(M) k (Jk K) k + 2ζ(Jk K) ψ 1 λ(Jk K+1)

λ(Jk K) ψ( (Jk K) k ) ,

a (loose) lower bound for UCB for the same quantity is Rk,UCB λ(M)/ψ( (M) k )

λ(Jk K+1) Rk,MF-UCB. Therefore for any k K(m), m < M, MF-UCB is at most a constant times

worse than UCB. However, whenever (Jk K) k is comparable to or larger than (M) k , MF-UCB outperforms UCB by a factor of λ(Jk K)/λ(M) on arm k. As can be inferred from the theorem, most of the cost invested by MF-UCB on arm k is at the Jk Kth ﬁdelity. For example, in Fig. 1, MF-UCB would not play the yellow arms K(1) beyond the ﬁrst ﬁdelity (more than a constant number of times). Similarly all green and red arms are played mostly at the second and third ﬁdelities respectively. Only the blue arms are played at the fourth (most expensive) ﬁdelity. On the other hand UCB plays all arms at the fourth ﬁdelity. Since lower ﬁdelities are cheaper MF-UCB achieves better regret than UCB.

It is essential to note here that (M) k is small for arms in in K(M). These arms are close to the optimum and require more effort to distinguish than arms that are far away. MF-UCB, like UCB , invests log(nΛ)λ(M)/ψ( (M) k ) capital in those arms. That is, the multi-ﬁdelity setting does not help us signiﬁcantly with the hard-to-distinguish arms. That said, in cases where K is very large and the sets K(M) is small the bound for MF-UCB can be appreciably better than UCB.

4.2 Lower Bound

Since, N nΛ = Λ/λ(M) , any multi-ﬁdelity strategy which plays a suboptimal arm a polynomial number of times at any ﬁdelity after n time steps, will have worse regret than MF-UCB (and UCB). Therefore, in our lower bound we will only consider strategies which satisfy the following condition.

Assumption 3. Consider the strategy after n plays at any ﬁdelity. For any arm with (M) k > 0, we have E[PM m=1 T (m) k,n ] o(na) for any a > 0 .

For our lower bound we will consider a set of Bernoulli distributions θ(m) k for each ﬁdelity m and each arm k with mean µ(m) k . It is known that for Bernoulli distributions ψ(ϵ) Θ(ϵ2) [14]. To state our lower bound we will further partition the set K(m) into two sets K(m) , K(m) as follows,

K(m) = {k K(m) : (ℓ) k 0 ℓ< m}, K(m) = {k K(m) : ℓ< m s.t. (ℓ) k > 0}.

For any k K(m) our lower bound, given below, is different depending on which set k belongs to. Theorem 4 (Lower bound for R(Λ)). Consider any set of Bernoulli reward distributions with µ (1/2, 1) and ζ(1) < 1/2. Then, for any strategy satisfying Assumption 3 the following holds.

lim inf Λ E[R(Λ)]

(M) k min ℓ Lm(k) λ(ℓ)

Here c is a problem dependent constant. Lm(k) = {ℓ< m : (ℓ) k > 0} {m} is the union of the mth ﬁdelity and all ﬁdelities smaller than m for which (ℓ) k > 0.

Comparing this with Theorem 2 we ﬁnd that MF-UCB meets the lower bound on all arms k K(m) , m. However, it may be loose on any k K(m) . The gap can be explained as follows. For k K(m) , there exists some ℓ< m such that 0 < (ℓ) k < 2γ(ℓ). As explained previously, the switching criterion of MF-UCB ensures that we do not invest too much effort trying to distinguish whether (ℓ) k < 0 since (ℓ) k could be very small. That is, we proceed to the next ﬁdelity only if we

cannot conclude (ℓ) k γ(ℓ). However, since λ(m) > λ(ℓ) it might be the case that λ(ℓ)/ (ℓ) k

λ(m)/ (m) k

2 even though (m) k > 2γ(m). Consider for example a two ﬁdelity problem where = (1) k = (2) k < 2 p

λ(1)/λ(2)ζ(1). Here it makes sense to distinguish the arm as being suboptimal at the ﬁrst ﬁdelity with λ(1) log(nΛ)/ 2 capital instead of λ(2) log(nΛ)/ 2 at the second ﬁdelity. However, MF-UCB distinguishes this arm at the higher ﬁdelity as < 2γ(m) and therefore does not meet the lower bound on this arm. While it might seem tempting to switch based on estimates for (1) k , (2) k , this idea is not desirable as estimating (2) k for an arm requires log(nΛ)/ψ( (2) k ) samples at the second ﬁdelity; this is is exactly what we are trying to avoid for the majority of the arms via the multi-ﬁdelity setting. We leave it as an open problem to resolve this gap.

K(1) K(2) K(m) K(M) K E[T (1) k,n] log(n) ψ( (1) k )

log(n) ψ(γ(1)) ... log(n) ψ(γ(1)) ... log(n) ψ(γ(1))

log(n) ψ(γ(1)) E[T (2) k,n]

log(n) ψ( (2) k ) ... log(n) ψ(γ(2)) ... log(n) ψ(γ(2))

log(n) ψ(γ(2)) ... E[T (m) k,n ]

... log(n) ψ( (m) k ) ... log(n) ψ(γ(m))

log(n) ψ(γ(m)) ... E[T (M) k,n ] O(1) log(n) ψ( (M) k ) Ω(n)

Table 1: Bounds on the expected number of plays for each k K(m) (columns) at each ﬁdelity (rows) after n time steps (i.e. n plays at any ﬁdelity) in MF-UCB.

5 Proof Sketches

5.1 Theorem 2

First we analyse MF-UCB after n plays (at any ﬁdelity) and control the number of plays of an arm at various ﬁdelities depending on which K(m) it belongs to. To that end we prove the following.

Lemma 5. (Bounding E[T (m) k,n ] Informal) After n time steps of MF-UCB for any k K,

T (ℓ) k,n log(n)

ψ(γ(m)), ℓ< Jk K, E[T (Jk K) k,n ] log(n)

ψ( (Jk K) k /2) , E[T (>Jk K) k,n ] O(1).

The bounds above are illustrated in Table 1. Let Rk(Λ) = PM m=1 λ(m) (M) k T (m) k,N be the regret incurred due to arm k and Rkn = E[ Rk(Λ)|N = n]. Using Lemma 5 we have,

Rkn (M) k log(n)

ψ(γ(m)) + λ(Jk K)

ψ( (Jk K) k /2) + o(1) (7)

The next step will be to control the number of plays N within capital Λ which will bound E[log(N)]. While Λ/λ(1) is an easy bound, we will see that for MF-UCB, N will be on the order of nΛ = Λ/λ(M). For this we will use the following high probability bounds on T (m) k,n .

Lemma 6. (Bounding P(T (m) k,n > ) Informal) After n time steps of MF-UCB for any k K,

T (Jk K) k,n x log(n)

ψ( (Jk K) k /2)

1 nxρ 1 , P T (>Jk K) k,n > x 1 xρ 2 .

We bound the number of plays at ﬁdelities less than M via Lemma 6 and obtain n/2 > PM 1 m=1 Q(m) n with probability greater than, say δ, for all n n0. By setting δ = 1/ log(Λ/λ(1)), we get E[log(N)] log(nΛ). The actual argument is somewhat delicate since δ depends on Λ.

This gives as an expression for the regret due to arm k to be of the form (7) where n is replaced by nΛ. Then we we argue that the regret incurred by an arm k at ﬁdelities less than Jk K (ﬁrst term in the RHS of (7)) is dominated by λ(Jk K)/ψ( (Jk K) k ) (second term). This is possible due to the design of the sets K(m) and Assumption 1. While Lemmas 5, 6 require only ρ > 2, we need ρ > 4 to ensure that PM 1 m=1 Q(m) n remains sublinear when we plug-in the probabilities from Lemma 6. ρ > 2 is attainable with a more careful design of the sets K(m). The Λ > Λ0 condition is needed because initially MF-UCB is playing at lower ﬁdelities and for small Λ, N could be much larger than nΛ.

5.2 Theorem 4

First we show that for an arm k with (p) k > 0 and (ℓ) k 0 for all ℓ< p, any strategy should satisfy

Rk(Λ) log(nΛ) (M) k

min ℓ p, (ℓ) k >0

Λ 10 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

K = 500, M = 3, costs = [1; 10; 100]

Λ 10 5 0.5 1 1.5 2 2.5

K = 500, M = 4, costs = [1; 5; 20; 50]

Λ 10 4 1 2 3 4 5 6 7 8 9 10

K = 200, M = 2, costs = [1; 10]

Λ 10 5 1 2 3 4 5 6 7 8 9 10

K = 1000, M = 5, costs = [1; 3; 10; 30; 100]

0 50 100 150 200 250 300 350 400 450 500

Number of plays

m=3 m=2 m=1

0 50 100 150 200 250 300 350 400 450 500

Number of plays

m=3 m=2 m=1

Figure 2: Simulations results on the synthetic problems. The ﬁrst four ﬁgures compares UCB against MF-UCB on four synthetic problems. The title states K, M and the costs λ(1), . . . , λ(M). The ﬁrst two used Gaussian rewards and the last two used Bernoulli rewards. The last two ﬁgures show the number of plays by UCB and MF-UCB on a K = 500, M = 3 problem with Gaussian observations (corresponding to the ﬁrst ﬁgure).

where Rk is the regret incurred due to arm k. The proof uses a change of measure argument. The modiﬁcation has Bernoulli distributions with mean µ(ℓ) k , ℓ= 1, . . . , M where µ(ℓ) k = µ(ℓ) k for all ℓ< m. Then we push µ(ℓ) k slightly above µ ζ(ℓ) from ℓ= m all the way to M where µ(M) k > µ . To control the probabilities after changing to µ(ℓ) k we use the conditions in Assumption 3. Then for

k K(m) we argue that λ(ℓ) (ℓ) k

2 λ(m)/ (m) k

2 using, once again the design of the sets K(m). This yields the separate results for k K(m) , K(m) .

6 Some Simulations on Synthetic Problems

We compare UCB against MF-UCB on a series of synthetic problems. The results are given in Figure 2. Due to space constraints, the details on these experiments are given in Appendix C. Note that MF-UCB outperforms UCB on all these problems. Critically, note that the gradient of the curve is also smaller than that for UCB corroborating our theoretical insights. We have also illustrated the number of plays by MF-UCB and UCB at each ﬁdelity for one of these problems. The arms are arranged in increasing order of µ(M) k values. As predicted by our analysis, most of the very suboptimal arms are only played at the lower ﬁdelities. As lower ﬁdelities are cheaper, MF-UCB is able to use more higher ﬁdelity plays at arms close to the optimum than UCB.

7 Conclusion

We studied a novel framework for studying exploration exploitation trade-offs when cheaper approximations to a desired experiment are available. We propose an algorithm for this setting, MF-UCB, based on upper conﬁdence bound techniques. It uses the cheap lower ﬁdelity plays to eliminate several bad arms and reserves the expensive high ﬁdelity queries for a small set of arms with high expected reward, hence achieving better regret than strategies which ignore multi-ﬁdelity information. We complement this result with a lower bound which demonstrates that MF-UCB is near optimal.

Other settings for bandit problems with multi-ﬁdelity evaluations might warrant different deﬁnitions for the regret. For example, consider a gold mining robot where each high ﬁdelity play is a real world experiment of the robot and incurs cost λ(2). However, a vastly cheaper computer simulation which incurs λ(1) approximate a robot s real world behaviour. In applications like this λ(1) λ(2). However, unlike our setting lower ﬁdelity plays may not have any rewards (as simulations do not yield actual gold). Similarly, in clinical trials the regret due to a bad treatment at the high ﬁdelity, would be, say, a dead patient. However, a bad treatment at a lower ﬁdelity may not warrant a large penalty. These settings are quite challenging and we wish to work on them going forward.

[1] Rajeev Agrawal. Sample Mean Based Index Policies with O(log n) Regret for the Multi-Armed Bandit Problem. Advances in Applied Probability, 1995. [2] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration-exploitation Tradeoff Using Variance Estimates in Multi-armed Bandits. Theor. Comput. Sci., 2009. [3] Peter Auer. Using Conﬁdence Bounds for Exploitation-exploration Trade-offs. J. Mach. Learn. Res., 2003. [4] Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learning algorithms. The Journal of Machine Learning Research, 5:255 291, 2004. [5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012. [6] Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Reinforcement Learning with Multi-Fidelity Simulators. In IEEE International Conference on Robotics and Automation (ICRA), 2014. [7] D. Huang, T.T. Allen, W.I. Notz, and R.A. Miller. Sequential kriging optimization using multiple-ﬁdelity evaluations. Structural and Multidisciplinary Optimization, 2006. [8] Kirthevasan Kandasamy, Gautam Dasarathy, Junier Oliva, Jeff Schenider, and Barnabás Póczos. Gaussian Process Bandit Optimisation with Multi-ﬁdelity Evaluations. In Advances in Neural Information Processing Systems, 2016. [9] T. L. Lai and Herbert Robbins. Asymptotically Efﬁcient Adaptive Allocation Rules. Advances in Applied Mathematics, 1985. [10] Dev Rajnarayan, Alex Haas, and Ilan Kroo. A multiﬁdelity gradient-free optimization method and application to aerodynamic design. In AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, Victoria, Etats-Unis, 2008. [11] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952. [12] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933. [13] Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. Efﬁcient Regret Bounds for Online Bid Optimisation in Budget-Limited Sponsored Search Auctions. In UAI, 2014. [14] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated, 2010. [15] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson Sampling for Budgeted Multi-Armed Bandits. In IJCAI, 2015. [16] Chicheng Zhang and Kamalika Chaudhuri. Active Learning from Weak and Strong Labelers. In Advances in Neural Information Processing Systems, 2015.