# the_multifidelity_multiarmed_bandit__35978843.pdf The Multi-fidelity Multi-armed Bandit Kirthevasan Kandasamy , Gautam Dasarathy , Jeff Schneider , Barnabás Póczos Carnegie Mellon University, Rice University {kandasamy, schneide, bapoczos}@cs.cmu.edu, gautamd@rice.edu We study a variant of the classical stochastic K-armed bandit where observing the outcome of each arm is expensive, but cheap approximations to this outcome are available. For example, in online advertising the performance of an ad can be approximated by displaying it for shorter time periods or to narrower audiences. We formalise this task as a multi-fidelity bandit, where, at each time step, the forecaster may choose to play an arm at any one of M fidelities. The highest fidelity (desired outcome) expends cost λ(M). The mth fidelity (an approximation) expends λ(m) < λ(M) and returns a biased estimate of the highest fidelity. We develop MF-UCB, a novel upper confidence bound procedure for this setting and prove that it naturally adapts to the sequence of available approximations and costs thus attaining better regret than naive strategies which ignore the approximations. For instance, in the above online advertising example, MF-UCB would use the lower fidelities to quickly eliminate suboptimal ads and reserve the larger expensive experiments on a small set of promising candidates. We complement this result with a lower bound and show that MF-UCB is nearly optimal under certain conditions. 1 Introduction Since the seminal work of Robbins [11], the multi-armed bandit has become an attractive framework for studying exploration-exploitation trade-offs inherent to tasks arising in online advertising, finance and other fields. In the most basic form of the K-armed bandit [9, 12], we have a set K = {1, . . . , K} of K arms (e.g. K ads in online advertising). At each time step t = 1, 2, . . . , an arm is played and a corresponding reward is realised. The goal is to design a strategy of plays that minimises the regret after n plays. The regret is the comparison, in expectation, of the realised reward against an oracle that always plays the best arm. The well known Upper Confidence Bound (UCB) algorithm [3], achieves regret O(K log(n)) after n plays (ignoring mean rewards) and is minimax optimal [9]. In this paper, we propose a new take on this important problem. In many practical scenarios of interest, one can associate a cost to playing each arm. Furthermore, in many of these scenarios, one might have access to cheaper approximations to the outcome of the arms. For instance, in online advertising the goal is to maximise the cumulative number of clicks over a given time period. Conventionally, an arm pull maybe thought of as the display of an ad for a specific time, say one hour. However, we may approximate its hourly performance by displaying the ad for shorter periods. This estimate is biased (and possibly noisy), as displaying an ad for longer intervals changes user behaviour. It can nonetheless be useful in gauging the long run click through rate. We can also obtain biased estimates of an ad by displaying it only to certain geographic regions or age groups. Similarly one might consider algorithm selection for machine learning problems [4], where the goal is to be competitive with the best among a set of learning algorithms for a task. Here, one might obtain cheaper approximate estimates of the performance of algorithm by cheaper versions using less data or computation. In this paper, we will refer to such approximations as fidelities. Consider a 2-fidelity problem where the cost at the low fidelity is λ(1) and the cost at the high fidelity is λ(2). We will present a cost weighted notion of regret for this setting for a strategy that expends a capital 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. of Λ units. A classical K-armed bandit strategy such as UCB, which only uses the highest fidelity, can obtain at best O(λ(2)K log(Λ/λ(2))) regret [9]. In contrast, this paper will present multi-fidelity strategies that achieve O (λ(1)K + λ(2)|Kg|) log(Λ/λ(2)) regret. Here Kg is a (typically) small subset of arms with high expected reward that can be identified using plays at the (cheaper) low fidelity. When |Kg| < K and λ(1) < λ(2), such a strategy will outperform the more standard UCB algorithms. Intuitively, this is achieved by using the lower fidelities to eliminate several of bad arms and reserving expensive higher fidelity plays for a small subset of the most promising arms. We formalise the above intuitions in the sequel. Our main contributions are, 1. A novel formalism for studying bandit tasks when one has access to multiple fidelities for each arm, with each successive fidelity providing a better approximation to the most expensive one. 2. A new algorithm that we call Multi-Fidelity Upper Confidence Bound (MF-UCB) that adapts the classical Upper Confidence Bound (UCB) strategies to our multi-fidelity setting. Empirically, we demonstrate that our algorithm outperforms naive UCB on simulations. 3. A theoretical characterisation of the performance of MF-UCB that shows that the algorithm (a) uses the lower fidelities to explore all arms and eliminates arms with low expected reward, and (b) reserves the higher fidelity plays for arms with rewards close to the optimal value. We derive a lower bound on the regret and demonstrate that MF-UCB is near-optimal on this problem. Related Work The K-armed bandit has been studied extensively in the past [1, 9, 11]. There has been a flurry of work on upper confidence bound (UCB) methods [2, 3], which adopt the optimism in the face of uncertainty principle for bandits. For readers unfamiliar with UCB methods, we recommend Chapter 2 of Bubeck and Cesa-Bianchi [5]. Our work in this paper builds on UCB ideas, but the multi-fidelity framework poses significantly new algorithmic and theoretical challenges. There has been some interest in multi-fidelity methods for optimisation in many applied domains of research [7, 10]. However, these works do not formalise or analyse notions of regret in the multi-fidelity setting. Multi-fidelity methods are used in the robotics community for reinforcement learning tasks by modeling each fidelity as a Markov decision process [6]. Zhang and Chaudhuri [16] study active learning with a cheap weak labeler and an expensive strong labeler. The objective of these papers however is not to handle the exploration-exploitation trade-off inherent to the bandit setting. A line of work on budgeted multi-armed bandits [13, 15] study a variant of the K-armed bandit where each arm has a random reward and cost and the goal is to play the arm with the highest reward/cost ratio as much as possible. This is different from our setting where each arm has multiple fidelities which serve as an approximation. Recently, in Kandasamy et al. [8] we extended ideas in this work to analyse multi-fidelity bandits with Gaussian process payoffs. 2 The Stochastic K-armed Multi-fidelity Bandit In the classical K-armed bandit, each arm k K = {1, . . . , K} is associated with a real valued distribution θk with mean µk. Let K = argmaxk K µk be the set of optimal arms, k K be an optimal arm and µ = µk denote the optimal mean value. A bandit strategy would play an arm It K at each time step t and observe a sample from θIt. Its goal is to maximise the sum of expected rewards after n time steps Pn t=1 µIt, or equivalently minimise the cumulative pseudo-regret Pn t=1 µ µIt for all values of n. In other words, the objective is to be competitive, in expectation, against an oracle that plays an optimal arm all the time. In this work we differ from the usual bandit setting in the following aspect. For each arm k, we have access to M 1 successively approximate distributions θ(1) k , θ(2) k , . . . , θ(M 1) k to the desired distribution θ(M) k = θk. We will refer to these approximations as fidelities. Clearly, these approximations are meaningful only if they give us some information about θ(M) k . In what follows, we will assume that the mth fidelity mean of an arm is within ζ(m), a known quantity, of its highest fidelity mean, where ζ(m), decreasing with m, characterise the successive approximations. That is, |µ(M) k µ(m) k | ζ(m) for all k K and m = 1, . . . , M, where ζ(1) > ζ(2) > > ζ(M) = 0 and the ζ(m) s are known. It is possible for the lower fidelities to be misleading under this assumption: there could exist an arm k with µ(M) k < µ = µ(M) k but with µ(m) k > µ and/or µ(m) k > µ(m) k for any m < M. In other words, we wish to explicitly account for the biases introduced by the lower fidelities, and not treat them as just a higher variance observation of an expensive experiment. This problem of course becomes interesting only when lower fidelities are more attractive than higher fidelities in terms of some notion of cost. Towards this end, we will assign a cost λ(m) (such as advertising time, money etc.) to playing an arm at fidelity m where λ(1) < λ(2) < λ(M). Notation: T (m) k,t denotes the number of plays at arm k, at fidelity m until t time steps. T (>m) k,t is the number of plays at fidelities greater than m. Q(m) t = P k K T (m) k,t is the number of fidelity m plays at all arms until time t. X (m) k,s denotes the mean of s samples drawn from θ(m) k . Denote (m) k = µ µ(m) k ζ(m). When s refers to the number of plays of an arm, we will take 1/s = if s = 0. A denotes the complement of a set A K. While discussing the intuitions in our proofs and theorems we will use , , to denote equality and inequalities ignoring constants. Regret in the multi-fidelity setting: A strategy for a multi-fidelity bandit problem, at time t, produces an arm-fidelity pair (It, mt), where It K and mt {1, . . . , M}, and observes a sample Xt drawn (independently of everything else) from the distribution θ(mt) It . The choice of (It, mt) could depend on previous arm-observation-fidelity tuples {(Ii, Xi, mi)}t 1 i=1. The multi-fidelity setting calls for a new notion of regret. For any strategy A that expends Λ units of the resource, we will define the pseudo-regret R(Λ, A) as follows. Let qt denote the instantaneous pseudo-reward at time t and rt = µ qt denote the instantaneous pseudo-regret. We will discuss choices for qt shortly. Any notion of regret in the multi-fidelity setting needs to account for this instantaneous regret along with the cost of the fidelity at which we played at time t, i.e. λ(mt). Moreover, we should receive no reward (maximum regret) for any unused capital. These observations lead to the following definition, R(Λ, A) = Λµ t=1 λ(mt)qt = t=1 λ(mt) ! | {z } r(Λ,A) t=1 λ(mt)rt | {z } R(Λ,A) Above, N is the (random) number of plays within capital Λ by A, i.e. the largest n such that Pn t=1 λ(mt) Λ. To motivate our choice of qt we consider an online advertising example where λ(m) is the advertising time at fidelity m and µ(m) k is the expected number of clicks per unit time. While we observe from θ(mt) It at time t, we wish to reward the strategy according to its highest fidelity distribution θ(M) It . Therefore regardless of which fidelity we play we set qt = µ(M) It . Here, we are competing against an oracle which plays an optimal arm at any fidelity all the time. Note that we might have chosen qt to be µ(mt) It . However, this does not reflect the motivating applications for the multi-fidelity setting that we consider. For instance, a clickbait ad might receive a high number of clicks in the short run, but its long term performance might be poor. Furthermore, for such a choice, we may as well ignore the rich structure inherent to the multi-fidelity setting and simply play the arm argmaxm,k µ(m) k at each time. There are of course other choices for qt that result in very different notions of regret; we discuss this briefly at the end of Section 7. The distributions θ(m) k need to be well behaved for the problem to be tractable. We will assume that they satisfy concentration inequalities of the following form. For all ϵ > 0, X (m) k,s µ(m) k > ϵ < νe sψ(ϵ), P X (m) k,s µ(m) k < ϵ < νe sψ(ϵ). (2) Here ν > 0 and ψ is an increasing function with ψ(0) = 0 and is at least increasing linearly ψ(x) Ω(x). For example, if the distributions are sub-Gaussian, then ψ(x) Θ(x2). The performance of a multi-fidelity strategy which switches from low to high fidelities can be worsened by artificially inserting fidelities. Consider a scenario where λ(m+1) is only slightly larger than λ(m) and ζ(m+1) is only slightly smaller than ζ(m). This situation is unfavourable since there isn t much that can be inferred from the (m + 1)th fidelity that cannot already be inferred from the mth by expending the same cost. We impose the following regularity condition to avoid such situations. Assumption 1. The ζ(m) s decay fast enough such that Pm i=1 1 ψ(ζ(i)) 1 ψ(ζ(m+1)) for all m < M. Assumption 1 is not necessary to analyse our algorithm, however, the performance of MF-UCB when compared to UCB is most appealing when the above holds. In cases where M is small enough and can be treated as a constant, the assumption is not necessary. For sub-Gaussian distributions, the condition is satisfied for an exponentially decaying (ζ(1), ζ(2), . . . ) such as (1/ 2, 1/2, 1/2 Our goal is to design a strategy A0 that has low expected pseudo-regret E[R(Λ, A0)] for all values of (sufficiently large) Λ, i.e. the equivalent of an anytime strategy, as opposed to a fixed time horizon strategy, in the usual bandit setting. The expectation is over the observed rewards which also dictates the number of plays N. From now on, for simplicity we will write R(Λ) when A is clear from context and refer to it just as regret. 3 The Multi-Fidelity Upper Confidence Bound (MF-UCB) Algorithm As the name suggests, the MF-UCB algorithm maintains an upper confidence bound corresponding to µ(m) k for each m {1, . . . , M} and k K based on its previous plays. Following UCB strategies [2, 3], we define the following set of upper confidence bounds, B(m) k,t (s) = X (m) k,s + ψ 1 ρ log t + ζ(m), for all m {1, . . . , M} , k K Bk,t = min m=1,...,M B(m) k,t (T (m) k,t 1). (3) Here ρ is a parameter in our algorithm and ψ is from (2). Each B(m) k,t (T (m) k,t 1) provides a high probability upper bound on µ(M) k with their minimum Bk,t giving the tightest bound (See Appendix A). Similar to UCB, at time t we play the arm It with the highest upper bound It = argmaxk K Bk,t. Since our setup has multiple fidelities associated with each arm, the algorithm needs to determine at each time t which fidelity (mt) to play the chosen arm (It). For this consider an arbitrary fidelity m < M. The ζ(m) conditions on µ(m) k imply a constraint on the value of µ(M) k . If, at fidelity m, the uncertainty interval ψ 1(ρ log(t)/T (m) It,t 1) is large, then we have not constrained µ(M) It sufficiently well yet. There is more information to be gleaned about µ(M) It from playing the arm It at fidelity m. On the other hand, playing at fidelity m indefinitely will not help us much since the ζ(m) elongation of the confidence band caps off how much we can learn about µ(M) It from fidelity m; i.e. even if we knew µ(m) It , we will have only constrained µ(M) It to within a ζ(m) interval. Our algorithm captures this natural intuition. Having selected It, we begin checking at the first fidelity. If ψ 1(ρ log(t)/T (1) It,t 1) is smaller than a threshold γ(1) we proceed to check the second fidelity, continuing in a similar fashion. If at any point ψ 1(ρ log(t)/T (m) It,t 1) γ(m), we play It at fidelity mt = m. If we go all the way to fidelity M, we play at mt = M. The resulting procedure is summarised below in Algorithm 1. Algorithm 1 MF-UCB for t = 1, 2, . . . 1. Choose It argmaxk K Bk,t. (See equation (3).) 2. mt = minm { m | ψ 1(ρ log t/T (m) It,t 1) γ(m) m = M} (See equation (4).) 3. Play X θ(mt) It . Choice of γ(m): In our algorithm, we choose γ(m) = ψ 1 λ(m) λ(m+1) ψ ζ(m) (4) To motivate this choice, note that if (m) k = µ µ(m) k ζ(m) > 0 then we can conclude that arm k is not optimal. Step 2 of the algorithm attempts to eliminate arms for which (m) k γ(m) from plays above the mth fidelity. If γ(m) is too large, then we would not eliminate a sufficient number of arms whereas if it was too small we could end up playing a suboptimal arm k (for which µ(m) k > µ ) too many times at fidelity m. As will be revealed by our analysis, the given choice represents an optimal tradeoff under the given assumptions. K(1) K(1) K(4) K(4) (2)+2γ(2) J (2) (3)+2γ(3) J (3) (1)+2γ(1) J (1) Figure 1: Illustration of the partition K(m) s for a M = 4 fidelity problem. The sets J (m) ζ(m)+2γ(m) are indicated next to their bound- aries. K(1), K(2), K(3), K(4) are shown in yellow, green, red and purple respectively. The optimal arms K are shown as a black circle. We will be primarily concerned with the term R(Λ, A) = R(Λ) from (1). r(Λ, A) is a residual term; it is an artefact of the fact that after the N +1th play, the spent capital would have exceeded Λ. For any algorithm that operates oblivious to a fixed capital, it can be bounded by λ(M)µ which is negligible compared to R(Λ). According to the above, we have the following expressions for R(Λ): m=1 λ(m)T (m) k,N Central to our analysis will be the following partitioning of K. First denote the set of arms whose fidelity m mean is within η of µ to be J (m) η = {k K; µ µ(m) k η}. Define K(1) J (1) ζ(1)+2γ(1) = {k K; (1) k > 2γ(1)} to be the arms whose first fidelity mean µ(1) k is at least ζ(1) + 2γ(1) below the optimum µ . Then we recursively define, K(m) J (m) ζ(m)+2γ(m) m 1 \ ℓ=1 J (ℓ) ζ(ℓ)+2γ(ℓ) , m M 1, K(M) K M 1 \ ℓ=1 J (ℓ) ζ(ℓ)+2γ(ℓ) Observe that for all k K(m), (m) k > 2γ(m) and (ℓ) k 2γ(ℓ) for all ℓ< m. For what follows, for any k K, Jk K will denote the partition k belongs to, i.e. Jk K = m s.t. k K(m). We will see that K(m) are the arms that will be played at the mth fidelity but can be excluded from fidelities higher than m using information at fidelity m. See Fig. 1 for an illustration of these partitions. 4.1 Regret Bound for MF-UCB Recall that N = PM m=1 Q(m) N is the total (random) number of plays by a multi-fidelity strategy within capital Λ. Let nΛ = Λ/λ(M) be the (non-random) number of plays by any strategy that operates only on the highest fidelity. Since λ(m) < λ(M) for all m < M, N could be large for an arbitrary multi-fidelity method. However, our analysis reveals that for MF-UCB, N nΛ with high probability. The following theorem bounds R for MF-UCB. The proof is given in Appendix A. For clarity, we ignore the constants but they are fleshed out in the proofs. Theorem 2 (Regret Bound for MF-UCB). Let ρ > 4. There exists Λ0 depending on λ(m) s such that for all Λ > Λ0, MF-UCB satisfies, k/ K (M) k λ(Jk K) ψ( (Jk K) k ) k K(m) (M) k λ(m) Let us compare the above bound to UCB whose regret is E[R(Λ)] k/ K (M) k λ(M) ψ( (M) k ). We will first argue that MF-UCB does not do significantly worse than UCB in the worst case. Modulo the (M) k log(nΛ) terms, regret for MF-UCB due to arm k is Rk,MF-UCB λ(Jk K)/ψ( (Jk K) k ). Consider any k K(m), m < M for which (m) k > 2γ(m). Since (M) k (Jk K) k + 2ζ(Jk K) ψ 1 λ(Jk K+1) λ(Jk K) ψ( (Jk K) k ) , a (loose) lower bound for UCB for the same quantity is Rk,UCB λ(M)/ψ( (M) k ) λ(Jk K+1) Rk,MF-UCB. Therefore for any k K(m), m < M, MF-UCB is at most a constant times worse than UCB. However, whenever (Jk K) k is comparable to or larger than (M) k , MF-UCB outperforms UCB by a factor of λ(Jk K)/λ(M) on arm k. As can be inferred from the theorem, most of the cost invested by MF-UCB on arm k is at the Jk Kth fidelity. For example, in Fig. 1, MF-UCB would not play the yellow arms K(1) beyond the first fidelity (more than a constant number of times). Similarly all green and red arms are played mostly at the second and third fidelities respectively. Only the blue arms are played at the fourth (most expensive) fidelity. On the other hand UCB plays all arms at the fourth fidelity. Since lower fidelities are cheaper MF-UCB achieves better regret than UCB. It is essential to note here that (M) k is small for arms in in K(M). These arms are close to the optimum and require more effort to distinguish than arms that are far away. MF-UCB, like UCB , invests log(nΛ)λ(M)/ψ( (M) k ) capital in those arms. That is, the multi-fidelity setting does not help us significantly with the hard-to-distinguish arms. That said, in cases where K is very large and the sets K(M) is small the bound for MF-UCB can be appreciably better than UCB. 4.2 Lower Bound Since, N nΛ = Λ/λ(M) , any multi-fidelity strategy which plays a suboptimal arm a polynomial number of times at any fidelity after n time steps, will have worse regret than MF-UCB (and UCB). Therefore, in our lower bound we will only consider strategies which satisfy the following condition. Assumption 3. Consider the strategy after n plays at any fidelity. For any arm with (M) k > 0, we have E[PM m=1 T (m) k,n ] o(na) for any a > 0 . For our lower bound we will consider a set of Bernoulli distributions θ(m) k for each fidelity m and each arm k with mean µ(m) k . It is known that for Bernoulli distributions ψ(ϵ) Θ(ϵ2) [14]. To state our lower bound we will further partition the set K(m) into two sets K(m) , K(m) as follows, K(m) = {k K(m) : (ℓ) k 0 ℓ< m}, K(m) = {k K(m) : ℓ< m s.t. (ℓ) k > 0}. For any k K(m) our lower bound, given below, is different depending on which set k belongs to. Theorem 4 (Lower bound for R(Λ)). Consider any set of Bernoulli reward distributions with µ (1/2, 1) and ζ(1) < 1/2. Then, for any strategy satisfying Assumption 3 the following holds. lim inf Λ E[R(Λ)] (M) k min ℓ Lm(k) λ(ℓ) Here c is a problem dependent constant. Lm(k) = {ℓ< m : (ℓ) k > 0} {m} is the union of the mth fidelity and all fidelities smaller than m for which (ℓ) k > 0. Comparing this with Theorem 2 we find that MF-UCB meets the lower bound on all arms k K(m) , m. However, it may be loose on any k K(m) . The gap can be explained as follows. For k K(m) , there exists some ℓ< m such that 0 < (ℓ) k < 2γ(ℓ). As explained previously, the switching criterion of MF-UCB ensures that we do not invest too much effort trying to distinguish whether (ℓ) k < 0 since (ℓ) k could be very small. That is, we proceed to the next fidelity only if we cannot conclude (ℓ) k γ(ℓ). However, since λ(m) > λ(ℓ) it might be the case that λ(ℓ)/ (ℓ) k λ(m)/ (m) k 2 even though (m) k > 2γ(m). Consider for example a two fidelity problem where = (1) k = (2) k < 2 p λ(1)/λ(2)ζ(1). Here it makes sense to distinguish the arm as being suboptimal at the first fidelity with λ(1) log(nΛ)/ 2 capital instead of λ(2) log(nΛ)/ 2 at the second fidelity. However, MF-UCB distinguishes this arm at the higher fidelity as < 2γ(m) and therefore does not meet the lower bound on this arm. While it might seem tempting to switch based on estimates for (1) k , (2) k , this idea is not desirable as estimating (2) k for an arm requires log(nΛ)/ψ( (2) k ) samples at the second fidelity; this is is exactly what we are trying to avoid for the majority of the arms via the multi-fidelity setting. We leave it as an open problem to resolve this gap. K(1) K(2) K(m) K(M) K E[T (1) k,n] log(n) ψ( (1) k ) log(n) ψ(γ(1)) ... log(n) ψ(γ(1)) ... log(n) ψ(γ(1)) log(n) ψ(γ(1)) E[T (2) k,n] log(n) ψ( (2) k ) ... log(n) ψ(γ(2)) ... log(n) ψ(γ(2)) log(n) ψ(γ(2)) ... E[T (m) k,n ] ... log(n) ψ( (m) k ) ... log(n) ψ(γ(m)) log(n) ψ(γ(m)) ... E[T (M) k,n ] O(1) log(n) ψ( (M) k ) Ω(n) Table 1: Bounds on the expected number of plays for each k K(m) (columns) at each fidelity (rows) after n time steps (i.e. n plays at any fidelity) in MF-UCB. 5 Proof Sketches 5.1 Theorem 2 First we analyse MF-UCB after n plays (at any fidelity) and control the number of plays of an arm at various fidelities depending on which K(m) it belongs to. To that end we prove the following. Lemma 5. (Bounding E[T (m) k,n ] Informal) After n time steps of MF-UCB for any k K, T (ℓ) k,n log(n) ψ(γ(m)), ℓ< Jk K, E[T (Jk K) k,n ] log(n) ψ( (Jk K) k /2) , E[T (>Jk K) k,n ] O(1). The bounds above are illustrated in Table 1. Let Rk(Λ) = PM m=1 λ(m) (M) k T (m) k,N be the regret incurred due to arm k and Rkn = E[ Rk(Λ)|N = n]. Using Lemma 5 we have, Rkn (M) k log(n) ψ(γ(m)) + λ(Jk K) ψ( (Jk K) k /2) + o(1) (7) The next step will be to control the number of plays N within capital Λ which will bound E[log(N)]. While Λ/λ(1) is an easy bound, we will see that for MF-UCB, N will be on the order of nΛ = Λ/λ(M). For this we will use the following high probability bounds on T (m) k,n . Lemma 6. (Bounding P(T (m) k,n > ) Informal) After n time steps of MF-UCB for any k K, T (Jk K) k,n x log(n) ψ( (Jk K) k /2) 1 nxρ 1 , P T (>Jk K) k,n > x 1 xρ 2 . We bound the number of plays at fidelities less than M via Lemma 6 and obtain n/2 > PM 1 m=1 Q(m) n with probability greater than, say δ, for all n n0. By setting δ = 1/ log(Λ/λ(1)), we get E[log(N)] log(nΛ). The actual argument is somewhat delicate since δ depends on Λ. This gives as an expression for the regret due to arm k to be of the form (7) where n is replaced by nΛ. Then we we argue that the regret incurred by an arm k at fidelities less than Jk K (first term in the RHS of (7)) is dominated by λ(Jk K)/ψ( (Jk K) k ) (second term). This is possible due to the design of the sets K(m) and Assumption 1. While Lemmas 5, 6 require only ρ > 2, we need ρ > 4 to ensure that PM 1 m=1 Q(m) n remains sublinear when we plug-in the probabilities from Lemma 6. ρ > 2 is attainable with a more careful design of the sets K(m). The Λ > Λ0 condition is needed because initially MF-UCB is playing at lower fidelities and for small Λ, N could be much larger than nΛ. 5.2 Theorem 4 First we show that for an arm k with (p) k > 0 and (ℓ) k 0 for all ℓ< p, any strategy should satisfy Rk(Λ) log(nΛ) (M) k min ℓ p, (ℓ) k >0 Λ 10 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 K = 500, M = 3, costs = [1; 10; 100] Λ 10 5 0.5 1 1.5 2 2.5 K = 500, M = 4, costs = [1; 5; 20; 50] Λ 10 4 1 2 3 4 5 6 7 8 9 10 K = 200, M = 2, costs = [1; 10] Λ 10 5 1 2 3 4 5 6 7 8 9 10 K = 1000, M = 5, costs = [1; 3; 10; 30; 100] 0 50 100 150 200 250 300 350 400 450 500 Number of plays m=3 m=2 m=1 0 50 100 150 200 250 300 350 400 450 500 Number of plays m=3 m=2 m=1 Figure 2: Simulations results on the synthetic problems. The first four figures compares UCB against MF-UCB on four synthetic problems. The title states K, M and the costs λ(1), . . . , λ(M). The first two used Gaussian rewards and the last two used Bernoulli rewards. The last two figures show the number of plays by UCB and MF-UCB on a K = 500, M = 3 problem with Gaussian observations (corresponding to the first figure). where Rk is the regret incurred due to arm k. The proof uses a change of measure argument. The modification has Bernoulli distributions with mean µ(ℓ) k , ℓ= 1, . . . , M where µ(ℓ) k = µ(ℓ) k for all ℓ< m. Then we push µ(ℓ) k slightly above µ ζ(ℓ) from ℓ= m all the way to M where µ(M) k > µ . To control the probabilities after changing to µ(ℓ) k we use the conditions in Assumption 3. Then for k K(m) we argue that λ(ℓ) (ℓ) k 2 λ(m)/ (m) k 2 using, once again the design of the sets K(m). This yields the separate results for k K(m) , K(m) . 6 Some Simulations on Synthetic Problems We compare UCB against MF-UCB on a series of synthetic problems. The results are given in Figure 2. Due to space constraints, the details on these experiments are given in Appendix C. Note that MF-UCB outperforms UCB on all these problems. Critically, note that the gradient of the curve is also smaller than that for UCB corroborating our theoretical insights. We have also illustrated the number of plays by MF-UCB and UCB at each fidelity for one of these problems. The arms are arranged in increasing order of µ(M) k values. As predicted by our analysis, most of the very suboptimal arms are only played at the lower fidelities. As lower fidelities are cheaper, MF-UCB is able to use more higher fidelity plays at arms close to the optimum than UCB. 7 Conclusion We studied a novel framework for studying exploration exploitation trade-offs when cheaper approximations to a desired experiment are available. We propose an algorithm for this setting, MF-UCB, based on upper confidence bound techniques. It uses the cheap lower fidelity plays to eliminate several bad arms and reserves the expensive high fidelity queries for a small set of arms with high expected reward, hence achieving better regret than strategies which ignore multi-fidelity information. We complement this result with a lower bound which demonstrates that MF-UCB is near optimal. Other settings for bandit problems with multi-fidelity evaluations might warrant different definitions for the regret. For example, consider a gold mining robot where each high fidelity play is a real world experiment of the robot and incurs cost λ(2). However, a vastly cheaper computer simulation which incurs λ(1) approximate a robot s real world behaviour. In applications like this λ(1) λ(2). However, unlike our setting lower fidelity plays may not have any rewards (as simulations do not yield actual gold). Similarly, in clinical trials the regret due to a bad treatment at the high fidelity, would be, say, a dead patient. However, a bad treatment at a lower fidelity may not warrant a large penalty. These settings are quite challenging and we wish to work on them going forward. [1] Rajeev Agrawal. Sample Mean Based Index Policies with O(log n) Regret for the Multi-Armed Bandit Problem. Advances in Applied Probability, 1995. [2] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration-exploitation Tradeoff Using Variance Estimates in Multi-armed Bandits. Theor. Comput. Sci., 2009. [3] Peter Auer. Using Confidence Bounds for Exploitation-exploration Trade-offs. J. Mach. Learn. Res., 2003. [4] Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learning algorithms. The Journal of Machine Learning Research, 5:255 291, 2004. [5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012. [6] Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Reinforcement Learning with Multi-Fidelity Simulators. In IEEE International Conference on Robotics and Automation (ICRA), 2014. [7] D. Huang, T.T. Allen, W.I. Notz, and R.A. Miller. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 2006. [8] Kirthevasan Kandasamy, Gautam Dasarathy, Junier Oliva, Jeff Schenider, and Barnabás Póczos. Gaussian Process Bandit Optimisation with Multi-fidelity Evaluations. In Advances in Neural Information Processing Systems, 2016. [9] T. L. Lai and Herbert Robbins. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 1985. [10] Dev Rajnarayan, Alex Haas, and Ilan Kroo. A multifidelity gradient-free optimization method and application to aerodynamic design. In AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, Victoria, Etats-Unis, 2008. [11] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952. [12] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933. [13] Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. Efficient Regret Bounds for Online Bid Optimisation in Budget-Limited Sponsored Search Auctions. In UAI, 2014. [14] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated, 2010. [15] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson Sampling for Budgeted Multi-Armed Bandits. In IJCAI, 2015. [16] Chicheng Zhang and Kamalika Chaudhuri. Active Learning from Weak and Strong Labelers. In Advances in Neural Information Processing Systems, 2015.