# multifidelity_multiarmed_bandits_revisited__fc58f85c.pdf

Multi-Fidelity Multi-Armed Bandits Revisited

Xuchuang Wang Chinese University of Hong Kong xcwang@cse.cuhk.edu.hk

Qingyun Wu Pennsylvania State University qingyun.wu@psu.edu

Wei Chen Microsoft Research weic@microsoft.com

John C.S. Lui Chinese University of Hong Kong cslui@cse.cuhk.edu.hk

We study the multi-fidelity multi-armed bandit (MF-MAB), an extension of the canonical multi-armed bandit (MAB) problem. MF-MAB allows each arm to be pulled with different costs (fidelities) and observation accuracy. We study both the best arm identification with fixed confidence (BAI) and the regret minimization objectives. For BAI, we present (a) a cost complexity lower bound, (b) an algorithmic framework with two alternative fidelity selection procedures, and (c) both procedures cost complexity upper bounds. From both cost complexity bounds of MF-MAB, one can recover the standard sample complexity bounds of the classic (single-fidelity) MAB. For regret minimization of MF-MAB, we propose a new regret definition, prove its problem-independent regret lower bound Ω(K1/3Λ2/3) and problem-dependent lower bound Ω(K log Λ), where K is the number of arms and Λ is the decision budget in terms of cost, and devise an elimination-based algorithm whose worst-cost regret upper bound matches its corresponding lower bound up to some logarithmic terms and, whose problem-dependent bound matches its corresponding lower bound in terms of Λ.

1 Introduction

The multi-armed bandits (MAB) problem was first introduced in the seminal work by Lai and Robbins [22] and was extensively studied (ref. Bubeck and Cesa-Bianchi [4], Lattimore and Szepesvári [24]). In stochastic MAB, the decision maker repeatedly pulls arms among a set of K N+ arms and observes rewards drawn from unknown distributions of the pulled arms. The initial objective of MAB is regret minimization [22, 1], where the regret is the cumulative differences between the rewards from pulling the optimal arm and that from the concerned algorithm s arm pulling strategy. This is a fundamental sequential decision making framework for studying the exploration-exploitation trade-off, where one needs to balance between optimistically exploring arms with high uncertainty in reward (exploration) and myopically exploiting arms with high empirical reward average (exploitation). Another task of MAB is the best arm identification (BAI, a.k.a., pure exploration), later introduced by Even-Dar et al. [7, 8], Mannor and Tsitsiklis [27]. Best arm identification aims to find the best arm without considering the cumulative regret (reward) during the learning process. In this paper, we focus on the fixed confidence case, with the goal of identifying the best arm with a fixed confidence with as few number of decision rounds as possible.

In this paper, we investigate a generalized MAB model with a wide range of real-world applications the multi-fidelity multi-armed bandit (MF-MAB) introduced by Kandasamy et al. [19]. We study both the regret minimization and best arm identification objectives under the MF-MAB model. The MF-MAB model introduces flexibility for exploring arms: instead of squarely pulling an arm to observe its

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

reward sample as in MAB, MF-MAB offers M N+ different accesses to explore an arm, called M fidelities. Pulling an arm k K := {1, . . . , K} at fidelity m M := {1, . . . , M}, the decision maker pays a cost of λ(m) and observes a reward X(m) k whose mean µ(m) k is not too far away from the arm s true reward mean µk. More formally, |µ(m) k µk| ζ(m) where ζ(m) 0 is the error upper bound at fidelity m. The formal definition of MF-MAB is presented in Section 2.

1.1 Contributions

In Section 3, we first look into the best arm identification with fixed confidence (BAI) objective in MF-MAB, which has wide applications in hyperparameter optimization [15, 1.4] and neural architecture search (NAS) [6]. In the context of NAS, each arm of MF-MAB corresponds to one configuration of neural network architecture, and the different fidelities of pulling this arm correspond to training the neural network of this configuration with different sample sizes of training data (see details in Section 2.1). As in MF-MAB pulling an arm at different fidelities has different costs, we consider the total cost needed by an algorithm and refer to it as cost complexity (a generalization of sample complexity). We first derive a cost complexity lower bound Ω(P

k K minm: (m) k >0(λ(m)/( (m) k )2) log δ 1),

where the (m) k is the reward gap of arm k at fidelity m (defined in Eq.(1)) and δ is the confidence parameter. We then devise a Lower-Upper Confidence Bound (LUCB) algorithmic framework with two different fidelity selection procedures and prove both procedures cost complexity upper bounds. The first procedure focuses on finding the optimal fidelity suggested by the minimization in lower bound via an upper confidence bound (UCB) algorithm, which may pay high additional costs during searching the optimal fidelities. The second procedure, instead of identifying the optimal fidelity, seeks for good fidelities that are at least half as good as the optimal ones so as to reduce the cost for identifying the optimal fidelity while still enjoying fair theoretical performance.

In Section 4, we next study the regret minimization objective in MF-MAB. We introduce a novel definition of regret, wherein the rewards obtained are dependent on the pulled arms, while the selected fidelities solely affect the accuracy of the observed rewards. The new regret definition covers applications like product management and advertisement distribution [13, 10] where the distributed advertisement (ad) determines the actual (but unknown) reward while the cost (fidelity) spent by the company on evaluating the impact of this ad decides the accuracy of the observed rewards (see Remark 4.1). The difference of our regret definition and the one studied by Kandasamy et al. [19] is discussed in Remark 4.2. We first propose both a problem-independent (a.k.a., worst-case) regret lower bound Ω(K1/3Λ2/3) and a problem-dependent regret lower bound Ω(K log Λ) where the Λ (> 0) is the decision budget. We then devise an elimination-based algorithm, which explores (and eliminates) arms in the highest fidelity M and exploits the remaining arms in the lowest fidelity 1. The algorithm enjoys a O(K1/3Λ2/3) problem-independent regret upper bound, which matches the corresponding lower bound up to some logarithmic factor, and also a problem-dependent bound matching its corresponding lower bound tightly in a class of MF-MAB instances.

1.2 Related Works

Multi-fidelity multi-armed bandits (MF-MAB) was first proposed by Kandasamy et al. [19]. They studied a cumulative regret minimization setting whose regret definition is different from ours (see Remark 4.2 for more details). Later, Kandasamy et al. [18] extended MF-MAB to bandits optimization, i.e., in continuous action space, with the objective of minimizing simple regret, and Kandasamy et al. [20] further extended Kandasamy et al. [18] to the continuous fidelity space with the same objective of minimizing simple regret. We note that the simple regret minimization in bandits optimization corresponds to the best arm identification with fixed budget in multi-armed bandits, and, therefore, is different from the cumulative regret minimization and best arm identification with fixed confidence objectives studied in this paper. During the submission of this work, the authors were aware that the BAI task under MF-MAB model was also studied by Poiani et al. [30], where they proposed a similar cost complexity lower bound to our Theorem 3.1 and their proposed algorithm is similar to our EXPLORE-C procedure in Appendix D. Except for these two results, all other results proposed in this paper, including two BAI algorithms in Section 3.2 and the regret minimization results in Section 4, are novel.

The multi-fidelity optimization was first introduced in simulating expensive environments via cheap surrogate models [14, 11], and, later on, was extended to many real-world applications, e.g., shape design optimization for ship [3] or wing [32], and wind farm optimization [31]. Recently, the multifidelity idea was employed in hyperparameter optimization (HPO) for automated machine learning (auto ML) [16, 26, 9, 25]. The fidelity may correspond to training time, data set subsampling, and feature subsampling, etc. These multi-fidelity settings help agents to discard some hyperparameter configurations with low cost (a.k.a., early discarding [28, 3.1.3]). Jamieson and Talwalkar [16] modeled hyperparameter optimization as non-stochastic best arm identification and applied the successive halving algorithm (SHA) to address it. Li et al. [26] introduced Hyperband, a hyperparameter configuring and resources allocation method which employed SHA as its subroutine to solve real HPO tasks. Falkner et al. [9] proposed BOHB, which use Bayesian optimization method to select and update hyperparameters and employ SHA to allocate resources. Li et al. [25] extended SHA to asynchronous case for parallel hyperparameter optimization. This line of works was based on the non-stochastic best arm identification problem [16] and, therefore, is very different from our stochastic modelling.

We consider a K ( N+)-armed bandit. Each arm can be pulled with M ( N+) different fidelities. When an arm k K := {1, . . . , K} is pulled at fidelity m M := {1, . . . , M}, one pays a cost λ(m) (> 0) and observes a feedback value drawn from a [0, 1]-supported probability distribution with mean µ(m) k . We assume the non-trivial case that |µ(m) k µ(M) k | ζ(m), where ζ(m) 0 is the observation error upper bound at fidelity m. Without loss of generality, we assume λ(1) . . . λ(M); otherwise, we can relabel the fidelity so that the λ(m) is non-decreasing with respect to m. We call the reward mean of an arm at the highest fidelity M as this arm s true reward mean and, without loss of generality, assume that these true reward means are in a descending order µ(M) 1 > µ(M) 2 µ(M) 3 . . . µ(M) K , and arm 1 is the unique optimal arm.

For the online learning problem, we assume that the reward distribution and the mean µ(m) k of each arm k at fidelity m are unknown, while the costs λ(m) s and the error upper bounds ζ(m) s are known to the learning agent.1 To summarize, a MF-MAB instance I is parameterized by a set of tuples (K, M, (λ(m))m M, (ζ(m))m M, (µ(m) k )(k,m) K M), with elements described above.

In this paper, we consider two tasks on the above multi-fidelity bandit model: best arm identification and regret minimization. We will define these two tasks in the respective sections below.

2.1 Application

One typical application of the best arm identification problem is hyperparameter optimization [15, 6] (including neural architecture search) for machine learning, in which the goal is to identify the best hyperparameter configuration the training set-up for a machine learning model attaining the best predictive performance with as low resource as possible. A mapping between concepts in this application and MF-MAB is discussed as follows. Arm: Hyperparameter configurations of machine learning models, e.g., neural network architectures. Reward: Predictive performance of the resulting machine learning model trained based on the selected configuration (arm). Fidelity dimension: For a particular hyperparameter configuration (arm), one typically has the choice to determine a certain level of resources to allocate for training the model. The concept of training resource can be considered the fidelity dimension. More concretely, commonly used training resources include the number of epochs and the training data sample, both of which satisfy our cost assumption. For example, the larger the number of epochs or training data samples, the more expensive to train the model. Remark 2.1 (On the assumptions of multi-fidelity feedback in the hyperparameter optimization application). Observation error upper bound ζ(m) is the maximum distance from resources allocated to the terminal validation loss. According to benchmarked results in two recent benchmarks for multi-fidelity hyperparameter optimization, including HPOBench [5] and YAHPO Gym [29], under the typically used fidelity dimension, including the number of epochs, the training data sample, etc., the

1We provide application examples where λ(m) and ζ(m) are known in Section 2.1

maximum distance from the terminal validation loss often decreases with the increase of resources, i.e., ζ(m) decreases with the increase of m. Thanks to these benchmarks, it is also convenient to know ζ(m) under different fidelities m for commonly used types of fidelity dimension.

3 Best Arm Identification with Fixed Confidence

In this section, we study the best arm identification with fixed confidence (BAI) task in the multifidelity multi-armed bandits (MF-MAB) model. The objective of BAI is to minimize the total budget spent for identifying the best arm with a confidence of at least 1 δ. As the cost of pulling an arm in MF-MAB depends on the chosen fidelity, we use the total cost, instead of total pulling times (sample complexity), as our criterion, and refer to it as cost complexity. We first present a cost complexity lower bound in Section 3.1, then propose a LUCB algorithmic framework with two alternative procedures in Section 3.2, and analyze their cost complexity upper bounds in Section 3.3. Lastly, we introduce concrete applications of the BAI problem in Section 2.1.

3.1 Cost Complexity Lower Bound

We present the cost complexity lower bound in Theorem 3.1. Its proof is deferred to Appendix E.1. During the submission of this paper, we notice that a similar cost complexity lower bound was also proposed by Poiani et al. [30, Theorem 1]. Theorem 3.1 (Cost complexity lower bound). For any algorithm addressing the best arm identification with fixed confidence 1 δ for any parameter δ > 0, any number of arms K, any number of fidelities M with any observation error upper bound sequence (ζ(1), ζ(2), . . . , ζ(M)) (ζ(M) = 0) and any cost sequence (λ(1), . . . , λ(M)), and any K fidelity subsets {M1, M2, . . . , MK} where the Mk is a subset of full fidelity set M containing the highest fidelity M, i.e., M Mk 2M, for all k K, there exists a MF-MAB instance such that

min m M1 λ(m)

KL(ν(m) 1 , ν(M) 2 + ζ(m)) + X

k =1 min m Mk λ(m)

KL(ν(m) k , ν(M) 1 ζ(m))

log 1 2.4δ ,

where KL is the KL-divergence between two probability distributions, ν(m) k is the probability distribution associated with arm k when pulled at fidelity m, and ν ζ means to positively/negatively shift the distribution ν by an offset ζ.

According to the KL-divergences two inputs in the above lower bound, we define the reward gap (m) k of arm k at fidelity m as follows,

( µ(M) 1 (µ(m) k + ζ(m)) k = 1 (µ(m) 1 ζ(m)) µ(M) 2 k = 1 . (1)

For any suboptimal arm k = 1, the gap (m) k quantifies the distance between the optimal arm s reward mean µ(M) 1 and the suboptimal arm s reward mean upper bound at fidelity m, i.e., µ(m) k +ζ(m); and for the optimal arm 1, the gap (m) 1 represents the distance between the optimal arm s reward mean lower bound at fidelity m, i.e., µ(m) 1 ζ(m), and the second-best arm s reward mean µ(M) 2 . Remark 3.2. If all reward distributions are assumed to be Bernoulli or Gaussian and let Mk = {m : (m) k > 0}, the regret bound can be simplified as

k K min m: (m) k >0

( (m) k )2 log 1

where C > 0 is a constant depending on the reward distributions assumed. Especially, we denote m k := arg minm: (m) k >0 λ(m)

( (m) k )2 , and with some algebraic transformations, it can be expressed as

m k = arg max m M

Algorithm 1 LUCB Framework for Multi-Fidelity BAI

1: Input: violation probability δ

2: Initialization: ˆµ(m) k,t = 0, N (m) k,t = 0, UCBk,t = 1, LCBk,t = 0 for all arm k and fidelity m, and

ℓt = 1, ut = 2, t = 1, µ(M) 1 , µ(M) 2 . 3: while LCBℓt,t UCBut,t do 4: ℓt arg max k K UCBk,t, ut arg max k K\{ℓt} UCBk,t Select top two UCB indices

5: EXPLORE(ut) and EXPLORE(ℓt) Any procedure of Algorithm 2 6: Output: arm ℓt

With m k, the lower bound in Eq.(2)can be rewritten as E[Λ] C P

k K(λ(m k)/( (m k) k )2) log(1/δ),

and we define the coefficient as H := P

k K λ(m k)/( (m k) k )2. The m k can be interpreted as the optimal (most efficient) fidelity for exploring arm k. We note that the current lower bound in Eq.(2) does not contain the cost of finding this m k. This cost can be observed in our algorithm s cost complexity upper bound stated in Section 3.3.

3.2 Algorithm Design

In this subsection, we propose a Lower-Upper Confidence Bound (LUCB) algorithmic framework with two alternative procedures which employ different mechanisms to select suitable fidelities for arm exploration. Generalized from the original (single-fidelity) LUCB algorithm [17], the LUCB algorithmic framework in 3.2.1 determines two critical arms (the empirical optimal arm ℓt and second-best arm ut) for exploration in each time slot t. However, in MF-MAB, with the critical arms suggested by the LUCB framework, one still needs to decide the fidelities for exploring the critical arms. This fidelity selection faces an accuracy-cost trade-off, i.e., higher fidelity (accuracy) but suffering higher cost, or lower fidelity (accuracy) but enjoying lower cost. This trade-off is different from the common exploration-exploitation trade-off in classic bandits. Because the accuracy and costs are two orthogonal metrics, while in exploration-exploitation trade-off, there is only a single regret metric. We address the accuracy-cost trade-off in 3.2.2 with two alternative procedures.

3.2.1 LUCB Algorithmic Framework

The main idea of LUCB [17] is to repeatedly select and explore two critical arms, that is, the empirical optimal arm ℓt and the empirical second-best arm ut. When both critical arms confidence intervals are separated ℓt s LCB (lower confidence bound) is greater than ut s UCB (upper confidence bound), the algorithm terminates and outputs the estimated optimal arm ℓt.

The LUCB framework depends on a set of meaningful confidence intervals of the arms rewards. Usually, the confidence interval of an empirical mean estimate ˆµ(m) k can be expressed as (ˆµ(m) k β(N (m) k,t , t, δ), ˆµ(m) k + β(N (m) k,t , t, δ)), where β(N (m) k,t , t, δ) is the confidence radius and N (m) k,t is the number of times of pulling arm k at fidelity m up to time t (include t). As MF-MAB assumes that |µ(m) k µ(M) k | ζ(m), based on observations of fidelity m, the upper and lower confidence bounds for arm k s true reward mean at the highest fidelity µ(M) k can be expressed as follows,

UCB(m) k,t := ˆµ(m) k,t + ζ(m) + β(N (m) k,t , t, δ), LCB(m) k,t := ˆµ(m) k,t ζ(m) β(N (m) k,t , t, δ), (4)

where we set the confidence radius β(n, t, δ) = p

log(Lt4/δ)/n and L (> 0) is a factor. Since the multi-fidelity feedback allows one to estimate an arm s true reward mean with observations of every fidelity, we pick the tightest one as arm k s final confidence bounds as follows,

UCBk,t = min m M UCB(m) k,t , LCBk,t = max m M LCB(m) k,t . (5)

We use the above UCBk,t and LCBk,t formulas to select the two critical arms in each round and decide when to terminate the LUCB (Line 3). We present the LUCB framework in Algorithm 1. The next step is to decide fidelities for exploring both critical arms in each round ( Line 5).

3.2.2 Exploration Procedures

To address the accuracy-cost trade-off in fidelity selections, we devise a UCB-type policy which finds the optimal fidelity m k in Eq.(3) for each arm k (EXPLORE-A) and an explore-then-commit policy stopping at a good fidelity that is at least half as good as the optimal one (EXPLORE-B).

Notations. The lower bound in Eq.(2) implies that there exists an optimal fidelity m k for exploring arm k. As Eq.(3) shows, the optimal fidelity m k maximizes the (m) k /

λ(m), where the (m) k defined in Eq.(1) consists of two unknown reward means: µ(m) k and µ(M) 1 (or µ(M) 2 ). That is, calculating (m) k for all k needs the top two arms reward means µ(M) 1 and µ(M) 2 which are unknown a priori. To address the issue, we assume the knowledge of an upper bound of the optimal arm s reward mean µ(M) 1 and a lower bound of the second-best arm s reward mean µ(M) 2 (see Remark 3.3 for how to obtain µ(M) 1 and µ(M) 2 in real-world applications). With the µ(M) 1 and µ(M) 2 replaced by µ(M) 1 and µ(M) 2 , we define the ancillary reward gaps (m) k and the ancillary optimal fidelity m k as follows,

( µ(M) 1 (µ(m) k + ζ(m)) k = 1 (µ(m) k ζ(m)) µ(M) 2 k = 1 , and m k := arg max m M

Especially, we have m k m k because the cost λ(m) is non-decreasing and the replacement enlarges the numerator of the arg max item in Eq.(3), and, when the bounds µ(M) 1 and µ(M) 2 are close the the true reward means, we have m k = m k; hence, as (m k) k > 0, we assume ( m k) k > 0 for all arms k as well. To present the next two procedures, we define the estimate of (m) k as follows,

ˆ (m) k,t :=

( µ(M) 1 (ˆµ(m) k,t + ζ(m)) k = ℓt (ˆµ(m) k,t ζ(m)) µ(M) 2 k = ℓt ,

where the ℓt is the estimated optimal arm by LUCB. For simplify, we omit the input ℓt for ˆ (m) k,t (ℓt) in the LHS of this definition and thereafter.

EXPLORE-A We devise the (m) k /

λ(m) s upper confidence bounds (f-UCB) as follows, for any

fidelity m M and arms k K, f-UCB(m) k,t := ˆ (m) k,t /

2 log Nk,t/(λ(m)N (m) k,t ), where

the Nk,t := P

m M N (m) k,t is the total number of times of pulling arm k up to time t. Whenever the

arm k is selected by LUCB, we pick the fidelity m that maximizes its f-UCB(m) k,t to explore it (see Line 2 in Algorithm 2). For any arm k, this policy guarantees that most of the arm s pulling are on its estimated optimal fidelity m k, or formally, N (m) k,t = O(log(N ( m k) k,t )) for any fidelity m = m k as Lemma E.2 in Appendix shows. Therefore, EXPLORE-A spends most cost on the fidelity m k.

EXPLORE-B The cost of finding the estimated optimal fidelity m k in EXPLORE-A can be large. To avoid this cost, we devise another approach that stops exploration when finding a good fidelity ˆm k. That is, instead of finding the m k that maximizes (m) k /

λ(m), we stop at a fidelity ˆm k whose (m) k /

λ(m) is at least half as large as that of m k, i.e.,

λ( m k) . (6)

We prove that the above inequality holds for ˆm k = arg maxm M ˆ (m) k,t /

λ(m) when the condition in Line 10 of Algorithm 2 holds (see Lemma E.3). Hence, for each arm k, EXPLORE-B explores it at all fidelities uniformly, and, when the condition in Line 10 holds, EXPLORE-B finds a good fidelity ˆm k and keeps choosing ˆm k for exploring arm k since then.

Remark 3.3 (On the bounds µ(M) 1 and µ(M) 2 utilized in Algorithm 2 in the hyperparameter optimization application). Although the exact reward means are typically not accessible, it is easy to get a good approximation of them satisfying our requirements based on domain knowledge. For example,

Algorithm 2 EXPLORE Procedures

1: procedure EXPLORE-A(k) 2: mk,t arg maxm M f-UCB(m) k,t 3: Pull (k, mk,t), observe reward, and update corresponding statistics

4: procedure EXPLORE-B(k) 5: if is Fixedk then is Fixedk is initialized as False 6: Pull (k, ˆm k), observe reward, and update corresponding statistics 7: else 8: for each fidelity m do 9: Pull (k, m), observe reward, and update corresponding statistics

10: if maxm M ˆ (m) k,t

log(L/δ) λ(1)N (m) k,t then

11: ˆm k arg maxm M

λ(m) and is Fixedk True

in an image classification task, an easy approximation of the best arm s reward means is to use the reward from a perfect classification, i.e., µ(M) 1 = 1.0, which clearly satisfies µ(M) 1 µ(M) 1 . For the reward of the second-best arm, we can use the performance of a commonly used model that has a fairly good performance based on benchmarked results as a good approximation. One can easily find the benchmarked performance of commonly used models on a wide range of well-defined machine learning tasks on the Papers with code website.2 For novel tasks without well-benchmarked results, one pragmatic way to get µ(M) 2 is to use the result from a particular default machine learning model without any tuning.

3.3 Cost Complexity Upper Bound Analysis

In the following, we present the cost complexity upper bounds of Algorithm 1 with above two procedures in Theorem 3.4 respectively. Theorem 3.4 (Cost complexity upper bounds for Algorithm 1 with procedure in Algorithm 2). Given L 4KM, Algorithm 1 outputs the optimal arm with a probability at least 1 δ. The cost complexities of Algorithm 1 with different fidelity selection procedures in Algorithm 2 are upper bounded as follows,

(EXPLORE-A) E[Λ] = O

+ G log log

(EXPLORE-B) E[Λ] = O

P m M(λ(m)/λ(1)) HL

where H := P

k K λ( m k)

k )2 , and G := P

λ( m k) (m) k

(a) EXPLORE-A is better

(b) EXPLORE-B is better Figure 1: EXPLORE-A vs. EXPLORE-B

EXPLORE-A vs. EXPLORE-B When G = O(M H), the cost complexity upper bound of EXPLORE-A is less than that of EXPLORE-B (see Figure 1a). However, when G is far more larger than M H, EXPLORE-B is better (see Figure 1b). For example, when there is a fidelity m ( = m k) whose (m ) k /

λ(m ) is very close to that of fidelity m k (the case in Figure 1b), this G would be very large

2https://paperswithcode.com/sota

because EXPLORE-A needs to pay a high cost to distinguish fidelity m from m k; while in this scenario, EXPLORE-B stops by either m k or m k since their (m) k /

λ(m) are similar and, therefore, enjoys a smaller cost complexity upper bound. We report the numerical comparisons between both procedures in Figure 1. The detailed setup of the simulations is given in Appendix B. Remark 3.5 (Tightness of cost complexity bounds). The first term of cost complexity upper bound for EXPLORE-A in Eq.(7) matches the cost complexity lower bound in Eq.(2) up to a constant when m k = m k and H = H (i.e., when µ(M) 1 and µ(M) 2 are close to their ground truth values). The cost complexity upper bound of EXPLORE-B in Eq.(8) matches the lower bound with an additional P

m M λ(m)/λ(1) coefficient when m k = m k and H = H. Remark 3.6 (Comparison to classic MAB s sample complexity). If we reduce our cost complexity upper bound result in MF-MAB to classic (single-fidelity) MAB, i.e., letting M = 1, λ(m) = 1, then both cost complexity upper bounds reduce to O(P

k(1/ 2 k) log(1/δ)) where k := µ1 µk, which is exactly the classic sample complexity upper bound for (single-fidelity) BAI [27, 21].

4 Regret Minimization

In this section, we study the regret minimization objective: given a budget Λ R+, minimize the regret the cumulative difference between the optimal policy s rewards and an algorithm s. We define the reward obtained in each time slot as the pulled arm s true reward mean (realized at the highest fidelity, but unrevealed to the learner), no matter at which fidelity the arm is pulled, while the learner s observation depends on the pulled fidelity as Section 2 shows. Under this reward definition, the optimal policy is to constantly pull the optimal arm 1 with the lowest fidelity m = 1. Consequently, the expected regret can be expressed as follows,

E[R(Λ)] := Λ λ(1) µ(M) 1 E

t=1 µ(M) It

where N := max{n : Pn t=1 λ(mt) Λ} is the total number of time slots, and It is the arm pulled by a concerned algorithm at time slot t. Next, we illustrate the regret definition s real-world applications in Remark 4.1. Remark 4.1 (Applications of the new regret definition). One typical application of the regret minimization problem under MF-MAB is a variant of the advertisement distribution problem [13, 10]. In this problem, the objective is to maximize the total return from all the distributed ads within a fixed marketing budget (e.g., in terms of money). We have the following mapping between the application-specific concepts and concepts in MF-MAB. Arm: The ads to distribute are the arms. Reward: The return from each of the ads, once distributed, is the corresponding ground-truth reward µ(M) k . Low fidelity: A minimum cost is needed every time any ad is distributed. For example, the minimum cost may include the necessary resource needed to ensure that the ad satisfies legal and regulatory requirements and to distribute the ad on the designated platform. This minimum cost can be considered the lowest fidelity cost, which can never be waived. High fidelity: since the expected return from different ads can be vastly different, one needs a good estimation on the expected return so as to select the profitable ads to distribute. The cost needed to get a reliable estimate of the expected return can be considered the highest fidelity cost. For example, doing a large-scale user study/survey, and/or consulting experts can give a good estimate of the expected return from the ads, which, however, is resource-consuming.

This type of application is also common in production management with uncertainty where without knowing the expected return of the concerned products, the decision maker faces the two options of (1) spending the minimum resource needed to directly produce certain products; and (2) spending more resource to first get a good estimates of the expected returns from the different options and then put the ones with the highest expected returns into production. Remark 4.2 (Comparison to regret definition of Kandasamy et al. [19]). Kandasamy et al. [19] defined the per time slot reward as the pulled arms true reward mean multiplied by the cost, i.e., λ(mt)µ(M) It , and defined their regret as, E[R (Λ)] := Λµ(M) 1 E[PN t=1 λ(mt)µ(M) It ]. We note that

multiplying the reward mean with the fidelity-level cost, λ(mt)µ(M) It , does not fit into the applications

in Remark 4.1, and thus we provide an alternative definition in Eq.(9) to fit our needs. Comparing the formula of both regret definitions, we have E[R (Λ)] λ(1)E[R(Λ)]. Note that both regret definitions are very different, so as their bound analysis and algorithm design.

We first present both the problem-independent (worst-case) and problem-dependent regret lower bounds in Section 4.1 and then devise an elimination algorithm whose worst-case upper bounds match the worst-case lower bound up to some logarithmic factors and whose problem-dependent upper bound matches the problem-dependent lower bound in a class of MF-MAB in Section 4.2.

4.1 Regret Lower Bound

We present the problem-independent regret lower bound in Theorem 4.3 and the problem-dependent regret lower bound in Theorem 4.4. Both proofs are deferred to Appendix F.1 and F.2 respectively.

Theorem 4.3 (Problem-independent regret lower bound). Given budget Λ, the regret of MF-MAB is lower bounded as follows,

inf Algo sup I E [R(Λ)] Ω K1/3Λ2/3 ,

where the inf is over any algorithms, the sup is over any possible MF-MAB instances I.

Theorem 4.4 (Problem-dependent lower bound). For any consistent policy that, after spending Λ budgets, fulfills that for any suboptimal arm k (with (M) k > 0) and any a > 0, E[N ( m) k (Λ)] = o(Λa), its regret is lower bounded by the following inequality,

lim inf Λ E [R(Λ)]

k K min m: (m) k >0

λ(1) µ(M) 1 µ(M) k

( (m) k )2 .

4.2 An Elimination Algorithm and Its Regret Upper Bound

In this section, we propose an elimination algorithm for MF-MAB based on Auer and Ortner [1]. This algorithm proceeds in phases p = 0, 1, . . . and maintains a candidate arm set Cp. The set Cp is initialized as the full arm set K and the algorithm gradually eliminates arms from the set until there is only one arm remaining. When the candidate arm set contains more than one arms, the algorithm explores arms with the highest fidelity M, and when the set |Cp| = 1, the algorithm exploits the singleton in the set with the lowest fidelity m = 1. We present the detail in Algorithm 3.

Algorithm 3 Elimination for MF-MAB

1: Input: full arm set K, budget Λ, and parameter ε 2: Initialization: phase p 0, candidate set Cp K 3: while p < log2 2 ε and |Cp| > 1 do

4: pull each arm k Cp in highest fidelity M such that T (M) k = 22p log Λ 22pλ(M)

5: Update reward means ˆµ(M) k,p for all arms k Cp

6: Cp+1 {k Cp:ˆµ(M) k,p + 2 p+1> maxk Cp ˆµ(M) k ,p } Elimination 7: p p + 1 8: Pull the remaining arms of Cp in turn in fidelity m = 1 until the budget runs up

Remark 4.5 (Real-world implication of the 2-stage algorithm design). There are real-world applications, e.g., the advertisement distribution problem in Remark 4.1, where the explorations are conducted at the high fidelity (e.g., a large-scale user study) and the exploitations are conducted at the low fidelity (e.g., advertisement distribution via some platforms). This corroborates our algorithm design which also explores at high fidelity and exploits at low fidelity. On the other hand, the fact that our algorithm enjoys the tight regret performance comparing to regret lower bound (both problem-independent and -dependent, see Remarks 4.9 and 4.7) also implies that the approach of first conducting large-scale user study and then massive distributing good ads used in real-world advertisement distribution is reasonable.

4.2.1 Analysis Results

We first present the problem-dependent regret upper bound of Algorithm 3 in Theorem 4.6. Its full proof is deferred to Appendix F.3. Theorem 4.6 (Problem-Dependent Regret Upper Bound for Algorithm 3). For any ε > 0. Algorithm 3 s regret is upper bounded as follows,

E[R(Λ)] max k: (M) k ε

Λ λ(1) (M) k

k: (M) k >ε

λ(1) µ(M) 1 µ(M) k

( (M) k )2 log Λ( (M) k )2

16λ(M) + 48

( (M) k )2 + 1

λ(1) µ(M) 1 µ(M) k

3ε2 + 1 + 64

Especially, if letting ε go to zero and budget Λ go to infinity, the above upper bound becomes

λ(1) µ(M) 1 µ(M) k

( (M) k )2 . (11)

Remark 4.7 (Tightness of problem-dependent regret bounds). The problem-dependent regret bounds are tight for a class of MF-MAB instances. For instances fulfilling the condition that {m : (m) k > 0} = {M} for all arm k K, then the problem-dependent regret lower bound matches the upper bound. This kind of instances covers a vast number of real world scenarios. Because in practice, the highest fidelity M is often defined as the only fidelity where the optimal arm can be distinguished from the suboptimal arms. For example, in neural architecture search, the process of increasing the training sample size (fidelity) stops when one architecture performs much better than others.

Letting ε = (K log Λ/Λ)1/3 in Eq.(10) of Theorem 4.6, one can obtain a problem-independent regret upper bound of Algorithm 3 in Theorem 4.8 as follows. Theorem 4.8 (Problem-Independent Regret Upper Bound for Algorithm 3). Letting ε = (K log Λ/Λ)1/3, Algorithm 3 s regret is upper bounded as follows,

E[R(Λ)] O K1/3Λ2/3(log Λ)1/3 .

Remark 4.9 (Tightness of problem-independent regret bounds). The problem-independent regret upper bound matches the problem-independent lower bound in terms of the number of arms K and up to some logarithmic factor in terms of the budget Λ.

5 Future Directions

We note that the BAI s exploration procedures in Algorithm 2 require some prior knowledge of the top two arms reward mean estimates µ(M) 1 and µ(M) 2 as input. Although such prior knowledge is easy to access in many real world applications (see Remark 3.3), this is not a common assumption in bandits literature. Without relying on this prior knowledge, we devise a third procedure EXPLORE-C in Appendix D which starts from the lower fidelity and gradually increase fidelity when necessary. However, its cost complexity upper bound is incomparable to the lower bound in Eq.(2) and can be very large. Therefore, one interesting future direction is to devise BAI algorithms without this prior knowledge but still enjoying good theoretical performance.

Another interesting future direction is to quantify the cost of identifying optimal fidelities and improve the current cost complexity lower bound in Eq.(2). Note that the second term of cost complexity upper bound for EXPLORE-A in Eq.(7) has no correspondence in the lower bound, so does the additional factor P

m M λ(m)/λ(1) of the bound for EXPLORE-B in Eq.(8). These two additional terms may correspond to the cost of finding the optimal fidelity m k which is not accounted in the lower bound in Theorem 3.1.

Acknowledgements

We thank Quanlu Zhang for his insightful discussion on the potential applications of MF-MAB, especially, neural architecture search. The work of John C.S. Lui and Xuchuang Wang was supported in part by the RGC s GRF 14207721 and SRFS2122-4S02.

[1] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

[2] Gábor Bartók, Navid Zolghadr, and Csaba Szepesvári. An adaptive algorithm for finite stochastic partial monitoring. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1779 1786, 2012.

[3] L Bonfiglio, P Perdikaris, S Brizzolara, and GE Karniadakis. Multi-fidelity optimization of super-cavitating hydrofoils. Computer Methods in Applied Mechanics and Engineering, 332: 63 85, 2018.

[4] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

[5] Katharina Eggensperger, Philipp Müller, Neeratyoy Mallik, Matthias Feurer, Rene Sass, Aaron Klein, Noor Awad, Marius Lindauer, and Frank Hutter. HPOBench: A collection of reproducible multi-fidelity benchmark problems for HPO. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=1k4r JYEwda-.

[6] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997 2017, 2019.

[7] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and markov decision processes. In COLT, volume 2, pages 255 270. Springer, 2002.

[8] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.

[9] Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pages 1437 1446. PMLR, 2018.

[10] Paul W Farris, Dominique M Hanssens, James D Lenskold, and David J Reibstein. Marketing return on investment: Seeking clarity for concept and measurement. Applied Marketing Analytics, 1(3):267 282, 2015.

[11] Alexander IJ Forrester, András Sóbester, and Andy J Keane. Multi-fidelity optimization via surrogate modelling. Proceedings of the royal society a: mathematical, physical and engineering sciences, 463(2088):3251 3269, 2007.

[12] Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019.

[13] Benjamin Han and Jared Gabor. Contextual bandits for advertising budget allocation. Proceedings of the ADKDD, 17, 2020.

[14] Deng Huang, Theodore T Allen, William I Notz, and R Allen Miller. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 32(5):369 382, 2006.

[15] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019.

[16] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial intelligence and statistics, pages 240 248. PMLR, 2016.

[17] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655 662, 2012.

[18] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabás Póczos. Gaussian process bandit optimisation with multi-fidelity evaluations. Advances in neural information processing systems, 29, 2016.

[19] Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multifidelity multi-armed bandit. Advances in neural information processing systems, 29, 2016.

[20] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabás Póczos. Multi-fidelity bayesian optimisation with continuous approximations. In International Conference on Machine Learning, pages 1799 1808. PMLR, 2017.

[21] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1): 1 42, 2016.

[22] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[23] Tor Lattimore and Csaba Szepesvári. Cleaning up the neighborhood: A full classification for adversarial partial monitoring. In Algorithmic Learning Theory, pages 529 556. PMLR, 2019.

[24] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[25] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-Tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A system for massively parallel hyperparameter tuning. Proceedings of Machine Learning and Systems, 2:230 246, 2020.

[26] Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765 6816, 2017.

[27] Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648, 2004.

[28] Felix Mohr and Jan N van Rijn. Learning curves for decision making in supervised machine learning a survey. ar Xiv preprint ar Xiv:2201.12150, 2022.

[29] Florian Pfisterer, Lennart Schneider, Julia Moosbauer, Martin Binder, and Bernd Bischl. Yahpo gym-an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. In First Conference on Automated Machine Learning (Main Track), 2022.

[30] Riccardo Poiani, Alberto Maria Metelli, and Marcello Restelli. Multi-fidelity best-arm identification. Advances in Neural Information Processing Systems, 35:17857 17870, 2022.

[31] Pierre-Elouan Réthoré, Peter Fuglsang, Gunner C Larsen, Thomas Buhl, Torben J Larsen, and Helge A Madsen. Topfarm: Multi-fidelity optimization of wind farms. Wind Energy, 17(12): 1797 1816, 2014.

[32] Lingxiao Zheng, Tyson L Hedrick, and Rajat Mittal. A multi-fidelity modelling approach for evaluation and optimization of wing stroke aerodynamics in flapping flight. Journal of Fluid Mechanics, 721:118 154, 2013.

Supplementary Materials

A Additional Remarks on Analysis

A.1 Remarks on Best Arm Identification

Remark A.1 (On choice among EXPLORE-A and EXPLORE-B). For when do we prefer EXPLORE-A than EXPLORE-B, one can informally say if G is small ( G = O(M H)), EXPLORE-A is preferred, and, otherwise if G is large, EXPLORE-B is preferred. Recall that

λ( m k) (m) k

To give an intuition of when G is large or small, let us neglect the tilde for simplicity and fix a

suboptimal arm k. The term (m) k

λ(m) corresponds to the effectiveness of pulling arm k at fidelity m for distinguishing suboptimal arm k. As EXPLORE-A aims to find out the optimal (most effective) fidelity m k for distinguishing suboptimal arm k, it needs to compare the effectiveness of different

fidelities, that is, their effectiveness gap (m k) k

λ(m k) (m) k

λ(m) , which is similar to bandit s reward gap

(reward mean difference) µ µk. Therefore, for this suboptimal arm k, if all its effectiveness gaps are large, that is, the effectiveness of its optimal fidelity is much larger than other suboptimal fidelities, then the contribution of this arm k to the quantity G would be small, and vice versa. If for all arms k K, all of their effectiveness gaps are large, then the G would be small; otherwise, G can be large.

For a more sensible example, one can check the simulation set up (deferred to Appendix A) of the empirical comparison of EXPLORE-A and EXPLORE-B in Figure 1. In Figure 1a s set up, the effectiveness gaps are large and thus G is small, hence EXPLORE-A outperforms EXPLORE-B, while in Figure 1b s set up, the effectiveness gaps are small and thus G is large, then EXPLORE-B outperforms EXPLORE-A.

Practically, one can just run both EXPLORE-A and -B in parallel for each arm k. If EXPLORE-A starts to converge on a fidelity, i.e., a large proportion of pull times on this arm is at this fidelity, we stop EXPLORE-B and continue EXPLORE-A. Otherwise, if EXPLORE-B commits to a fidelity at first, we stop EXPLORE-A and follow EXPLORE-B to commit this fidelity. With this heuristic approach, we can expect the empirical cost is close to the smaller one of running either EXPLORE-A or EXPLORE-B alone.

Remark A.2 (On the choice of half in EXPLORE-B as good fidelity). The selection of half is just for the simplicity of presentation, and any constant between (0, 1) would work. If a constant C (0, 1) is used in Eq.(6), then the factor 3 of right hand side in the condition of Line 10 of EXPLORE-B becomes 1+C

1 C , and there would be an additional (C 2 + (1 C) 2) multiplier appearing in the cost complexity upper bound of EXPLORE-B in Eq.(8). In the (C 2 + (1 C) 2) multiplier, the first term C 2 factor is because a small C implies that EXPLORE-B may commit to a very inefficient suboptimal fidelity and thus suffer a high cost, while the second term (1 C) 2 corresponds to that if C is close to 1, then it becomes very difficult for EXPLORE-B to find a good fidelity to commit, and thus EXPLORE-B has to spend a large cost in exploration.

A.2 Remarks on Regret Minimization

Remark A.3 (Relation to partial monitoring). Recall that, under our regret definition in Eq.(9), the optimal policy is to pull the optimal arm 1 at the lowest fidelity, and any exploration at the highest fidelity results in nonzero regret. This clear separation of exploration and exploitation is similar to the hard case of partial monitoring [2, 23] where the decision maker needs to choose the globally observable actions with nonzero regret cost to do exploration and then, when exploiting, choose locally observable actions. The regret of the hard case of partial monitoring is Θ(T 2/3) where T is the decision round, which is also similar to our O(Λ2/3) regret bound.

B Simulation Details of Figure 1

In Figure 1, we report the cost complexities of Algorithm 1 with EXPLORE-A and Algorithm 1 with EXPLORE-B (let µ(M) 1 = 0.95 and µ(M) 2 = 0.75). We set the confidence parameter δ as 0.05, 0.1, 0.15, 0.2, 0.25 respectively in comparing the performance of both procedures. For each simulation, we run 100 trials, plot their cost complexities mean as markers and their deviation as shaded regions. We present the parameters of MF-MAB instances of Figures 1a and 1b in Tables 1 and 2 respectively. We note that it is more difficult to find the optimal fidelity m k in the MF-MAB instances for Figure 1b because (1) there are more fidelities choices in this instance than that of Figure 1a; (2) the value of (m) k /

λ(m) are closer in the second instance than that of Figure 1a.

Table 1: Figure 1a s MF-MAB with K = 5 arms and M = 3 fidelities

Parameters µ(m) 1 µ(m) 2 µ(m) 3 µ(m) 4 µ(m) 5 ζ(m) λ(m)

m = 1 0.70 0.75 0.50 0.50 0.30 0.30 1 m = 2 0.80 0.775 0.60 0.55 0.45 0.15 1.1 m = 3 0.90 0.80 0.70 0.60 0.50 0 1.2

Table 2: Figure 1b s MF-MAB with K = 5 arms and M = 5 fidelities

Parameters µ(m) 1 µ(m) 2 µ(m) 3 µ(m) 4 µ(m) 5 ζ(m) λ(m)

m = 1 0.83 0.82 0.76 0.82 0.70 0.10 1 m = 2 0.84 0.83 0.80 0.80 0.72 0.08 1.1 m = 3 0.85 0.85 0.80 0.82 0.74 0.06 1.2 m = 4 0.85 0.86 0.80 0.80 0.76 0.04 1.3 m = 5 0.90 0.88 0.86 0.84 0.80 0 1.4

C Additional Experiments

In this section, we present an additional numerical simulation to compare our two exploration policies, Explore-A and Explore-B, with a naive baseline that is always selecting the highest fidelity to explore arms. All three policies are employed under the LUCB algorithmic framework (Algorithm 1). This simulation aims to illustrate when our exploration policies Explore-A and Explore-B are better than the naive baseline.

Experiment set up. We consider a MF-MAB model consisting of K = 5 arms and M = 5 fidelities. The reward means are set as µ(m) k = 0.1(k + 4) + 0.01(m 5), where, for example, µ(5) 5 = 0.9 is the true reward mean of optimal arm and its low-fidelity reward means are {0.89, 0.88, 0.87, 0.86} accordingly, etc. The error upper bound in different fidelities ζ(m) are {0.04, 0.03, 0.02, 0.01, 0}. We set the cost of fidelity m {1, 2, 3, 4} as λ(m) = m respectively, and consider four different cases I {1, 2, 3, 4} of the cost of highest fidelity λ(5) = 5I. That is, from Case 1 to Case 4, the only difference is that the cost of highest fidelity increases.

We set the confidence parameter δ = 0.1 and run each experiment for 100 trials. Averaged over these 100 trials of each experiment, we present the empirical cost complexity in the table as follows,

Table 3: Cost complexity under four different cases Cost Complexity ( 104) Case 1 Case 2 Case 3 Case 4 Explore-A 20.5 20.0 17.4 17.4 Explore-B 36.1 47.9 59.5 69.4 Baseline 21.4 42.0 64.6 86.2

Discussion of additional simulation result. In all four cases, the cost complexity of Explore-A does not change too much. Especially, when the cost of highest fidelity becomes larger, it becomes

easier for Explore-A to find out the optimal fidelity (e.g., the highest fidelity is obviously not the optimal one) and hence enjoy a lower cost complexity. The cost complexity of both Explore-B and Baseline increases over these four cases.

When the cost of highest fidelity is expensive, e.g., Cases 3 and 4, both Explore-A and Explore-B outperform Baseline. This highlights the advantage of utilizing low fidelity to explore arms when the cost of high fidelity is large. When the cost of highest fidelity is cheap, e.g., Case 1, the cost complexity of Explore-A is similar to Baseline, while Explore-B is worse than Baseline. This suggests that when the cost of highest fidelity is similar to that of lower fidelity, the Baseline may be preferred.

In this comparison, Explore-A always outperforms Explore-B. For another scenario where Explore-B is better than Explore-B, please refer to Figure 1b.

D A Third Fidelity Selection Procedure: EXPLORE-C

Besides the EXPLORE-A and -B procedures, here we consider a third naïve and conservative idea for fidelity selection that one should start from low risk (cost), gradually increase the risk (cost) as the learning task needs, and stop when finding the optimal arm. To decide when to increase the fidelity for exploring an arm k, we use the arm s confidence radius β(N (m) k,t , t, δ) at fidelity m as a measure of the amount of information left in this fidelity, and when the fidelity m s confidence radius is less than the error upper bound ζ(m) at this fidelity, we increase the fidelity by 1 for higher accuracy, or formally, the fidelity is selected as follows,

mk,t min n m β(N (m) k,t , t, δ) ζ(m) o .

Algorithm 4 EXPLORE-C Procedures

procedure EXPLORE-C(k)

mk,t min n m β(N (m) k,t , t, δ) ζ(m) o

Pull (k, mk,t), observe reward, and update corresponding statistics

Theorem D.1 (Cost complexity upper bounds for Algorithm 1 with EXPLORE-C). Given L 4KM, Algorithm 1 outputs the optimal arm with a probability at least 1 δ. The cost complexity of Algorithm 1 with EXPLORE-C are upper bounded as follows,

E[Λ] = O H log L(H + Q)

+ Q log L(H + Q)

where, letting m k denote the smallest fidelity for arm k such that (m) k > 2ζ(m), or formally,

m k := min{m : (m) k > 2ζ(m)},

and we denote

( (m k) k )2 , Q := X

Remark D.2 (EXPLORE-C vs. EXPLORE-A and -B). The EXPLORE-C procedure does not require additional knowledge as the other two. It is a one-size-fits-all option. If with some addition information of a specific scenario, e.g., the exact or approximated reward means of top two arms, one can use the EXPLORE-A or B.

E Proofs for Best Arm Identification with Fixed Confidence

E.1 Proof of Theorem 3.1

Lemma E.1 (Kaufmann et al. [21], Lemma 1). Let ν and ν be two bandit models with K arms such that for all k, the distributions νk and ν k are mutually absolutely continuous. For any almost-surely

finite stopping time σ with respect to the filtration {Ft}t N where Ft = σ(I1, X1, . . . , It, Xt),

k=1 Eν[Nk(σ)] KL(νk, ν k) sup E Fσ kl(Pν(E), Pν (E)),

where kl(x, y) is the binary relative entropy.

In MF-MAB model, regarding each arm-fidelity (k, m)-pair as an individual arm, we can extend Lemma E.1 to multi-fidelity case as follows,

m=1 Eν[N (m) k (σ)] KL(ν(m) k , ν (m) k ) sup E Fσ kl(Pν(E), Pν (E)). (13)

Next, we construct instances ν and ν . We set the reward distributions ν = (ν(m) k )(k,m) K M as Bernoulli and the reward means fulfill µ(M) 1 > µ(M) 2 µ(M) 3 . . . µ(M) K , where µ(m) k = EX ν(m) k [X]. We let ν (m) k be the same to ν(m) k for all k and m, except for that an arm ℓ = 1. We set

arm ℓ s reward means on fidelities m Mk to be ν (m) ℓ = ν(M) 1 ζ(m) + ϵ. So, in instance ν , the optimal arm is ℓand its true reward mean µ (M) ℓ is slightly greater than µ(M) 1 . This implies for the event E = {output arm 1} and any algorithm π that can find the optimal arm with a confidence 1 δ, Pν,π(E) 1 δ and Pν ,π(E) δ. Then, from Eq.(13), we have X

m Mk Eν[N (m) ℓ (σ)] KL(ν(m) ℓ , ν (m) ℓ ) sup E Fσ kl(Pν,π(E), Pν ,π(E))

log 1 2.4δ .

We rewrite the above inequality as follows,

m Mk λ(m)Eν[N (m) ℓ (σ)] KL(ν(m) ℓ , ν (m) ℓ ) λ(m) log 1 2.4δ .

Therefore, for the arm ℓ, our aim is to minimize its cost complexity with a constraint as follows,

min E[N(m) ℓ ], m

m=1 λ(m)Eν[N (m) ℓ (σ)]

such that X

m Mk λ(m)Eν[N (m) ℓ (σ)] KL(ν(m) ℓ , ν (m) ℓ ) λ(m) log 1 2.4δ .

Note that the above is a linear programming (LP) and its optimum is reached at one of its polyhedron constraint s vertex only one E[N (m) ℓ ] is positive and all others are equal to zero.

min E[N(m) ℓ ], m

m=1 λ(m)Eν[N (m) ℓ (σ)] (a) min m Mk λ(m)

KL(ν(m) ℓ , ν (m) ℓ ) log 1 2.4δ

= min m Mk λ(m)

KL(ν(m) ℓ , ν(M) 1 ζ(m) + ϵ) log 1 2.4δ

(b) min m Mk λ(m)

(1 + ε) KL(ν(m) ℓ , ν(M) 1 ζ(m)) log 1 2.4δ

where the inequality (a) is due to the property of LP we mentioned above, and the inequality (b) is because of the continuity of KL-divergence.

To bound the optimal arm 1 s cost complexity, we use the same ν as above and construct another instance ν . The instance ν s reward means are the same to ν except for arm 1 whose reward means

for fidelity m M1 are set as µ (m) 1 = µ(m) 2 + ζ(m) ϵ. Then, with similar procedure as the above, we obtain

min E[N(m) 1 ], m

m=1 λ(m)Eν[N (m) 1 (σ)] min m M1 λ(m)

(1 + ε) KL(ν(m) 1 , ν(M) 2 + ζ(m)) log 1 2.4δ .

Summing up the above costs leads to the lower bound as follows, and letting the ϵ goes to zeros concludes the proof.

min m M1 λ(m)

(1 + ε) KL(ν(m) 1 , ν(M) 2 + ζ(m)) + X

k =1 min m Mk λ(m)

(1 + ε) KL(ν(m) k , ν(M) 1 ζ(m))

log 1 2.4δ .

E.2 Proof of Theorem 3.4

Notation. Denote the threshold c = µ(M) 1 +maxk(µ (m k) k +ζ(m k)) 2 as the average between the optimal arm s reward mean and the highest reward mean upper bound of other arms k at fidelity m k. Denote At := {k K : LCBk,t > c} and Bt := {k K : UCBk,t < c} as the above and below sets which respectively contain arms whose rewards are clearly higher or lower than the threshold with high probability, and let Ct := K \ (At Bt) as the complement of both sets union. Then, we define two events as follows

TERMt := {LCBℓt,t > UCBut,t}, CROSt := { k = 1 : k At} {1 Bt}.

The TERMt event corresponds to the complement of the main while loop condition in the LUCB algorithm. When the TERMt event happens, the LUCB algorithm terminates. The CROSt event means there exists a suboptimal arm whose LCBk,t is greater than c or that the optimal arm 1 s UCB1,t is less than c, both of which means that at least one arm s reward mean confidence interval incorrectly crosses the threshold c.

Step 1. Prove TERMt CROSt = (ℓt Ct) (ut Ct). We show this statement by contradiction case by case. That is, the negation of (ℓt Ct) (ut Ct) cannot happen when TERMt CROSt.

Case 1: (ℓt At) (ut At) TERMt = (ℓt At) (ut At) = |At| 2 = k = 1 : k At = CROSt, Case 2: (ℓt Bt) (ut At) TERMt = UCBℓt,t < c < LCBut,t < UCBut,t = (contradicts the selection of ℓt and ut), Case 3: (ℓt At) (ut Bt) TERMt = {LCBℓt,t > c > UCBut,t} TERMt = , Case 4: (ℓt Bt) (ut Bt) TERMt = (ℓt Bt) (ut Bt) = |Bt| = K = 1 Bt = CROSt.

Step 2. Prove P (CROSt) KMδ

Lt3 . For any suboptimal arm k = 1, we bound the probability that the arm k is in At as follows,

P(k At) = P(LCBk,t > c) = P max m M LCB(m) k,t > c X

m M P LCB(m) k,t > c

m M P ˆµ(m) k,t ζ(m) β(N (m) k,t , t) > c

m M P ˆµ(m) k,t µ(m) k + (µ(m) k ζ(m) c) > β(N (m) k,t , t)

m M P ˆµ(m) k,t µ(m) k > β(N (m) k,t , t) X

n=1 P(ˆµ(m) k,t µ(m) k > β(n, t))

n=1 exp( n(β(n, t))2) = X

where the inequality (a) is due to that µ(m) k ζ(m) µ(M) k < c. With similar derivation, we have P(1 Bt) Mδ Lt3 . Noticing that P(CROSt) P

k =1 P(k At) + P(1 Bt), we have P(CROSt) KMδ

Step 3. Prove P k K : (N (m k) k,t > 16N k,t) (k Midt) 16δ P

k K 2 k Lt4 , where N k,t :=

k )2 . For any fixed suboptimal arm k = 1 (with µ(M) k < c), we have

P (N (m k) k,t > 16N k,t) (k Midt)

= P (N (m k) k,t > 16N k,t) (k At Bt)

P (N (m k) k,t > 16N k,t) (UCBk,t > c)

= P (N (m k) k,t > 16N k,t) min m M ˆµ(m) k,t + ζ(m) + β(N (m) k,t , t) > c

P (N (m k) k,t > 16N k,t) (ˆµ(m k) k,t + ζ(m k) + β(N (m k) k,t , t) > c)

P (N (m k) k,t > 16N k,t) (ˆµ(m k) k,t µ(m k) k > (c µ(m k) k ζ(m k)) β(N (m k) k,t , t))

(N (m k) k,t > 16N k,t)

ˆµ(m k) k,t µ(m k) k > ( m k) k

2 β(N ( m k) k,t , t)

τ>16N k,t P

ˆµ(m k) k,t(τ) µ(m k) k > ( m k) k

τ>16N k,t exp

τ( ( m k) k )2

τ>16N k,t exp

τ( ( m k) k )2

( ( m k) k )2Lt4 ,

where inequality (a) is due to the definition of c, and inequality (b), denoting ˆµ(m k) k,t(τ) as the empirical

mean of τ observations, is due to β(τ, t) < k

4 for τ > 16N k,t.

From Step 3, we obtain that the following equation holds with high probability,

N ( m k) k,t 16

( ( m k) k )2 log Lt4

( ( m k) k )2 log Lt

Next, we respectively present the cost complexity upper bounds for different fidelity selection procedures in Algorithm 2.

E.2.1 Proof for EXPLORE-A s Upper Bound

Step 4 for EXPLORE-A: prove that if the small probability events of Steps 2 and 3 do not happen, then the algorithm terminates with a high probability when Λ is large.

Lemma E.2. Give reward means µ(M) 1 and µ(M) 2 . For a fixed arm k, there exist Nk,t and αk > 0 such that when Nk,t > Nk,t, Nk,t < 2N ( m k) k,t , the number of times of pulling this arm k at fidelities m ( = m k) is O(log(log Nk,t)), or formally,

N (m) k,t 8 λ(m)

λ( m k) (m) k

log Nk,t, m = m k. (15)

Combine Lemma E.2 with Eq.(14) in Step 3, we have, for any arm k and fidelity m = m k:

N (m) k,t 8 λ(m)

λ( m k) (m) k

( ( m k) k )2 log Lt

Next, we can upper bound the total cost of the LUCB algorithm (before it terminating) via Eq.(14) and Eq.(16). Specially, we show it is impossible for Λ = C H log L(G+H)

λ(1)δ + G log log L(G+H)

via contradiction. Suppose Λ = C H log L(G+H)

λ(1)δ + G log log L(G+H)

λ(1)δ , we have the following,

m M λ(m)N (m) k,t

k K λ( m k)N ( m k) k,t + X

m = m k λ(m)N (m) k,t

(a) 64H log Lt

δ + 8G log log Lt

δ + G log(128H)

(b) 64H log LΛ

λ(1)δ + 8G log log LΛ

λ(1)δ + G log(128H)

(c) = 64H log L λ(1)δ C H log L(G + H)

λ(1)δ + G log log L(G + H)

+ 8G log log L λ(1)δ C H log L(G + H)

λ(1)δ + G log log L(G + H)

+ G log(128H)

(d) 128(2 + log C) H log L(G + H)

λ(1)δ + G log log L(G + H)

(e) < C H log L(G + H)

λ(1)δ + G log log L(G + H)

where the inequality (a) is due to Eq.(14) and Eq.(16), the inequality (b) is because t Λ λ(1) , the inequality (c) is by the supposition, the inequality (d) is by separately bounding the above first two terms via Eq.(17) and Eq.(18) in the following, and the inequality (e) holds for C > 1200. This above inequality contradicts the supposition, and, therefore, we conclude the cost complexity upper bound proof for EXPLORE-A. Similar proof also holds for EXPLORE-B by replacing m k with m k.

Next, we provide the upper bounds used in the inequality (d) above:

H log L λ(1)δ C H log L(G + H)

λ(1)δ + G log log L(G + H)

H log L λ(1)δ CH log L(G + H)

+ H log L λ(1)δ CG log log L(G + H)

H log C + H log LH

λ(1)δ log L(G + H)

+ H log C + H log LG

λ(1)δ log log L(G + H)

2H log C + 4H log L(G + H)

(4 + 2 log C)H log L(G + H)

G log log L λ(1)δ C H log L(G + H)

λ(1)δ + G log log L(G + H)

G log log L λ(1)δ CH log L(G + H)

+ G log log L λ(1)δ CG log log L(G + H)

G log log C + G log log LH

λ(1)δ log L(G + H)

+ G log log C + G log log LG

λ(1)δ log log L(G + H)

2G log log C + 4G log log L(G + H)

(4 + 2 log log C)G log log L(G + H)

Proof of Lemma E.2.

Claim 1. For any fixed m = m k, if the following equation holds, then the algorithm will not pull arm k at fidelity m with high probability.

N (m) k,t > 8 λ(m)

λ( m k) (m) k

f-UCB(m) ut,t(µ(M) 1 ) = 1

µ(M) 1 ˆµ(m) ut,t ζ(m) +

2 log Nut,t

µ(M) 1 µ(m) ut ζ(m) + 2

2 log Nut,t

µ(M) 1 µ (m ut) ut ζ(m ut)

µ(M) 1 ˆµ (m ut) ut,t ζ(m ut) +

2 log Nut,t

N ( m k) ut,t

= f-UCB (m ut) ut,t (µ(M) 1 ),

where the inequalities (a) and (c) hold with a probability at least 1 1 (Nk,t)2 respectively (by Hoeffding s inequality), and the inequality (b) holds due to the equation in the claim.

E.2.2 Proof for EXPLORE-B s Upper Bound

As EXPLORE-B of Algorithm 2 also employs the LUCB framework, it shares the first three steps of the proof for Theorem 3.4 in Appendix E.2. Hence, in this part, we focus on the proof of the final cost complexity upper bound.

Lemma E.3. If the condition in Line 10 holds, then the committed fidelity ˆm k fulfills the following inequality:

2 ( ˆm k) k

λ( ˆm k) ( m k) k

λ( m k) . (19)

Proof of Lemma E.3. Eq.(19) is proved as follows,

( ˆm k) k /

λ( ˆm k) (ˆµ(M) (ˆµ( m k) k + ζ( m k)))/

λ( m k) + q

log(2KM/δ)/λ(1)N (m) k,t

(ˆµ(M) (ˆµ ( ˆm k) k + ζ( ˆm k)))/

log(2KM/δ)/λ(1)N (m) k,t

(a) (ˆµ(M) (ˆµ( ˆm k) k + ζ( ˆm k)))/

λ( ˆm k) + q

log(2KM/δ)/λ(1)N (m) k,t

(ˆµ(M) (ˆµ ( ˆm k) k + ζ( ˆm k)))/

log(2KM/δ)/λ(1)N (m) k,t

log(2KM/δ)/λ(1)N (m) k,t

(ˆµ(M) (ˆµ ( ˆm k) k + ζ( ˆm k)))/

log(2KM/δ)/λ(1)N (m) k,t

(b) 1 + 2 q

log(2KM/δ)/λ(1)N (m) k,t

log(2KM/δ)/λ(1)N (m) k,t = 2,

where inequality (a) is due to the definition of ˆm k, and inequality (b) is due to the condition in Line 10.

We next upper bound the number of times of N (m) k,t that guarantees that the condition in Line 10 is true. Let us consider the case of exploring arm ut.

λ(m) ˆµ(M) k (ˆµ( m k) k + ζ( m k))

(a) ˆµ(M) k (µ( m k) k + ζ( m k))

λ(1)N (m) k,t

(b) ( m k) k

λ(1)N (m) k,t ,

where inequality (a) is because that ˆµ( m k) k µ( m k) k + r

N(m) k,t with a probability of at least

1 δ/2KM (therefore, with the union bound over all arm-fidelity pairs, the total failure probability of EXPLORE is upper bounded by δ/2), and inequality (b) is because ˆµ(M) k (µ( m k) k + ζ( m k))

µ(M) k (µ( m k) k + ζ( m k)) = ( m k) k .

To make the condition in Line 10 hold, with the above inequality, we need

λ(1)N (m) k,t 3

λ(1)N (m) k,t ,

which, after rearrangement, becomes

N (m) k,t > 16λ( m k)

( ( m k) k )2 log(2KM/δ)

It means that if the above inequality holds, than the condition in Line 10 must hold. That is, except for the committed fidelity ˆm k, we have

N (m) k,t 16λ( m k)

( ( m k) k )2 log(2KM/δ)

λ(1) , for any other fidelities m = ˆm k.

For another thing, Eq.(14) of LUCB s proof guarantees that for the selected fidelity ˆm k, the number of pulling times is upper bounded as follows,

N ( ˆm k) k,t 64

( ( ˆm k) k )2 log Lt

Then, we upper bound the total budget of the algorithm as follows,

m M λ(m)N (m) k,t

( ( ˆm k) k )2 log Lt

16λ(m)λ( m k)

( ( m k) k )2λ(1) log 2KM

( ( m k) k )2 log Lt

16λ(m)λ( m k)

( ( m k) k )2λ(1) log 2KM

( ( m k) k )2 + X

16λ(m)λ( m k)

( ( m k) k )2λ(1)

( ( m k) k )2 log Lt

( ( m k) k )2 log LΛ

1024λ( m k)

( ( m k) k )2 log

( ( m k) k )2 L λ(1)δ

λ(1) H L λ(1)δ

where inequality (a) is due to Eq.(19), inequality (b) is due to that Λ A log(BΛ) = Λ 4A log(ABΛ).

E.2.3 Proof for EXPLORE-C s Upper Bound in Theorem D.1

Step 4 for EXPLORE-C: Prove that if the events of Steps 2 and 3 do not happen, for Λ > O Q log KM Q δ(λ(1))2 , the algorithm terminates with a probability at least O(1 δ/Λ2). Denote T := Λ 2λ(M) and two events E1, E2 as follows,

E1 := { t T : CROSt},

E2 := { t T, k K : (n(m k) k,t > 16n k,t) (k Midt)}.

We first upper bound the number of rounds after T as follows,

t T λ(mt)1{ TERMt} (a) = X

t T λ(mt)1{ TERMt CROSt}

t T λ(mt)1{(ℓt Midt) (ut Midt)}

k K λ(mt)1{((k = ℓt) (k = ut)) (k Midt)}

k K λ(mt)1 n ((k = ℓt) (k = ut)) (n(m k) k,t 16n k,t) o

t T λ(mt)1 n ((k = ℓt) (k = ut)) (n(m k) k,t 16n k,t) o

λ(ℓ) log(Lt4/δ)

(ζ(ℓ))2 + 16λ(m k)n k,t

(ζ(ℓ))2 + 16λ(m k)

(ζ(ℓ))2 + 16λ(m k)

(ζ(ℓ))2 + 16λ(m k)

where the equation (a) is due to E1, the inequality (b) is due to Step 1, and the inequality (c) is due to E2.

Also notice that

t< T λ(mt) + X

t T λ(mt)1{ TERMt} Λ

t T λ(mt)1{ TERMt},

and, combining with Eq.(20), we have,

(ζ(ℓ))2 + 16λ(m k)

Solving the above inequality concludes the cost complexity upper bound for EXPLORE-C.

In the end of Step 4, we show that the LUCB algorithm fulfills the fixed confidence requirement. The probability that the algorithm does not terminate after spending Λ budget is upper bounded by P(E1 E2). Based on Steps 2 and 3, it can be upper bounded as follows,

Lt3 + 16δ P

k K 2 k Lt4

λ(1) 1 2λ(M)

δ (Λ/λ(1))3

(λ(1))2 (λ(1))3

F Proofs for Regret Minimization Results

The two lower bound proofs utilize two different regret decomposition as follows,

λ(1) µ1 + (M) It

m=2 N (m) k (Λ)λ(m) λ(1)

λ(1) µ1 + X

N ( m) k (Λ) (M) k ; (21)

m=1 N (m) k (Λ) λ(m)

λ(1) µ(M) 1 µ(M) k

where the inequality (a) is due to Λ > PN t=1 λ(mt), the N (m) k (Λ) is the number of times that arm k is pulled in fidelity m after paying budget Λ. The N with subscript k and superscript ( m) mean the pulling times of all arms and all fidelities respectively. Due to the multi-fidelity feedback, both decompositions are different from classic bandits regret decomposition [24, Lemma 4.5], and, therefore, we need to non-trivially extend the known approaches to our scenario.

To prove the problem-independent lower bound Ω(K1/3Λ2/3) in Theorem 4.3, we utilize the decomposition in Eq.(21): In the RHS, the first term increases whether one choose fidelity other than the lowest; this is a novel term. The second term of RHS corresponds to the cost of pulling suboptimal arms which also appears in the classic MAB. With this observation, one can construct instance pairs such that it is unavoidable to do exploration at higher fidelities and, therefore, the first term is not negligible. Lastly, one needs to balance the magnitude of the above two terms case-by-case, which together bounds the regret as Ω Λ2/3 . To prove the problem-dependent lower bound in Theorem F.1, we utilize Eq.(22) to decompose the regret to each arm and bound each of them separately.

F.1 Proof of Theorem 4.3

Step 1. Construct instances and upper bound KL-divergence Fix a policy π. We construct two MF-MAB instances, each with K arms and M fidelities. For the pulling costs of different fidelities, we set λ(M) 2λ(1). For reward feedback, we assume all arms reward distributions at any fidelity are Bernoulli, and denote µ(m) k (1), µ(m) k (2) as the reward means of these two instances. Let P1 = Pµ(1),π, E1 = Eµ(1),π and P2 = Pµ(2),π, E2 = Eµ(2),π be the probability measures and expectations on the canonical MF-MAB model induced by Λ-budget interconnection of π and µ1 (and µ2). Denote k = arg mink K E1[N (m>1) k (Λ)].

The only difference between both instance pair is in the arm k s reward mean for fidelities m > 1, so that instance 1 s optimal arm is arm 1 and instance 2 s optimal arm is arm k . The detailed reward means are listed as follows,

m = 1 m > 1 (µ(m) k (1))k K 1

2, . . . , 1

2, . . . , 1

(µ(m) k (2))k K 1

2, . . . , 1

2, . . . , 1

Denote the entry (Mt, It, Xt) as a tuple of the pulled fidelity, pulled arm, and observed reward random variables at time t, and H := (M1, I1, X1; M2, I2, X2; . . . ; MN, IN, XN) as a sequence of applying policy π. We note that N is also a random variable depending on the sequence of fidelities in pulling arms. Next, we calculate the upper bound of the KL-divergence of the above two instances in this sequence.

KL(P1, P2) = E1

d P2 (M1, I1, X1; M2, I2, X2; . . . ; MN, IN, XN)

d P2 (M1, I1, X1; M2, I2, X2; . . . ; MN, IN, XN) N

t=1 log p1(Xt|Mt, It)

p2(Xt|Mt, It)

log p1(Xt|Mt, It)

p2(Xt|Mt, It)

t=1 E1 h KL P (Mt,It) 1 , P (Mt,It) 2 N i#

(m,k) M K E1

t=1 E1 h 1{Mt = m, It = k} KL P (Mt,It) 1 , P (Mt,It) 2 N i#

(m,k) M K E1

KL P (m,k) 1 , P (m,k) 2 N X

t=1 E1 [1{Mt = m, It = k}| N]

(m,k) M K KL P (m,k) 1 , P (m,k) 2 E1

t=1 1{Mt = m, It = k}

(m,k) M K KL P (m,k) 1 , P (m,k) 2 E1 h T (m) k (Λ) i

= KL P (m,2) 1 , P (m,2) 2 E1 h T (m>1) k (Λ) i

K E1 h T (m>1) k (Λ) i KL B 1

(e) E1 h T (m>1) k (Λ) i 9 2

where the equation (a) is due to

d P1 d(ρ ρ λ)N (m1, k1, x1; m2, k2, x2; . . . ; m N, k N, x N)

= pµ1,π(m1, k1, x1; m2, k2, x2; . . . ; m N, k N, x N)

t=1 πt(mt, kt|m1, k1, x1; . . . ; mt 1, kt 1, xt 1)pµ1(xt|mt, kt),

the equation (b) is due to

log p1(Xt|Mt, It)

p2(Xt|Mt, It)

log p1(Xt|Mt, It)

p2(Xt|Mt, It)

= E1 h KL P (Mt,It) 1 , P (Mt,It) 2 N i ,

the equation (c) is due to the tower property as well, the inequality (d) is because arm k is the arm with the smallest number of pulled with fidelity m > 1, and the inequality (e) is by calculating the KL-divergent between two Bernoulli distributions.

Step 2. Lower bound the regret We first note that the regret defined in Eq.(9) for instance 1 can be decomposed as follows,

E1[R(Λ)] (a)

λ(1) µ1 + (M) It

m=2 T (m) k (Λ)λ(m) λ(1)

λ(1) µ1 + X

T ( m) k (Λ) (M) k

where inequality (a) is due to Λ > PN t=1 λ(mt), the T (m) k (Λ) is the number of times that arm k is pulled in fidelity m (given total budget Λ).

Then, we lower bound the summation of the regrets under both instances, E1[R(Λ)] + E2[R(Λ)]

m=2 T (m) k (Λ)λ(m) λ(1)

λ(1) µ1 + X

T ( m) k (Λ) (M) k

m=2 T (m) k (Λ)λ(m) λ(1)

λ(1) µ1 + X

T ( m) k (Λ) (M) k

T (m>1) k (Λ)λ(2) λ(1)

λ(1) µ1 + X

T ( m) k (Λ) (M) k

T (m>1) k (Λ)λ(2) λ(1)

λ(1) µ1 + X

T ( m) k (Λ) (M) k

(a) E1 h T (m>1) k (Λ) i λ(2) λ(1)

+ min Λ λ(M) Λ 2λ(1) , Λ 2λ(1)

T ( m) 1 Λ 2λ(1)

T ( m) 1 > Λ 2λ(1)

(b) E1 h T (m>1) k (Λ) i λ(2) λ(1)

λ(1) µ1 + Λ

2 min 1 λ(M) 1 2λ(1) , 1 2λ(1)

exp( KL(P1, P2))

(c) E1 h T (m>1) k (Λ) i λ(2) λ(1)

2 min 1 λ(M) 1 2λ(1) , 1 2λ(1)

exp 2 2K 1E1 h T (m>1) k (Λ) i ,

where inequality (a) uses the λ(M) 2λ(1) condition, inequality (b) is by Bretagnolle-Huber inequality [24, Theorem 14.2], and inequality (c) is by Eq.(23).

Step 3. Obtain the Ω(K 1 3 Λ 2 3 ) lower bound We show that E1[R(Λ)] + E2[R(Λ)] Ω(K 1 3 Λ 2 3 )

for any possible quantity of E1 h T (m>1) k (Λ) i via categorized discussion as follows,

Case 1: If E1 h T (m>1) k (Λ) i = 0, then the last formula becomes CΛ . Letting be a

constant (via sup), we have a Ω(Λ) lower bound, which means Ω(K 1 3 Λ 2 3 ) is also valid.

Case 2: If E1 h T (m>1) k (Λ) i K 1 3 Λ 2 3 , then we have the Ω(K 1 3 Λ 2 3 ) lower bound from the first term.

Case 3: If 0 < E1 h T (m>1) k (Λ) i < K 1 3 Λ 2 3 , we choose = K 1 3 Λ 1

3 and also obtain the

Ω(K 1 3 Λ 2 3 ) lower bound.

F.2 Proof of Theorem 4.4

In this proof, we prove that for any arm k K, the total regret due to this arm k is lower bounded as follows,

lim inf Λ E [Rk(Λ)]

log(Λ) min m: (m) k >0

λ(1) µ(M) 1 µ(M) k

( (m) k )2 .

Then, noticing that R(Λ) = P

k M Rk(Λ) concludes the proof.

We construct instances ν and ν . We set the reward distributions ν = (ν(m) k )(k,m) K M as Bernoulli

and the reward means fulfill µ(M) 1 > µ(M) 2 µ(M) 3 . . . µ(M) K , where µ(m) k = EX ν(m) k [X]. We

let ν (m) k be the same to ν(m) k for all k and m, except for that an arm ℓ = 1. We set arm ℓ s reward means on fidelities m Mk to be ν (m) ℓ = ν(M) 1 ζ(m) + ϵ. One can verify that the reward means of arm k under instance v fulfill the condition that |µ (m) k µ (M) k | ζ(m) for any fidelity m. Notice that µ (M) k > µ(M) k . Hence, under instance I , the optimal arm is k. We denote P, E and P , E as the probability measures and expectations for instances I and I respectively.

Next, we employ the (extended) key inequality from Garivier et al. [12] as follows, X

m M E[N (m) k (Λ)] KL(v(m) k , v (m) k ) kl(E[Z], E [Z]).

Let Z = λ(M)N ( m) ℓ (Λ)/Λ in the above inequality, we have X

m: (m) ℓ >0 E[N (m) k (Λ)] KL(v(m) ℓ , v (m) ℓ )

E[N ( m) ℓ (Λ)] Λ , E [N ( m) ℓ (Λ)] Λ

1 λ(M)E[N ( m) ℓ (Λ)] Λ

Λ λ(M)E [N ( m) ℓ (Λ)] log 2

where inequality (a) is due to that for all (p, q) [0, 1]2,kl(p, q) (1 p) log(1/(1 q)) log 2.

Notice that the regret attributed to any arm k can be decomposed and lower bounded as follows,

m=1 N (m) ℓ (Λ) λ(m)

λ(1) µ(M) 1 µ(M) ℓ

with the constraint in Eq.(24). Since this is a linear programming, we know its solution is reached at its vertex. Therefore, we lower bound the regret as follows,

Rℓ(Λ) min m: (m) k >0

λ(1) µ(M) 1 µ(M) ℓ

KL(v(m) ℓ , v (m) ℓ )

1 λ(M)E[N ( m) ℓ (Λ)] Λ

Λ λ(M)E [N ( m) ℓ (Λ)] log 2

Notice that the policy is consistent, that is, E[N ( m) ℓ (Λ)] = o(T a) and E [N ( m) k (Λ)] = o(Λa) for

any a (0, 1] and any suboptimal arm k = ℓ. We have λ(M)E[N( m) ℓ (Λ)] Λ = o(1) and

Λ λ(M)E [N ( m) ℓ (Λ)] λ(M) X

k =ℓ E [N ( m) k (Λ)] = o(Λa).

Dividing both sides of Eq.(25) by Λ, and letting Λ go to infinity and a go to 1, we have

lim inf Λ E[Rℓ(Λ)]

Λ min m: (m) ℓ >0

λ(1) µ(M) 1 µ(M) ℓ

KL(v(m) ℓ , v (m) ℓ )

min m: (m) ℓ >0

λ(1) µ(M) 1 µ(M) ℓ

KL(v(m) ℓ , v(M) 1 ζ(m) + ϵ)

To bound the optimal arm 1 s sample cost, we use the same ν as above and construct another instance ν . The instance ν s reward means are the same to ν except for arm 1 whose reward means for fidelity m M1 are set as µ (m) 1 = µ(m) 2 + ζ(m) ϵ. Then, with similar procedure as the above, we obtain

lim inf Λ E[R1(Λ)]

Λ min m: (m) k >0

λ(1) µ(M) 1 µ(M) 1

KL(v(m) 1 , v(m) 2 + ζ(m) ϵ)

F.3 Proof of Theorem 4.8

We first prove a problem dependent regret upper bound as follows and then convert this bound to the problem independent regret upper bound presented in Theorem 4.8.

Denote (M) k = µ(M) 1 µ(M) k , and, especially, (M) 1 = 0. Theorem F.1 (Problem-Dependent Regret Upper Bound). For any ε > 0. Algorithm 3 s regret is upper bounded as follows,

k: (M) k >ε

λ(1) µ(M) 1 µ(M) k

( (M) k )2 log Λ( (M) k )2

16λ(M) + 48

( (M) k )2 + 1

λ(1) µ(M) 1 µ(M) k

3ε2 + 1 + 64

+ max k: (M) k ε

Λ λ(1) (M) k .

Proof of Theorem F.1. By Hoeffding s inequality, we have

P ˆµ(M) k,p > µ(M) k + 2 p exp

2 1 2 22p log(Λ/22pλ(M))

P ˆµ(M) k,p < µ(M) k 2 p exp

2 1 2 22p log(Λ/22pλ(M))

That is, the empirical mean ˆµ(M) k is within the confidence interval (µ(M) k 2 p, µ(M) k + 2 p) with high probability.

Choose any ε > e

Λ. Let K = {k K| (M) k > ε}. Denote pk := min{p : 2 p < (M) k

2 }. From pk s definition, we have the following inequality

(M) k < 2pk+1. (27)

We also note that the cost of pulling a suboptimal arm k K at highest fidelity M is upper bounded as λ(M)

λ(1) µ(M) 1 µ(M) k , where the factor λ(M)

λ(1) is because the budget paying to pull an arm at fidelity M can be used to the the arm at fidelity 1 for fractional times.

The rest of this proof consists of two steps. In the first step, we assume that all empirical means are in their corresponding confidence intervals at each phases, and show the algorithm can properly eliminate all suboptimal arms in K the arm is eliminated in or before the phase pk. In the second step, we upper bound the regret if there are any empirical estimates lying outside their corresponding confidence intervals.

Step 1. If all arms empirical means lie in confidence intervals. That is, any suboptimal arm k is eliminated in or before the phase pk. Because, if k Cpk, we have

ˆµ(M) k µ(M) k + 2 pk = µ(M) 1 (M) k + 2 pk

(a) µ(M) 1 4 2 pk 1 + 2 pk = µ(M) 1 2 pk < ˆµ(M) 1 max k Cpk ˆµ(M) k ,

where (a) is due to Eq.(27). That is, if this arm k haven t been eliminated before phase pk, it must be eliminated in this phase. Therefore, the total pulling times of this arm k at highest fidelity M is upper bounded as follows,

T (M) k 22pk log Λ 22pkλ(M)

( (M) k )2 log

( (M) k )2Λ 16λ(M)

where (a) is due to Eq.(27).

We then handle the number of times of pulling arms k with (M) k ε in fidelity M. Although these arms total pulling times, eliminated in or before phase pk, are also upper bounded by Eq.(28), their corresponding phases pk is greater than log2 2 ε and, therefore, cannot be reached. So, these arms (including the optimal arm 1 s) total pulling times in fidelity M is upper bounded by

16 ε2 log Λε2

Therefore, the cost due to pulling arms at fidelity M is upper bounded as follows,

λ(1) µ(M) 1 µ(M) k

16 max n ε, (M) k o 2 log

max n ε, (M) k o 2 Λ

After the elimination process, arms with (M) k ε may remain in the candidate arm set Cp and are exploited in turn at fidelity m = 1. As some of them are not the optimal arm, this additional cost can be upper bounded as follows,

max k: (M) k ε

Λ λ(1) (M) k . (30)

Step 2. There are two cases that some arms are eliminated improperly: for an suboptimal arm k, either

2.1 The suboptimal arm k is not eliminated in (or before) the phase pk, and the optimal arm 1 is in Cpk in phase pk; or

2.2 The suboptimal arm k is eliminated in (or before) the phase pk, and the optimal arm 1 is not in Cpk in phase pk.

Case 2.1 s happening means that arm k is not eliminated in or before phase pk, which can only happen when the arm s empirical mean ˆµ(M) k lies outside its corresponding confidence interval. This event in or before phase pk is with a probability no greater than 2 22pk λ(M)

Λ . Since this event may happen to any suboptimal arm k K , then the regret of this case is upper bounded by

λ(1) µ(M) 1 µ(M) k

k K 22pk+1 λ(M)

λ(1) µ(M) 1 µ(M) k

λ(1) µ(M) 1 µ(M) k

where (a) is due to Eq.(27).

If Case 2.2 happens, the optimal arm 1 is not in the candidate arm set Cpk in phase pk. We denote p1 as the phase that the optimal arm 1 is eliminated, and it is at this phase that some arms empirical means lie outside their confidence interval so that this mis-elimination happens. The probability of this event is upper bounded by 2 22p1λ(M)

Λ . We assume that arms k with pk < p1 are eliminated in or before phase pi properly; otherwise, the regret is counted in Case 2.1. Therefore, the optimal arm 1 eliminated in phase p1 should be eliminated by an arm k with pk p1. Consequently, the maximal per time slot regret in Case 2.2 is among arms with pk p1. Denote pε := min{p|2 p < ε

2}. We bound the cost of Case 2.2 as follows, maxi K pi X

Λ Λ λ(M) max k >1:pk p1

λ(1) µ(M) 1 µ(M) k

maxi K pi X

k>1:pk p1 22p1+1 λ(M)

λ(1) µ(M) 1 µ(M) 1 + max k >1:pk p1 (M) k

maxi K pi X

k>1:pk p1 22p1+1 λ(M)

λ(1) µ(M) 1 µ(M) 1 + 4 2 p1

min{pk,pε} X

p1=0 22p1+1 λ(M)

λ(1) µ(M) 1 µ(M) 1 + 4 2 p1

λ(1) µ(M) 1 µ(M) 1

min{pk,pε} X

p1=0 22p1+1 +

min{pk,pε} X

λ(1) µ(M) 1 µ(M) 1

22 min{pk,pε}+1

3 + 2min{pk,pε}+4

λ(1) µ(M) 1 µ(M) 1

3 + 2pk+4 + X

λ(1) µ(M) 1 µ(M) 1

λ(1) µ(M) 1 µ(M) 1

3( (M) k )2 + 64

λ(1) µ(M) 1 µ(M) 1

λ(1) µ(M) 1 µ(M) k

3( (M) k )2 + 64

λ(1) µ(M) 1 µ(M) k

where (a) and (c) are due to Eq.(27), and (b) is due to the property of swapping two summations.

Summing up the costs in Eq.(29), Eq.(30), Eq.(31), and Eq.(32) concludes the proof.

Next, we derive the problem-independent regret bound from Eq.(26). When Λ is large, the O(log Λ) and O(Λ) terms dominate other terms in Eq.(26). For any given ε, if Λ is large enough, we always have

Λ , which, with some calculus, guarantees that 16 ( (M) k )2 log Λ( (M) k )2

16λ(M) < 16

all (M) k > ε. Therefore, we can scale all logarithmic arm pulling times as 16

16λ(M) , and upper

bound the pulling cost λ(M)

λ(1) µ(M) 1 µ(M) k by λ(M)

λ(1) µ(M) 1 . We derive the problem-independent regret upper bound as follows,

E [R(Λ)] Kµ(M) 1 λ(M)

λ(1) 16 ε2 log Λε2

16λ(M) + Λ λ(1) ε

Kµ(M) 1 λ(M)

λ(1) 16 ε2 log Λ 16λ(M) + Λ λ(1) ε

16Kµ(M) 1 λ(M)

λ(1) log Λ 16λ(M)

where the equation of (a) holds when ε = Kµ(M) 1 log(Λ/16λ(M)) (Λ/16λ(M))