# optimal_multifidelity_bestarm_identification__368c05e4.pdf Optimal Multi-Fidelity Best-Arm Identification Riccardo Poiani DEIB, Politecnico di Milano, Milan, Italy riccardo.poiani@polimi.it Rémy Degenne Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189-CRISt AL, F-59000 Lille, France remy.degenne@inria.fr Emilie Kaufmann Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189-CRISt AL, F-59000 Lille, France emilie.kaufmann@univ-lille.fr Alberto Maria Metelli DEIB, Politecnico di Milano, Milan, Italy albertomaria.metelli@polimi.it Marcello Restelli DEIB, Politecnico di Milano, Milan, Italy marcello.restelli@polimi.it In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multifidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm. Our first contribution is a tight, instance-dependent lower bound on the cost complexity. The study of the optimization problem featured in the lower bound provides new insights to devise computationally efficient algorithms, and leads us to propose a gradient-based approach with asymptotically optimal cost complexity. We demonstrate the benefits of the new algorithm compared to existing methods in experiments. Our theoretical and empirical findings also shed light on an intriguing concept of optimal fidelity for each arm. 1 Introduction In multi-armed bandits [20], an algorithm chooses at each step one arm among K > 1 possibilities. It then observes a reward, sampled from a probability distribution on R corresponding to the arm. Several goals are possible for the algorithm, and we focus on the best arm identification task (BAI) in which we aim to identify the arm with the largest mean, using as few samples as possible. This is a well-studied problem [6, 1, 12, 10, 8] with potential applications to, e.g. A/B/n testing [27] or hyper-parameter optimization [11]. In some applications, like physics, parameter studies, or hyper-parameter optimization, getting a sample from the arm distribution might be expensive since it requires evaluating or training a complex model and is computationally demanding. However, it is often the case that cheaper, less accurate sampling methods are available, for instance, by using a coarser model in the physics study example. Work done while at Inria, Lille, France. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). The multi-fidelity bandit framework takes such scenarios into account. When choosing an arm, the algorithm also chooses a fidelity, with a trade-off: a higher fidelity gives a more precise observation but has a higher cost. We assume that the algorithm knows both the cost and the maximal bias of the observations from each fidelity. This is also how the knowledge about the fidelity was modeled in prior work [see, e.g., 16, 15, 25, 31]. The goal is then to find the best arm (i.e., the arm with the highest mean at the highest fidelity) with high probability and minimal cost. Specifically, the bandit algorithm interacts with the multi-fidelity environment and gathers information to find which arm has the highest mean when pulled at the highest fidelity. In the fixed confidence setting, we want to ensure that the algorithm returns a correct answer with probably at least 1 δ for a given parameter δ (0, 1). A good algorithm should do that at a minimum cost, and thus, the appropriate quality metric for evaluating an algorithm s performance is the sum of costs paid until it stops, i.e., the cost complexity. Previous work on the multi-fidelity BAI problem [25, 31] provided lower bounds on the cost complexity as well as algorithms with cost upper bounds. Those lower and upper bounds do not match, and the proposed methods require additional prior information [31], or their guarantees are restricted to problems satisfying additional hypotheses [25]. We lift all those requirements and provide an improved lower bound and an algorithm with a matching upper bound. Contributions and organization of the paper After presenting additional related works, in Section 2, we define fixed-confidence best arm identification in multi-fidelity bandits in more mathematical detail and introduce the notations used throughout the paper. Then, Section 3 contains our first contribution: a tight instance-dependent lower bound on the cost complexity of any algorithm expressed with the maximum of a complex function over all possible cost allocations. We also highlight features of that lower bound, like the existence of an optimal fidelity for each arm, which should be chosen exclusively. In Section 4, we propose a computationally efficient procedure for computing gradients of the function featured in the lower bound and describe a gradient-based algorithm whose cost complexity is asymptotically matching the lower bound. Finally, in Section 5, we present the results of numerical experiments which demonstrate the good empirical performance of our new algorithm compared to prior work. Additional related works The multi-fidelity setting has mostly been studied in the context of Bayesian optimization [9, 24, 17, 26, 14, 21] and black-box function optimization with different structural assumptions [28, 29, 7, 23]. The goal there is to find the minimum of a function by successive queries of that function or of cheaper approximations. The metric for success in these works is most often the simple regret, that is, the difference between the best value found and the true minimum, although other goals were considered like the cumulative regret [16, 15]. Furthermore, we notice that best arm identification with costs has recently been studied in [13] for BAI with only one fidelity. The authors introduce a variant of the Track-and-Stop algorithm [8] and prove its asymptotic optimality. However, we will not be able to adapt this study to the multi-fidelity case because, as we shall see, it requires solving a complex optimization problem for which we have no efficient solution. Finally, our work is related to the vast strand of BAI studies that proposes tight lower bound with asymptotically optimal algorithms [e.g., 8, 4, 22]; nevertheless, as we discuss throughout the text, these studies cannot be directly applied to the multi-fidelity BAI problem. 2 Background In this section, we provide essential background and notation that is used throughout the rest of the paper. A table that summarizes the notation is available in Appendix A. A multi-fidelity bandit model with K arms and M fidelities is a set of K M probability distributions ν = (νa,m)a [K],m [M] where νa,m has mean of µa,m. For each arm a [K], µa,m represents the mean value of an observation of arm a using fidelity m, and let µ = (µa,m)a [K],m [M]. An observation at fidelity m is assigned a (known) cost λm 0 with λ1 < λ2 < < λM. The goal is to identify the arm that has the largest mean at the highest fidelity M, a (µ) := argmaxa [K] µa,M (sometimes denoted by in the sequel to ease notation) with a small total sampling cost, by exploring the arms at different fidelities and using some prior knowledge about their precision. Specifically, we assume that there are some (known) values ξ1 > ξ2 > > ξM = 0 such that, for all arm a [K], the vector µa := (µa,m)m [M] satisfies m [M], |µa,m µa,M| ξm . We write µa MF to indicate that arm a satisfies these multi-fidelity constraints, with these particular parameters ξm (although they are not shown in the notation). In this paper, we consider arms that belong to a canonical exponential family [2]. This includes, e.g. arms that have Bernoulli distributions or Gaussian distributions with known variances. Such models are known to be characterized by their means and we refer to such an exponential multi-fidelity bandit model ν using the means of its arms µ, which belongs to the set MMF := {µ ΘK M : a [K], µa MF}, where Θ R is the interval of possible means. At each interaction round t = 1, 2, . . . , the agent selects an arm At and a fidelity Mt, observes a sample Xt νAt,Mt and pays a cost λMt. Letting Ft = σ(A1, M1, X1, . . . , At, Mt, Xt) be the sigma field generated by the observations up to time t, a fixed-confidence identification algorithm takes as input a risk parameter δ (0, 1) and is defined by the following ingredients: (i) a sampling rule (At, Mt)t, where (At, Mt) is Ft 1-measurable, (ii) a stopping rule τδ, which is a stopping time w.r.t. Ft, and (iii) a decision rule ˆaτδ [K], which is Fτδ-measurable. We want to build strategies that ensure Pµ (ˆaτδ = a (µ)) δ for all µ MMF with a unique optimal arm. Such a strategy is called δ-correct. Among δ-correct strategies, we are looking for strategies that minimize the expected identification cost (i.e., cost complexity) defined as Eµ[cτδ] := X m [M] λm Eµ[Na,m(τδ)] = X m [M] Eµ[Ca,m(τδ)], where Na,m(t) denotes the number of pulls of arm a at fidelity m up to time t and Ca,m(t) = λm Na,m(t) denotes the cost associated to these pulls. In the sequel, we will provide cost complexity guarantees for multi-fidelity instances µ that belong to the set M MF of multi-fidelity instances with a unique optimal arm, i.e., for which |a (µ)| = 1. We remark that for M = 1 and λm = 1 we recover the best arm identification problem in a classical bandit model, for which the cost complexity coincides with the sample complexity, Eµ[τδ]. Additional notation Given an integer n N, we denote by n the n-dimensional simplex. Furthermore, given x, y (0, 1), we define kl(x, y) = x log(x/y) + (1 x) log((1 x)/(1 y)). Given (p, q) Θ2, we denote by d(p, q) the Kullback-Leibler (KL) divergence between the distribution in the exponential family with mean p and that with mean q. We also write d (x, y) = d(x, y)1 {x y} and d+(x, y) = d(x, y)1 {x y}. Finally, we denote by v(p) the variance of the distribution with mean p. 3 On the cost complexity of multi-fidelity best-arm identification In this section, we discuss the statistical complexity of identifying the best-arm in MF-BAI problems. Formal proofs of the claims of this section are presented in Appendix B. 3.1 Lower bound on the cost complexity We present an instance-dependent lower bound on the expected cost-complexity. The lower bound uses the solution to an optimization problem, where the functions optimized quantify the trade-off between the information gained by pulling an arm at some fidelity and the cost of that fidelity. Since those functions also appear in our algorithm, we will now introduce notation for them. For all ω K M and µ ΘK M, we define fi,j(ω, µ) := inf θi MF, θj MF θj,M θi,M m [M] ωa,m d(µa,m, θa,m) F(ω, µ) := max i [K] min j =i fi,j(ω, µ) . (2) The quantity fi,j(ω, µ) is the dissimilarity according to a KL weighted by the costs between µ and the closest θ ΘK M such that arms i and j satisfy the multi-fidelity constraints and θk = µk for k / {i, j}, with arm j better than arm i. If µ MMF then that closest θ is also in MMF but otherwise it might not be the case: if an arm k / {i, j} is not in MF for µ, then it is equally not in MF for θ. For µ MMF the maximum in the definition of F is realized at the best arm , as mina =i fi,a(ω, µ) is zero for i = . That is, F(ω, µ) = minj = f ,j(ω, µ). We define F with a maximum over i and not with that last expression because we want to define it for all points in ΘK M, even the points which are not in MMF. For those points, we could imagine different notions of best arm, for example, arg maxk µk,M, but the right one for our algorithm is the arm for which we have the most evidence (weighted by cost) to say that all other arms are not better. That arm is the argmax in our definition of F. Given these definitions, we now introduce our new lower bound. Theorem 3.1. Let δ (0, 1). For any δ-correct strategy, and any multi-fidelity bandit model µ M MF, it holds that: Eµ[cτδ] C (µ) log 1 2.4 δ , (3) where C (µ) 1 := supω K M F(ω, µ) = supω K M mina = f ,a(ω, µ) . The quantity C (µ) describes the statistical complexity of an MF problem µ as the typical max-min game that appears in lower bounds for BAI problems [see, e.g., 8, 3]. Specifically, first, the max-player chooses a vector ω K M, and then the min-player chooses a bandit model θ MMF in which the optimal arm is different, with the goal of minimizing the function F(ω, µ). Following the methods from previous work, the objective value for ω and θ should be P m [M] ωa,m d(µa,m,θa,m) λm , featuring a sum over all arms and fidelities. However in the definition of fi,a(ω, µ) we restrict θ to be different from µ on only two arms. We can prove that if µ MMF, this gives the same objective value at the minimizing θ as the full sum. The difference will be important in our algorithm, which will compute that minimizer for points ˆµ that do not belong to MMF. A difference with standard BAI settings is that in Equation (1) each ω K M should be interpreted as a vector of cost proportions that the max-player is investing (in expectation) in each arm-fidelity pair to identify the optimal arm µ ,M.2 We can interpret the oracle weights ω argmax K M F(ω, µ) as the optimal cost proportions that the agent should follow in order to identify µ ,M while minimizing the identification cost. To clarify the difference and the relationship between cost and pull proportions we notice that, given a cost proportion ω, it is always possible to compute the pull proportions π(ω) K M that the agent should play in order to incur the costs proportions specified by ω, and vice versa. More specifically, these relationships are described for each arm-fidelity pair by the following equations for every a [K] and m [M]: πa,m(ω) = ωa,m λj ωa,m(π) = λmπa,m P j [M] λjπi,j . (4) As a direct consequence, it is possible to rewrite C (µ) 1 as a function of π, the pull proportions. Doing so reveals that the minimizer θ in f ,j does not depend on the costs: it is also the minimizer of P a {i,j},m [M] πa,md(µa,m, θa,m). While the agent optimizes the cost proportions ω to get the best possible information/cost ratio, the min-player minimizes only the information available to the algorithm to tell µ and θ apart. Finally, we notice that F(ω, µ) is concave in ω3 but F(ω(π), µ) is not concave in π. As we shall see in Section 4, this difference will play a crucial role in constructing an asymptotically optimal algorithm. The formulation of the lower bound as a game where one player maximizes an information/cost ratio while the other player minimizes information makes our result close to lower bounds for regret minimization like the one of [5], where the (unknown) gap of an arm plays the role of the cost. Comparison to previous work The only known lower bound for the multi-fidelity BAI problem is the one presented in [25]. That same bound was then shown in [31]. The bound from those previous works is looser than Theorem 3.1. For example, in a two-arms bandit with a single fidelity (denoted by M) and Gaussian rewards with variance 1, the bound from previous work is λM(µ1,M µ2,M) 2 log(1/2.4δ), while our lower bound is 8λM(µ1,M µ2,M) 2 log(1/2.4δ). Furthermore, on particular instances with 2 arms and 2 fidelity, we can prove that our lower bound improves by a factor λM/λ1, which can be arbitrarily large (See Appendix B.2). More generally, the proof of the previous lower bounds exhibits a particular point in the alternative, which makes it always looser than our bound which features an infimum over all points. Theorem 3.1 is also optimal in the regime δ 0 since it is matched by the algorithm we introduce in the next section. 2This claim is evident when looking at the proof of Theorem 3.1. 3This is a consequence of the fact F(ω, µ) is an infimum over linear functions of ω. 3.2 Sparsity of the oracle weights: a tight concept of optimal fidelity We conclude our study of the lower bound by further analyzing the optimal allocation ω . Unlike in the standard best arm identification problem, we did not find an efficient algorithm to compute it, which prevents us from using a Track-and-Stop-like approach [8]. Nevertheless, we will explain in the next section how to efficiently compute the fi,j functions and their gradient. These computations are crucial for our algorithm but also allow us to prove our next result about the possible sparsity of ω . For each arm a [K], it is not difficult to show that there must exist some fidelity m [M] for which ω a,m > 0 (Lemma B.2). However, as the following result highlights, in most cases, only one fidelity per arm has non-zero weight. Theorem 3.2. Let K M(µ) := argmaxω K M F(ω, µ) and f MMF := µ M MF : i [K], m1, m2 [M]2, ω K M(µ) : ω i,m1 > 0, ω i,m2 > 0 . The set f MMF is a subset of RK M whose Lebesgue measure is zero. Theorem 3.2 implies that in almost all multi-fidelity bandits, for any ω K M(µ) and each arm a [K], there exists a single fidelity m a [M] for which ω a,m a > 0 holds. However, we note that this result does not offer an easy way to compute these optimal arm-dependent fidelities.4 Nevertheless, as we shall see in the next section, our algorithm does not actually require identifying these optimal fidelity levels to enjoy optimality guarantees. Finally, we remark that existing MF-BAI works [25, 31] already proposed notions of optimal, armdependent fidelity that the agent should employ to identify the optimal arm . Nevertheless, as we verify in Appendix B.5, these concepts do not comply with the concept of optimal fidelity that arises from the tight lower bound of Theorem 3.1. In other words, there exist bandit models µ in which following these alternative concepts of optimal fidelity leads to sub-optimal performance. 4 The multi-fidelity sub-gradient ascent algorithm We present our solution for solving MF-BAI problems, an algorithm called Multi-Fidelity Sub Gradient Ascent (MF-GRAD). Its pseudocode can be found in Algorithm 1. All proofs for this section are presented in Appendix C. A reader familiar with the literature on BAI algorithms inspired from lower bounds like Theorem 3.1 may have the natural idea of simply using the Track-and-Stop algorithm [8] or the related game-based algorithm of [4]. Those algorithms can t be directly applied here, first because of the costs: we want to bound the cost complexity, not the stopping time, and adapting those methods to costs is not trivial. Furthermore, Track-and-Stop (even in the cost-aware variant of [13]) would require the computation of the optimal cost proportions at ˆµ(t), which is a max-min problem for which we don t have an efficient algorithm. Our solution is inspired by the gradient ascent algorithm of [22], which requires computing gradients of F (hence only a minimization problem and not a max-min). The same innovations required to extend this method to the multi-fidelity case could likely allow us to adapt the algorithm of [4], or the exploration part of the regret-minimizing algorithm of [5]. Let us introduce some auxiliary notation. Let ω K M be the uniform vector 1 KM , . . . , 1 KM . For all t N, we define Clipt(x) = min{xa,m, G a [K],m [M] for an arbitrary constant G > 0. We also define αt = 1 t and γt = 1 4 t. Finally, for all t N, we denote by C(t) RKM the vector whose (a, m)-th dimension is given by Ca,m(t). We now present Algorithm 1. Sampling rule After a first initialization phase in which the algorithm pulls each arm at each fidelity once (Line 1), the agent starts its sub-gradient ascent routine. More specifically at each iteration t N, the agent first computes the vector ω(t + 1) using the Exponential Weights algorithm on the sequence of gain functions {gs}t s=1 := {Clips P a,m λm πa,m(s) F( ω(s), ˆµ(s)) }t s=1, where π(t) := π( ω(t)) represents the pull-proportions induced by ω(t) and F( ω(s), ˆµ(s)) denotes a sub-gradient of F(ω, µ) w.r.t ω (Line 3). Neglecting for a moment the clipping function and the term 4We provide insights on cases in which it is possible to compute the optimal fidelity in Appendix B.4. Algorithm 1 Multi-Fidelity Sub-Gradient Ascent 1: Initialization. Pull each arm at each fidelity once, and set ω(t) = ω for all t {1, . . . , KM} 2: Sampling Rule for t KM 3: Sub-gradient Ascent ω(t + 1) argmax ω K M αt+1 s=KM ω Clips a,m λm πa,m(s) F( ω(s), ˆµ(s)) 4: From Costs to Pulls πa,m(t + 1) = ωa,m(t + 1) j [M] ωi,j(t+1) λj a [K], m [M] 5: Forced Exploration π (t + 1) = (1 γt) π(t + 1) + γtω 6: Cumulative Tracking (At+1, Mt+1) argmax(a,m) [K] [M] Pt s=1 π a,m(s) Na,m(t) 7: Stopping Rule τδ = inf t KM : maxi [K] minj =i fi,j(C(t), ˆµ(t)) βt,δ 8: Decision Rule ˆaτδ argmaxi [K] minj =i fi,j(C(t), ˆµ(t)) a,m λm πa,m(s) (these terms are present mainly for technical reasons), this step can be interpreted, from an intuitive perspective, as finding a sequence of weights { ω(t)}t that minimizes the regret on the sequence of empirical losses F(ω , ˆµ(s)) F( ω(s), ˆµ(s)). 5 At this point, once ω(t + 1) is computed, Algorithm 1 will convert these cost proportions into pull proportions while adding some forced exploration (Line 4-5), and then, it applies a standard cumulative tracking procedure [8] in the pull-proportion space so to ensure that Na,m(t) Pt s=1 π a,m(s) (Line 6). Stopping and decision rule Finally, the algorithm applies a generalized likelihood ratio (GLR) test to decide when to stop (Line 7). For i, j [K], fi,j(C(t), ˆµ(t)) can be interpreted a GLR statistics for comparing two classes: ΘKM versus {θ | θi MF, θj MF, θj,M θi,M}. If that GLR is large enough (if it exceeds a threshold βt,δ), we can reject the hypothesis that µ belongs to the second class. If there is an arm i for which we can reject the alternative class for all j = i, we have rejected all θ MMF where i is not the best arm and we can safely stop and return the answer ˆaτδ = i. Since each fi,j is expressed as a sum of only two arms and M fidelities, it is possible to show that choosing βt,δ log(K/δ) + 2M log (log(t) + 1) (see its exact expression in (31)) guarantees the correctness of the test, namely that Pµ(ˆaτδ = ) δ holds (Proposition C.13). 4.1 Theoretical guarantees At this point, we are ready to state the main theoretical result on the performance of our algorithm. Theorem 4.1. For any multi-fidelity bandit model µ MMF, Algorithm 1 using the threshold βt,δ given in (31) is δ-correct and satisfies lim sup δ 0 Eµ[cτδ] log(1/δ) C (µ). (5) As we can see from Theorem 4.1, Algorithm 1 is asymptotically optimal, meaning that it matches the lower bound we presented in Theorem 3.1 for the asymptotic regime of δ 0. Comparison with existing MF-BAI algorithms We conclude this section by comparing our results with the literature [25, 31]. First, [25] and [31] rely on additional assumptions that play a crucial role both for the algorithm design and the resulting theoretical guarantees. In [25], the authors enforce an additional and intricate structural assumption on the relationships between λ s and ξ s (see Assumption 1 in [25]). In [31], instead, the authors assume additional knowledge expressed as 5Whenever ˆµ(s) is sufficiently close to µ, this implicitly generates a sequence of weights that provide values of F( , µ) "close" to the one of the oracle weights ω . an upper bound on µ ,M and a lower bound on argmaxi = µi,M. For both works, whenever these assumptions are not satisfied (i.e., λ s and ξ s do not respect Assumption 1 in [25], and the knowledge on µ ,M, argmaxi = µi,M is imprecise/not available), the theoretical guarantees offered by existing algorithms are arbitrarily sub-optimal. On the other hand, our algorithm requires no additional assumptions and is the only one that matches exactly the cost complexity lower bound. Indeed, neither the cost upper bound of [25] nor the one of [31] matches the lower bound of Theorem 3.1, even when their additional hypotheses are satisfied. 4.2 Computing the gradient of F(ω, µ) Algorithm 1 requires computing a sub-gradient of F(ω, µ). Notably, we remark that this is needed for a generic µ ΘKM, as ˆµ(t) might violate the fidelity constraints due to inaccurate estimations or degenerate cases in which the multi-fidelity constraints are attained with equality. In this section, we provide an efficient algorithm for the computation of the sub-gradient that arises from a more in-depth study of the function F(ω, µ). To this end, we begin by presenting some intermediate characterization of the functions fi,j(ω, µ) that define F(ω, µ). Lemma 4.2. Consider µ ΘKM and ω K M. Define for k [K], ψ k := argmin ψ R m=1 ωk,m d (µk,m, ψ + ξm) + d+(µk,m, ψ ξm) Then, the following holds: fi,j(ω, µ) = X m=1 ωk,m d (µk,m, ψ k + ξm) + d+(µk,m, ψ k ξm) λm if ψ j > ψ i (6) fi,j(ω, µ) = inf η R m=1 ωk,m d (µk,m, η + ξm) + d+(µk,m, η ξm) λm otherwise. (7) We further introduce η i,j as the minimizer in the expression in (7) 6. When µ MMF, we can show that ψ k = µk,M for all k and due to the multi-fidelity constraints the expression in (6) is always equal to zero. Hence in both cases fi,j(ω, µ) is equal to the expression in (7), which can be rewritten fi,j(ω, µ) = 1 (µi,M µj,m) X m=1 ωk,m d (µk,m, η i,j + ξm) + d+(µk,m, η i,j ξm) λm . This quantity can be interpreted as the transportation cost for making µj,M larger than µi,M. When µi / MF or µj / MF, if ψ j > ψ i , fi,j(ω, µ) is equal to the expression (6) that can be interpreted as a transportation cost with an alternative in which i and j satisfy the multi-fidelity constraints. Using this preliminary result, we provide a precise expression for the sub-gradient of F(ω, µ). Theorem 4.3. Consider µ ΘKM and ω K M such that F(ω, µ) > 0 holds. Let (i, a) [K]2 be a pair of arms that attains the max-min value in Equation (2). Then a sub-gradient F(ω, µ) of F(ω, µ) w.r.t. to ω is given by one of the two following expressions: for j {a, i} and m [M] , F(ω, µ)j,m = d+(µj,m, η i,a ξm) + d (µj,m, η i,a + ξm) λm if ψ i ψ a , (8) F(ω, µ)j,m = d+(µj,m, ψ j ξm) + d (µj,m, ψ j + ξm) λm otherwise. (9) That sub-gradient F(ω, µ) is 0 in all the remaining KM 2M dimensions. Theorem 4.3 shows how to compute a sub-gradient of F(ω, µ) under the mild assumption that F(ω, µ) > 0.7 More specifically, it is sufficient to consider the pair (i, a) that attains the max-min 6To ease the notation, we omit (most of the time) the dependence of ψ k and η i,j in ω and µ. 7As we discuss in Remark C.8, F( ω(t), ˆµ(t)) = 0 is a rare condition, and, whenever it happens, it is possible to alter Algorithm 1 slightly without affecting its theoretical guarantees. value in Equation (2), and then test whether ψ i ψ a holds to choose which expression to use among Equations (8) and (9). An interesting interpretation of the sub-gradient expression is that, whenever ψ i ψ a, the sub-gradient is pointing toward the direction of the space that aims at increasing the information to discriminate the eventual optimality of arm a against i. On the other hand, whenever ψ a > ψ i holds, the sub-gradient points towards the direction of minimizing errors in the multi-fidelity constraints for arm i and arm a (if any). Computing the sub-gradient efficiently To conclude, we notice that to compute a sub-gradient, it is required to compute ψ k for all arm k and η i,j for all pairs of arms such that ψ i ψ j . Using their definitions, this will require solving O(K2) one-dimensional optimization problems of functions that involve O(M) variables, which leads to a computational complexity which is roughly O(K2Mn), where n is the number of iterations of the convex solver. In the following, we show that it is possible to exploit the structure of the fi,j s to obtain an algorithm whose total complexity is O(K2M 2) and that does not suffer from any approximation error due to the optimization procedure. Specifically, we now present a result that shows how to compute η i,j. A similar result holds also for ψ k and is deferred to Appendix C. Lemma 4.4. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0 . Suppose that ψ i, ψ j holds. Then, there exists a unique minimizer η i,j(ω) of Equation (7) which is the unique solution of the following equation of η: ka,m(η) µa,m+ξm v(η ξm) + ka,m(η) µa,m ξm ka,m(η) 1 v(η ξm) + ka,m(η) 1 v(η+ξm) , (10) where ka,m(x) = 1{x µa,m + ξm} and ka,m(x) = 1{x µi,m ξm}. From Lemma 4.4, to compute η i,j it is sufficient to find the unique solution to the fixed point equation given in (10). To do this efficiently, we observe that the right hand side of Equation (10) depends on η only for the presence of the indicator functions ka,m(η) and ka,m(η), which can only take a finite number of values. Hence, it is sufficient to evaluate the right-hand side at an arbitrary point within a given interval where the values of the indicator functions do not change. If the resulting value is within the considered interval, then this value is our fixed point. Since there are at most O(M) candidate fixed points, this procedure takes at most O(M 2) steps. Computational complexity remark It follows that the per-iteration computational complexity of Algorithm 1 is O K2M 2 . The computationally efficient technique explained above indeed applies not only to the sampling rule but also to the stopping and the decision rules.8 5 Numerical experiments We conclude this work by presenting numerical simulations whose goal is to show the empirical benefits of our approach. We compare MF-GRAD against IISE [25], and the gradient approach of [22] that simply does BAI using samples collected at fidelity M. We will refer to this additional baseline as GRAD. In the following, we avoided the comparison with the multi-fidelity algorithms in [31] as we ran into issues when doing experiments. We elaborate more on this point in Appendix D.6, where we provide numerical evidence of the fact those algorithms might fail at stopping, together with an argument that shows a mistake in the proofs of [31]. Given this setup, first, we test all methods on a 4 5 multi-fidelity bandit with Gaussian arms that have been randomly generated, using a risk parameter δ = 0.01. Due to space constraints and for the sake of exposition, we refer the reader to Appendix D.1 for the value of µ, ξ s and λ s and details on the stopping rules calibration. We report the empirical distribution of the resulting cost complexities in Figure 1. As one can verify, MF-GRAD obtains the most competitive performance. Experiments on additional 4 5 bandits that are reported in Appendix D.3 provide a similar conclusion. Furthermore, to illustrate the sub-optimality of IISE and GRAD from an intuitive perspective, we test our algorithm on a simple 5 2 instance that allows to easily understand why existing methods underperform MF-GRAD. Specifically, we consider µi = [0.4, 0.5] for all i [4], µ5 = [0.5, 0.6], 8Given the definition of fi,j, it is sufficient to replace ωi,m/λm in Equation (10) with Na,m(t). Figure 1: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 4 5 multi-fidelity bandit. Figure 2: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 5 2 multi-fidelity bandit. Figure 3: Empirical cost proportions of MF-GRAD for 100000 iterations on the 5 2 bandit model. Results are average over 100 runs and shaded area report 95% confidence intervals. Empirical cost proportions of a certain arm are plotted with the same color. Cost proportions at fidelity 1 and 2 are visualized with a circle and a squared respectively. λ = [0.5, 5], ξ = [0.1, 0] and we report the cost complexity of the three algorithms in Figure 2. In this case, we can prove that the optimal fidelity is sparse on fidelity m = 1 for i [4], and on fidelity m = 2 for arm 5. Furthermore, thanks to the symmetry of the problem, it is possible to show that ω i = [0.09621, 0] for all i [4], and ω 5 = [0, 0.61516] (see Appendix D.1). As one can see, IISE obtains the worst performance in this domain. The reason is that the concept of optimal fidelity on which IISE relies is sub-optimal (i.e., according to the design principle of IISE, the optimal fidelity is m = 2 for all arms), and the algorithm, in practice, will discard sub-optimal arms using samples that have been collected only at fidelity m = 2. Nevertheless, this will only happen after a first period in which IISE tries to exploit (unsuccessfully) data at fidelity m = 1. GRAD, on the other hand, obtains sub-optimal performances since although most of the budget should be spent on fidelity 2 (as ω 5,2 = 0.61516), it never pulls the cheapest (and optimal) fidelity for arms i [4]. Finally, MF-GRAD, on the other hand, obtains the most competitive performance since, as learning progresses, its empirical cost proportions eventually approach the one prescribed by ω . To verify this behavior, we removed the stopping rule from MF-GRAD, and let the algorithm run for 105 iterations. In Figure 3, we report the entire evolution of the cost proportions during learning. As one can appreciate, at the end of this process, the empirical cost proportions of MF-GRAD are approaching the one described by ω . 9. Finally, we also refer the reader to Appendix D for additional results (e.g., additional domains, smaller regimes of δ) and further insights. 9Furthermore, at the end of this period, we measured the distance between ω and the empirical cost proportions ω(105); it holds that ||ω ω(105)||2 0.031 0.0006. The error has been estimated with 100 independent runs, and 0.0006 reports the 95% confidence intervals. 6 Conclusions For fixed-confidence best arm identification in multi-fidelity bandits, we presented a lower bound on the cost complexity and an algorithm with a matching upper bound in the regime of high confidence. The algorithm uses features of the lower bound optimization problem in order to compute its updates efficiently. Unlike prior work, it does not require any assumption or prior knowledge on the bandit instance. Our work also confirmed the existence in most cases of an optimal fidelity to explore each arm in the asymptotic regime, and revealed that the intuitive such notions proposed in prior work were inaccurate. Yet, our algorithm does not need to identify these optimal fidelities in order to be asymptotically optimal. This raises the following question: could the performance of the algorithm be enhanced by exploiting the sparsity pattern? We conjecture that estimating the optimal fidelities accurately may actually be harder than identifying the best arm. However, leveraging some sufficient conditions for w a,m = 0 (such as the ones given in Proposition B.6) to eliminate some fidelities and reduce the support of the forced exploration component of the algorithm seems a promising idea. A limitation of our current analysis is that it only provides asymptotic guarantees in the high confidence regime, although our experiments reveal good performance for moderate values of δ. In future work, we will seek a better understanding of the moderate confidence regime [30]. To this end, we may leverage some proof techniques from other works using online optimization that obtain finite-time bounds [4, 5]. On the lower bound side, while C (µ) essentially scales with K due to the sparsity pattern, an interesting open question is whether there is a worse case O(KM) scaling in the moderate confidence regime, indicating that all fidelities do need to be explored at least a constant amount of times. Acknowledgments and Disclosure of Funding This work was done while Riccardo Poiani was visiting the Scool team of Inria Lille. He acknowledges the funding of a MOB-LIL-EX grant from the University of Lille. Rémy Degenne and Emilie Kaufmann acknowledge the funding of the French National Research Agency under the project FATE (ANR22-CE23-0016-01) and the PEPR IA FOUNDRY project (ANR-23-PEIA-0003). Alberto Maria Metelli and Marcello Restelli acknowledge the funding of the European Union Next Generation EU within the project NRPP M4C2, Investment 1.,3 DD. 341 - 15 march 2022 FAIR Future Artificial Intelligence Research Spoke 4 - PE00000013 - D53C22002380006. [1] J-Y. Audibert, S. Bubeck, and R. Munos. Best Arm Identification in Multi-armed Bandits. In Proceedings of the 23rd Conference on Learning Theory, 2010. [2] Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516 1541, 2013. [3] Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019. [4] Rémy Degenne, Wouter M Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 2019. [5] Rémy Degenne, Han Shao, and Wouter Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443 2452. PMLR, 2020. [6] E. Even-Dar, S. Mannor, and Y. Mansour. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:1079 1105, 2006. [7] Côme Fiegel, Victor Gabillon, and Michal Valko. Adaptive multi-fidelity optimization with fast learning rates. In International Conference on Artificial Intelligence and Statistics, pages 3493 3502. PMLR, 2020. [8] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027. PMLR, 2016. [9] Deng Huang, Theodore T Allen, William I Notz, and R Allen Miller. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 32:369 382, 2006. [10] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil UCB: an Optimal Exploration Algorithm for Multi-Armed Bandits. In Proceedings of the 27th Conference on Learning Theory, 2014. [11] Kevin G. Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In AISTATS, 2016. [12] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655 662, 2012. [13] Kellen Kanarios, Qining Zhang, and Lei Ying. Cost aware best arm identification. ar Xiv preprint ar Xiv:2402.16710, 2024. [14] Kirthevasan Kandasamy, Gautam Dasarathy, Junier Oliva, Jeff Schneider, and Barnabas Poczos. Multi-fidelity gaussian process bandit optimisation. Journal of Artificial Intelligence Research, 66:151 196, 2019. [15] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabás Póczos. Gaussian process bandit optimisation with multi-fidelity evaluations. Advances in neural information processing systems, 29, 2016. [16] Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multifidelity multi-armed bandit. Advances in neural information processing systems, 29, 2016. [17] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabás Póczos. Multi-fidelity bayesian optimisation with continuous approximations. In International conference on machine learning, pages 1799 1808. PMLR, 2017. [18] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17:1 42, 2016. [19] Emilie Kaufmann and Wouter M Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22(246):1 44, 2021. [20] Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019. [21] Shibo Li, Wei Xing, Robert Kirby, and Shandian Zhe. Multi-fidelity bayesian optimization via deep neural networks. Advances in Neural Information Processing Systems, 33:8521 8531, 2020. [22] Pierre Ménard. Gradient ascent for active exploration in bandit problems. ar Xiv preprint ar Xiv:1905.08165, 2019. [23] Quan Nguyen, Arghavan Modiri, and Roman Garnett. Nonmyopic multifidelity acitve search. In International Conference on Machine Learning, pages 8109 8118. PMLR, 2021. [24] Victor Picheny, David Ginsbourger, Yann Richet, and Gregory Caplin. Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics, 55(1):2 13, 2013. [25] Riccardo Poiani, Alberto Maria Metelli, and Marcello Restelli. Multi-fidelity best-arm identification. Advances in Neural Information Processing Systems, 35:17857 17870, 2022. [26] Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. Advances in neural information processing systems, 30, 2017. [27] Yoan Russac, Christina Katsimerou, Dennis Bohle, Olivier Cappé, Aurélien Garivier, and Wouter M Koolen. A/b/n testing with control in the presence of subpopulations. Advances in Neural Information Processing Systems, 34, 2021. [28] Rajat Sen, Kirthevasan Kandasamy, and Sanjay Shakkottai. Multi-fidelity black-box optimization with hierarchical partitions. In International conference on machine learning, pages 4538 4547. PMLR, 2018. [29] Rajat Sen, Kirthevasan Kandasamy, and Sanjay Shakkottai. Noisy blackbox optimization using multi-fidelity queries: A tree search approach. In The 22nd international conference on artificial intelligence and statistics, pages 2096 2105. PMLR, 2019. [30] Max Simchowitz, Kevin G. Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In International Conference On Learning Theory (COLT), 2017. [31] Xuchuang Wang, Qingyun Wu, Wei Chen, and John Lui. Multi-fidelity multi-armed bandits revisited. Advances in Neural Information Processing Systems, 36, 2024. A Table of Symbols Table 1 reports a summary on the main symbols and the notation used throughout the paper. Table 1: Notation Symbol Meaning K, M Number of arms and number of fidelity δ (0, 1) Maximum risk parameter τδ Stopping time of an algorithm cτδ Cost incurred at the stopping time τδ a (µ) argmaxa [K] µa,M. Often denoted simply by ˆaτδ Arm recommended by the algorithm when it stops µ Bandit model µi,m Mean of the i-th arm at fidelity m within bandit model µ ξm Precision of fidelity m, i.e., maxi [K] |µi,m µi,M| ξm λm Cost incurred for gathering samples at fidelity m Θ Set of possible means in the exponential family µa MF Arm a satisfies the multi-fidelity constraints ˆµ(t) Empirical bandit model at time t MMF Set of multi-fidelity bandit models M MF Set of multi-fidelity bandit models with a unique optimal arm d(p, q) KL divergence between two distributions with means p,q in the exponential family d+(p, q), d (p, q) d+(p, q) = d(p, q)1 {p q}, d (x, y) = d(p, q)1 {p q} v(y) Variance of the distribution in the exponential family with mean parameter y C (µ) 1 Expression that characterizes the lower-bound on the cost-complexity ω, π Vector of cost and pull proportions respectively ω ω argmaxω K M F(ω, µ) fi,j(ω, µ) Dissimilarity between arms i and j defined in Equation (1) F(ω, µ) maxi [K] minj =i fi,j(ω, µ) K M(µ) Set of optimal oracle weights ω for the multi-fidelity bandit model µ f MMF Subset of multi-fidelity bandit models for which there exists a non-sparse optimal allocation ω ω Uniform KM-dimensional vector ((KM) 1, . . . , (KM) 1) G > 0 Clipping constant in Algorithm 1 αt, γt Learning rate and forced exploration rate respectively C(t) Vector whose (a, m)-th dimension is Ca,m(t) ω(t) Vector of empirical cost proportions, namely ωa,m(t) = Ca,m(t)(P i [K] P j [M] Ci,j(t)) 1 ψ i Minimizer of Equation (6) η i,j Minimizer of Equation (7) ki,m(η) 1 {η µi,m + ξm} ki,m(η) 1 {η µi,m ξm} B Cost complexity lower bound: proofs and derivations B.1 Proof of Theorem 3.1 Theorem 3.1. Let δ (0, 1). For any δ-correct strategy, and any multi-fidelity bandit model µ M MF, it holds that: Eµ[cτδ] C (µ) log 1 2.4 δ , (3) where C (µ) 1 := supω K M F(ω, µ) = supω K M mina = f ,a(ω, µ) . Proof. Consider δ (0, 1), a multi-fidelity bandit model µ and an alternative instance θ Alt(µ) where Alt(µ) = S i = {θ MMF : θa,M > θ ,M}. Then, by applying Lemma 1 in [18], we can directly connect the expected number of draws of each arm to the KL divergence of the two multifidelity bandit models. More specifically, we have that: X m [M] Eµ[Na,m(τδ)]d(µa,m, θa,m) kl(δ, 1 δ). (11) Then, similarly to Theorem 1 in [8], we now proceed by applying Equation (11) with all the alternative models θ Alt(µ). Specifically, we have that: kl(δ, 1 δ) inf θ Alt(µ) m [M] Eµ[Na,m(τδ)]d(µa,m, θa,m) = Eµ[cτδ] inf θ Alt(µ) Eµ[λm Na,m(τδ)] Eµ[cτδ] d(µa,m, θa,m) Eµ[cτδ] sup ω K M inf θ Alt(µ) m [M] ωa,m d(µa,m, θa,m) = Eµ[cτδ] sup ω K M min a = inf θ MMF: θa,M>θ ,M i,m ωi,m d(µi,m, θi,m) (a) = Eµ[cτδ] sup ω K M min a = inf θa MF,θ MF: θa,M>θ ,M i { ,a},m [M] ωi,m d(µi,m, θi,m) = Eµ[cτδ] sup ω K M min a = f ,a(ω, µ) = Eµ[cτδ]C (µ) 1, where in (a) we use that as µ MMF, the minimum in θ does not change any arm i / { , a}. Finally, we lower bound kl(δ, 1 δ) with log 1 2.4 δ , thus concluding the proof. B.2 Comparison with existing lower bound In this section, we provide a comparison with the existing lower bound for the MF-BAI setting. In the following, we restrict our attention to bandits with Gaussian arms with variance 1/2. We assume for simplicity of the exposition that = 1 and that µ1,M > µ2,M µK,M. Given this setup, we begin by recalling Theorem 1 in [25]. Theorem B.1 (Theorem 1 in [25]). Consider any multi-fidelity bandit model µ with Gaussian arms with variance 1/2. Then, for any δ-correct algorithm and δ 0.15 it holds that Eµ[cτδ]/ log (2.4δ) 1 is lower bounded by: min m [M]: µ1,m>µ2,M+ξm λm (µ1,m (µ2,M + ξm))2 + i=1 min m [M]: µ1,M ξm>µi,m λm (µi,m (µ1,M ξm))2 . At this point, focus, for simplicity on the following 2 2 bandit model (but a trivial generalization holds for K M bandits). We consider µ1,m = µ1,M + ξm 2 and µ2,m = µ2,M ξm 2 . Furthermore, suppose that µ1,M = µ2,M. Then, let := µ1,M µ2,M. Suppose that = ξm, which yields µ1,M = ξm 2 and µ2,M = ξm 2 , and that Since = ξm, this condition actually simplifies to λM 4λm. Under these conditions it is possible to verify that the lower bound of [25] is given by Eµ[cτδ] 2λm ξm At this point, consider, instead, the result that we presented in Theorem 3.1 and consider a generic weight proportion ω. From Corollary C.2, we know that: F(ω, µ) = f1,2(ω, µ) = inf η [µ2,M,µ1,M] m [M] ω1,m d (µ1,m, η + ξm) λm + ω2,m d+(µ2,m, η ξm) Then, let ω ,M be the optimal weights restricted on the portion of the simplex in which ω1,m = ω2,m = 0. Then, let η ,M be the optimal solution of Equation (12) when considering ω ,M. Using the symmetry of the KL divergence for Gaussian distributions, it holds that η ,M = 0 and ω ,M 1,2 = ω ,M 2,2 = 0.5. Then, for any ω it holds that: m [M] ω1,m d (µ1,m, η ,M + ξm) λm + ω2,m d+(µ2,m, η ,M ξm) = ω1,M d(µ1,M, η ,M) λM + ω2,M d(µ2,M, η ,M) λM F(ω ,M, µ), where in the second step, we have used the fact that d(µ2,m, η ,M ξm) = d(µ2,M ξm 2 , ξm) = 0 and d(µ1,m, η ,M + ξm) = d(µ1,M + ξm 2 , ξm) = 0. In other words, we have shown that in this example the optimal allocation is sparse and on fidelity M. To conclude, we have that: F(ω ,M, µ) = 0.5 d( ξm 2 , 0) λM + 0.5 d( ξm 2 , 0) λM = d( ξm 2 , 0) λM = ( /2)2 which leads to: Eµ[cτδ] 4λM 2 log (2.4δ) 1 . Under the assumptions on the problem, 4λM 2 is always larger than 8λm 2 . This result says that the ratio among the lower bounds can be of order λM/λm, which is arbitrarily large. B.3 Proof of Theorem 3.2 In this section, we provide a formal proof on the sparsity of the optimal oracle allocation ω . The proofs given in this section rely on results that are explained in Appendix C.1. At this point, in order to prove Theorem 3.2, we first introduce some intermediate results that will be used in the proving the theorem. Specifically, we begin by showing that, for each arm a, there always exists a fidelity m such that ω a,m > 0 holds. Lemma B.2. Consider µ MMF and ω K M(µ). Then, for all a [K], there exists m [M] such that ω a,m > 0. Proof. We split the proof into two cases. First we consider a = , and proceed by contradiction. Consider ω K M(µ), and suppose there exists a = such that ω ,m = 0 for all m [M]. In this case, however, we have that: F(ω , µ) f ,a(ω , µ) = inf θa MF,θ MF: θa,M θ ,M m=1 ω ,m d(µ ,m, θ ,m) λm = 0, (13) where, in the first step we have used the definition of F(ω , µ), in the second one the fact that ω a,m = 0 for all m [M], and in the last one we selected θ ,m = µ ,m for all m [M]. Nevertheless, from Lemma C.6, we know that, whenever ωi,M > 0 holds for all i [K], then F(ω, µ) > 0 holds as well. Therefore, ω / K M. The proof for the case in which i = follows identical reasoning. We then continue by proving that, at any optimal allocation ω , all the transportation costs fa(ω , µ) are equal. Lemma B.3. Consider µ MMF and ω K M(µ). Then, for all a, b such that a = and b = the following holds: f ,a(ω , µ) = f ,b(ω , µ). Proof. We introduce the following notation: a [K] \ { } : a argmin b = fb(ω , µ) B = ([K] \ { }) \ A. At this point, we proceed by contradiction. Suppose that B = . Then, for some sufficiently small ϵ > 0, we define ω K M in the following way. For all a A: ωa,M = ω a,M + ϵ/|A| ωa,m = ω a,m m < M. For all b B, instead: ωb,mb = ω b,mb ϵ/|B| ωb,m = ω b,m m = mb, where mb [M] is any fidelity such that ω b,mb > 0 (which exists by Lemma B.2). Given this definition of ω, it is easy to see that f ,a( ω, µ) > f ,a(ω , µ) for all a A. This is a direct consequence of the fact that f ,a( , µ) is a strictly increasing function of ωa,M, which is apparent from its expression from its expression for µ MMF given in Corollary C.2 and the computation of its gradient (Lemma C.3). Moreover, due to similar arguments, it also holds that f ,b( ω, µ) f ,b(ω , µ) for all b B. Using the continuity of the functions f, for ε small enough we further have f ,a( ω, µ) < f ,b( ω, µ) for all a A and b B. This leads to mina = f ,a(ω , µ) < mina = f ,a( ω, µ) which contradicts the optimality of ω . We now continue by providing necessary conditions that characterize some key properties of the oracle weights ω, which follows from the expression of the gradient of f ,a(ω, µ) with respect to ω for µ MMF (Lemma C.3). Lemma B.4. Consider µ M and ω K M(µ). Then, for all a = the following conditions holds: d+(µa,m1, η ,a(ω ) ξm1) λm1 = d+(µa,m2, η ,a(ω ) ξm2) λm2 m1, m2 : ω a,m1, ω a,m2 > 0 d (µ ,m1, η ,a(ω ) + ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω ,m1, ω ,m2 > 0 d+(µa,m1, η ,a(ω ) ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω a,m1, ω ,m2 > 0 Proof. We begin by recalling the definition of C (µ) 1: C (µ) 1 = sup ω K M min a = f ,a(ω, µ), which, is a concave optimization problem with a non-empty feasible region. Therefore, we can apply the KKT conditions to study the properties of each local optimal point ω for which the sub-derivatives exist, i.e., from Theorem 4.3 and Corollary C.2, the ones for which the following condition hold:10 min a = f ,a(ω) > 0. (14) 10We notice that a global optimum point clearly satisfies this condition. At this point, fix any arm a that attains the minimum in Equation (14). Then, from the KKT conditions, we obtain the following system of inequalities: ωa,m f ,a(ω, µ) + c ba,m = 0 m [M] ω ,m f ,a(ω, µ) = 0 m [M] bi,mω i,m = 0 i { , a}, m [M] bi,m 0 i [K], m [M] ω i,m 0 i [K], m [M] P i,m ω i,m = 1 At this point, suppose that ωi1,m1 > 0, ωi2,m2 > 0 for some i1, i2 { , a} and some m1, m2 [M], then bi1,m1 = 0, bi2,m2 = 0. As a consequence, by applying Lemma C.3 and the fact that µ MMF (i.e., see Corollary C.2), the following equations holds: d+(µa,m1, η ,a(ω ) ξm1) λm1 = d+(µa,m2, η ,a(ω ) ξm2) λm2 m1, m2 : ω a,m1, ω a,m2 > 0 d (µ ,m1, η ,a(ω ) + ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω ,m1, ω ,m2 > 0 d+(µa,m1, η ,a(ω ) ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω a,m1, ω ,m2 > 0 Finally, to conclude the proof, it is sufficient to iterate these arguments for all a = . Indeed, from Lemma B.3, we know that all sub-optimal arms will attain the minimum in Equation (14) at a global optimum ω . At this point we are ready to prove our main result. Theorem 3.2. Let K M(µ) := argmaxω K M F(ω, µ) and f MMF := µ M MF : i [K], m1, m2 [M]2, ω K M(µ) : ω i,m1 > 0, ω i,m2 > 0 . The set f MMF is a subset of RK M whose Lebesgue measure is zero. Proof. Let us introduce some additional notation. Consider a subset of arm-fidelity pairs X [K] [M], and define G(X) M as the subset of multi-fidelity bandit models µ for which there exists ω K M(µ) such that, for all (i, m) X, ω i,m > 0 holds. Then, fix an arm i = , and any three fidelity m1, m2, m3 [M], and consider µ G({(i, m1), (i, m2), ( , m3)}).11 Then, from Lemma B.4, we know that the following condition holds: d+(µi,m1, η ,i(ω ) ξm1) λm1 = d (µ ,m3, η ,i(ω ) + ξm3) λm3 . (15) This, in turn, implies that η ,i(ω ) is uniquely identified as a function of µi,m1, µ ,m3, ξm1, ξm3, λm1 and λm3. Indeed, d+(µi,m1, η ,i(ω ) ξm1) is a strictly increasing function of η ,i(ω ), while d (µ ,m3, η ,i(ω ) + ξm3) is a strictly decreasing function of η ,i(ω ). Let c1 = η ,i(ω ), and let c2 = d+(µi,m1,η ,i(ω ) ξm1) λm1 . At this point, since ω i,m2 > 0 holds by definition, we also know, from Lemma B.4, that the following condition has to be satisfied: d+(µi,m2, c1 ξm2) Therefore, the value of µi,m2 is uniquely identified as a function of µi,m1, µ ,m3,ξm1, ξm3, λm1 and λm3. That function is measurable (it s a combination of d+, d and their inverses), hence µ lies on the graph of a measurable function, and such a graph has Lebesgue measure 0. Therefore, 11Similar arguments hold also for i = . G({(i, m1), (i, m2), ( , m3)}) has measure 0. At this point, thanks to Lemma B.2, we know that, for arm , there always exists at least a fidelity m3 such that ω ,m3 > 0. We thus have that G((i, m1), (i, m2)) [ m3 [M] G((i, m1), (i, m2), ( , m3)) . Since G((i, m1), (i, m2)) is contained in a set which is a countable union of null measure sets, it has null measure. To conclude the proof, we notice that: m1,m2 [M]2 G({(i, m1), (i, m2)}) := Y. The proof follows from the fact that (i) Y is a countable union of set of null measure (and, consequently, has null measure), and (ii) f MMF Y. B.4 Additional results on the sparsity of the oracle weights In this section, we present additional results on the sparsity of the oracle weights. Specifically: (i) We identify a specific class of multi-fidelity bandit models in which the optimal allocation is sparse. In particular, within this class of MF-bandit models, the optimal allocation have non-zero values only at the cheapest fidelity. (ii) We then provide sufficient conditions to determine whether some fidelity have zero weights at any optimal weight vector ω We now proceed by constructing the class of multi-fidelity bandits that we mentioned in point (i) above. In this construction, we will consider Gaussian multi-fidelity bandits with variance 1 2. Then, for any number of arms K and fidelity M, we will denote with AKM, the set of Gaussian multifidelity bandits that satisfy the following construction. We start by building the means of the arms at the highest fidelity M. Specifically, we consider a generic µ ,m > 0, and let µa,M = µ ,m for all a = . Then, for each fidelity m < M, and any values of λm and ξm, we let µi,m = µi,M ξm for all i = , and µ ,m = µ ,M + ξm. Finally, to simplify some computations, we set σ2 of each Gaussian distribution to 1 Proposition B.5. For all µ AKM, and any ω K M, it holds that, for all a [K] and all m > 1, ω a,m = 0. Proof. To prove the result, starting from Corollary C.2, it is sufficient to notice that, for all µ AKM and all a = , f ,a(ω, µ) can be rewritten as: f ,a(ω, µ) = inf η [µa,M,µ ,m] m=1 ωi,m d(µi,M, η) Specifically, Equation (16) follows directly from the symmetric property of KL divergence for Gaussian distributions, and by the construction of µ. The proof then continue by contradiction. Suppose there exists ω such that there exists (i, m) (with m > 1) such that ω i,m > 0. By defining ω as the vector which is equal to ω except in the components (i, m) and (a, 1) for all a [K]. More specifically, for a sufficiently small ϵ > 0, we define ωi,m = ω i,m ϵ and ωa,1 = ω a,1 + ϵ/(K) for all a [K]. Then, it is easy to see that f ,a( ω, µ) > f ,a(ω , µ) holds for all a = , thus contradicting the optimality of ω . Finally, we now provide sufficient conditions to determine whether some fidelity have zero weights at any optimal weight vector ω Proposition B.6. Fix a = . Then, if µa,m + ξm µ ,m, then it holds that ω a,m = 0. Furthermore, if µ ,m ξm µj,M for all j = , then it holds that ω ,m = 0. Proof. Consider a = , and let us analyze f ,a(ω, µ) for any ω K M. More specifically, we recall from Corollary C.2, that the only term in which ωa,m plays a role is the following one: ωa,m d(µa,m, η ,a(ω) ξm) λm 1 η ,a(ω) µa,m + ξm . (17) Nevertheless, since η ,a(ω) µ ,m µa,m + ξm, we have that Equation (17) is always equal to 0 for all ω K M. To prove the result we now proceed by contradiction. Suppose that ω is such that ω a,m > 0. Then, consider ω as a vector which is equal to ω except in the components (a, m) and (i, M) for all i = . More specifically, for a sufficiently small ϵ > 0, we define ωa,m = ω a,m ϵ and ωi,M = ω i,M + ϵ/(K 1) for all i = . At this point, by noticing that f ,i(ω, µ) is strictly increasing in ωi,M (i.e., due to Theorem 4.3 and the fact that µ MMF), and since f ,a(ω, µ) is not affected by the value of ωa,m (i.e., Equation (17)), we have that f ,i( ω, µ) > f ,i(ω , µ) for all i = , thus contradicting the optimality of ω . To show that if µ ,m ξm µj,M for all j = , then it holds that ω ,m = 0, it is possible to follow identifical reasonings. The only difference is that the term ω ,m plays a role in each of the (K 1)-equations defining F(ω, µ), namely: ω ,m d(µ ,m, η ,a(ω) + ξm) λm 1 η ,a(ω) µ1,m ξm a = . (18) Nevertheless, Equation (18) is equal to 0 for all a = since η ,a(ω) µi,m ξm µa,M holds for all ω and all a = . The proof then follows by an identical construction of an alternative weight vector ω which increases the objective function. B.5 Sub-optimality of "optimal" fidelity of previous works In this section, we discuss how the concept of "optimal" fidelity of previous works (i.e., [25] and [31]) fails to satisfy the notion of optimal fidelity that arises from the tighter lower bound that we presented in Section 3. In this section, we consider as example 2 2 multi-fidelity bandit models with Gaussian distributions. To ease the notation, we will consider µ1,M > µ2,M. B.5.1 Case 1 We notice that [25] provided the two concepts of optimal fidelity. The first one is from their Theorem 1. This same concept was then considered later in [31]. A fidelity m is optimal for a certain arm a [K] if it satisfies the following condition: m a argmax m [M] µ1,M (µa,m + ξm) λm if a = 1 (19) m a argmax m [M] (µa,m ξm) µ2,M λm if a = 1 (20) Then, consider the following 2 2 example of multi-fidelity BAI problem. Let ξ1 = 0.1, µ1,M = 0.6, µ1,m = 0.65, µ2,M = 0.5, µ2,m = 0.45 (where we use the notation M = 2 for the maximal fidelity and m = 1). Suppose, furthermore, that all distributions are Gaussian. In this case, from Equation (19)-(20), we have that m 1 = 1 and m 2 = 1 whenever the following conditions are satisfied: µ1,M (µ2,m + ξm) λm > µ1,M µ2,M λM µ1,m ξm µ2,M λm > µ1,M µ2,M λM . Plugging in the numerical values, we obtain in both cases 0.05 λm > 0.1 λM , thus showing that, according to [31], the optimal fidelity for both arms is m = 1 whenever q At this point, consider the expression of F(ω, µ) = f1,2(ω, µ) in this particular example. Then, it is possible to show that, for any ω 2 2 such that ω1,M = ω2,M = 0, then, f1,2(ω, µ) = 0. Specifically, we have that F(ω, µ) is given by: inf η [µ2,M,µ1,M] ω1,m d(µ1,m, η + ξm)1{η µ1,m ξm} λm + ωa,m d(µ2,m, η ξm)1{η µa,m + ξm} In turn, this is equal to: inf η [0.5,0.6] ω1,m d(0.55, η)1{η 0.55} λm + ωa,m d(0.55, η)1{η 0.55} which is always 0 for η = 0.55. On the other hand, Lemma C.6, shows that any strategy that gives positive value to weights at fidelity M = 2 obtains F(ω, µ) > 0. B.5.2 Case 2 Furthermore, [25] provided also the following concept of optimal fidelity which only holds for sub-optimal arms (see Definition 1 in [25]). A fidelity m such that µ1,M µ2,M > 4ξm holds is said to be optimal for arm a = 1 if the following holds: λm (µ1,M µa,M 4ξm)2 min m>m λ m (µ1,M µa,M 4ξm)2 . (21) At this point, consider the following classes of multi-fidelity bandit models: µ2,m = µ2,M ξm, µ1,m = µ1,M +ξm, µ1,M µ2,M 4ξm. In this case, from Equation (20) it follows that the optimal fidelity for arm 2 is always M. Nevertheless, since µ2,m = µ2,M ξm, µ1,m = µ1,M + ξm, we know from Proposition B.5 ω1,M = ω2,M = 0. C Algorithm analysis C.1 Gradient computation We start by analyzing a salient feature of fi,j(ω, µ) that holds for any µ ΘKM. Lemma C.1. Consider µ ΘKM. Fix any ω K M and i, j [K]. Let θ be the solution of the following optimization problem: θ argmin θi MF,θj MF: θj,M θi,M m [M] ωj,m d(µj,m, θj,m) m [M] ωi,m d(µi,m, θi,m) Furthermore, define for k {i, j}: M k(ω, µ, θ ) := m [M 1] : θ k,M > µk,m + ξm M k(ω, µ, θ ) := m [M 1] : θ k,M < µk,m ξm . Then, for k {i, j} we have that θ k,m = µk,m m [M] \ M k(ω, µ, θ ) M k(ω, µ, θ ) (22) θ k,m = θ k,M ξm m M k(ω, µ, θ ) (23) θ k,m = θ k,M + ξm m M k(ω, µ, θ ) (24) In particular, fi,j(w, µ) = X m [M] ωk,m d (µk,m, θ k,M + ξm) + d+(µk,m, θ k,M ξm) λm = min θj,M θi,M m [M] ωk,m d (µk,m, θk,M + ξm) + d+(µk,m, θk,M ξm) Proof. We begin by proving Equation (22). To this end, it is sufficient to notice that, given a fixed θ k,M, it is possible to set θ k,m := µk,m, whenever the following condition is satisfied: |θ k,M µk,m| ξm. (25) The condition is Equation (25) is equivalent to requiring m [M] \ Mk(ω, µ, θ ) Mk(ω, µ, θ ) , which concludes the first part of the proof. We continue by proving Equation (23). Consider m M k(ω, µ, θ ), that is θ k,m > µk,m + ξm. From this condition, it directly follows that θ k,m > µk,m; therefore, since d(µk,m, x) is increasing in x, it follows that, in order to attain the argmin, we need to pick the smallest value of θk,m that satisfies the multi-fidelity constraint |θ k,M θk,m|, that is θ k,M ξm. The proof of Equation (23) follows is almost identical to the one of Equation (24); it is sufficient to replace the definition of M k(ω, µ, θ ) with M k(ω, µ, θ ). At this point, we continue by analyzing in more detail the function fi,j(ω, µ). Lemma 4.2. Consider µ ΘKM and ω K M. Define for k [K], ψ k := argmin ψ R m=1 ωk,m d (µk,m, ψ + ξm) + d+(µk,m, ψ ξm) Then, the following holds: fi,j(ω, µ) = X m=1 ωk,m d (µk,m, ψ k + ξm) + d+(µk,m, ψ k ξm) λm if ψ j > ψ i (6) fi,j(ω, µ) = inf η R m=1 ωk,m d (µk,m, η + ξm) + d+(µk,m, η ξm) λm otherwise. (7) Proof. The proof follows by analyzing the definition of fi,j(ω, µ). Consider θ i , θ j that attaines the minimum in Equation (1). Then, there are two possibilities: either θ i,M = θ j,M or θ j,M > θ i,M. Suppose that θ j,M > θ i,M, then we notice that the optimization problem in fi,j(ω, µ) is a 2D-convex optimization problem in the variables θj,M, θi,M (thanks to Lemma C.1). Therefore, since the minimum of the constrained problem is such that θj,M > θi,M, than, by the convexity of the problem, this is also a minimum for the unconstrained problem, thus leading to: fi,j(ω, µ) = inf θi MF,θj MF m [M] ωk,m d(µk,m, θk,m) At this point, we notice that the constraints in the previous optimization problem are only intra-arm. Therefore, we can rewrite fi,j(ω, µ) as: fi,j(ω, µ) = X k {i,j} inf θk MF m [M] ωk,m d(µk,m, θk,m) Furthermore, applying the same reasoning as in the proof of Lemma C.1, we can further rewrite fi,j(ω, µ) as follows: fi,j(ω, µ) = X k {i,j} inf ψ R m [M] ωk,m d+(µk,m, ψ ξm) + d (µk,m, ψ + ξm) m [M] ωk,m d (µk,m, ψ k + ξm) + d+(µk,m, ψ k ξm) λm . At this point, we notice that due to Lemma C.1 we know that θ j,M = ψ j and θ i,M = ψ i , thus concluding the first part of the proof. Consider now the case in which θ j,M = θ i,M holds. Then, applying Lemma C.1, and using θ i,M = θ j,M, we can rewrite fi,j(ω, µ) as follows: fi,j(ω, µ) = inf η R d+(µk,m, η ξm) + d (µk,m, η + ξm) , thus concluding the proof. Given this result, we recall that the definitions of ψ i = argmin ψ R m [M] ωi,m d (µi,m, ψ + ξm) + d+(µi,m, ψ ξm) η i,j = argmin η R m [M] ωk,m d (µk,m, η + ξm) + d+(µk,m, η ξm) Corollary C.2. Consider µ MMF, and a [K] such that a = . Then it holds that: f ,a(ω, µ) = inf η [µa,M,µ ,M] m=1 ωi,m d (µi,m, η + ξm) + d+(µi,m, η ξm) = inf η [µa,M,µ ,M] m=1 ω ,m d (µ ,m, η + ξm) λm + ωa,m d+(µa,m, η ξm) Proof. At this point, we notice that whenever µ MMF it holds that f ,a can always be expressed as Equation (7). This is direct by the condition on ψ s in Lemma 4.2. Furthermore, it also holds at η ,a that d (µa,m, η ,a + ξm) = 0, and d+(µ ,m, η ,a ξm) = 0. This is a consequence of the fact that η ,a [µa,M, µ ,M] for all weights ω. Indeed, η ,a [µa,M, µ ,M] holds due to monotonicity property of the KL divergence. We now analyze in more detail Equations (26) and (27). In particular, we begin by focusing on Equation (26). Taking the gradient in Equation (27) w.r.t. the optimization variable η, and setting it equal to 0, we obtain that any optimal point η i,j(ω) needs to satisfy the following equation: ka,m 1 v(η ξm) + ka,m 1 v(η + ξm) ka,m µa,m + ξm v(η ξm) + ka,m µa,m ξm where we recall that ka,m(η) and ka,m(η) are given by: ka,m(η) = 1 {η µa,m + ξm} ka,m(η) = 1 {η µa,m ξm} . Given this intermediate result, we now investigate in more depth the solution of Equation (28). Lemma 4.4. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0 . Suppose that ψ i, ψ j holds. Then, there exists a unique minimizer η i,j(ω) of Equation (7) which is the unique solution of the following equation of η: ka,m(η) µa,m+ξm v(η ξm) + ka,m(η) µa,m ξm ka,m(η) 1 v(η ξm) + ka,m(η) 1 v(η+ξm) , (10) where ka,m(x) = 1{x µa,m + ξm} and ka,m(x) = 1{x µi,m ξm}. Proof. Let us analyze: fi,j(ω, µ) = inf η R m M ωa,m d (µa,m, η + ξm) λm + ωa,m d+(µa,m, η ξm) := inf η R gi,j(ω, µ, η). At this point, we proceed by contradiction. Suppose that there exists x1, x2 argminη R gi,j(ω, µ, η) such that x1 = x2. From the convexity of gi,j(ω, µ, η) w.r.t. η, we know that any x [x1, x2] belongs to the argmin set as well. Furthermore, for all x [x1, x2], since fi,j(ω, µ) > 0, at least one of the following condition is satisfied: λM d(µa,M, x) > 0 holds for some a {i, j} λm d(µa,M, x + ξm)ka,m(x) > 0 holds for some a {i, j} and some fidelity m < M λm d(µa,M, x ξm)ka,m(x) > 0 holds for some a {i, j} and some fidelity m < M Therefore, from Equation (28), we obtain that all x [x1, x2] are fixed points of the following Equation: a {i,j} PM m=1 ωa,m ka,m(x) µa,m+ξm v(x ξm) + ka,m(x) µa,m ξm a {i,j} PM m=1 ωa,m ka,m(x) 1 v(x ξm) + ka,m(x) 1 v(x+ξm) , (29) At this point, we notice that for any couple of different x1, x2 that satisfies Equation (29), there exists at least one arm a {i, j} and one fidelity m < M such that at least one of the following two conditions hold: ka,m( x1) = ka,m( x2) ka,m( x2) = ka,m( x2) This however, is possible only for a finite number of points, while the interval [x1, x2] contains infinitely many optimal points. Therefore, there exists a unique solution η i,j(ω) argminη R gi,j(ω, µ, η), and, furthermore, it is a solution of Equation (29), thus concluding the proof. Given this result, we continue by providing a result on how to compute the derivative of fi,j(ω, µ) whenever fi,j(ω, µ) is given by Equation (7). Lemma C.3. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0. Furthermore, suppose that fi,j(ω, µ) is given by Equation (7). Then, for all a {i, j} and all m [M]: ωa,m = d+(µa,m, η i,j + ξm) + d (µi,m, η i,j ξm) λm . Proof. First of all, we notice that, since fi,j(ω, µ) > 0 holds, and since fi,j(ω, µ) is expressed as in Equation (7), then, thanks to Lemma 4.4, we know that η i,j(ω) is the unique optimum of the Equation (7). In the rest of this proof, we will explicit the relationship between fi,j and η i,j(ω) by writing fi,j(ω, µ, η i,j(ω)). At this point, fix a {i, j} and m [M]. Then, it is easy to verify from Equation (10) that both the right and left derivative of η i,j(ω) w.r.t. ωa,m exists. Suppose for a moment that they are equal, then we have that η i,j(ω) ωa,m exists and it continuous. Therefore, we obtain ωa,m fi,j(ω, µ, η i,j(ω)) = fi,j ωa,m (ω, µ, η i,j(ω)) + fi,j η i,j(ω)(ω, µ, η i,j(ω)) η i,j(ω) ωa,m ωa,m (ω, µ, η i,j(ω)) = d+(µa,m, η i,j + ξm) + d (µa,m, η i,j ξm) λm , where in the second step we have used that fi,j η i,j(ω)(ω, µ, η i,j(ω)) = 0 since η i,j(ω) is a minimizer of Equation (7). Similarly, whenever, the right and the left derivatives of η i,j(ω) are different12, we can follow similar arguments, but analyzing left and right derivatives, and we will obtain an identical result. Indeed, this does not introduce discontinuity issue in the derivatives of fi,j thanks to the fact that η i,j(ω) is a minimizer of Equation (7). At this point, it remains to analyze in more detail the case in which we have that fi,j(ω) is expressed as in Equation (6). Lemma C.4. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0 holds. Furthermore, suppose that ψ j > ψ i . Then, for each a {i, j}, there exists a unique minimizer ψ a of Equation (6) which is the unique solution of the following equation of ψ: PM m=1 ωa,m ka,m(ψ) µa,m+ξm v(ψ ξm) + ka,m(ψ) µa,m ξm PM m=1 ωa,m ka,m(ψ) 1 v(ψ ξm) + ka,m(ψ) 1 v(ψ+ξm) . (30) Proof. The proof follows by noticing that, for each a {i, j}, the optimization problem in Equation (6) is an unconstrained convex optimization problem in ψ. Taking the derivative and setting it equal to 0 yields the desired result. At this point, we proceed by showing how to compute the partial derivatives of fi,j(ω) whenever it is expressed as in Equation (6). Lemma C.5. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0. Furthermore, suppose that ψ j > ψ i . Then, for all m [M] it holds that: ωa,m = d+(µa,m, ψ a + ξm) + d (µa,m, ψ a ξm) λm . Proof. The proof is a straightforward adaptation of the proof of Lemma C.3. Finally, we are now ready to prove our result on the sub-gradient of F(ω, µ). Theorem 4.3. Consider µ ΘKM and ω K M such that F(ω, µ) > 0 holds. Let (i, a) [K]2 be a pair of arms that attains the max-min value in Equation (2). Then a sub-gradient F(ω, µ) of F(ω, µ) w.r.t. to ω is given by one of the two following expressions: for j {a, i} and m [M] , F(ω, µ)j,m = d+(µj,m, η i,a ξm) + d (µj,m, η i,a + ξm) λm if ψ i ψ a , (8) F(ω, µ)j,m = d+(µj,m, ψ j ξm) + d (µj,m, ψ j + ξm) λm otherwise. (9) That sub-gradient F(ω, µ) is 0 in all the remaining KM 2M dimensions. Proof. The proof follows by the definition of F(ω, µ) together with Lemma 4.2, Lemma C.3, and Lemma C.5. 12This can happen, for instance, whenever η i,j(ω) = µa,m ξm. We now show a sufficient condition for F(ω, µ) > 0 to hold when µ ΘKM. Lemma C.6. Consider µ ΘKM such that there exists for which µ ,M > maxa = µa,M. Furthermore, consider ω K M such that ωi,M > 0 holds for all i [K]. Then, we have that F(ω, µ) > 0. Proof. From the definition of F, and the definition of , we have that: F(ω, µ) min a = inf θa MF, θ MF θa,M θ ,M m [M] ωj,m d(µj,m, θj,m) min a = inf y x ω ,M d(µ ,M, x) λM + ωa,M d(µa,M, y) = min a = inf η [µa,M,µ ,M] ω ,M d(µ ,M, η) λM + ωa,M d(µa,M, η) where in the last step we have used the fact that µ ,M > µi,M for all i = , together with ωi,M > 0 for all i [K]. Furthermore, we show that the sequence of weights generated by Algorithm 1 satisfy ωi,M > 0 for all i [K] Lemma C.7. The sequence of weights { ω(t)}t satisfy ωi,M(t) > 0 for all i [K] and for all t. Proof. We begin by recalling the definition of ω(t): ω(t + 1) argmax ω K M αt+1 s=KM ω Clips ( F( ω(s), ˆµ(s)) kl(ω, ω) From this definition, thanks to the property of kl, we have that ωa,m(t) > 0 for all a [K], and all m [M]. Remark C.8. It follows by combining Lemma C.6 and Lemma C.7, that F(ω(t), ˆµ(t)) = 0 might happen only when there are multiple best arms at fidelity M. Whenever this condition is encountered, it is possible to project the bandit model ˆµ(t) to have a unique optimal arm (e.g., by adding a small ϵ > 0 to one of the optimal arms). When looking at the proof of Theorem 4.1, we can see that this does not impact its theoretical guarantees as (i) on the good event ET this does not happen, and, Lemma C.17 holds unchanged. C.2 Smoothness of F(ω, µ) Lemma C.9. For any set S ΘKM and any subset of arms A [K], the function (ω, µ) 7 infθ S P a A,m [M] ωa,m d(µa,m,θa,m) λm is jointly continuous on K M ΘKM . Proof. We apply (a trivial generalization of) Lemma 27 of [3]. The lemma in that paper is stated for a set of alternative models, but the proof actually works for any set S. Likewise, it is stated for the case of λm = 1, but since it works for an arbitrary Bregman divergence d it applies to a rescaled version as well. To deal with the restriction to a subset of arms A instead of all arms, we can view the function as a function of (ωA [M], µA), where we restrict the vectors to the arms in A, and continuity of the original function is equivalent to continuity of the restricted version. Lemma C.10. The function (ω, µ) 7 F(ω, µ) is jointly continuous on K M ΘKM . Proof. By definition, F(ω, µ) = maxi mina =i fi,a(ω, µ) with fi,a(ω, µ) = infθ Si,a P a Ai,a,m [M] ωa,m d(µa,m,θa,m) λm for Si,a = {θ MMF | θa,M θi,M} and Ai,a = {i, a}. Since a minimum of finitely many continuous functions is continuous and likewise for the maximum, it suffices to show that each fi,a is jointly continuous. This is true by Lemma C.9. Corollary C.11. Let C ΘKM be a compact set. Then F is uniformly continuous on K M C. Proof. The set K M C is compact and F is continuous, hence it is uniformly continuous on that set. Lemma C.12. Let µ ΘKM. For all ε > 0, there exists κϵ > 0 such that for all ω K M and all µ ΘKM , µ µ κϵ = F(ω, µ ) F(ω, µ) ϵ. Proof. Take any compact ball B(µ, κ) for the norm with κ > 0 centered at µ. Then F is uniformly continuous on K M B(µ, κ) by Corollary C.11. This means that for any ε > 0, there exists κ ε > 0 such that for all (ω , µ ) K M B(µ, κ), ω ω κ ε µ µ κ ε = |F(ω , µ ) F(ω, µ)| ε . We can take κε = min{κ, κ ε} to remove the condition µ B(µ, κ). The result of the Lemma is this for the special case ω = ω. C.3 Correctness In the following, we propose an analysis on the correctness which is based on the concentration results provided in [22]. We notice that these results are based on Gaussian distributions. Nevertheless, at the cost of a more involved notation, it is possible to extend all the results of this work for canonical exponential families using, e.g., Theorem 7 in [19]. At this point, let us consider the following value of βt,δ: βt,δ = log K + 2M log 4 log K + 1 + 12M log (log(t) + 3) + 2M C, (31) where C is a universal constant (see Proposition 1 in [22]). Then, we can show the following result. Proposition C.13. Let δ > 0, then it holds that Pµ(ˆaτδ = ) δ. Proof. With probabilistic arguments we have that: Pµ(ˆaτδ = ) Pµ t KM, i = , min j =i fi,j(C(t), ˆµ(t)) βt,δ t KM, min j =i fi,j(C(t), ˆµ(t)) βt,δ i = Pµ ( t KM, fi, (C(t), ˆµ(t)) βt,δ) m [M] Nk,md(ˆµa,m(t), µa,m) βt,δ where in fourth step we have used the definition of fi, , and in the last one Proposition 1 in [22] together with a union bound on K. C.4 Auxiliary lemmas This section contains auxiliary lemmas that will be used in the analysis of Algorithm 1. Lemma C.14. For all a [K], m [M], and for all t 1, it holds that Na,m(t) t 4KM ln(KM). Furthermore, it holds that: s=0 π(s) N(t) 2 ln(KM) Proof. This lemma is a simple combination of Lemma 3 in [22] with the tracking result of [5]. Using algebraic manipulations, we have that: s=1 π a,m(s) Na,m(t) s=1 π a,m(s) s=1 π a,m(s) ln(KM) γs KM ln(KM) 1 4 s ln(KM) t 4KM ln(KM), where, in the second step we have used Theorem 6 in [5], together with the fact that ln(KM) ln(4) 1. For the second part of the proof, we have that: s=1 πa,m(s) Na,m(t) s=1 π a,m(s) Na,m(t) + 2 Lemma C.15. Consider ϵ > 0 and B R such that C (µ) 1 B ϵ > 0. Then, there exists a constant Cϵ such that, for a,m Ca,m(T) max λMCϵ, log K δ + 2M log 4 log K C (µ) 1 B ϵ := C0(ϵ, δ), (33) it holds that: C (µ) 1 B βT,δ P a,m Ca,m(T). (34) Proof. Let Cϵ be a constant that depends on ϵ such that, for T Cϵ it holds that: 12M log(log(T) + 3) + 2M C Then, for P a,m Ca,m(T) C0(ϵ, δ), we have that: a,m Ca,m(T) = log K δ + 2M log 4 log K δ + 1 + 12M log(log(T) + 3) + 2M C P a,m Ca,m(T) δ + 2M log 4 log K a,m Ca,m(T) + ϵ which concludes the proof. C.5 Proof of Theorem 4.1 Before diving into the proof of Theorem 4.1, we introduce some additional notation. We denote with B (x, κ) the ball of radius κ centered at x. Then, for all T, and ϵ > 0, we introduce the following event: t h(T ) {ˆµ(t) B (µ, κϵ)} , where h(T) T 1/4. At this point, we present our result. Furthermore, we denote with ω(t) the vector of empirical cost proportions, namely, for all (a, m), ωa,m(t) = Ca,m(t) P j [M] Ci,j(t). First of all, we introduce an initial result that controls the expectation of the stopping cost. Lemma C.16. Consider B such that C (µ) 1 B ϵ > 0, and suppose that there exists a constant Tϵ such that, for all T Tϵ, it holds that F(ω(T), ˆµ(T)) F(ω (µ)) B on the good event Eϵ(T). Then, it holds that: Eµ[cτδ] λMTϵ + C0(ϵ, δ) + 1 + t=0 Pµ(E(T)c). (35) Proof. Using probabilistic arguments, we have that: Eµ[cτδ] = Z + 0 Pµ(cτδ > x)dx λMTϵ + C0(ϵ, δ) + Z λMTϵ+C0(ϵ,δ) Pµ(cτδ > x, cτδ C0(ϵ, δ))dx λMTϵ + C0(ϵ, δ) + Z λMTϵ+C0(ϵ,δ) Pµ τδ > x λM , cτδ C0(ϵ, δ) dx λMTϵ + C0(ϵ, δ) + 1 + T = Tϵ+ C0(ϵ,δ) λM Pµ(τδ > T, cτδ C0(ϵ, δ)) λMTϵ + C0(ϵ, δ) + 1 + T =0 Pµ(E(T)c), where (i) in the second inequality, we have upper bounded cτδ λMτδ, (ii) in the third inequality, we have used the fact that τδ is an integer variable, and (iii) in the last inequality we have used that for all T Tϵ such that P a,m Ca,m(T) C0(ϵ, δ), then we have that E(T) {τδ < T}. Indeed, combining Lemma C.15 with the definition Tϵ, we obtain that for all T Tϵ such that P a,m Ca,m(T) C0(ϵ, δ), then we have that E(T) {τδ < T}. This last step is direct by noticing that F(ω(T), ˆµ(T)) βT,δ P a,m Ca,m(T ) implies stopping. Lemma C.16 shows how to upper-bound the expected cost complexity. Notice that this result requires different arguments w.r.t. the usual ones that appears while controlling the expected sample complexity (see, e.g., [8]). Then, we report a basic property of the sub-gradient ascent routing that is employed in our algorithm. Before doing that, we recall that, on the good-event ET , it holds that there exists a constant L, that depends on µ, such that the empirical sub-gradients are uniformly-bounded by L. Lemma C.17. Let c(t) = P a,m λm πa,m(s). Define Cr := log(KM) + KMG + 4(LλM)2 + 2G2, and consider the sequence of weights { ω(t)}t generated by Algorithm 1. Then, on the good event Eϵ(T) it holds that: t=h(t) c(t) F(ω , ˆµ(t)) (ω ω(t)) Cr Proof. The proof is identical to the one of Proposition 3 in [22]. The only difference is that, in our case, we need to multiply the scale of the sub-gradient L by λM, to the additional presence of c(t) in the sequence of gains that we use in our sub-gradient ascent algorithm. Finally, we show that there exists an additional problem dependent constant Cµ that will be useful in performing some upper-bound reasoning in the proof of the final result. Lemma C.18. Consider the following quantity: mina = infθa,θ MF PT s=1 P m πi,m(s)d(µi,m, θi,m) P a,m Ca,m(T) . There exists a problem dependent constant Cµ such that the previous equation can be upper bounded by: F(ω(T), µ) + 4 ln(KM)MλMCµ Proof. Let us begin by analyzing F(ω(T), µ). Fix a such that F(ω(T), µ) = f , a(ω(T), µ). Moreover, consider θ , θ a argmin P m ωa,m(T) d(µa,m,θa,m) λm . Then, consider the following difference: H := mina = infθa,θ MF PT s=1 P m πi,m(s)d(µi,m, θi,m) P a,m Ca,m(T) F(ω(T), µ). Then, the previous Equation can be upper bounded by: m PT s=1 πi,m(s) Ni,m(T) d(µi,m, θ i,m) P a,m Ca,m(T) m d(µi,m, θ i,m) P a,m Ca,m(T) a,m Ca,m(T) , where in the first step, we have used the definition of θ a, θ and the definition of ω(T), in the second one, we have used Lemma C.14, and in the last one the facts that, thanks to definition θ a, θ , there exists some problem dependent constant Cµ such that d(µi,m, θ i,m) is bounded. Theorem 4.1. For any multi-fidelity bandit model µ MMF, Algorithm 1 using the threshold βt,δ given in (31) is δ-correct and satisfies lim sup δ 0 Eµ[cτδ] log(1/δ) C (µ). (5) Proof. The proof of the δ-correctness is from Proposition C.13. To prove the optimality, we first proceed by upper bounding the following quantity on the good event Eϵ(T): F(ω , µ) F(ω(T), ˆµ(T)). (36) Define, for brevity, e T := T h(T) + 1 and c(s) := P a,m λm πa,m(s). Then, we start by analyzing F(ω , µ). On Eϵ(T) we have that: a,m Ca,m(T) P a,m Ca,m(T)F(ω , µ) PT s=1 c(s) P a,m Ca,m(T)F(ω , µ) + PT s=1 c(s) P a,m Ca,m(T)F(ω , µ) PT s=1 c(s) P a,m Ca,m(T)F(ω , µ) + F(ω , µ) P a,m Ca,m(T)2 ln(KM) PT s=1 c(s) P a,m Ca,m(T)F(ω , µ) + 2 ln(KM)F(ω , µ) PT s=h(T ) c(s) P a,m Ca,m(T)F(ω , µ) + h(T) λmin T F(ω , µ) + 2 ln(KM)F(ω , µ) PT s=h(T ) c(s)F(ω , ˆµ(t))) a,m Ca,m(T) + λM e Tϵ λmin T + h(T) λmin T F(ω , µ) + 2 ln(KM)F(ω , µ) where in the first inequality we have used Lemma C.14, while in the last step we have used Lemma C.12 together with the event Eϵ(T). At this point, we focus our analysis on PT s=h(T ) c(s)F (ω ,ˆµ(t)) P a,m Ca,m(T ) . Define, for brevity, gs = c(s) F( ω(s), ˆµ(s)); then, we have that: PT s=h(T ) c(s)F(ω , ˆµ(s))) a,m Ca,m(T) = PT s=h(T ) c(s) (F(ω , ˆµ)) F( ω(s), ˆµ(s)))) a,m Ca,m(T) PT s=h(T ) c(s)F( ω(s), ˆµ(s))) a,m Ca,m(T) + PT s=h(T ) gs (ω ω(s)) a,m Ca,m(T) PT s=h(T ) c(s)F( ω(s), ˆµ(s))) a,m Ca,m(T) + Cr λmin PT s=h(T ) c(s)F( ω(s), µ) a,m Ca,m(T) + Cr λmin T + λM e Tϵ where in the first inequality we have used the concavity of F, in the second one we have used Lemma C.17, and in the last one Lemma C.12 and the definition of Eϵ(T). Finally, we have that: PT s=1 c(s)F( ω(s), µ) P a,m Ca,m(T) = PT s=1 mina = infθa,θ MF P m πi,m(s)d(µi,m, θi,m) P a,m Ca,m(T) mina = infθa,θ MF PT s=1 P m πi,m(s)d(µi,m, θi,m) P a,m Ca,m(T) F(ω(T), µ) + 4 ln(KM)MCµ F(ω(T), ˆµ(T)) + 4 ln(KM)MCµ where (i) in the first equality we used the definition of c(s), ω(s) and F, in the second one we used Lemma C.18, and in the last one Lemma C.12. Given this analysis, let us define: BT := 2λM e Tϵ λmin T + h(T) λmin T F(ω , µ) + 2 ln(KM)F(ω , µ) T + 4 ln(KM)MCµ Consider T such that: ( 2 ln(KM)F(ω , µ) 2 , 4 ln(KM)MCµ Then, it holds that: λmin + 5ϵ = Bϵ, and, consequently, we have that F(ω(T), ˆµ(T)) F(ω , µ) Bϵ. Combining this result with Lemma C.15 and C.16, we obtain: E[cτδ] λMTϵ + C0(δ, ϵ) + 1 + t=0 Pµ(Eϵ(t)c). By Lemma C.14 and Lemma 19 in [8], we obtain that: lim sup δ 0 E[cτδ] log(1/δ) 1 C (µ) 1 Bϵ . Letting ϵ 0 concludes the proof. D Experiment details and additional results In this section, we provide experimental details and additional results. For the experiments we relied on a server with 100 Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz cpus and 256GB of RAM. The time to obtain all the empirical results is less than a day. This section is structured as follows. First, we provide an additional details and results on the experiment presented in Section 5 (Section D.1 and Section D.2). Secondly, we provide results on additional 4 5 multi-fidelity bandits (Section D.3). Then, we analyze a typical trick that is used to improve the performance of gradient-based methods, that is using a constant rate against using the learning rate that the theory prescribes. (Section D.4). We then present results using very small value of δ w.r.t. to the one that has been considered in the main text (Section D.5). In particular, we verify that the performance difference amplifies. Finally, we discuss the approaches of [31]. D.1 Further details on the experiments presented in Section 5 First instance We begin by providing further details on the 4 5 multi-fidelity bandit model that we used in Figure 1 and Figure 2. First, Table 2 reports the 4 5 bandit model of Figure 1. Table 2: Multi-fidelity bandit model presented in Figure 1. µ1 µ2 µ3 µ4 ξ λ m = 1 0.9465 0.8526 0.8162 0.9099 0.1 0.05 m = 2 0.7727 0.8708 0.9050 1.0594 0.08 0.1 m = 3 0.8812 0.8515 0.8209 1.0083 0.05 0.2 m = 4 0.8284 0.8374 0.8353 0.9745 0.025 0.4 m = 5 0.8494 0.8401 0.8495 0.9856 0.0 5 All arms, both for the bandit model of Figure 1 and 2 are Gaussian distributions with variance σ2 = 0.1. The bandit model in Figure 1 has been generated according to a procedure that has been used to generate MF instances in [25] (see their Appendix D.1). Specifically, first, two M-dimensional vectors are specified, which we refer to as a and b. Specifically, a and b are such that am am+1 and bm bm+1 for all m [M 1]. Then, we first sample the means of the arm at fidelity M 13, and once this is done we sample µi,m µi,M am b 2, µi,M + am + b 2 . Then, ξ is computed as ξm = am + bm 2 . In this sampling procedure, we have used a = [0.075, 0.06, 0.04, 0.02, 0] and b = [0.05, 0.04, 0.02, 0.01, 0]. Second instance We now recall the 5 2 example of Section 5: Table 3: Multi-fidelity bandit model presented in Figure 2. µ1 µ2 µ3 µ4 µ5 ξ λ m = 1 0.4 0.4 0.4 0.4 0.5 0.1 0.5 m = 2 0.5 0.5 0.5 0.5 0.6 0 5 We prove that, in this instance, the oracle weights are given by ω i = [0.09621, 0] for all i [4], and ω 5 = [0, 0.61516] (this number have been rounded to the fourth decimal precision). In order to prove this, we first notice that in the considered domain the optimal fidelity for i [4] is m = 1. This is direct from the fact that µi,m = µi,M ξm (see, e.g., Proposition B.5). Furthermore, we recall the expression f5,i, for any i [4]: f5,i(ω, µ) = inf η [µi,M,µ5,M] m [M] ω5,m d (µ5,m, η + ξm) λm + ωi,m d+(µi,m, η ξm) Then, since η 5,i [µi,M, µ5,M], µi,M = µ5,m = µ5,M ξm, we have that the optimal fidelity for arm 5 is m = 2. At this point, consider the oracle weights ω . We notice that, due to the symmetry of the problem, ω i,1 is equal for all i [4]. Then, we can rewrite f5,i(ω , µ) as a function of a single variable, that is: f5,i(ω, µ) = inf η [µi,M,µ5,M](1 4ωi,1)d (µ5,2, η) λM + ωi,1 d+(µi,1, η ξm) and, consequently, we obtain that C (µ) 1 can be expressed as a convex optimization of a single variable, that is ωi,1. Taking the derivative of F(ω, µ) w.r.t. ωi,1 we obtain that the following equality should be satisfied at the optimum: 4d(µ1,M, η 5,i) λM = d(µi,1, η 5,i 1) λm . Solving for η 5,i gives a unique solution in the range [0.5, 0.6], which is 0.539. Then, using Lemma C.3 and solving for ωi,1, we obtain ωi,1 = 0.09621, and consequently, ω5,2 = 0.61516. Thresholds To conclude, we comment on the thresholds βt,δ used by the algorithms. For the stopping rule in MF-GRAD we used βt,δ = log(K/δ) +M log(log(t) +1), which is a simplification of its theoretical value (31) that retains the same scaling in K and M (up to constants). In GRAD, we used βt,δ = log(K/δ) + log(log(t) + 1), which is a similar simplification of the usual threshold for BAI, which instead of concentrating a sum of 2M KL terms (see the proof of Proposition C.13) only requires to concentrate a sum over 2 KL terms. Finally, in the confidence intervals that are used in IISE we have used the confidence bonuses q 2σ2(log(KM/δ)+log(log(t))) Na,m(t) 14, which compared to their original form is replacing some crude union bound over t with a stylized version of the threshold that would follows from using tight time-uniform concentration. These choices were adopted consistently in all the experiments presented in this appendix, and they ensured the δ-correctness requirement in all cases. 13For this step, we constrained the minimum gap between arms at fidelity M is at least 0.1. 14Notice, indeed, that ISEE requires a union bound both on K and M. D.2 Empirical cost proportions of MF-GRAD In this section, we analyze the behavior of MF-GRAD by analyzing the evolution of the empirical cost proportions. Specifically, we repeat on the 4 5 bandit model of Table 2, the same experiment that we presented in Section 5 for the 5 2 bandit model. Figure 4 reports the result. Interestingly, we highlight how the sparsity pattern emerges also in this domain. D.3 Results on additional domains In this section, we present results on additional 4 5 multi-fidelity bandit models. Specifically, we generate another random instance according to the procedure of [25], and we report the model in Table 4. Furthermore, we created an additional bandit model where means of some arms are slightly increasing on displaying a stationary trend over fidelity and we report the model in Table 5. In both cases, we considered Gaussian distribution with σ2 = 0.1. Empirical results of MF-GRAD, IISE and GRAD for δ = 0.01 can be found in Figure 5 and 6 respectively. As we can appreciate, MF-GRAD maintains the most competitive performance across both domains. Table 4: Additional random multi-fidelity bandit model. µ1 µ2 µ3 µ4 ξ λ m = 1 0.6944 0.5080 0.4153 0.3564 0.1 0.05 m = 2 0.5634 0.3723 0.4132 0.4570 0.08 0.1 m = 3 0.6178 0.4322 0.3817 0.4065 0.05 0.2 m = 4 0.6323 0.4225 0.3838 0.3582 0.025 0.4 m = 5 0.6171 0.4216 0.3831 0.3783 0.0 1 Table 5: Additional multi-fidelity bandit model. µ1 µ2 µ3 µ4 ξ λ m = 1 0.41 0.35 0.51 0.41 0.1 0.1 m = 2 0.45 0.37 0.56 0.39 0.08 0.125 m = 3 0.47 0.38 0.64 0.40 0.04 0.25 m = 4 0.48 0.36 0.62 0.42 0.02 0.5 m = 5 0.5 0.35 0.61 0.42 0.0 1 D.4 Improving performance with constant learning rate As reported in [22], using constant learning rate can improve the identification performance in standard BAI settings. In the following, we analyze the performance difference of MF-GRAD that uses the theoretical learning rate, and MF-GRAD that uses a constant learning rate of α = 0.25. We will refer this second version as MF-GRAD-CONST. Figure 7 and 8 reports the performance of the algorithms in the two bandit models of Section 5. As we can see, in both cases, MF-GRAD-CONST outperforms MF-GRAD. We further investigate this behavior by showing the evolution of the empirical cost proportions of MF-GRAD-CONST during the learning process. Figure 9 and 10 reports the evolution of the empirical costs, over 100000 iterations. Comparing the results with Figure 4 and 9 we can appreciate as MF-GRAD-CONST move away from the initial cost proportions way sooner than MF-GRAD, which explains its superior performance in the moderate regime of δ. D.5 Smaller value of δ Finally, we repeat the experiments that we presented in the previous section using smaller values of δ. Specifically, we consider δ = 10 10. Figure 11 reports the performance of the 4 5 bandit model of Section 5, Figure 12 reports the performance of the 5 2 bandit model of Section 5, Figure 13 and 14 reports the performance of the additional bandit models presented in Appendix D.3. As one can notice the performance gap between MF-GRAD and the considered baseline increases. Figure 4: Empirical cost proportions of MF-GRAD for 100000 iterations on the 5 2 bandit model of Section 5. Results are average over 100 runs and shaded area report 95% confidence intervals. Empirical cost proportions of each arm are plotted with the same color. Cost proportions at fidelity 1, 2, 3, 4 and 5 are visualized with circle, squared, cross, triangle, and diamond respectively. Figure 5: Empirical cost complexity for 1000 runs times with δ = 0.01 on the multi-fidelity bandit of Table 4. Figure 6: Empirical cost complexity for 1000 runs times with δ = 0.01 on the multi-fidelity bandit of Table 5. Figure 7: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 4 5 multi-fidelity bandit of Section 5. Figure 8: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 5 2 multi-fidelity bandit of Section 5. Figure 9: Empirical cost proportions of MF-GRAD-CONST for 100000 iterations on the 4 5 bandit model of Section 5. Results are average over 100 runs and shaded area report 95% confidence intervals. Empirical cost proportions of each arm are plotted with the same color. Cost proportions at fidelity 1, 2, 3, 4 and 5 are visualized with circle, squared, cross, triangle, and diamond respectively. Figure 10: Empirical cost proportions of MF-GRAD-CONST for 100000 iterations on the 5 2 bandit model of Section 5. Results are average over 100 runs and shaded area report 95% confidence intervals. Empirical cost proportions of each arm are plotted with the same color. Cost proportions at fidelity 1 and 2 are visualized with circle and squared respectively. Figure 11: Empirical cost complexity for 1000 runs times with δ = 10 10 on the 4 5 multi-fidelity bandit of Table 2. Figure 12: Empirical cost complexity for 1000 runs times with δ = 10 10 on the 5 2 multi-fidelity bandit of Table 3. Figure 13: Empirical cost complexity for 1000 runs times with δ = 10 10 on the multi-fidelity bandit of Table 4. Figure 14: Empirical cost complexity for 1000 runs times with δ = 10 10 on the multi-fidelity bandit of Table 5. D.6 On LUCB-Explore A and LUCB-Explore B In this section, we present in detail the main issue behind the algorithms presented in [31], i.e., LUCBExplore A and LUCBExplore B, that is the fact that these algorithm might fail at stopping in some specific multi-fidelity bandit models. First, we provide numerical evidence of this phenomena by running both methods in a specific instance (Section D.6.1). Then, in Section D.6.2, we point out an error in the analysis of [31] that highlights how both algorithms fails at stopping when considering instances such as the one that has been considered in Section D.6.2. D.6.1 Experimental issues When experimenting with the algorithms proposed in [31], namely LUCBExplore A and LUCBExplore B, we have faced stopping issues. Specifically, both algorithms were not terminating in any reasonable number of steps on some specific instances. We now report an illustrative example of such scenarios. Consider the following Gaussian multi-fidelity bandit model: µ1 = [0.64, 0.6], µ2 = [0.46, 0.5], λ = [0.1, 5], ξ = [0.1, 0] and σ2 = 1. In this scenario, the well-known LUCB algorithm [12] which only uses samples at fidelity M, stops soon (iteration 100k) paying a total cost of roughly 500k. When running LUCBExplore A and LUCBExplore B, instead, we faced termination issues. We let both algorithms run for a maximum number of 108 samples (reaching a total cost which is approximately 107), and the stopping criterion was never met for LUCBExplore A, while 70% of LUCBExplore B runs did not stop. LUCBExplore B explores more fidelities at the beginning, and that initial exploration can be enough to trigger the stopping test on some runs, but many continue until we artificially stop the experiment. Figure 15 reports the results of this experiment. As a final remark, we notice that both LUCBExplore A and LUCBExplore B require additional knowledge in order to run, that is an upper bound on µ1,M and a lower bound on µ2,M (assuming arms to being ordered according to µ1,M > µ2,M µK,M). The result presented in this section have been presented running their algorithms in the most favorable scenario, that is the situation in which the agent has perfect knowledge on the values µ1,M and µ2,M. D.6.2 Theoretical issues The general idea of the LUCBExplore algorithms of [31] is to identify for each arm the optimal fidelity and pull the arm at that fidelity. In Appendix B.5, we described how that optimal fidelity can differ from the fidelity selected by our lower bound. Since our lower bound can be matched by an algorithm and thus describes the actual cost complexity of the problem, it betters represent the notion of optimal fidelity. We will thus call the fidelity used by the LUCBExplore algorithms target fidelity instead. The two variants Explore A and Explore B differ in the mechanism used to look for the target fidelity. Figure 15: Visualization of the non-stopping behavior of LUCBExplore A and LUCBExplore B. We first show that even if their algorithm used an oracle for the fidelity exploration mechanism that returns the target fidelity for all arms, it would still not be able to stop on some examples. We then highlight an issue with the proof of [31]. Failure to stop with an oracle Consider the bandit instance from Appendix B.5. Recall that this is a 2 2 example of multi-fidelity BAI problem with ξ1 = 0.1, ξ2 = 0.0, µ1,M = 0.6, µ1,m = 0.65, µ2,M = 0.5, µ2,m = 0.45 (we write M = 2 and m = 1). All distributions are Gaussian with variance 1. We choose λM > 4λm, which means that the target fidelity for that problem are m 1 = 1 and m 2 = 1 (see details in Appendix B.5). LUCBExplore with an oracle that always selects that fidelity is the following algorithm: Initialization: ˆµk,m(t) = 0, Nk,m(t) = 0, UCBk(t) = 1, LCBk(t) = 0 for all arms k and fidelity m. ℓt = 1, ut = 2. While LCBℓt(t) UCBut(t) ℓt = arg maxk UCBk(t), ut = arg maxk [k]\{ℓt} UCBk(t) Pull arms ℓt and ut at their target fidelity. The indices are LCBk(t) = max m (ˆµk,m(t) ξm β(Nk,m(t), t, δ)) UCBk(t) = min m (ˆµk,m(t) + ξm + β(Nk,m(t), t, δ)) where β(n, t, δ) = p log(Lt4/δ)/n for some constant L > 0. In the two-arms example here, the algorithm simplifies greatly: it always pulls both arms alternatively, always at fidelity m = 1. It stops when the LCB of one arm surpasses the LCB of the other. We show that it can t stop and return the best arm 1, unless a confidence interval is not valid, which happens with small probability. If ˆµ1,1(t) µ1,1 + β(t/2, t, δ), LCB1(t) = max{0, ˆµ1,1(t) ξ1 β(t/2, t, δ)} µ1,1 ξ2 = 0.55 . On the other hand, if min{1, ˆµ2,1(t) µ2,1 β(t/2, t, δ), UCB2(t) = min{1, ˆµ2,1(t) + ξ1 + β(t/2, t, δ)} µ2,1 + ξ1 = 0.55 . We get that we always have LCB1(t) UCB2(t), unless one of the two concentration inequalities on the empirical means are not true. The confidence width β is designed to make those inequalities true for all t N with probability close to 1. We can similarly get that LCB2(t) UCB1(t) (which is expected since 2 is a worse arm) unless some concentration inequality is false. We obtain that this algorithm with an oracle selection for the target fidelity cannot stop fast: the only way it can stop is if unlikely deviations occur. Issue with the proof There is an issue with the proof of the cost complexity upper bound of [31]. The issue is in the first 3 steps of their appendix E.2, pages 17 and 18. They identify a threshold c (with value 0.55 in our example of the last paragraph) and prove the following. Step 1: if the algorithm does not terminate and confidence intervals hold, then either both LCBℓt(t) c and UCBℓt(t) c or both LCBut(t) c and UCBut(t) c. Step 2: confidence intervals are likely to hold. Step 3: if a sub-optimal arm k satisfies LCBk(t) c and UCBk(t) c, then its target fidelity cannot be pulled much. They conclude that for all arms, the number of pulls at the target fidelity is upper bounded, with large probability. Let s see the issue with that proof, on the same example as in the last paragraph. In the example above with the oracle choice for the target fidelity, we saw that if confidence intervals hold and ℓt = 1 (which is the most likely), then LCBℓt(t) c and UCBℓt(t) c. That is, step 1 gives a condition on arm 1 only (and nothing on arm 2). But then we get nothing from step 3, since arm 1 is not a sub-optimal arm. We only get an upper bound on the number of pulls for sub-optimal arms if we can say that they satisfy LCBk(t) c and UCBk(t) c at some point, but it might not be the case. Indeed, when the algorithm does not terminate, steps 1 and 2 together give that with large probability either both LCBℓt(t) c and UCBℓt(t) c or both LCBut(t) c and UCBut(t) c. It is possible that we always have this property for ℓt = 1 (the optimal arm), and that we can never apply step 3. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Claims and introduction reflects the contribution and content of the paper. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss limitations and future direction of improvements in Section 6 Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Each statement presented in the main text is supported with formal proofs presented in the appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Our algorithm is described in mathematical rigor so that it can be reproduced. Furthermore, codebase and detailed instructions on how to reproduce the result is provided. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the codebase with instructions on how to reproduce the results. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Experiments have been detailed in Appendix D. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: All results are average of 1000/100 runs. Error bars are reported in all cases (e.g., depending on the experiment, via boxplots and 95% confidence intervals). Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Computational resources employed in this study are reported in Appendix D. Time taken to re-run all the experiments is also reported. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The paper aligns with the guidelines (e.g., does not involve human participants nor datasets) and anonymity of the submission is preserved. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This paper is a foundational research work whose goal is to advance theoretical aspects of sequential-decision making. We do not any direct path to negative applications. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This paper does not involve such assets. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not rely on existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We provide the codebase together with instruction on how to reproduce the experiments. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No crowdsourcing experiments and research with human subjects were involved in this work. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No study participants were involved in this work. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.