# optimal_multifidelity_bestarm_identification__368c05e4.pdf

Optimal Multi-Fidelity Best-Arm Identiﬁcation

Riccardo Poiani DEIB, Politecnico di Milano, Milan, Italy riccardo.poiani@polimi.it

Rémy Degenne Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189-CRISt AL, F-59000 Lille, France remy.degenne@inria.fr

Emilie Kaufmann Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189-CRISt AL, F-59000 Lille, France emilie.kaufmann@univ-lille.fr

Alberto Maria Metelli DEIB, Politecnico di Milano, Milan, Italy albertomaria.metelli@polimi.it

Marcello Restelli DEIB, Politecnico di Milano, Milan, Italy marcello.restelli@polimi.it

In bandit best-arm identiﬁcation, an algorithm is tasked with ﬁnding the arm with highest mean reward with a speciﬁed accuracy as fast as possible. We study multiﬁdelity best-arm identiﬁcation, in which the algorithm can choose to sample an arm at a lower ﬁdelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm. Our ﬁrst contribution is a tight, instance-dependent lower bound on the cost complexity. The study of the optimization problem featured in the lower bound provides new insights to devise computationally efﬁcient algorithms, and leads us to propose a gradient-based approach with asymptotically optimal cost complexity. We demonstrate the beneﬁts of the new algorithm compared to existing methods in experiments. Our theoretical and empirical ﬁndings also shed light on an intriguing concept of optimal ﬁdelity for each arm.

1 Introduction

In multi-armed bandits [20], an algorithm chooses at each step one arm among K > 1 possibilities. It then observes a reward, sampled from a probability distribution on R corresponding to the arm. Several goals are possible for the algorithm, and we focus on the best arm identiﬁcation task (BAI) in which we aim to identify the arm with the largest mean, using as few samples as possible. This is a well-studied problem [6, 1, 12, 10, 8] with potential applications to, e.g. A/B/n testing [27] or hyper-parameter optimization [11].

In some applications, like physics, parameter studies, or hyper-parameter optimization, getting a sample from the arm distribution might be expensive since it requires evaluating or training a complex model and is computationally demanding. However, it is often the case that cheaper, less accurate sampling methods are available, for instance, by using a coarser model in the physics study example.

Work done while at Inria, Lille, France.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

The multi-ﬁdelity bandit framework takes such scenarios into account. When choosing an arm, the algorithm also chooses a ﬁdelity, with a trade-off: a higher ﬁdelity gives a more precise observation but has a higher cost. We assume that the algorithm knows both the cost and the maximal bias of the observations from each ﬁdelity. This is also how the knowledge about the ﬁdelity was modeled in prior work [see, e.g., 16, 15, 25, 31]. The goal is then to ﬁnd the best arm (i.e., the arm with the highest mean at the highest ﬁdelity) with high probability and minimal cost.

Speciﬁcally, the bandit algorithm interacts with the multi-ﬁdelity environment and gathers information to ﬁnd which arm has the highest mean when pulled at the highest ﬁdelity. In the ﬁxed conﬁdence setting, we want to ensure that the algorithm returns a correct answer with probably at least 1 δ for a given parameter δ (0, 1). A good algorithm should do that at a minimum cost, and thus, the appropriate quality metric for evaluating an algorithm s performance is the sum of costs paid until it stops, i.e., the cost complexity. Previous work on the multi-ﬁdelity BAI problem [25, 31] provided lower bounds on the cost complexity as well as algorithms with cost upper bounds. Those lower and upper bounds do not match, and the proposed methods require additional prior information [31], or their guarantees are restricted to problems satisfying additional hypotheses [25]. We lift all those requirements and provide an improved lower bound and an algorithm with a matching upper bound.

Contributions and organization of the paper After presenting additional related works, in Section 2, we deﬁne ﬁxed-conﬁdence best arm identiﬁcation in multi-ﬁdelity bandits in more mathematical detail and introduce the notations used throughout the paper. Then, Section 3 contains our ﬁrst contribution: a tight instance-dependent lower bound on the cost complexity of any algorithm expressed with the maximum of a complex function over all possible cost allocations. We also highlight features of that lower bound, like the existence of an optimal ﬁdelity for each arm, which should be chosen exclusively. In Section 4, we propose a computationally efﬁcient procedure for computing gradients of the function featured in the lower bound and describe a gradient-based algorithm whose cost complexity is asymptotically matching the lower bound. Finally, in Section 5, we present the results of numerical experiments which demonstrate the good empirical performance of our new algorithm compared to prior work.

Additional related works The multi-ﬁdelity setting has mostly been studied in the context of Bayesian optimization [9, 24, 17, 26, 14, 21] and black-box function optimization with different structural assumptions [28, 29, 7, 23]. The goal there is to ﬁnd the minimum of a function by successive queries of that function or of cheaper approximations. The metric for success in these works is most often the simple regret, that is, the difference between the best value found and the true minimum, although other goals were considered like the cumulative regret [16, 15]. Furthermore, we notice that best arm identiﬁcation with costs has recently been studied in [13] for BAI with only one ﬁdelity. The authors introduce a variant of the Track-and-Stop algorithm [8] and prove its asymptotic optimality. However, we will not be able to adapt this study to the multi-ﬁdelity case because, as we shall see, it requires solving a complex optimization problem for which we have no efﬁcient solution. Finally, our work is related to the vast strand of BAI studies that proposes tight lower bound with asymptotically optimal algorithms [e.g., 8, 4, 22]; nevertheless, as we discuss throughout the text, these studies cannot be directly applied to the multi-ﬁdelity BAI problem.

2 Background

In this section, we provide essential background and notation that is used throughout the rest of the paper. A table that summarizes the notation is available in Appendix A.

A multi-ﬁdelity bandit model with K arms and M ﬁdelities is a set of K M probability distributions ν = (νa,m)a [K],m [M] where νa,m has mean of µa,m. For each arm a [K], µa,m represents the mean value of an observation of arm a using ﬁdelity m, and let µ = (µa,m)a [K],m [M]. An observation at ﬁdelity m is assigned a (known) cost λm 0 with λ1 < λ2 < < λM. The goal is to identify the arm that has the largest mean at the highest ﬁdelity M, a (µ) := argmaxa [K] µa,M (sometimes denoted by in the sequel to ease notation) with a small total sampling cost, by exploring the arms at different ﬁdelities and using some prior knowledge about their precision. Speciﬁcally, we assume that there are some (known) values ξ1 > ξ2 > > ξM = 0 such that, for all arm a [K], the vector µa := (µa,m)m [M] satisﬁes

m [M], |µa,m µa,M| ξm .

We write µa MF to indicate that arm a satisﬁes these multi-ﬁdelity constraints, with these particular parameters ξm (although they are not shown in the notation). In this paper, we consider arms that belong to a canonical exponential family [2]. This includes, e.g. arms that have Bernoulli distributions or Gaussian distributions with known variances. Such models are known to be characterized by their means and we refer to such an exponential multi-ﬁdelity bandit model ν using the means of its arms µ, which belongs to the set MMF := {µ ΘK M : a [K], µa MF}, where Θ R is the interval of possible means.

At each interaction round t = 1, 2, . . . , the agent selects an arm At and a ﬁdelity Mt, observes a sample Xt νAt,Mt and pays a cost λMt. Letting Ft = σ(A1, M1, X1, . . . , At, Mt, Xt) be the sigma ﬁeld generated by the observations up to time t, a ﬁxed-conﬁdence identiﬁcation algorithm takes as input a risk parameter δ (0, 1) and is deﬁned by the following ingredients: (i) a sampling rule (At, Mt)t, where (At, Mt) is Ft 1-measurable, (ii) a stopping rule τδ, which is a stopping time w.r.t. Ft, and (iii) a decision rule ˆaτδ [K], which is Fτδ-measurable. We want to build strategies that ensure Pµ (ˆaτδ = a (µ)) δ for all µ MMF with a unique optimal arm. Such a strategy is called δ-correct. Among δ-correct strategies, we are looking for strategies that minimize the expected identiﬁcation cost (i.e., cost complexity) deﬁned as

Eµ[cτδ] := X

m [M] λm Eµ[Na,m(τδ)] = X

m [M] Eµ[Ca,m(τδ)],

where Na,m(t) denotes the number of pulls of arm a at ﬁdelity m up to time t and Ca,m(t) = λm Na,m(t) denotes the cost associated to these pulls. In the sequel, we will provide cost complexity guarantees for multi-ﬁdelity instances µ that belong to the set M MF of multi-ﬁdelity instances with a unique optimal arm, i.e., for which |a (µ)| = 1. We remark that for M = 1 and λm = 1 we recover the best arm identiﬁcation problem in a classical bandit model, for which the cost complexity coincides with the sample complexity, Eµ[τδ].

Additional notation Given an integer n N, we denote by n the n-dimensional simplex. Furthermore, given x, y (0, 1), we deﬁne kl(x, y) = x log(x/y) + (1 x) log((1 x)/(1 y)). Given (p, q) Θ2, we denote by d(p, q) the Kullback-Leibler (KL) divergence between the distribution in the exponential family with mean p and that with mean q. We also write d (x, y) = d(x, y)1 {x y} and d+(x, y) = d(x, y)1 {x y}. Finally, we denote by v(p) the variance of the distribution with mean p.

3 On the cost complexity of multi-ﬁdelity best-arm identiﬁcation

In this section, we discuss the statistical complexity of identifying the best-arm in MF-BAI problems. Formal proofs of the claims of this section are presented in Appendix B.

3.1 Lower bound on the cost complexity

We present an instance-dependent lower bound on the expected cost-complexity. The lower bound uses the solution to an optimization problem, where the functions optimized quantify the trade-off between the information gained by pulling an arm at some ﬁdelity and the cost of that ﬁdelity. Since those functions also appear in our algorithm, we will now introduce notation for them. For all ω K M and µ ΘK M, we deﬁne

fi,j(ω, µ) := inf θi MF, θj MF θj,M θi,M

m [M] ωa,m d(µa,m, θa,m)

F(ω, µ) := max i [K] min j =i fi,j(ω, µ) . (2)

The quantity fi,j(ω, µ) is the dissimilarity according to a KL weighted by the costs between µ and the closest θ ΘK M such that arms i and j satisfy the multi-ﬁdelity constraints and θk = µk for k / {i, j}, with arm j better than arm i. If µ MMF then that closest θ is also in MMF but otherwise it might not be the case: if an arm k / {i, j} is not in MF for µ, then it is equally not in MF for θ. For µ MMF the maximum in the deﬁnition of F is realized at the best arm , as mina =i fi,a(ω, µ) is zero for i = . That is, F(ω, µ) = minj = f ,j(ω, µ). We deﬁne F with a maximum over i and

not with that last expression because we want to deﬁne it for all points in ΘK M, even the points which are not in MMF. For those points, we could imagine different notions of best arm, for example, arg maxk µk,M, but the right one for our algorithm is the arm for which we have the most evidence (weighted by cost) to say that all other arms are not better. That arm is the argmax in our deﬁnition of F. Given these deﬁnitions, we now introduce our new lower bound.

Theorem 3.1. Let δ (0, 1). For any δ-correct strategy, and any multi-ﬁdelity bandit model µ M MF, it holds that:

Eµ[cτδ] C (µ) log 1 2.4 δ , (3)

where C (µ) 1 := supω K M F(ω, µ) = supω K M mina = f ,a(ω, µ) .

The quantity C (µ) describes the statistical complexity of an MF problem µ as the typical max-min game that appears in lower bounds for BAI problems [see, e.g., 8, 3]. Speciﬁcally, ﬁrst, the max-player chooses a vector ω K M, and then the min-player chooses a bandit model θ MMF in which the optimal arm is different, with the goal of minimizing the function F(ω, µ). Following the methods from previous work, the objective value for ω and θ should be P

m [M] ωa,m d(µa,m,θa,m)

λm , featuring a sum over all arms and ﬁdelities. However in the deﬁnition of fi,a(ω, µ) we restrict θ to be different from µ on only two arms. We can prove that if µ MMF, this gives the same objective value at the minimizing θ as the full sum. The difference will be important in our algorithm, which will compute that minimizer for points ˆµ that do not belong to MMF.

A difference with standard BAI settings is that in Equation (1) each ω K M should be interpreted as a vector of cost proportions that the max-player is investing (in expectation) in each arm-ﬁdelity pair to identify the optimal arm µ ,M.2 We can interpret the oracle weights ω argmax K M F(ω, µ) as the optimal cost proportions that the agent should follow in order to identify µ ,M while minimizing the identiﬁcation cost. To clarify the difference and the relationship between cost and pull proportions we notice that, given a cost proportion ω, it is always possible to compute the pull proportions π(ω) K M that the agent should play in order to incur the costs proportions speciﬁed by ω, and vice versa. More speciﬁcally, these relationships are described for each arm-ﬁdelity pair by the following equations for every a [K] and m [M]:

πa,m(ω) = ωa,m

λj ωa,m(π) = λmπa,m P

j [M] λjπi,j . (4)

As a direct consequence, it is possible to rewrite C (µ) 1 as a function of π, the pull proportions. Doing so reveals that the minimizer θ in f ,j does not depend on the costs: it is also the minimizer of P

a {i,j},m [M] πa,md(µa,m, θa,m). While the agent optimizes the cost proportions ω to get the best possible information/cost ratio, the min-player minimizes only the information available to the algorithm to tell µ and θ apart. Finally, we notice that F(ω, µ) is concave in ω3 but F(ω(π), µ) is not concave in π. As we shall see in Section 4, this difference will play a crucial role in constructing an asymptotically optimal algorithm.

The formulation of the lower bound as a game where one player maximizes an information/cost ratio while the other player minimizes information makes our result close to lower bounds for regret minimization like the one of [5], where the (unknown) gap of an arm plays the role of the cost.

Comparison to previous work The only known lower bound for the multi-ﬁdelity BAI problem is the one presented in [25]. That same bound was then shown in [31]. The bound from those previous works is looser than Theorem 3.1. For example, in a two-arms bandit with a single ﬁdelity (denoted by M) and Gaussian rewards with variance 1, the bound from previous work is λM(µ1,M µ2,M) 2 log(1/2.4δ), while our lower bound is 8λM(µ1,M µ2,M) 2 log(1/2.4δ). Furthermore, on particular instances with 2 arms and 2 ﬁdelity, we can prove that our lower bound improves by a factor λM/λ1, which can be arbitrarily large (See Appendix B.2). More generally, the proof of the previous lower bounds exhibits a particular point in the alternative, which makes it always looser than our bound which features an inﬁmum over all points. Theorem 3.1 is also optimal in the regime δ 0 since it is matched by the algorithm we introduce in the next section.

2This claim is evident when looking at the proof of Theorem 3.1. 3This is a consequence of the fact F(ω, µ) is an inﬁmum over linear functions of ω.

3.2 Sparsity of the oracle weights: a tight concept of optimal ﬁdelity

We conclude our study of the lower bound by further analyzing the optimal allocation ω . Unlike in the standard best arm identiﬁcation problem, we did not ﬁnd an efﬁcient algorithm to compute it, which prevents us from using a Track-and-Stop-like approach [8]. Nevertheless, we will explain in the next section how to efﬁciently compute the fi,j functions and their gradient. These computations are crucial for our algorithm but also allow us to prove our next result about the possible sparsity of ω . For each arm a [K], it is not difﬁcult to show that there must exist some ﬁdelity m [M] for which ω a,m > 0 (Lemma B.2). However, as the following result highlights, in most cases, only one ﬁdelity per arm has non-zero weight.

Theorem 3.2. Let K M(µ) := argmaxω K M F(ω, µ) and

f MMF := µ M MF : i [K], m1, m2 [M]2, ω K M(µ) : ω i,m1 > 0, ω i,m2 > 0 .

The set f MMF is a subset of RK M whose Lebesgue measure is zero.

Theorem 3.2 implies that in almost all multi-ﬁdelity bandits, for any ω K M(µ) and each arm a [K], there exists a single ﬁdelity m a [M] for which ω a,m a > 0 holds. However, we note that this result does not offer an easy way to compute these optimal arm-dependent ﬁdelities.4 Nevertheless, as we shall see in the next section, our algorithm does not actually require identifying these optimal ﬁdelity levels to enjoy optimality guarantees.

Finally, we remark that existing MF-BAI works [25, 31] already proposed notions of optimal, armdependent ﬁdelity that the agent should employ to identify the optimal arm . Nevertheless, as we verify in Appendix B.5, these concepts do not comply with the concept of optimal ﬁdelity that arises from the tight lower bound of Theorem 3.1. In other words, there exist bandit models µ in which following these alternative concepts of optimal ﬁdelity leads to sub-optimal performance.

4 The multi-ﬁdelity sub-gradient ascent algorithm

We present our solution for solving MF-BAI problems, an algorithm called Multi-Fidelity Sub Gradient Ascent (MF-GRAD). Its pseudocode can be found in Algorithm 1. All proofs for this section are presented in Appendix C.

A reader familiar with the literature on BAI algorithms inspired from lower bounds like Theorem 3.1 may have the natural idea of simply using the Track-and-Stop algorithm [8] or the related game-based algorithm of [4]. Those algorithms can t be directly applied here, ﬁrst because of the costs: we want to bound the cost complexity, not the stopping time, and adapting those methods to costs is not trivial. Furthermore, Track-and-Stop (even in the cost-aware variant of [13]) would require the computation of the optimal cost proportions at ˆµ(t), which is a max-min problem for which we don t have an efﬁcient algorithm. Our solution is inspired by the gradient ascent algorithm of [22], which requires computing gradients of F (hence only a minimization problem and not a max-min). The same innovations required to extend this method to the multi-ﬁdelity case could likely allow us to adapt the algorithm of [4], or the exploration part of the regret-minimizing algorithm of [5].

Let us introduce some auxiliary notation. Let ω K M be the uniform vector 1 KM , . . . , 1 KM . For all t N, we deﬁne Clipt(x) = min{xa,m, G

a [K],m [M] for an arbitrary constant G > 0.

We also deﬁne αt = 1

t and γt = 1 4

t. Finally, for all t N, we denote by C(t) RKM the vector whose (a, m)-th dimension is given by Ca,m(t). We now present Algorithm 1.

Sampling rule After a ﬁrst initialization phase in which the algorithm pulls each arm at each ﬁdelity once (Line 1), the agent starts its sub-gradient ascent routine. More speciﬁcally at each iteration t N, the agent ﬁrst computes the vector ω(t + 1) using the Exponential Weights algorithm on the

sequence of gain functions {gs}t s=1 := {Clips P

a,m λm πa,m(s) F( ω(s), ˆµ(s)) }t s=1, where

π(t) := π( ω(t)) represents the pull-proportions induced by ω(t) and F( ω(s), ˆµ(s)) denotes a sub-gradient of F(ω, µ) w.r.t ω (Line 3). Neglecting for a moment the clipping function and the term

4We provide insights on cases in which it is possible to compute the optimal ﬁdelity in Appendix B.4.

Algorithm 1 Multi-Fidelity Sub-Gradient Ascent

1: Initialization. Pull each arm at each ﬁdelity once, and set ω(t) = ω for all t {1, . . . , KM} 2: Sampling Rule for t KM 3: Sub-gradient Ascent

ω(t + 1) argmax ω K M αt+1

s=KM ω Clips

a,m λm πa,m(s)

F( ω(s), ˆµ(s))

4: From Costs to Pulls

πa,m(t + 1) = ωa,m(t + 1)

j [M] ωi,j(t+1)

λj a [K], m [M]

5: Forced Exploration π (t + 1) = (1 γt) π(t + 1) + γtω 6: Cumulative Tracking (At+1, Mt+1) argmax(a,m) [K] [M] Pt s=1 π a,m(s) Na,m(t)

7: Stopping Rule τδ = inf t KM : maxi [K] minj =i fi,j(C(t), ˆµ(t)) βt,δ

8: Decision Rule ˆaτδ argmaxi [K] minj =i fi,j(C(t), ˆµ(t))

a,m λm πa,m(s) (these terms are present mainly for technical reasons), this step can be interpreted, from an intuitive perspective, as ﬁnding a sequence of weights { ω(t)}t that minimizes the regret on the sequence of empirical losses F(ω , ˆµ(s)) F( ω(s), ˆµ(s)). 5 At this point, once ω(t + 1) is computed, Algorithm 1 will convert these cost proportions into pull proportions while adding some forced exploration (Line 4-5), and then, it applies a standard cumulative tracking procedure [8] in the pull-proportion space so to ensure that Na,m(t) Pt s=1 π a,m(s) (Line 6).

Stopping and decision rule Finally, the algorithm applies a generalized likelihood ratio (GLR) test to decide when to stop (Line 7). For i, j [K], fi,j(C(t), ˆµ(t)) can be interpreted a GLR statistics for comparing two classes: ΘKM versus {θ | θi MF, θj MF, θj,M θi,M}. If that GLR is large enough (if it exceeds a threshold βt,δ), we can reject the hypothesis that µ belongs to the second class. If there is an arm i for which we can reject the alternative class for all j = i, we have rejected all θ MMF where i is not the best arm and we can safely stop and return the answer ˆaτδ = i. Since each fi,j is expressed as a sum of only two arms and M ﬁdelities, it is possible to show that choosing βt,δ log(K/δ) + 2M log (log(t) + 1) (see its exact expression in (31)) guarantees the correctness of the test, namely that Pµ(ˆaτδ = ) δ holds (Proposition C.13).

4.1 Theoretical guarantees

At this point, we are ready to state the main theoretical result on the performance of our algorithm. Theorem 4.1. For any multi-ﬁdelity bandit model µ MMF, Algorithm 1 using the threshold βt,δ given in (31) is δ-correct and satisﬁes

lim sup δ 0

Eµ[cτδ] log(1/δ) C (µ). (5)

As we can see from Theorem 4.1, Algorithm 1 is asymptotically optimal, meaning that it matches the lower bound we presented in Theorem 3.1 for the asymptotic regime of δ 0.

Comparison with existing MF-BAI algorithms We conclude this section by comparing our results with the literature [25, 31]. First, [25] and [31] rely on additional assumptions that play a crucial role both for the algorithm design and the resulting theoretical guarantees. In [25], the authors enforce an additional and intricate structural assumption on the relationships between λ s and ξ s (see Assumption 1 in [25]). In [31], instead, the authors assume additional knowledge expressed as

5Whenever ˆµ(s) is sufﬁciently close to µ, this implicitly generates a sequence of weights that provide values of F( , µ) "close" to the one of the oracle weights ω .

an upper bound on µ ,M and a lower bound on argmaxi = µi,M. For both works, whenever these assumptions are not satisﬁed (i.e., λ s and ξ s do not respect Assumption 1 in [25], and the knowledge on µ ,M, argmaxi = µi,M is imprecise/not available), the theoretical guarantees offered by existing algorithms are arbitrarily sub-optimal. On the other hand, our algorithm requires no additional assumptions and is the only one that matches exactly the cost complexity lower bound. Indeed, neither the cost upper bound of [25] nor the one of [31] matches the lower bound of Theorem 3.1, even when their additional hypotheses are satisﬁed.

4.2 Computing the gradient of F(ω, µ)

Algorithm 1 requires computing a sub-gradient of F(ω, µ). Notably, we remark that this is needed for a generic µ ΘKM, as ˆµ(t) might violate the ﬁdelity constraints due to inaccurate estimations or degenerate cases in which the multi-ﬁdelity constraints are attained with equality. In this section, we provide an efﬁcient algorithm for the computation of the sub-gradient that arises from a more in-depth study of the function F(ω, µ). To this end, we begin by presenting some intermediate characterization of the functions fi,j(ω, µ) that deﬁne F(ω, µ).

Lemma 4.2. Consider µ ΘKM and ω K M. Deﬁne for k [K],

ψ k := argmin ψ R

m=1 ωk,m d (µk,m, ψ + ξm) + d+(µk,m, ψ ξm)

Then, the following holds:

fi,j(ω, µ) = X

m=1 ωk,m d (µk,m, ψ k + ξm) + d+(µk,m, ψ k ξm) λm if ψ j > ψ i (6)

fi,j(ω, µ) = inf η R

m=1 ωk,m d (µk,m, η + ξm) + d+(µk,m, η ξm)

λm otherwise. (7)

We further introduce η i,j as the minimizer in the expression in (7) 6. When µ MMF, we can show that ψ k = µk,M for all k and due to the multi-ﬁdelity constraints the expression in (6) is always equal to zero. Hence in both cases fi,j(ω, µ) is equal to the expression in (7), which can be rewritten

fi,j(ω, µ) = 1 (µi,M µj,m) X

m=1 ωk,m d (µk,m, η i,j + ξm) + d+(µk,m, η i,j ξm) λm .

This quantity can be interpreted as the transportation cost for making µj,M larger than µi,M. When µi / MF or µj / MF, if ψ j > ψ i , fi,j(ω, µ) is equal to the expression (6) that can be interpreted as a transportation cost with an alternative in which i and j satisfy the multi-ﬁdelity constraints.

Using this preliminary result, we provide a precise expression for the sub-gradient of F(ω, µ). Theorem 4.3. Consider µ ΘKM and ω K M such that F(ω, µ) > 0 holds. Let (i, a) [K]2 be a pair of arms that attains the max-min value in Equation (2). Then a sub-gradient F(ω, µ) of F(ω, µ) w.r.t. to ω is given by one of the two following expressions: for j {a, i} and m [M] ,

F(ω, µ)j,m = d+(µj,m, η i,a ξm) + d (µj,m, η i,a + ξm) λm if ψ i ψ a , (8)

F(ω, µ)j,m = d+(µj,m, ψ j ξm) + d (µj,m, ψ j + ξm) λm otherwise. (9)

That sub-gradient F(ω, µ) is 0 in all the remaining KM 2M dimensions.

Theorem 4.3 shows how to compute a sub-gradient of F(ω, µ) under the mild assumption that F(ω, µ) > 0.7 More speciﬁcally, it is sufﬁcient to consider the pair (i, a) that attains the max-min

6To ease the notation, we omit (most of the time) the dependence of ψ k and η i,j in ω and µ. 7As we discuss in Remark C.8, F( ω(t), ˆµ(t)) = 0 is a rare condition, and, whenever it happens, it is possible to alter Algorithm 1 slightly without affecting its theoretical guarantees.

value in Equation (2), and then test whether ψ i ψ a holds to choose which expression to use among Equations (8) and (9). An interesting interpretation of the sub-gradient expression is that, whenever ψ i ψ a, the sub-gradient is pointing toward the direction of the space that aims at increasing the information to discriminate the eventual optimality of arm a against i. On the other hand, whenever ψ a > ψ i holds, the sub-gradient points towards the direction of minimizing errors in the multi-ﬁdelity constraints for arm i and arm a (if any).

Computing the sub-gradient efﬁciently To conclude, we notice that to compute a sub-gradient, it is required to compute ψ k for all arm k and η i,j for all pairs of arms such that ψ i ψ j . Using their deﬁnitions, this will require solving O(K2) one-dimensional optimization problems of functions that involve O(M) variables, which leads to a computational complexity which is roughly O(K2Mn), where n is the number of iterations of the convex solver. In the following, we show that it is possible to exploit the structure of the fi,j s to obtain an algorithm whose total complexity is O(K2M 2) and that does not suffer from any approximation error due to the optimization procedure. Speciﬁcally, we now present a result that shows how to compute η i,j. A similar result holds also for ψ k and is deferred to Appendix C.

Lemma 4.4. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0 . Suppose that ψ i, ψ j holds. Then, there exists a unique minimizer η i,j(ω) of Equation (7) which is the unique solution of the following equation of η:

ka,m(η) µa,m+ξm

v(η ξm) + ka,m(η) µa,m ξm

ka,m(η) 1 v(η ξm) + ka,m(η) 1 v(η+ξm) , (10)

where ka,m(x) = 1{x µa,m + ξm} and ka,m(x) = 1{x µi,m ξm}.

From Lemma 4.4, to compute η i,j it is sufﬁcient to ﬁnd the unique solution to the ﬁxed point equation given in (10). To do this efﬁciently, we observe that the right hand side of Equation (10) depends on η only for the presence of the indicator functions ka,m(η) and ka,m(η), which can only take a ﬁnite number of values. Hence, it is sufﬁcient to evaluate the right-hand side at an arbitrary point within a given interval where the values of the indicator functions do not change. If the resulting value is within the considered interval, then this value is our ﬁxed point. Since there are at most O(M) candidate ﬁxed points, this procedure takes at most O(M 2) steps.

Computational complexity remark It follows that the per-iteration computational complexity of Algorithm 1 is O K2M 2 . The computationally efﬁcient technique explained above indeed applies not only to the sampling rule but also to the stopping and the decision rules.8

5 Numerical experiments

We conclude this work by presenting numerical simulations whose goal is to show the empirical beneﬁts of our approach. We compare MF-GRAD against IISE [25], and the gradient approach of [22] that simply does BAI using samples collected at ﬁdelity M. We will refer to this additional baseline as GRAD. In the following, we avoided the comparison with the multi-ﬁdelity algorithms in [31] as we ran into issues when doing experiments. We elaborate more on this point in Appendix D.6, where we provide numerical evidence of the fact those algorithms might fail at stopping, together with an argument that shows a mistake in the proofs of [31].

Given this setup, ﬁrst, we test all methods on a 4 5 multi-ﬁdelity bandit with Gaussian arms that have been randomly generated, using a risk parameter δ = 0.01. Due to space constraints and for the sake of exposition, we refer the reader to Appendix D.1 for the value of µ, ξ s and λ s and details on the stopping rules calibration. We report the empirical distribution of the resulting cost complexities in Figure 1. As one can verify, MF-GRAD obtains the most competitive performance. Experiments on additional 4 5 bandits that are reported in Appendix D.3 provide a similar conclusion.

Furthermore, to illustrate the sub-optimality of IISE and GRAD from an intuitive perspective, we test our algorithm on a simple 5 2 instance that allows to easily understand why existing methods underperform MF-GRAD. Speciﬁcally, we consider µi = [0.4, 0.5] for all i [4], µ5 = [0.5, 0.6],

8Given the deﬁnition of fi,j, it is sufﬁcient to replace ωi,m/λm in Equation (10) with Na,m(t).

Figure 1: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 4 5 multi-ﬁdelity bandit.

Figure 2: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 5 2 multi-ﬁdelity bandit.

Figure 3: Empirical cost proportions of MF-GRAD for 100000 iterations on the 5 2 bandit model. Results are average over 100 runs and shaded area report 95% conﬁdence intervals. Empirical cost proportions of a certain arm are plotted with the same color. Cost proportions at ﬁdelity 1 and 2 are visualized with a circle and a squared respectively.

λ = [0.5, 5], ξ = [0.1, 0] and we report the cost complexity of the three algorithms in Figure 2. In this case, we can prove that the optimal ﬁdelity is sparse on ﬁdelity m = 1 for i [4], and on ﬁdelity m = 2 for arm 5. Furthermore, thanks to the symmetry of the problem, it is possible to show that ω i = [0.09621, 0] for all i [4], and ω 5 = [0, 0.61516] (see Appendix D.1). As one can see, IISE obtains the worst performance in this domain. The reason is that the concept of optimal ﬁdelity on which IISE relies is sub-optimal (i.e., according to the design principle of IISE, the optimal ﬁdelity is m = 2 for all arms), and the algorithm, in practice, will discard sub-optimal arms using samples that have been collected only at ﬁdelity m = 2. Nevertheless, this will only happen after a ﬁrst period in which IISE tries to exploit (unsuccessfully) data at ﬁdelity m = 1. GRAD, on the other hand, obtains sub-optimal performances since although most of the budget should be spent on ﬁdelity 2 (as ω 5,2 = 0.61516), it never pulls the cheapest (and optimal) ﬁdelity for arms i [4]. Finally, MF-GRAD, on the other hand, obtains the most competitive performance since, as learning progresses, its empirical cost proportions eventually approach the one prescribed by ω . To verify this behavior, we removed the stopping rule from MF-GRAD, and let the algorithm run for 105 iterations. In Figure 3, we report the entire evolution of the cost proportions during learning. As one can appreciate, at the end of this process, the empirical cost proportions of MF-GRAD are approaching the one described by ω . 9. Finally, we also refer the reader to Appendix D for additional results (e.g., additional domains, smaller regimes of δ) and further insights.

9Furthermore, at the end of this period, we measured the distance between ω and the empirical cost proportions ω(105); it holds that ||ω ω(105)||2 0.031 0.0006. The error has been estimated with 100 independent runs, and 0.0006 reports the 95% conﬁdence intervals.

6 Conclusions

For ﬁxed-conﬁdence best arm identiﬁcation in multi-ﬁdelity bandits, we presented a lower bound on the cost complexity and an algorithm with a matching upper bound in the regime of high conﬁdence. The algorithm uses features of the lower bound optimization problem in order to compute its updates efﬁciently. Unlike prior work, it does not require any assumption or prior knowledge on the bandit instance. Our work also conﬁrmed the existence in most cases of an optimal ﬁdelity to explore each arm in the asymptotic regime, and revealed that the intuitive such notions proposed in prior work were inaccurate. Yet, our algorithm does not need to identify these optimal ﬁdelities in order to be asymptotically optimal.

This raises the following question: could the performance of the algorithm be enhanced by exploiting the sparsity pattern? We conjecture that estimating the optimal ﬁdelities accurately may actually be harder than identifying the best arm. However, leveraging some sufﬁcient conditions for w a,m = 0 (such as the ones given in Proposition B.6) to eliminate some ﬁdelities and reduce the support of the forced exploration component of the algorithm seems a promising idea. A limitation of our current analysis is that it only provides asymptotic guarantees in the high conﬁdence regime, although our experiments reveal good performance for moderate values of δ. In future work, we will seek a better understanding of the moderate conﬁdence regime [30]. To this end, we may leverage some proof techniques from other works using online optimization that obtain ﬁnite-time bounds [4, 5]. On the lower bound side, while C (µ) essentially scales with K due to the sparsity pattern, an interesting open question is whether there is a worse case O(KM) scaling in the moderate conﬁdence regime, indicating that all ﬁdelities do need to be explored at least a constant amount of times.

Acknowledgments and Disclosure of Funding

This work was done while Riccardo Poiani was visiting the Scool team of Inria Lille. He acknowledges the funding of a MOB-LIL-EX grant from the University of Lille. Rémy Degenne and Emilie Kaufmann acknowledge the funding of the French National Research Agency under the project FATE (ANR22-CE23-0016-01) and the PEPR IA FOUNDRY project (ANR-23-PEIA-0003). Alberto Maria Metelli and Marcello Restelli acknowledge the funding of the European Union Next Generation EU within the project NRPP M4C2, Investment 1.,3 DD. 341 - 15 march 2022 FAIR Future Artiﬁcial Intelligence Research Spoke 4 - PE00000013 - D53C22002380006.

[1] J-Y. Audibert, S. Bubeck, and R. Munos. Best Arm Identiﬁcation in Multi-armed Bandits. In Proceedings of the 23rd Conference on Learning Theory, 2010. [2] Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback-leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516 1541, 2013. [3] Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019. [4] Rémy Degenne, Wouter M Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 2019. [5] Rémy Degenne, Han Shao, and Wouter Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443 2452. PMLR, 2020. [6] E. Even-Dar, S. Mannor, and Y. Mansour. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:1079 1105, 2006. [7] Côme Fiegel, Victor Gabillon, and Michal Valko. Adaptive multi-ﬁdelity optimization with fast learning rates. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3493 3502. PMLR, 2020. [8] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Conference on Learning Theory, pages 998 1027. PMLR, 2016. [9] Deng Huang, Theodore T Allen, William I Notz, and R Allen Miller. Sequential kriging optimization using multiple-ﬁdelity evaluations. Structural and Multidisciplinary Optimization, 32:369 382, 2006. [10] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil UCB: an Optimal Exploration Algorithm for Multi-Armed Bandits. In Proceedings of the 27th Conference on Learning Theory, 2014. [11] Kevin G. Jamieson and Ameet Talwalkar. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In AISTATS, 2016. [12] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655 662, 2012. [13] Kellen Kanarios, Qining Zhang, and Lei Ying. Cost aware best arm identiﬁcation. ar Xiv preprint ar Xiv:2402.16710, 2024. [14] Kirthevasan Kandasamy, Gautam Dasarathy, Junier Oliva, Jeff Schneider, and Barnabas Poczos. Multi-ﬁdelity gaussian process bandit optimisation. Journal of Artiﬁcial Intelligence Research, 66:151 196, 2019. [15] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabás Póczos. Gaussian process bandit optimisation with multi-ﬁdelity evaluations. Advances in neural information processing systems, 29, 2016. [16] Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multiﬁdelity multi-armed bandit. Advances in neural information processing systems, 29, 2016. [17] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabás Póczos. Multi-ﬁdelity bayesian optimisation with continuous approximations. In International conference on machine learning, pages 1799 1808. PMLR, 2017. [18] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identiﬁcation in multi-armed bandit models. Journal of Machine Learning Research, 17:1 42, 2016. [19] Emilie Kaufmann and Wouter M Koolen. Mixture martingales revisited with applications to sequential tests and conﬁdence intervals. Journal of Machine Learning Research, 22(246):1 44, 2021. [20] Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019. [21] Shibo Li, Wei Xing, Robert Kirby, and Shandian Zhe. Multi-ﬁdelity bayesian optimization via deep neural networks. Advances in Neural Information Processing Systems, 33:8521 8531, 2020.

[22] Pierre Ménard. Gradient ascent for active exploration in bandit problems. ar Xiv preprint ar Xiv:1905.08165, 2019. [23] Quan Nguyen, Arghavan Modiri, and Roman Garnett. Nonmyopic multiﬁdelity acitve search. In International Conference on Machine Learning, pages 8109 8118. PMLR, 2021. [24] Victor Picheny, David Ginsbourger, Yann Richet, and Gregory Caplin. Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics, 55(1):2 13, 2013. [25] Riccardo Poiani, Alberto Maria Metelli, and Marcello Restelli. Multi-ﬁdelity best-arm identiﬁcation. Advances in Neural Information Processing Systems, 35:17857 17870, 2022. [26] Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. Advances in neural information processing systems, 30, 2017. [27] Yoan Russac, Christina Katsimerou, Dennis Bohle, Olivier Cappé, Aurélien Garivier, and Wouter M Koolen. A/b/n testing with control in the presence of subpopulations. Advances in Neural Information Processing Systems, 34, 2021. [28] Rajat Sen, Kirthevasan Kandasamy, and Sanjay Shakkottai. Multi-ﬁdelity black-box optimization with hierarchical partitions. In International conference on machine learning, pages 4538 4547. PMLR, 2018. [29] Rajat Sen, Kirthevasan Kandasamy, and Sanjay Shakkottai. Noisy blackbox optimization using multi-ﬁdelity queries: A tree search approach. In The 22nd international conference on artiﬁcial intelligence and statistics, pages 2096 2105. PMLR, 2019. [30] Max Simchowitz, Kevin G. Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-conﬁdence regime. In International Conference On Learning Theory (COLT), 2017. [31] Xuchuang Wang, Qingyun Wu, Wei Chen, and John Lui. Multi-ﬁdelity multi-armed bandits revisited. Advances in Neural Information Processing Systems, 36, 2024.

A Table of Symbols

Table 1 reports a summary on the main symbols and the notation used throughout the paper.

Table 1: Notation Symbol Meaning

K, M Number of arms and number of ﬁdelity δ (0, 1) Maximum risk parameter τδ Stopping time of an algorithm cτδ Cost incurred at the stopping time τδ a (µ) argmaxa [K] µa,M. Often denoted simply by ˆaτδ Arm recommended by the algorithm when it stops µ Bandit model µi,m Mean of the i-th arm at ﬁdelity m within bandit model µ ξm Precision of ﬁdelity m, i.e., maxi [K] |µi,m µi,M| ξm λm Cost incurred for gathering samples at ﬁdelity m Θ Set of possible means in the exponential family µa MF Arm a satisﬁes the multi-ﬁdelity constraints ˆµ(t) Empirical bandit model at time t MMF Set of multi-ﬁdelity bandit models M MF Set of multi-ﬁdelity bandit models with a unique optimal arm d(p, q) KL divergence between two distributions with means p,q in the exponential family d+(p, q), d (p, q) d+(p, q) = d(p, q)1 {p q}, d (x, y) = d(p, q)1 {p q} v(y) Variance of the distribution in the exponential family with mean parameter y C (µ) 1 Expression that characterizes the lower-bound on the cost-complexity ω, π Vector of cost and pull proportions respectively ω ω argmaxω K M F(ω, µ) fi,j(ω, µ) Dissimilarity between arms i and j deﬁned in Equation (1) F(ω, µ) maxi [K] minj =i fi,j(ω, µ) K M(µ) Set of optimal oracle weights ω for the multi-ﬁdelity bandit model µ f MMF Subset of multi-ﬁdelity bandit models for which there exists a non-sparse optimal allocation ω

ω Uniform KM-dimensional vector ((KM) 1, . . . , (KM) 1) G > 0 Clipping constant in Algorithm 1 αt, γt Learning rate and forced exploration rate respectively C(t) Vector whose (a, m)-th dimension is Ca,m(t) ω(t) Vector of empirical cost proportions, namely ωa,m(t) = Ca,m(t)(P i [K] P j [M] Ci,j(t)) 1

ψ i Minimizer of Equation (6) η i,j Minimizer of Equation (7) ki,m(η) 1 {η µi,m + ξm} ki,m(η) 1 {η µi,m ξm}

B Cost complexity lower bound: proofs and derivations

B.1 Proof of Theorem 3.1

Theorem 3.1. Let δ (0, 1). For any δ-correct strategy, and any multi-ﬁdelity bandit model µ M MF, it holds that:

Eµ[cτδ] C (µ) log 1 2.4 δ , (3)

where C (µ) 1 := supω K M F(ω, µ) = supω K M mina = f ,a(ω, µ) .

Proof. Consider δ (0, 1), a multi-ﬁdelity bandit model µ and an alternative instance θ Alt(µ) where Alt(µ) = S

i = {θ MMF : θa,M > θ ,M}. Then, by applying Lemma 1 in [18], we can

directly connect the expected number of draws of each arm to the KL divergence of the two multiﬁdelity bandit models. More speciﬁcally, we have that: X

m [M] Eµ[Na,m(τδ)]d(µa,m, θa,m) kl(δ, 1 δ). (11)

Then, similarly to Theorem 1 in [8], we now proceed by applying Equation (11) with all the alternative models θ Alt(µ). Speciﬁcally, we have that:

kl(δ, 1 δ) inf θ Alt(µ)

m [M] Eµ[Na,m(τδ)]d(µa,m, θa,m)

= Eµ[cτδ] inf θ Alt(µ)

Eµ[λm Na,m(τδ)]

Eµ[cτδ] d(µa,m, θa,m)

Eµ[cτδ] sup ω K M inf θ Alt(µ)

m [M] ωa,m d(µa,m, θa,m)

= Eµ[cτδ] sup ω K M min a = inf θ MMF: θa,M>θ ,M

i,m ωi,m d(µi,m, θi,m)

(a) = Eµ[cτδ] sup ω K M min a = inf θa MF,θ MF: θa,M>θ ,M

i { ,a},m [M] ωi,m d(µi,m, θi,m)

= Eµ[cτδ] sup ω K M min a = f ,a(ω, µ)

= Eµ[cτδ]C (µ) 1,

where in (a) we use that as µ MMF, the minimum in θ does not change any arm i / { , a}. Finally, we lower bound kl(δ, 1 δ) with log 1 2.4 δ , thus concluding the proof.

B.2 Comparison with existing lower bound

In this section, we provide a comparison with the existing lower bound for the MF-BAI setting. In the following, we restrict our attention to bandits with Gaussian arms with variance 1/2. We assume for simplicity of the exposition that = 1 and that µ1,M > µ2,M µK,M. Given this setup, we begin by recalling Theorem 1 in [25]. Theorem B.1 (Theorem 1 in [25]). Consider any multi-ﬁdelity bandit model µ with Gaussian arms with variance 1/2. Then, for any δ-correct algorithm and δ 0.15 it holds that Eµ[cτδ]/ log (2.4δ) 1 is lower bounded by:

min m [M]: µ1,m>µ2,M+ξm

λm (µ1,m (µ2,M + ξm))2 +

i=1 min m [M]: µ1,M ξm>µi,m

λm (µi,m (µ1,M ξm))2 .

At this point, focus, for simplicity on the following 2 2 bandit model (but a trivial generalization holds for K M bandits). We consider µ1,m = µ1,M + ξm

2 and µ2,m = µ2,M ξm

2 . Furthermore, suppose that µ1,M = µ2,M. Then, let := µ1,M µ2,M. Suppose that = ξm, which yields µ1,M = ξm

2 and µ2,M = ξm

2 , and that

Since = ξm, this condition actually simpliﬁes to λM 4λm.

Under these conditions it is possible to verify that the lower bound of [25] is given by

Eµ[cτδ] 2λm ξm

At this point, consider, instead, the result that we presented in Theorem 3.1 and consider a generic weight proportion ω. From Corollary C.2, we know that:

F(ω, µ) = f1,2(ω, µ) = inf η [µ2,M,µ1,M]

m [M] ω1,m d (µ1,m, η + ξm)

λm + ω2,m d+(µ2,m, η ξm)

Then, let ω ,M be the optimal weights restricted on the portion of the simplex in which ω1,m = ω2,m = 0. Then, let η ,M be the optimal solution of Equation (12) when considering ω ,M. Using the symmetry of the KL divergence for Gaussian distributions, it holds that η ,M = 0 and ω ,M 1,2 = ω ,M 2,2 = 0.5. Then, for any ω it holds that:

m [M] ω1,m d (µ1,m, η ,M + ξm)

λm + ω2,m d+(µ2,m, η ,M ξm)

= ω1,M d(µ1,M, η ,M)

λM + ω2,M d(µ2,M, η ,M)

λM F(ω ,M, µ),

where in the second step, we have used the fact that d(µ2,m, η ,M ξm) = d(µ2,M ξm

2 , ξm) = 0 and d(µ1,m, η ,M + ξm) = d(µ1,M + ξm

2 , ξm) = 0. In other words, we have shown that in this example the optimal allocation is sparse and on ﬁdelity M. To conclude, we have that:

F(ω ,M, µ) = 0.5 d( ξm

2 , 0) λM + 0.5 d( ξm

2 , 0) λM = d( ξm

2 , 0) λM = ( /2)2

which leads to:

Eµ[cτδ] 4λM

2 log (2.4δ) 1 .

Under the assumptions on the problem, 4λM

2 is always larger than 8λm

2 . This result says that the ratio among the lower bounds can be of order λM/λm, which is arbitrarily large.

B.3 Proof of Theorem 3.2

In this section, we provide a formal proof on the sparsity of the optimal oracle allocation ω . The proofs given in this section rely on results that are explained in Appendix C.1.

At this point, in order to prove Theorem 3.2, we ﬁrst introduce some intermediate results that will be used in the proving the theorem. Speciﬁcally, we begin by showing that, for each arm a, there always exists a ﬁdelity m such that ω a,m > 0 holds.

Lemma B.2. Consider µ MMF and ω K M(µ). Then, for all a [K], there exists m [M] such that ω a,m > 0.

Proof. We split the proof into two cases. First we consider a = , and proceed by contradiction. Consider ω K M(µ), and suppose there exists a = such that ω ,m = 0 for all m [M]. In this case, however, we have that:

F(ω , µ) f ,a(ω , µ) = inf θa MF,θ MF: θa,M θ ,M

m=1 ω ,m d(µ ,m, θ ,m)

λm = 0, (13)

where, in the ﬁrst step we have used the deﬁnition of F(ω , µ), in the second one the fact that ω a,m = 0 for all m [M], and in the last one we selected θ ,m = µ ,m for all m [M]. Nevertheless, from Lemma C.6, we know that, whenever ωi,M > 0 holds for all i [K], then F(ω, µ) > 0 holds as well. Therefore, ω / K M.

The proof for the case in which i = follows identical reasoning.

We then continue by proving that, at any optimal allocation ω , all the transportation costs fa(ω , µ) are equal. Lemma B.3. Consider µ MMF and ω K M(µ). Then, for all a, b such that a = and b = the following holds:

f ,a(ω , µ) = f ,b(ω , µ).

Proof. We introduce the following notation:

a [K] \ { } : a argmin b = fb(ω , µ)

B = ([K] \ { }) \ A.

At this point, we proceed by contradiction. Suppose that B = . Then, for some sufﬁciently small ϵ > 0, we deﬁne ω K M in the following way. For all a A:

ωa,M = ω a,M + ϵ/|A|

ωa,m = ω a,m m < M.

For all b B, instead:

ωb,mb = ω b,mb ϵ/|B|

ωb,m = ω b,m m = mb,

where mb [M] is any ﬁdelity such that ω b,mb > 0 (which exists by Lemma B.2).

Given this deﬁnition of ω, it is easy to see that f ,a( ω, µ) > f ,a(ω , µ) for all a A. This is a direct consequence of the fact that f ,a( , µ) is a strictly increasing function of ωa,M, which is apparent from its expression from its expression for µ MMF given in Corollary C.2 and the computation of its gradient (Lemma C.3). Moreover, due to similar arguments, it also holds that f ,b( ω, µ) f ,b(ω , µ) for all b B. Using the continuity of the functions f, for ε small enough we further have f ,a( ω, µ) < f ,b( ω, µ) for all a A and b B. This leads to mina = f ,a(ω , µ) < mina = f ,a( ω, µ) which contradicts the optimality of ω .

We now continue by providing necessary conditions that characterize some key properties of the oracle weights ω, which follows from the expression of the gradient of f ,a(ω, µ) with respect to ω for µ MMF (Lemma C.3). Lemma B.4. Consider µ M and ω K M(µ). Then, for all a = the following conditions holds:

d+(µa,m1, η ,a(ω ) ξm1) λm1 = d+(µa,m2, η ,a(ω ) ξm2) λm2 m1, m2 : ω a,m1, ω a,m2 > 0

d (µ ,m1, η ,a(ω ) + ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω ,m1, ω ,m2 > 0

d+(µa,m1, η ,a(ω ) ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω a,m1, ω ,m2 > 0

Proof. We begin by recalling the deﬁnition of C (µ) 1:

C (µ) 1 = sup ω K M min a = f ,a(ω, µ),

which, is a concave optimization problem with a non-empty feasible region. Therefore, we can apply the KKT conditions to study the properties of each local optimal point ω for which the sub-derivatives exist, i.e., from Theorem 4.3 and Corollary C.2, the ones for which the following condition hold:10

min a = f ,a(ω) > 0. (14)

10We notice that a global optimum point clearly satisﬁes this condition.

At this point, ﬁx any arm a that attains the minimum in Equation (14). Then, from the KKT conditions, we obtain the following system of inequalities:

ωa,m f ,a(ω, µ) + c ba,m = 0 m [M] ω ,m f ,a(ω, µ) = 0 m [M] bi,mω i,m = 0 i { , a}, m [M] bi,m 0 i [K], m [M] ω i,m 0 i [K], m [M] P

i,m ω i,m = 1

At this point, suppose that ωi1,m1 > 0, ωi2,m2 > 0 for some i1, i2 { , a} and some m1, m2 [M], then bi1,m1 = 0, bi2,m2 = 0. As a consequence, by applying Lemma C.3 and the fact that µ MMF (i.e., see Corollary C.2), the following equations holds:

d+(µa,m1, η ,a(ω ) ξm1) λm1 = d+(µa,m2, η ,a(ω ) ξm2) λm2 m1, m2 : ω a,m1, ω a,m2 > 0

d (µ ,m1, η ,a(ω ) + ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω ,m1, ω ,m2 > 0

d+(µa,m1, η ,a(ω ) ξm1) λm1 = d (µ ,m2, η ,a(ω ) + ξm2) λm2 m1, m2 : ω a,m1, ω ,m2 > 0

Finally, to conclude the proof, it is sufﬁcient to iterate these arguments for all a = . Indeed, from Lemma B.3, we know that all sub-optimal arms will attain the minimum in Equation (14) at a global optimum ω .

At this point we are ready to prove our main result.

Theorem 3.2. Let K M(µ) := argmaxω K M F(ω, µ) and

f MMF := µ M MF : i [K], m1, m2 [M]2, ω K M(µ) : ω i,m1 > 0, ω i,m2 > 0 .

The set f MMF is a subset of RK M whose Lebesgue measure is zero.

Proof. Let us introduce some additional notation. Consider a subset of arm-ﬁdelity pairs X [K] [M], and deﬁne G(X) M as the subset of multi-ﬁdelity bandit models µ for which there exists ω K M(µ) such that, for all (i, m) X, ω i,m > 0 holds.

Then, ﬁx an arm i = , and any three ﬁdelity m1, m2, m3 [M], and consider µ G({(i, m1), (i, m2), ( , m3)}).11 Then, from Lemma B.4, we know that the following condition holds:

d+(µi,m1, η ,i(ω ) ξm1) λm1 = d (µ ,m3, η ,i(ω ) + ξm3) λm3 . (15)

This, in turn, implies that η ,i(ω ) is uniquely identiﬁed as a function of µi,m1, µ ,m3, ξm1, ξm3, λm1 and λm3. Indeed, d+(µi,m1, η ,i(ω ) ξm1) is a strictly increasing function of η ,i(ω ), while d (µ ,m3, η ,i(ω ) + ξm3) is a strictly decreasing function of η ,i(ω ). Let c1 = η ,i(ω ), and let

c2 = d+(µi,m1,η ,i(ω ) ξm1) λm1 . At this point, since ω i,m2 > 0 holds by deﬁnition, we also know, from Lemma B.4, that the following condition has to be satisﬁed:

d+(µi,m2, c1 ξm2)

Therefore, the value of µi,m2 is uniquely identiﬁed as a function of µi,m1, µ ,m3,ξm1, ξm3, λm1 and λm3. That function is measurable (it s a combination of d+, d and their inverses), hence µ lies on the graph of a measurable function, and such a graph has Lebesgue measure 0. Therefore,

11Similar arguments hold also for i = .

G({(i, m1), (i, m2), ( , m3)}) has measure 0. At this point, thanks to Lemma B.2, we know that, for arm , there always exists at least a ﬁdelity m3 such that ω ,m3 > 0. We thus have that

G((i, m1), (i, m2)) [

m3 [M] G((i, m1), (i, m2), ( , m3)) .

Since G((i, m1), (i, m2)) is contained in a set which is a countable union of null measure sets, it has null measure.

To conclude the proof, we notice that:

m1,m2 [M]2 G({(i, m1), (i, m2)}) := Y.

The proof follows from the fact that (i) Y is a countable union of set of null measure (and, consequently, has null measure), and (ii) f MMF Y.

B.4 Additional results on the sparsity of the oracle weights

In this section, we present additional results on the sparsity of the oracle weights. Speciﬁcally:

(i) We identify a speciﬁc class of multi-ﬁdelity bandit models in which the optimal allocation is sparse. In particular, within this class of MF-bandit models, the optimal allocation have non-zero values only at the cheapest ﬁdelity.

(ii) We then provide sufﬁcient conditions to determine whether some ﬁdelity have zero weights at any optimal weight vector ω

We now proceed by constructing the class of multi-ﬁdelity bandits that we mentioned in point (i) above. In this construction, we will consider Gaussian multi-ﬁdelity bandits with variance 1

2. Then, for any number of arms K and ﬁdelity M, we will denote with AKM, the set of Gaussian multiﬁdelity bandits that satisfy the following construction. We start by building the means of the arms at the highest ﬁdelity M. Speciﬁcally, we consider a generic µ ,m > 0, and let µa,M = µ ,m for all a = . Then, for each ﬁdelity m < M, and any values of λm and ξm, we let µi,m = µi,M ξm for all i = , and µ ,m = µ ,M + ξm. Finally, to simplify some computations, we set σ2 of each Gaussian distribution to 1

Proposition B.5. For all µ AKM, and any ω K M, it holds that, for all a [K] and all m > 1, ω a,m = 0.

Proof. To prove the result, starting from Corollary C.2, it is sufﬁcient to notice that, for all µ AKM and all a = , f ,a(ω, µ) can be rewritten as:

f ,a(ω, µ) = inf η [µa,M,µ ,m]

m=1 ωi,m d(µi,M, η)

Speciﬁcally, Equation (16) follows directly from the symmetric property of KL divergence for Gaussian distributions, and by the construction of µ. The proof then continue by contradiction. Suppose there exists ω such that there exists (i, m) (with m > 1) such that ω i,m > 0. By deﬁning ω as the vector which is equal to ω except in the components (i, m) and (a, 1) for all a [K]. More speciﬁcally, for a sufﬁciently small ϵ > 0, we deﬁne ωi,m = ω i,m ϵ and ωa,1 = ω a,1 + ϵ/(K) for all a [K]. Then, it is easy to see that f ,a( ω, µ) > f ,a(ω , µ) holds for all a = , thus contradicting the optimality of ω .

Finally, we now provide sufﬁcient conditions to determine whether some ﬁdelity have zero weights at any optimal weight vector ω

Proposition B.6. Fix a = . Then, if µa,m + ξm µ ,m, then it holds that ω a,m = 0. Furthermore, if µ ,m ξm µj,M for all j = , then it holds that ω ,m = 0.

Proof. Consider a = , and let us analyze f ,a(ω, µ) for any ω K M. More speciﬁcally, we recall from Corollary C.2, that the only term in which ωa,m plays a role is the following one:

ωa,m d(µa,m, η ,a(ω) ξm)

λm 1 η ,a(ω) µa,m + ξm . (17)

Nevertheless, since η ,a(ω) µ ,m µa,m + ξm, we have that Equation (17) is always equal to 0 for all ω K M. To prove the result we now proceed by contradiction. Suppose that ω is such that ω a,m > 0. Then, consider ω as a vector which is equal to ω except in the components (a, m) and (i, M) for all i = . More speciﬁcally, for a sufﬁciently small ϵ > 0, we deﬁne ωa,m = ω a,m ϵ and ωi,M = ω i,M + ϵ/(K 1) for all i = . At this point, by noticing that f ,i(ω, µ) is strictly increasing in ωi,M (i.e., due to Theorem 4.3 and the fact that µ MMF), and since f ,a(ω, µ) is not affected by the value of ωa,m (i.e., Equation (17)), we have that f ,i( ω, µ) > f ,i(ω , µ) for all i = , thus contradicting the optimality of ω .

To show that if µ ,m ξm µj,M for all j = , then it holds that ω ,m = 0, it is possible to follow identiﬁcal reasonings. The only difference is that the term ω ,m plays a role in each of the (K 1)-equations deﬁning F(ω, µ), namely:

ω ,m d(µ ,m, η ,a(ω) + ξm)

λm 1 η ,a(ω) µ1,m ξm a = . (18)

Nevertheless, Equation (18) is equal to 0 for all a = since η ,a(ω) µi,m ξm µa,M holds for all ω and all a = . The proof then follows by an identical construction of an alternative weight vector ω which increases the objective function.

B.5 Sub-optimality of "optimal" ﬁdelity of previous works

In this section, we discuss how the concept of "optimal" ﬁdelity of previous works (i.e., [25] and [31]) fails to satisfy the notion of optimal ﬁdelity that arises from the tighter lower bound that we presented in Section 3. In this section, we consider as example 2 2 multi-ﬁdelity bandit models with Gaussian distributions. To ease the notation, we will consider µ1,M > µ2,M.

B.5.1 Case 1

We notice that [25] provided the two concepts of optimal ﬁdelity. The ﬁrst one is from their Theorem 1. This same concept was then considered later in [31]. A ﬁdelity m is optimal for a certain arm a [K] if it satisﬁes the following condition:

m a argmax m [M]

µ1,M (µa,m + ξm) λm if a = 1 (19)

m a argmax m [M]

(µa,m ξm) µ2,M λm if a = 1 (20)

Then, consider the following 2 2 example of multi-ﬁdelity BAI problem. Let ξ1 = 0.1, µ1,M = 0.6, µ1,m = 0.65, µ2,M = 0.5, µ2,m = 0.45 (where we use the notation M = 2 for the maximal ﬁdelity and m = 1). Suppose, furthermore, that all distributions are Gaussian. In this case, from Equation (19)-(20), we have that m 1 = 1 and m 2 = 1 whenever the following conditions are satisﬁed:

µ1,M (µ2,m + ξm) λm > µ1,M µ2,M λM µ1,m ξm µ2,M λm > µ1,M µ2,M λM .

Plugging in the numerical values, we obtain in both cases

0.05 λm > 0.1 λM ,

thus showing that, according to [31], the optimal ﬁdelity for both arms is m = 1 whenever q

At this point, consider the expression of F(ω, µ) = f1,2(ω, µ) in this particular example. Then, it is possible to show that, for any ω 2 2 such that ω1,M = ω2,M = 0, then, f1,2(ω, µ) = 0. Speciﬁcally, we have that F(ω, µ) is given by:

inf η [µ2,M,µ1,M] ω1,m d(µ1,m, η + ξm)1{η µ1,m ξm}

λm + ωa,m d(µ2,m, η ξm)1{η µa,m + ξm}

In turn, this is equal to:

inf η [0.5,0.6] ω1,m d(0.55, η)1{η 0.55}

λm + ωa,m d(0.55, η)1{η 0.55}

which is always 0 for η = 0.55.

On the other hand, Lemma C.6, shows that any strategy that gives positive value to weights at ﬁdelity M = 2 obtains F(ω, µ) > 0.

B.5.2 Case 2

Furthermore, [25] provided also the following concept of optimal ﬁdelity which only holds for sub-optimal arms (see Deﬁnition 1 in [25]). A ﬁdelity m such that µ1,M µ2,M > 4ξm holds is said to be optimal for arm a = 1 if the following holds:

λm (µ1,M µa,M 4ξm)2 min m>m λ m (µ1,M µa,M 4ξm)2 . (21)

At this point, consider the following classes of multi-ﬁdelity bandit models: µ2,m = µ2,M ξm, µ1,m = µ1,M +ξm, µ1,M µ2,M 4ξm. In this case, from Equation (20) it follows that the optimal ﬁdelity for arm 2 is always M. Nevertheless, since µ2,m = µ2,M ξm, µ1,m = µ1,M + ξm, we know from Proposition B.5 ω1,M = ω2,M = 0.

C Algorithm analysis

C.1 Gradient computation

We start by analyzing a salient feature of fi,j(ω, µ) that holds for any µ ΘKM.

Lemma C.1. Consider µ ΘKM. Fix any ω K M and i, j [K]. Let θ be the solution of the following optimization problem:

θ argmin θi MF,θj MF: θj,M θi,M

m [M] ωj,m d(µj,m, θj,m)

m [M] ωi,m d(µi,m, θi,m)

Furthermore, deﬁne for k {i, j}:

M k(ω, µ, θ ) := m [M 1] : θ k,M > µk,m + ξm

M k(ω, µ, θ ) := m [M 1] : θ k,M < µk,m ξm .

Then, for k {i, j} we have that

θ k,m = µk,m m [M] \

M k(ω, µ, θ ) M k(ω, µ, θ ) (22)

θ k,m = θ k,M ξm m M k(ω, µ, θ ) (23)

θ k,m = θ k,M + ξm m M k(ω, µ, θ ) (24)

In particular,

fi,j(w, µ) = X

m [M] ωk,m d (µk,m, θ k,M + ξm) + d+(µk,m, θ k,M ξm) λm

= min θj,M θi,M

m [M] ωk,m d (µk,m, θk,M + ξm) + d+(µk,m, θk,M ξm)

Proof. We begin by proving Equation (22). To this end, it is sufﬁcient to notice that, given a ﬁxed θ k,M, it is possible to set θ k,m := µk,m, whenever the following condition is satisﬁed:

|θ k,M µk,m| ξm. (25)

The condition is Equation (25) is equivalent to requiring m [M] \

Mk(ω, µ, θ ) Mk(ω, µ, θ ) , which concludes the ﬁrst part of the proof.

We continue by proving Equation (23). Consider m M k(ω, µ, θ ), that is θ k,m > µk,m + ξm. From this condition, it directly follows that θ k,m > µk,m; therefore, since d(µk,m, x) is increasing in x, it follows that, in order to attain the argmin, we need to pick the smallest value of θk,m that satisﬁes the multi-ﬁdelity constraint |θ k,M θk,m|, that is θ k,M ξm.

The proof of Equation (23) follows is almost identical to the one of Equation (24); it is sufﬁcient to replace the deﬁnition of M k(ω, µ, θ ) with M k(ω, µ, θ ).

At this point, we continue by analyzing in more detail the function fi,j(ω, µ).

Lemma 4.2. Consider µ ΘKM and ω K M. Deﬁne for k [K],

ψ k := argmin ψ R

m=1 ωk,m d (µk,m, ψ + ξm) + d+(µk,m, ψ ξm)

Then, the following holds:

fi,j(ω, µ) = X

m=1 ωk,m d (µk,m, ψ k + ξm) + d+(µk,m, ψ k ξm) λm if ψ j > ψ i (6)

fi,j(ω, µ) = inf η R

m=1 ωk,m d (µk,m, η + ξm) + d+(µk,m, η ξm)

λm otherwise. (7)

Proof. The proof follows by analyzing the deﬁnition of fi,j(ω, µ). Consider θ i , θ j that attaines the minimum in Equation (1). Then, there are two possibilities: either θ i,M = θ j,M or θ j,M > θ i,M.

Suppose that θ j,M > θ i,M, then we notice that the optimization problem in fi,j(ω, µ) is a 2D-convex optimization problem in the variables θj,M, θi,M (thanks to Lemma C.1). Therefore, since the minimum of the constrained problem is such that θj,M > θi,M, than, by the convexity of the problem, this is also a minimum for the unconstrained problem, thus leading to:

fi,j(ω, µ) = inf θi MF,θj MF

m [M] ωk,m d(µk,m, θk,m)

At this point, we notice that the constraints in the previous optimization problem are only intra-arm. Therefore, we can rewrite fi,j(ω, µ) as:

fi,j(ω, µ) = X

k {i,j} inf θk MF

m [M] ωk,m d(µk,m, θk,m)

Furthermore, applying the same reasoning as in the proof of Lemma C.1, we can further rewrite fi,j(ω, µ) as follows:

fi,j(ω, µ) = X

k {i,j} inf ψ R

m [M] ωk,m d+(µk,m, ψ ξm) + d (µk,m, ψ + ξm)

m [M] ωk,m d (µk,m, ψ k + ξm) + d+(µk,m, ψ k ξm) λm .

At this point, we notice that due to Lemma C.1 we know that θ j,M = ψ j and θ i,M = ψ i , thus concluding the ﬁrst part of the proof.

Consider now the case in which θ j,M = θ i,M holds. Then, applying Lemma C.1, and using θ i,M = θ j,M, we can rewrite fi,j(ω, µ) as follows:

fi,j(ω, µ) = inf η R

d+(µk,m, η ξm) + d (µk,m, η + ξm) ,

thus concluding the proof.

Given this result, we recall that the deﬁnitions of

ψ i = argmin ψ R

m [M] ωi,m d (µi,m, ψ + ξm) + d+(µi,m, ψ ξm)

η i,j = argmin η R

m [M] ωk,m d (µk,m, η + ξm) + d+(µk,m, η ξm)

Corollary C.2. Consider µ MMF, and a [K] such that a = . Then it holds that:

f ,a(ω, µ) = inf η [µa,M,µ ,M]

m=1 ωi,m d (µi,m, η + ξm) + d+(µi,m, η ξm)

= inf η [µa,M,µ ,M]

m=1 ω ,m d (µ ,m, η + ξm)

λm + ωa,m d+(µa,m, η ξm)

Proof. At this point, we notice that whenever µ MMF it holds that f ,a can always be expressed as Equation (7). This is direct by the condition on ψ s in Lemma 4.2. Furthermore, it also holds at η ,a that d (µa,m, η ,a + ξm) = 0, and d+(µ ,m, η ,a ξm) = 0. This is a consequence of the fact that η ,a [µa,M, µ ,M] for all weights ω. Indeed, η ,a [µa,M, µ ,M] holds due to monotonicity property of the KL divergence.

We now analyze in more detail Equations (26) and (27). In particular, we begin by focusing on Equation (26). Taking the gradient in Equation (27) w.r.t. the optimization variable η, and setting it equal to 0, we obtain that any optimal point η i,j(ω) needs to satisfy the following equation:

ka,m 1 v(η ξm) + ka,m 1 v(η + ξm)

ka,m µa,m + ξm

v(η ξm) + ka,m µa,m ξm

where we recall that ka,m(η) and ka,m(η) are given by:

ka,m(η) = 1 {η µa,m + ξm} ka,m(η) = 1 {η µa,m ξm} .

Given this intermediate result, we now investigate in more depth the solution of Equation (28).

Lemma 4.4. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0 . Suppose that ψ i, ψ j holds. Then, there exists a unique minimizer η i,j(ω) of Equation (7) which is the unique solution of the following equation of η:

ka,m(η) µa,m+ξm

v(η ξm) + ka,m(η) µa,m ξm

ka,m(η) 1 v(η ξm) + ka,m(η) 1 v(η+ξm) , (10)

where ka,m(x) = 1{x µa,m + ξm} and ka,m(x) = 1{x µi,m ξm}.

Proof. Let us analyze:

fi,j(ω, µ) = inf η R

m M ωa,m d (µa,m, η + ξm)

λm + ωa,m d+(µa,m, η ξm)

:= inf η R gi,j(ω, µ, η).

At this point, we proceed by contradiction. Suppose that there exists x1, x2 argminη R gi,j(ω, µ, η) such that x1 = x2. From the convexity of gi,j(ω, µ, η) w.r.t. η, we know that any x [x1, x2] belongs to the argmin set as well. Furthermore, for all x [x1, x2], since fi,j(ω, µ) > 0, at least one of the following condition is satisﬁed:

λM d(µa,M, x) > 0 holds for some a {i, j}

λm d(µa,M, x + ξm)ka,m(x) > 0 holds for some a {i, j} and some ﬁdelity m < M

λm d(µa,M, x ξm)ka,m(x) > 0 holds for some a {i, j} and some ﬁdelity m < M

Therefore, from Equation (28), we obtain that all x [x1, x2] are ﬁxed points of the following Equation:

a {i,j} PM m=1 ωa,m

ka,m(x) µa,m+ξm

v(x ξm) + ka,m(x) µa,m ξm

a {i,j} PM m=1 ωa,m

ka,m(x) 1 v(x ξm) + ka,m(x) 1 v(x+ξm) , (29)

At this point, we notice that for any couple of different x1, x2 that satisﬁes Equation (29), there exists at least one arm a {i, j} and one ﬁdelity m < M such that at least one of the following two conditions hold:

ka,m( x1) = ka,m( x2)

ka,m( x2) = ka,m( x2)

This however, is possible only for a ﬁnite number of points, while the interval [x1, x2] contains inﬁnitely many optimal points. Therefore, there exists a unique solution η i,j(ω) argminη R gi,j(ω, µ, η), and, furthermore, it is a solution of Equation (29), thus concluding the proof.

Given this result, we continue by providing a result on how to compute the derivative of fi,j(ω, µ) whenever fi,j(ω, µ) is given by Equation (7).

Lemma C.3. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0. Furthermore, suppose that fi,j(ω, µ) is given by Equation (7). Then, for all a {i, j} and all m [M]:

ωa,m = d+(µa,m, η i,j + ξm) + d (µi,m, η i,j ξm) λm .

Proof. First of all, we notice that, since fi,j(ω, µ) > 0 holds, and since fi,j(ω, µ) is expressed as in Equation (7), then, thanks to Lemma 4.4, we know that η i,j(ω) is the unique optimum of the Equation (7). In the rest of this proof, we will explicit the relationship between fi,j and η i,j(ω) by writing fi,j(ω, µ, η i,j(ω)). At this point, ﬁx a {i, j} and m [M]. Then, it is easy to verify from Equation (10) that both the right and left derivative of η i,j(ω) w.r.t. ωa,m exists. Suppose for a

moment that they are equal, then we have that η i,j(ω) ωa,m exists and it continuous. Therefore, we obtain

ωa,m fi,j(ω, µ, η i,j(ω)) = fi,j

ωa,m (ω, µ, η i,j(ω)) + fi,j η i,j(ω)(ω, µ, η i,j(ω)) η i,j(ω) ωa,m

ωa,m (ω, µ, η i,j(ω))

= d+(µa,m, η i,j + ξm) + d (µa,m, η i,j ξm) λm ,

where in the second step we have used that fi,j η i,j(ω)(ω, µ, η i,j(ω)) = 0 since η i,j(ω) is a minimizer of Equation (7).

Similarly, whenever, the right and the left derivatives of η i,j(ω) are different12, we can follow similar arguments, but analyzing left and right derivatives, and we will obtain an identical result. Indeed, this does not introduce discontinuity issue in the derivatives of fi,j thanks to the fact that η i,j(ω) is a minimizer of Equation (7).

At this point, it remains to analyze in more detail the case in which we have that fi,j(ω) is expressed as in Equation (6). Lemma C.4. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0 holds. Furthermore, suppose that ψ j > ψ i . Then, for each a {i, j}, there exists a unique minimizer ψ a of Equation (6) which is the unique solution of the following equation of ψ:

PM m=1 ωa,m

ka,m(ψ) µa,m+ξm

v(ψ ξm) + ka,m(ψ) µa,m ξm

PM m=1 ωa,m

ka,m(ψ) 1 v(ψ ξm) + ka,m(ψ) 1 v(ψ+ξm) . (30)

Proof. The proof follows by noticing that, for each a {i, j}, the optimization problem in Equation (6) is an unconstrained convex optimization problem in ψ. Taking the derivative and setting it equal to 0 yields the desired result.

At this point, we proceed by showing how to compute the partial derivatives of fi,j(ω) whenever it is expressed as in Equation (6). Lemma C.5. Consider µ ΘKM and ω K M such that fi,j(ω, µ) > 0. Furthermore, suppose that ψ j > ψ i . Then, for all m [M] it holds that:

ωa,m = d+(µa,m, ψ a + ξm) + d (µa,m, ψ a ξm) λm .

Proof. The proof is a straightforward adaptation of the proof of Lemma C.3.

Finally, we are now ready to prove our result on the sub-gradient of F(ω, µ). Theorem 4.3. Consider µ ΘKM and ω K M such that F(ω, µ) > 0 holds. Let (i, a) [K]2 be a pair of arms that attains the max-min value in Equation (2). Then a sub-gradient F(ω, µ) of F(ω, µ) w.r.t. to ω is given by one of the two following expressions: for j {a, i} and m [M] ,

F(ω, µ)j,m = d+(µj,m, η i,a ξm) + d (µj,m, η i,a + ξm) λm if ψ i ψ a , (8)

F(ω, µ)j,m = d+(µj,m, ψ j ξm) + d (µj,m, ψ j + ξm) λm otherwise. (9)

That sub-gradient F(ω, µ) is 0 in all the remaining KM 2M dimensions.

Proof. The proof follows by the deﬁnition of F(ω, µ) together with Lemma 4.2, Lemma C.3, and Lemma C.5.

12This can happen, for instance, whenever η i,j(ω) = µa,m ξm.

We now show a sufﬁcient condition for F(ω, µ) > 0 to hold when µ ΘKM.

Lemma C.6. Consider µ ΘKM such that there exists for which µ ,M > maxa = µa,M. Furthermore, consider ω K M such that ωi,M > 0 holds for all i [K]. Then, we have that F(ω, µ) > 0.

Proof. From the deﬁnition of F, and the deﬁnition of , we have that:

F(ω, µ) min a = inf θa MF, θ MF θa,M θ ,M

m [M] ωj,m d(µj,m, θj,m)

min a = inf y x ω ,M d(µ ,M, x)

λM + ωa,M d(µa,M, y)

= min a = inf η [µa,M,µ ,M] ω ,M d(µ ,M, η)

λM + ωa,M d(µa,M, η)

where in the last step we have used the fact that µ ,M > µi,M for all i = , together with ωi,M > 0 for all i [K].

Furthermore, we show that the sequence of weights generated by Algorithm 1 satisfy ωi,M > 0 for all i [K]

Lemma C.7. The sequence of weights { ω(t)}t satisfy ωi,M(t) > 0 for all i [K] and for all t.

Proof. We begin by recalling the deﬁnition of ω(t):

ω(t + 1) argmax ω K M αt+1

s=KM ω Clips ( F( ω(s), ˆµ(s)) kl(ω, ω)

From this deﬁnition, thanks to the property of kl, we have that ωa,m(t) > 0 for all a [K], and all m [M].

Remark C.8. It follows by combining Lemma C.6 and Lemma C.7, that F(ω(t), ˆµ(t)) = 0 might happen only when there are multiple best arms at ﬁdelity M. Whenever this condition is encountered, it is possible to project the bandit model ˆµ(t) to have a unique optimal arm (e.g., by adding a small ϵ > 0 to one of the optimal arms). When looking at the proof of Theorem 4.1, we can see that this does not impact its theoretical guarantees as (i) on the good event ET this does not happen, and, Lemma C.17 holds unchanged.

C.2 Smoothness of F(ω, µ)

Lemma C.9. For any set S ΘKM and any subset of arms A [K], the function (ω, µ) 7 infθ S P

a A,m [M] ωa,m d(µa,m,θa,m)

λm is jointly continuous on K M ΘKM .

Proof. We apply (a trivial generalization of) Lemma 27 of [3]. The lemma in that paper is stated for a set of alternative models, but the proof actually works for any set S. Likewise, it is stated for the case of λm = 1, but since it works for an arbitrary Bregman divergence d it applies to a rescaled version as well. To deal with the restriction to a subset of arms A instead of all arms, we can view the function as a function of (ωA [M], µA), where we restrict the vectors to the arms in A, and continuity of the original function is equivalent to continuity of the restricted version.

Lemma C.10. The function (ω, µ) 7 F(ω, µ) is jointly continuous on K M ΘKM .

Proof. By deﬁnition, F(ω, µ) = maxi mina =i fi,a(ω, µ) with fi,a(ω, µ) = infθ Si,a P

a Ai,a,m [M] ωa,m d(µa,m,θa,m)

λm for Si,a = {θ MMF | θa,M θi,M} and Ai,a = {i, a}. Since a minimum of ﬁnitely many continuous functions is continuous and likewise for the maximum, it sufﬁces to show that each fi,a is jointly continuous. This is true by Lemma C.9.

Corollary C.11. Let C ΘKM be a compact set. Then F is uniformly continuous on K M C.

Proof. The set K M C is compact and F is continuous, hence it is uniformly continuous on that set.

Lemma C.12. Let µ ΘKM. For all ε > 0, there exists κϵ > 0 such that for all ω K M and all µ ΘKM , µ µ κϵ = F(ω, µ ) F(ω, µ) ϵ.

Proof. Take any compact ball B(µ, κ) for the norm with κ > 0 centered at µ. Then F is uniformly continuous on K M B(µ, κ) by Corollary C.11. This means that for any ε > 0, there exists κ ε > 0 such that for all (ω , µ ) K M B(µ, κ),

ω ω κ ε µ µ κ ε = |F(ω , µ ) F(ω, µ)| ε .

We can take κε = min{κ, κ ε} to remove the condition µ B(µ, κ). The result of the Lemma is this for the special case ω = ω.

C.3 Correctness

In the following, we propose an analysis on the correctness which is based on the concentration results provided in [22]. We notice that these results are based on Gaussian distributions. Nevertheless, at the cost of a more involved notation, it is possible to extend all the results of this work for canonical exponential families using, e.g., Theorem 7 in [19].

At this point, let us consider the following value of βt,δ:

βt,δ = log K

+ 2M log 4 log K

+ 1 + 12M log (log(t) + 3) + 2M C, (31)

where C is a universal constant (see Proposition 1 in [22]). Then, we can show the following result. Proposition C.13. Let δ > 0, then it holds that Pµ(ˆaτδ = ) δ.

Proof. With probabilistic arguments we have that:

Pµ(ˆaτδ = ) Pµ

t KM, i = , min j =i fi,j(C(t), ˆµ(t)) βt,δ

t KM, min j =i fi,j(C(t), ˆµ(t)) βt,δ

i = Pµ ( t KM, fi, (C(t), ˆµ(t)) βt,δ)

m [M] Nk,md(ˆµa,m(t), µa,m) βt,δ

where in fourth step we have used the deﬁnition of fi, , and in the last one Proposition 1 in [22] together with a union bound on K.

C.4 Auxiliary lemmas

This section contains auxiliary lemmas that will be used in the analysis of Algorithm 1.

Lemma C.14. For all a [K], m [M], and for all t 1, it holds that Na,m(t)

t 4KM ln(KM). Furthermore, it holds that:

s=0 π(s) N(t) 2 ln(KM)

Proof. This lemma is a simple combination of Lemma 3 in [22] with the tracking result of [5]. Using algebraic manipulations, we have that:

s=1 π a,m(s) Na,m(t)

s=1 π a,m(s)

s=1 π a,m(s) ln(KM)

γs KM ln(KM)

1 4 s ln(KM)

t 4KM ln(KM),

where, in the second step we have used Theorem 6 in [5], together with the fact that ln(KM) ln(4) 1.

For the second part of the proof, we have that:

s=1 πa,m(s) Na,m(t)

s=1 π a,m(s) Na,m(t) + 2

Lemma C.15. Consider ϵ > 0 and B R such that C (µ) 1 B ϵ > 0. Then, there exists a constant Cϵ such that, for

a,m Ca,m(T) max

λMCϵ, log K

δ + 2M log 4 log K

C (µ) 1 B ϵ

:= C0(ϵ, δ), (33)

it holds that:

C (µ) 1 B βT,δ P

a,m Ca,m(T). (34)

Proof. Let Cϵ be a constant that depends on ϵ such that, for T Cϵ it holds that:

12M log(log(T) + 3) + 2M C

Then, for P

a,m Ca,m(T) C0(ϵ, δ), we have that:

a,m Ca,m(T) = log K

δ + 2M log 4 log K

δ + 1 + 12M log(log(T) + 3) + 2M C P

a,m Ca,m(T)

δ + 2M log 4 log K

a,m Ca,m(T) + ϵ

which concludes the proof.

C.5 Proof of Theorem 4.1

Before diving into the proof of Theorem 4.1, we introduce some additional notation. We denote with B (x, κ) the ball of radius κ centered at x. Then, for all T, and ϵ > 0, we introduce the following event:

t h(T ) {ˆµ(t) B (µ, κϵ)} ,

where h(T) T 1/4. At this point, we present our result. Furthermore, we denote with ω(t) the vector of empirical cost proportions, namely, for all (a, m), ωa,m(t) = Ca,m(t) P

j [M] Ci,j(t).

First of all, we introduce an initial result that controls the expectation of the stopping cost. Lemma C.16. Consider B such that C (µ) 1 B ϵ > 0, and suppose that there exists a constant Tϵ such that, for all T Tϵ, it holds that F(ω(T), ˆµ(T)) F(ω (µ)) B on the good event Eϵ(T). Then, it holds that:

Eµ[cτδ] λMTϵ + C0(ϵ, δ) + 1 +

t=0 Pµ(E(T)c). (35)

Proof. Using probabilistic arguments, we have that:

Eµ[cτδ] = Z +

0 Pµ(cτδ > x)dx

λMTϵ + C0(ϵ, δ) + Z

λMTϵ+C0(ϵ,δ) Pµ(cτδ > x, cτδ C0(ϵ, δ))dx

λMTϵ + C0(ϵ, δ) + Z

λMTϵ+C0(ϵ,δ) Pµ

τδ > x λM , cτδ C0(ϵ, δ) dx

λMTϵ + C0(ϵ, δ) + 1 +

T = Tϵ+ C0(ϵ,δ)

λM Pµ(τδ > T, cτδ C0(ϵ, δ))

λMTϵ + C0(ϵ, δ) + 1 +

T =0 Pµ(E(T)c),

where (i) in the second inequality, we have upper bounded cτδ λMτδ, (ii) in the third inequality, we have used the fact that τδ is an integer variable, and (iii) in the last inequality we have used that for all T Tϵ such that P

a,m Ca,m(T) C0(ϵ, δ), then we have that E(T) {τδ < T}. Indeed, combining Lemma C.15 with the deﬁnition Tϵ, we obtain that for all T Tϵ such that P

a,m Ca,m(T) C0(ϵ, δ), then we have that E(T) {τδ < T}. This last step is direct by noticing

that F(ω(T), ˆµ(T)) βT,δ P

a,m Ca,m(T ) implies stopping.

Lemma C.16 shows how to upper-bound the expected cost complexity. Notice that this result requires different arguments w.r.t. the usual ones that appears while controlling the expected sample complexity (see, e.g., [8]).

Then, we report a basic property of the sub-gradient ascent routing that is employed in our algorithm. Before doing that, we recall that, on the good-event ET , it holds that there exists a constant L, that depends on µ, such that the empirical sub-gradients are uniformly-bounded by L. Lemma C.17. Let c(t) = P

a,m λm πa,m(s). Deﬁne Cr := log(KM) + KMG + 4(LλM)2 + 2G2, and consider the sequence of weights { ω(t)}t generated by Algorithm 1. Then, on the good event Eϵ(T) it holds that:

t=h(t) c(t) F(ω , ˆµ(t)) (ω ω(t)) Cr

Proof. The proof is identical to the one of Proposition 3 in [22]. The only difference is that, in our case, we need to multiply the scale of the sub-gradient L by λM, to the additional presence of c(t) in the sequence of gains that we use in our sub-gradient ascent algorithm.

Finally, we show that there exists an additional problem dependent constant Cµ that will be useful in performing some upper-bound reasoning in the proof of the ﬁnal result.

Lemma C.18. Consider the following quantity:

mina = infθa,θ MF PT s=1 P

m πi,m(s)d(µi,m, θi,m) P

a,m Ca,m(T) .

There exists a problem dependent constant Cµ such that the previous equation can be upper bounded by:

F(ω(T), µ) + 4 ln(KM)MλMCµ

Proof. Let us begin by analyzing F(ω(T), µ). Fix a such that F(ω(T), µ) = f , a(ω(T), µ). Moreover, consider θ , θ a argmin P

m ωa,m(T) d(µa,m,θa,m)

λm . Then, consider the following difference:

H := mina = infθa,θ MF PT s=1 P

m πi,m(s)d(µi,m, θi,m) P

a,m Ca,m(T) F(ω(T), µ).

Then, the previous Equation can be upper bounded by:

m PT s=1 πi,m(s) Ni,m(T) d(µi,m, θ i,m) P

a,m Ca,m(T)

m d(µi,m, θ i,m) P

a,m Ca,m(T)

a,m Ca,m(T) ,

where in the ﬁrst step, we have used the deﬁnition of θ a, θ and the deﬁnition of ω(T), in the second one, we have used Lemma C.14, and in the last one the facts that, thanks to deﬁnition θ a, θ , there exists some problem dependent constant Cµ such that d(µi,m, θ i,m) is bounded.

Theorem 4.1. For any multi-ﬁdelity bandit model µ MMF, Algorithm 1 using the threshold βt,δ given in (31) is δ-correct and satisﬁes

lim sup δ 0

Eµ[cτδ] log(1/δ) C (µ). (5)

Proof. The proof of the δ-correctness is from Proposition C.13.

To prove the optimality, we ﬁrst proceed by upper bounding the following quantity on the good event Eϵ(T):

F(ω , µ) F(ω(T), ˆµ(T)). (36)

Deﬁne, for brevity, e T := T h(T) + 1 and c(s) := P

a,m λm πa,m(s). Then, we start by analyzing F(ω , µ). On Eϵ(T) we have that:

a,m Ca,m(T) P

a,m Ca,m(T)F(ω , µ) PT s=1 c(s) P

a,m Ca,m(T)F(ω , µ) + PT s=1 c(s) P

a,m Ca,m(T)F(ω , µ)

PT s=1 c(s) P

a,m Ca,m(T)F(ω , µ) + F(ω , µ) P

a,m Ca,m(T)2 ln(KM)

PT s=1 c(s) P

a,m Ca,m(T)F(ω , µ) + 2 ln(KM)F(ω , µ)

PT s=h(T ) c(s) P

a,m Ca,m(T)F(ω , µ) + h(T)

λmin T F(ω , µ) + 2 ln(KM)F(ω , µ)

PT s=h(T ) c(s)F(ω , ˆµ(t)))

a,m Ca,m(T) + λM e Tϵ

λmin T + h(T)

λmin T F(ω , µ) + 2 ln(KM)F(ω , µ)

where in the ﬁrst inequality we have used Lemma C.14, while in the last step we have used Lemma

C.12 together with the event Eϵ(T). At this point, we focus our analysis on

PT s=h(T ) c(s)F (ω ,ˆµ(t)) P

a,m Ca,m(T ) .

Deﬁne, for brevity, gs = c(s) F( ω(s), ˆµ(s)); then, we have that: PT s=h(T ) c(s)F(ω , ˆµ(s)))

a,m Ca,m(T) =

PT s=h(T ) c(s) (F(ω , ˆµ)) F( ω(s), ˆµ(s))))

a,m Ca,m(T)

PT s=h(T ) c(s)F( ω(s), ˆµ(s)))

a,m Ca,m(T) +

PT s=h(T ) gs (ω ω(s))

a,m Ca,m(T)

PT s=h(T ) c(s)F( ω(s), ˆµ(s)))

a,m Ca,m(T) + Cr λmin

PT s=h(T ) c(s)F( ω(s), µ)

a,m Ca,m(T) + Cr λmin

T + λM e Tϵ

where in the ﬁrst inequality we have used the concavity of F, in the second one we have used Lemma C.17, and in the last one Lemma C.12 and the deﬁnition of Eϵ(T).

Finally, we have that: PT s=1 c(s)F( ω(s), µ) P

a,m Ca,m(T) =

PT s=1 mina = infθa,θ MF P

m πi,m(s)d(µi,m, θi,m) P

a,m Ca,m(T)

mina = infθa,θ MF PT s=1 P

m πi,m(s)d(µi,m, θi,m) P

a,m Ca,m(T)

F(ω(T), µ) + 4 ln(KM)MCµ

F(ω(T), ˆµ(T)) + 4 ln(KM)MCµ

where (i) in the ﬁrst equality we used the deﬁnition of c(s), ω(s) and F, in the second one we used Lemma C.18, and in the last one Lemma C.12.

Given this analysis, let us deﬁne:

BT := 2λM e Tϵ

λmin T + h(T)

λmin T F(ω , µ) + 2 ln(KM)F(ω , µ)

T + 4 ln(KM)MCµ

Consider T such that:

( 2 ln(KM)F(ω , µ)

2 , 4 ln(KM)MCµ

Then, it holds that:

λmin + 5ϵ = Bϵ,

and, consequently, we have that

F(ω(T), ˆµ(T)) F(ω , µ) Bϵ.

Combining this result with Lemma C.15 and C.16, we obtain:

E[cτδ] λMTϵ + C0(δ, ϵ) + 1 +

t=0 Pµ(Eϵ(t)c).

By Lemma C.14 and Lemma 19 in [8], we obtain that:

lim sup δ 0

E[cτδ] log(1/δ) 1 C (µ) 1 Bϵ .

Letting ϵ 0 concludes the proof.

D Experiment details and additional results

In this section, we provide experimental details and additional results. For the experiments we relied on a server with 100 Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz cpus and 256GB of RAM. The time to obtain all the empirical results is less than a day.

This section is structured as follows.

First, we provide an additional details and results on the experiment presented in Section 5 (Section D.1 and Section D.2). Secondly, we provide results on additional 4 5 multi-ﬁdelity bandits (Section D.3). Then, we analyze a typical trick that is used to improve the performance of gradient-based methods, that is using a constant rate against using the learning rate that the theory prescribes. (Section D.4). We then present results using very small value of δ w.r.t. to the one that has been considered in the main text (Section D.5). In particular, we verify that the performance difference ampliﬁes. Finally, we discuss the approaches of [31].

D.1 Further details on the experiments presented in Section 5

First instance We begin by providing further details on the 4 5 multi-ﬁdelity bandit model that we used in Figure 1 and Figure 2. First, Table 2 reports the 4 5 bandit model of Figure 1.

Table 2: Multi-ﬁdelity bandit model presented in Figure 1.

µ1 µ2 µ3 µ4 ξ λ

m = 1 0.9465 0.8526 0.8162 0.9099 0.1 0.05 m = 2 0.7727 0.8708 0.9050 1.0594 0.08 0.1 m = 3 0.8812 0.8515 0.8209 1.0083 0.05 0.2 m = 4 0.8284 0.8374 0.8353 0.9745 0.025 0.4 m = 5 0.8494 0.8401 0.8495 0.9856 0.0 5

All arms, both for the bandit model of Figure 1 and 2 are Gaussian distributions with variance σ2 = 0.1. The bandit model in Figure 1 has been generated according to a procedure that has been

used to generate MF instances in [25] (see their Appendix D.1). Speciﬁcally, ﬁrst, two M-dimensional vectors are speciﬁed, which we refer to as a and b. Speciﬁcally, a and b are such that am am+1 and bm bm+1 for all m [M 1]. Then, we ﬁrst sample the means of the arm at ﬁdelity M 13, and once this is done we sample µi,m µi,M am b

2, µi,M + am + b

2 . Then, ξ is computed as ξm = am + bm

2 . In this sampling procedure, we have used a = [0.075, 0.06, 0.04, 0.02, 0] and b = [0.05, 0.04, 0.02, 0.01, 0].

Second instance We now recall the 5 2 example of Section 5:

Table 3: Multi-ﬁdelity bandit model presented in Figure 2.

µ1 µ2 µ3 µ4 µ5 ξ λ

m = 1 0.4 0.4 0.4 0.4 0.5 0.1 0.5 m = 2 0.5 0.5 0.5 0.5 0.6 0 5

We prove that, in this instance, the oracle weights are given by ω i = [0.09621, 0] for all i [4], and ω 5 = [0, 0.61516] (this number have been rounded to the fourth decimal precision). In order to prove this, we ﬁrst notice that in the considered domain the optimal ﬁdelity for i [4] is m = 1. This is direct from the fact that µi,m = µi,M ξm (see, e.g., Proposition B.5). Furthermore, we recall the expression f5,i, for any i [4]:

f5,i(ω, µ) = inf η [µi,M,µ5,M]

m [M] ω5,m d (µ5,m, η + ξm)

λm + ωi,m d+(µi,m, η ξm)

Then, since η 5,i [µi,M, µ5,M], µi,M = µ5,m = µ5,M ξm, we have that the optimal ﬁdelity for arm 5 is m = 2. At this point, consider the oracle weights ω . We notice that, due to the symmetry of the problem, ω i,1 is equal for all i [4]. Then, we can rewrite f5,i(ω , µ) as a function of a single variable, that is:

f5,i(ω, µ) = inf η [µi,M,µ5,M](1 4ωi,1)d (µ5,2, η)

λM + ωi,1 d+(µi,1, η ξm)

and, consequently, we obtain that C (µ) 1 can be expressed as a convex optimization of a single variable, that is ωi,1. Taking the derivative of F(ω, µ) w.r.t. ωi,1 we obtain that the following equality should be satisﬁed at the optimum:

4d(µ1,M, η 5,i) λM = d(µi,1, η 5,i 1) λm .

Solving for η 5,i gives a unique solution in the range [0.5, 0.6], which is 0.539. Then, using Lemma C.3 and solving for ωi,1, we obtain ωi,1 = 0.09621, and consequently, ω5,2 = 0.61516.

Thresholds To conclude, we comment on the thresholds βt,δ used by the algorithms. For the stopping rule in MF-GRAD we used βt,δ = log(K/δ) +M log(log(t) +1), which is a simpliﬁcation of its theoretical value (31) that retains the same scaling in K and M (up to constants). In GRAD, we used βt,δ = log(K/δ) + log(log(t) + 1), which is a similar simpliﬁcation of the usual threshold for BAI, which instead of concentrating a sum of 2M KL terms (see the proof of Proposition C.13) only requires to concentrate a sum over 2 KL terms. Finally, in the conﬁdence intervals that are used in

IISE we have used the conﬁdence bonuses q

2σ2(log(KM/δ)+log(log(t)))

Na,m(t) 14, which compared to their original form is replacing some crude union bound over t with a stylized version of the threshold that would follows from using tight time-uniform concentration. These choices were adopted consistently in all the experiments presented in this appendix, and they ensured the δ-correctness requirement in all cases.

13For this step, we constrained the minimum gap between arms at ﬁdelity M is at least 0.1. 14Notice, indeed, that ISEE requires a union bound both on K and M.

D.2 Empirical cost proportions of MF-GRAD

In this section, we analyze the behavior of MF-GRAD by analyzing the evolution of the empirical cost proportions. Speciﬁcally, we repeat on the 4 5 bandit model of Table 2, the same experiment that we presented in Section 5 for the 5 2 bandit model. Figure 4 reports the result. Interestingly, we highlight how the sparsity pattern emerges also in this domain.

D.3 Results on additional domains

In this section, we present results on additional 4 5 multi-ﬁdelity bandit models. Speciﬁcally, we generate another random instance according to the procedure of [25], and we report the model in Table 4. Furthermore, we created an additional bandit model where means of some arms are slightly increasing on displaying a stationary trend over ﬁdelity and we report the model in Table 5. In both cases, we considered Gaussian distribution with σ2 = 0.1. Empirical results of MF-GRAD, IISE and GRAD for δ = 0.01 can be found in Figure 5 and 6 respectively. As we can appreciate, MF-GRAD maintains the most competitive performance across both domains.

Table 4: Additional random multi-ﬁdelity bandit model.

µ1 µ2 µ3 µ4 ξ λ

m = 1 0.6944 0.5080 0.4153 0.3564 0.1 0.05 m = 2 0.5634 0.3723 0.4132 0.4570 0.08 0.1 m = 3 0.6178 0.4322 0.3817 0.4065 0.05 0.2 m = 4 0.6323 0.4225 0.3838 0.3582 0.025 0.4 m = 5 0.6171 0.4216 0.3831 0.3783 0.0 1

Table 5: Additional multi-ﬁdelity bandit model.

µ1 µ2 µ3 µ4 ξ λ

m = 1 0.41 0.35 0.51 0.41 0.1 0.1 m = 2 0.45 0.37 0.56 0.39 0.08 0.125 m = 3 0.47 0.38 0.64 0.40 0.04 0.25 m = 4 0.48 0.36 0.62 0.42 0.02 0.5 m = 5 0.5 0.35 0.61 0.42 0.0 1

D.4 Improving performance with constant learning rate

As reported in [22], using constant learning rate can improve the identiﬁcation performance in standard BAI settings. In the following, we analyze the performance difference of MF-GRAD that uses the theoretical learning rate, and MF-GRAD that uses a constant learning rate of α = 0.25. We will refer this second version as MF-GRAD-CONST. Figure 7 and 8 reports the performance of the algorithms in the two bandit models of Section 5. As we can see, in both cases, MF-GRAD-CONST outperforms MF-GRAD.

We further investigate this behavior by showing the evolution of the empirical cost proportions of MF-GRAD-CONST during the learning process. Figure 9 and 10 reports the evolution of the empirical costs, over 100000 iterations. Comparing the results with Figure 4 and 9 we can appreciate as MF-GRAD-CONST move away from the initial cost proportions way sooner than MF-GRAD, which explains its superior performance in the moderate regime of δ.

D.5 Smaller value of δ

Finally, we repeat the experiments that we presented in the previous section using smaller values of δ. Speciﬁcally, we consider δ = 10 10. Figure 11 reports the performance of the 4 5 bandit model of Section 5, Figure 12 reports the performance of the 5 2 bandit model of Section 5, Figure 13 and 14 reports the performance of the additional bandit models presented in Appendix D.3. As one can notice the performance gap between MF-GRAD and the considered baseline increases.

Figure 4: Empirical cost proportions of MF-GRAD for 100000 iterations on the 5 2 bandit model of Section 5. Results are average over 100 runs and shaded area report 95% conﬁdence intervals. Empirical cost proportions of each arm are plotted with the same color. Cost proportions at ﬁdelity 1, 2, 3, 4 and 5 are visualized with circle, squared, cross, triangle, and diamond respectively.

Figure 5: Empirical cost complexity for 1000 runs times with δ = 0.01 on the multi-ﬁdelity bandit of Table 4.

Figure 6: Empirical cost complexity for 1000 runs times with δ = 0.01 on the multi-ﬁdelity bandit of Table 5.

Figure 7: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 4 5 multi-ﬁdelity bandit of Section 5.

Figure 8: Empirical cost complexity for 1000 runs times with δ = 0.01 on the 5 2 multi-ﬁdelity bandit of Section 5.

Figure 9: Empirical cost proportions of MF-GRAD-CONST for 100000 iterations on the 4 5 bandit model of Section 5. Results are average over 100 runs and shaded area report 95% conﬁdence intervals. Empirical cost proportions of each arm are plotted with the same color. Cost proportions at ﬁdelity 1, 2, 3, 4 and 5 are visualized with circle, squared, cross, triangle, and diamond respectively.

Figure 10: Empirical cost proportions of MF-GRAD-CONST for 100000 iterations on the 5 2 bandit model of Section 5. Results are average over 100 runs and shaded area report 95% conﬁdence intervals. Empirical cost proportions of each arm are plotted with the same color. Cost proportions at ﬁdelity 1 and 2 are visualized with circle and squared respectively.

Figure 11: Empirical cost complexity for 1000 runs times with δ = 10 10 on the 4 5 multi-ﬁdelity bandit of Table 2.

Figure 12: Empirical cost complexity for 1000 runs times with δ = 10 10 on the 5 2 multi-ﬁdelity bandit of Table 3.

Figure 13: Empirical cost complexity for 1000 runs times with δ = 10 10 on the multi-ﬁdelity bandit of Table 4.

Figure 14: Empirical cost complexity for 1000 runs times with δ = 10 10 on the multi-ﬁdelity bandit of Table 5.

D.6 On LUCB-Explore A and LUCB-Explore B

In this section, we present in detail the main issue behind the algorithms presented in [31], i.e., LUCBExplore A and LUCBExplore B, that is the fact that these algorithm might fail at stopping in some speciﬁc multi-ﬁdelity bandit models. First, we provide numerical evidence of this phenomena by running both methods in a speciﬁc instance (Section D.6.1). Then, in Section D.6.2, we point out an error in the analysis of [31] that highlights how both algorithms fails at stopping when considering instances such as the one that has been considered in Section D.6.2.

D.6.1 Experimental issues

When experimenting with the algorithms proposed in [31], namely LUCBExplore A and LUCBExplore B, we have faced stopping issues. Speciﬁcally, both algorithms were not terminating in any reasonable number of steps on some speciﬁc instances. We now report an illustrative example of such scenarios. Consider the following Gaussian multi-ﬁdelity bandit model: µ1 = [0.64, 0.6], µ2 = [0.46, 0.5], λ = [0.1, 5], ξ = [0.1, 0] and σ2 = 1. In this scenario, the well-known LUCB algorithm [12] which only uses samples at ﬁdelity M, stops soon (iteration 100k) paying a total cost of roughly 500k. When running LUCBExplore A and LUCBExplore B, instead, we faced termination issues. We let both algorithms run for a maximum number of 108 samples (reaching a total cost which is approximately 107), and the stopping criterion was never met for LUCBExplore A, while 70% of LUCBExplore B runs did not stop. LUCBExplore B explores more ﬁdelities at the beginning, and that initial exploration can be enough to trigger the stopping test on some runs, but many continue until we artiﬁcially stop the experiment. Figure 15 reports the results of this experiment.

As a ﬁnal remark, we notice that both LUCBExplore A and LUCBExplore B require additional knowledge in order to run, that is an upper bound on µ1,M and a lower bound on µ2,M (assuming arms to being ordered according to µ1,M > µ2,M µK,M). The result presented in this section have been presented running their algorithms in the most favorable scenario, that is the situation in which the agent has perfect knowledge on the values µ1,M and µ2,M.

D.6.2 Theoretical issues

The general idea of the LUCBExplore algorithms of [31] is to identify for each arm the optimal ﬁdelity and pull the arm at that ﬁdelity. In Appendix B.5, we described how that optimal ﬁdelity can differ from the ﬁdelity selected by our lower bound. Since our lower bound can be matched by an algorithm and thus describes the actual cost complexity of the problem, it betters represent the notion of optimal ﬁdelity. We will thus call the ﬁdelity used by the LUCBExplore algorithms target ﬁdelity instead. The two variants Explore A and Explore B differ in the mechanism used to look for the target ﬁdelity.

Figure 15: Visualization of the non-stopping behavior of LUCBExplore A and LUCBExplore B.

We ﬁrst show that even if their algorithm used an oracle for the ﬁdelity exploration mechanism that returns the target ﬁdelity for all arms, it would still not be able to stop on some examples. We then highlight an issue with the proof of [31].

Failure to stop with an oracle Consider the bandit instance from Appendix B.5. Recall that this is a 2 2 example of multi-ﬁdelity BAI problem with ξ1 = 0.1, ξ2 = 0.0, µ1,M = 0.6, µ1,m = 0.65, µ2,M = 0.5, µ2,m = 0.45 (we write M = 2 and m = 1). All distributions are Gaussian with variance 1. We choose λM > 4λm, which means that the target ﬁdelity for that problem are m 1 = 1 and m 2 = 1 (see details in Appendix B.5). LUCBExplore with an oracle that always selects that ﬁdelity is the following algorithm:

Initialization: ˆµk,m(t) = 0, Nk,m(t) = 0, UCBk(t) = 1, LCBk(t) = 0 for all arms k and ﬁdelity m. ℓt = 1, ut = 2.

While LCBℓt(t) UCBut(t)

ℓt = arg maxk UCBk(t), ut = arg maxk [k]\{ℓt} UCBk(t) Pull arms ℓt and ut at their target ﬁdelity.

The indices are

LCBk(t) = max m (ˆµk,m(t) ξm β(Nk,m(t), t, δ))

UCBk(t) = min m (ˆµk,m(t) + ξm + β(Nk,m(t), t, δ))

where β(n, t, δ) = p

log(Lt4/δ)/n for some constant L > 0.

In the two-arms example here, the algorithm simpliﬁes greatly: it always pulls both arms alternatively, always at ﬁdelity m = 1. It stops when the LCB of one arm surpasses the LCB of the other.

We show that it can t stop and return the best arm 1, unless a conﬁdence interval is not valid, which happens with small probability. If ˆµ1,1(t) µ1,1 + β(t/2, t, δ),

LCB1(t) = max{0, ˆµ1,1(t) ξ1 β(t/2, t, δ)} µ1,1 ξ2 = 0.55 .

On the other hand, if min{1, ˆµ2,1(t) µ2,1 β(t/2, t, δ),

UCB2(t) = min{1, ˆµ2,1(t) + ξ1 + β(t/2, t, δ)} µ2,1 + ξ1 = 0.55 .

We get that we always have LCB1(t) UCB2(t), unless one of the two concentration inequalities on the empirical means are not true. The conﬁdence width β is designed to make those inequalities true for all t N with probability close to 1. We can similarly get that LCB2(t) UCB1(t) (which is expected since 2 is a worse arm) unless some concentration inequality is false.

We obtain that this algorithm with an oracle selection for the target ﬁdelity cannot stop fast: the only way it can stop is if unlikely deviations occur.

Issue with the proof There is an issue with the proof of the cost complexity upper bound of [31]. The issue is in the ﬁrst 3 steps of their appendix E.2, pages 17 and 18. They identify a threshold c (with value 0.55 in our example of the last paragraph) and prove the following.

Step 1: if the algorithm does not terminate and conﬁdence intervals hold, then either both LCBℓt(t) c and UCBℓt(t) c or both LCBut(t) c and UCBut(t) c. Step 2: conﬁdence intervals are likely to hold. Step 3: if a sub-optimal arm k satisﬁes LCBk(t) c and UCBk(t) c, then its target ﬁdelity cannot be pulled much.

They conclude that for all arms, the number of pulls at the target ﬁdelity is upper bounded, with large probability.

Let s see the issue with that proof, on the same example as in the last paragraph.

In the example above with the oracle choice for the target ﬁdelity, we saw that if conﬁdence intervals hold and ℓt = 1 (which is the most likely), then LCBℓt(t) c and UCBℓt(t) c. That is, step 1 gives a condition on arm 1 only (and nothing on arm 2). But then we get nothing from step 3, since arm 1 is not a sub-optimal arm.

We only get an upper bound on the number of pulls for sub-optimal arms if we can say that they satisfy LCBk(t) c and UCBk(t) c at some point, but it might not be the case. Indeed, when the algorithm does not terminate, steps 1 and 2 together give that with large probability either both LCBℓt(t) c and UCBℓt(t) c or both LCBut(t) c and UCBut(t) c. It is possible that we always have this property for ℓt = 1 (the optimal arm), and that we can never apply step 3.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: Claims and introduction reﬂects the contribution and content of the paper.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: We discuss limitations and future direction of improvements in Section 6

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: Each statement presented in the main text is supported with formal proofs presented in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justiﬁcation: Our algorithm is described in mathematical rigor so that it can be reproduced. Furthermore, codebase and detailed instructions on how to reproduce the result is provided.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justiﬁcation: We provide the codebase with instructions on how to reproduce the results.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justiﬁcation: Experiments have been detailed in Appendix D.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments?

Answer: [Yes]

Justiﬁcation: All results are average of 1000/100 runs. Error bars are reported in all cases (e.g., depending on the experiment, via boxplots and 95% conﬁdence intervals).

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justiﬁcation: Computational resources employed in this study are reported in Appendix D. Time taken to re-run all the experiments is also reported. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes]

Justiﬁcation: The paper aligns with the guidelines (e.g., does not involve human participants nor datasets) and anonymity of the submission is preserved. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA]

Justiﬁcation: This paper is a foundational research work whose goal is to advance theoretical aspects of sequential-decision making. We do not any direct path to negative applications. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justiﬁcation: This paper does not involve such assets. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justiﬁcation: The paper does not rely on existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justiﬁcation: We provide the codebase together with instruction on how to reproduce the experiments. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: No crowdsourcing experiments and research with human subjects were involved in this work. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: No study participants were involved in this work. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution.

For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.