# preferencebased_pure_exploration__aa8b711f.pdf

Preference-based Pure Exploration

Apurv Shukla Department of ECE, Texas A&M University College Station, TX 77840 apurv.shukla@umich.edu

Debabrota Basu Equipe School, Univ. Lille, Inria, CNRS Centrale Lille, UMR-9189 - CRISt AL, France debabrota.basu@inria.fr

We study the preference-based pure exploration problem for bandits with vectorvalued rewards. The rewards are ordered using a (given) preference cone C and our goal is to identify the set of Pareto optimal arms. First, to quantify the impact of preferences, we derive a novel lower bound on sample complexity for identifying the most preferred policy with a confidence level 1 δ. Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to existing best-arm identification variants of the problem. We further explicate this geometry when the rewards follow Gaussian distributions. We then provide a convex relaxation of the lower bound and leverage it to design the Preference-based Track and Stop (Pre TS) algorithm that identifies the most preferred policy. Finally, we show that the sample complexity of Pre TS is asymptotically tight by deriving a new concentration inequality for vector-valued rewards.

1 Introduction

Following COVID-19, the importance of reliable clinical trials and corresponding data acquisition to design effective drugs has gained wider recognition. However, conducting large-scale clinical trials is cost and time intensive as it requires working with large number of patients and following up their medical conditions over time. In the past two decades, this has led to doubling in the cost to bring a drug to the market, i.e., to $2.6 billion with a 12-year drug development horizon and 90% failure rate during the clinical trial (Mullard, 2014; Sun et al., 2022). However, due to the rise of systematic data acquisition about biological systems, pharmaceutical firms are interested in harvesting the collected data for drug discovery (Gaulton et al., 2012; Reker and Schneider, 2015). Thus, machine learning-based methods are increasingly studied and deployed as a promising avenue for identifying potentially successful drugs with less patient involvement, increasing the hit rate , and speeding up the development process (Jayatunga et al., 2022; Smer-Barreto et al., 2023; Sadybekov and Katritch, 2023; Hasselgren and Oprea, 2024). But deciding whether a drug is successful depends on multiple and often conflicting objectives regarding safety, efficacy, and pharmacokinetic constraints (Lizotte and Laber, 2016). For example, COV-BOOST (Munro et al., 2021) demonstrates a phase II vaccine clinical trial conducted on 2883 participants to measure the immunogenicity indicators (e.g. cellular response, anti-spike Ig G and NT50) of different Covid19 vaccines as a booster (third dose). Experts decide how different indicators are preferred over one another, and above different thresholds (Jayatunga et al., 2022). This motivates us to study a sequential decision-making problem, where we aim to conduct minimum number of experiments to acquire informative data, and to reliably validate a hypothesis with multiple objectives by imposing preferences over them.

This work was done when the author was at Texas A& M University. The author is currently at the University of Michigan, Ann Arbor.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Problems of such nature can be modeled as a multi-armed bandit (in brief, bandits), which is an established framework for sequential decision-making under uncertainty (Lattimore and Szepesv ari, 2020). In bandits, a learner has access to an instance of K decisions (or arms). Each arm k {1, . . . , K} corresponds to a probability distribution Pk of feedback (rewards) with unknown mean µk. At each step t N, the learner interacts with the instance by taking a decision kt (analogously pulling an arm), and observes a noisy reward Rt from the corresponding distribution of rewards Pkt. The goal of the learner is to identify the arm with the highest expected reward over a certain confidence level through minimum number of interactions with the instance. This is popularly known as a fixed-confidence Best Arm Identification (BAI) in bandit literature (Jamieson and Nowak, 2014; Garivier and Kaufmann, 2016; Soare et al., 2014), which is a special case of pure exploration problems (Even-Dar et al., 2006; Bubeck et al., 2009; Auer et al., 2016).

The bandit literature spanning over a century mostly focuses on a scalar reward, i.e., a single objective. In our problem, each reward Rt is a real-valued vector of L N objectives, and thus, the unknown mean vector of each arm Mk RL. Since the objectives can be often conflicting, there might not exist a single best arm. Rather, there exists a Pareto Optimal Set of arms (Drugan and Nowe, 2013; Auer et al., 2016). Given a set of preferences over the objectives, the Pareto Optimal Set consists of arms whose mean vectors dominate the mean vectors of any other arm outside the set. Keeping generality, we assume that preferences are defined by a cone of vectors C RL. Every C induces a set of partial or incomplete orders over the L objectives (Jahn et al., 2009; L ohne, 2011). Given the preference cone C, we aim to exactly identify the complete Pareto Optimal Set with a confidence level (1 δ) [0, 1) using as few interactions as possible. We refer to this problem Preference-based Pure Exploration (Pre PEx). Recently, Auer et al. (2016); Kone et al. (2023a,b); crepon et al. (2024) consider a special case of Pre PEx, where the preference is known. To the best of our knowledge, Ararat and Tekin (2023) and Korkmaz et al. (2023) are the only studies of Pre PEx from frequentist and Bayesian angles, respectively. Here, we consider a frequentist approach as in (Ararat and Tekin, 2023). However, their goal is to identify points that are in the Pareto Optimal Set or very close to it. In contrast, we focus on exactly identifying the Pareto Optimal Set. Additionally, Ararat and Tekin (2023) propose a gap-based elimination algorithm to solve the problem that generalises the algorithm of Even-Dar et al. (2006). But in BAI, there is another paradigm of designing efficient algorithms that solves and tracks the exact lower bound on the expected time to identify the best arm (1 δ) correctly (Garivier and Kaufmann, 2016; Degenne and Koolen, 2019). We explore this paradigm for Pre PEx and ask two questions:

What is the exact lower bound of Pre PEx for identifying the Pareto Optimal Set, and how to design a computationally tractable algorithm matching this bound?

We address them affirmatively in our contributions: 1. Lower Bound for Pre PEx. In Theorem 3.1, we study hardness of Pre PEx problems by deriving the novel lower bound on the expected sample complexity of any algorithm to yield the exact Pareto Optimal Set with confidence (1 δ). The challenge here is to extend the classical BAI lower bound (Garivier and Kaufmann, 2016) to a set of confusing instances given C. We observe that unlike BAI, distinguishability of two arms in Pre PEx depends on their projections on the cone polar to C. We also show that our lower bound generalises the lower bound for pure exploration under known constraints (Carlsson et al., 2024). Additionally, we provide an exact characterization the lower bound further for Gaussian reward distributions in Theorem 3.2. It shows that the hardness depends on the bilinear projection of the mean matrix of arms onto the boundary of a normal cone of policies and the preferences. This is novel w.r.t. the existing gap-dependent lower bounds that hold either for a narrow range of µa s (Ararat and Tekin, 2023), or fixed preference (Kone et al., 2023a). 2. Algorithm Design. First, we observe that the optimisation problem in our lower bound involves minimisation over a non-convex set. We provide a convex relaxation of the problem based on ideas from disjunctive programming (Theorem 4.1 and 4.2). We then leverage this lower bound to propose a novel Track-and-Stop (Garivier and Kaufmann, 2016) style algorithm, called Pre TS (Preference-based Track-and-Stop). In Theorem 4.3, we devise a new stopping rule that can handle the preference-aligned suboptimality gaps between the arms. 3. Sample Complexity Analysis. Finally, we provide an upper bound on sample complexity of Pre TS. This requires us to define a distance metric between two pareto sets of arms, and proving a concentration bound with respect to this metric (Theorem 5.1). In Theorem 5.2, we prove that sample complexity of Pre TS matches the convexified lower bound up to constants.

1.1 Related Works

In the past decade, works on multi-armed bandits also focuses on pure-exploration in addition to regret minimization. Regret minimization and pure-exploration differ in the sense when arms in pure-exploration are immediately discarded upon being deemed as sub-optimal, whereas, in the regret minimization setting, sub-optimal arms may still be played since they provide additional information about other arms. Pure-exploration problems have been considered in two settings: fixed-budget and fixed-confidence. The fixed-budget setting aims at bounding the probability of underestimating the best arm given a budget of samples. Audibert and Bubeck (2010) propose the first algorithm for the fixed budget setting. Here, the budget is divided into K 1 rounds and at the end of every round, the arms with the lowest empirical mean are discarded. On the other hand, best-arm identification is a version of the pure-exploration problem with scalar rewards (Even-Dar et al., 2006). In this setting, we are given a δ (0, 1) and the goal is to identify the best-arm with probability at least 1 δ. Several strategies such as those based on elimination, adaptivity, racing, upper-confidence bounds have been proposed to minimize the number of expected pulls of an arm in the fixed confidence setting by (Kalyanakrishnan et al., 2012; Gabillon et al., 2012; Jamieson et al., 2014; Garivier and Kaufmann, 2016; Jedra and Proutiere, 2020). Arm rewards can be modeled as a vector with Gaussian Process (Zuluaga et al., 2016), linear rewards (Drugan and Nowe, 2013; Lu et al., 2019), and non-parametric rewards (Turgay et al., 2018), which can include contextual bandit formulations (Tekin and Turgay, 2017; Shukla, 2022). In recent past, the pure exploration techniques have been successfully applied in hyperparameter tuning (Li et al., 2018) and black-box optimization problems (Contal et al., 2013; Wang et al., 2021, 2022) demonstrating considerable performance gains.

In a marked deviation, given an instance of the bandit problem, the goal of this paper is to identify the entire Pareto front. A key observation in this regard is that there might be arms, which are sub-optimal for almost every objective but still lie on the Pareto front. Further, since sampling an arm returns a vector of rewards determining an arm-strategy that reduces the uncertainty in the estimate of every reward function is challenging. An immediate consequence of these differences is the fact that the complexity of identifying the Pareto front is different from that of best arm identification. Auer et al. (2016) consider the Pareto front identification problem in the multi-armed bandit model and establish sample complexity bounds for the problem in terms of relevant problem parameters in the fixed-confidence setting. The multi-armed bandit problem is further studied under cone-based preferences by Ararat and Tekin (2023). The main contribution of (Ararat and Tekin, 2023) are bounds on the sample complexity of the problem in terms of gap-based notions that depend on the cone. Karag ozl u et al. (2024) builds upon this work to introduce adaptive elimination based algorithms for learning the Pareto front under incomplete preferences. When the reward vectors are Gaussian processes Korkmaz et al. (2023) propose an elimination based algorithm based for identifying the Pareto front. The goal in these works is to identify the set of arms that are ϵ close to the Pareto front as the sample complexity to identify the exact Pareto set can be very large. Kone et al. (2023a) consider the problem of identifying a relevant subset of the Pareto set using a single sampling strategy Adaptive Pareto Exploration, along with different stopping rules to consider variations of the Pareto Set Identification problem. crepon et al. (2024) consider the exact Pareto front identification problem in the multi-armed bandit setting but with fixed and known preference cone. They propose a lower bound and a computationally efficient gradient-based algorithm to implement a track-and-stop based strategy. To the best of our knowledge, ours is the first work to consider the exact Pareto front identification problem from a pure-exploration perspective. Therefore, our proposed framework can be used for identifying the Pareto front given a preference cone for several variants of the bandit problem including the standard multi-armed bandit problem, linear bandits, etc.

2 Preference-based Pure Exploration Problem

In this section, we formalise the fixed-confidence setting of preference-based pure exploration and introduce the notations.

Notations. For n N, let [n] denote the set {1, 2, . . . , n}. We use 1, 2, to denote the ℓ1-norm, ℓ2-norm and ℓ -norm, respectively. For a vector z, z(ℓ) denotes its ℓth component. Let eℓdenote the vector with 1 in the ℓth position and zero otherwise. K denotes the simplex on [K]. d KL (P, Q) measures the KL-divergence between distributions P and Q. vect(A) is the vectorized version of matrix A. 1 is the vector of all 1 s. Further details of notations are deferred to Appendix A.

Pre PEx: Problem Formulation. In Pre PEx, a learner can access a bandit instance with K arms. Each arm k [K] corresponds to a reward distribution Pk over RL with unknown mean Mk RL and known covariance Σ = Diag(σ2 1, . . . , σ2 L). Here, L denotes the number of objectives corresponding to each arm. Thus, a bandit instance can be specified with the vector of mean rewards {Mk}K k=1. For brevity, we represent them with a matrix M RL K such that its kth column is Mk. At each time t N, the learner pulls an arm kt [K] and observes the corresponding reward vector Rt sampled from νkt. In pure exploration, the learner typically focuses on finding the best arm, i.e. the arm with highest mean (Garivier and Kaufmann, 2016). In pure exploration, a more general setting of BAI, the learner aims to find a policy π K that dictates the arm-proportion to choose to maximize the expected reward from the instance.

Following the vector optimization literature (Jahn et al., 2009; Ararat and Tekin, 2023), we assume that the learner has additionally access to an ordering cone C. Definition 2.1 (Ordering Cone). A set C RL is called a cone if v C implies that αv C for all α 0. A solid cone has a non-empty interior, i.e., int(C) = . A pointed cone contains the origin. A closed convex, pointed and solid cone is called an ordering cone.

An ordering cone can be both polyhedral and non-polyhedral. Following the literature (Ararat and Tekin, 2023; Karag ozl u et al., 2024), we consider access to a polyhedral ordering cone.

Definition 2.2 (Polyhederal Ordering Cone). An ordering cone C is a polyhedral cone if C {x RL | Ax 0}, where A RK L with rows a i . A is called the half-space representation of C.

Each polyhedral ordering cone induces a set of partial order on the reward vectors in RL. To ignore the redundancies and to focus on the bandit problem, we further assume that A is full row-rank and Ai 2 = 1 (Ararat and Tekin, 2023). Hereafter, we call them preference cones, and the vectors in the cone as the preferences. We refer to (Jahn et al., 2009; L ohne, 2011) for further details on cones. Example 2.1 (Preference cones). The positive orthant RL + is a polyhedral ordering cone. This is the one used in the pareto-set identification literature (Auer et al., 2016; Kone et al., 2023b; crepon et al., 2024). The cones with all non-negative entries are called solvency cones and used in finance (Kabanov, 2009). Another simple example is Cπ/3 {(r cos θ, r sin θ) R2 | r 0 θ [0, π/3]}, i.e.,all the 2-dimensional vectors that make an angle less than π/3 with the x-axis. Definition 2.3 (Partial Order). For every µ, µ RL, µ C µ if µ µ + C and µ C µ if µ µ + int(C). Alternatively, µ C µ is equivalent to z (µ µ ) 0, z C.

Figure 1: Effect of cone selection on size of Pareto optimal set

The partial order induced by C induces further order over the set of arms [K]. Definition 2.4 (Order over arms). Consider two arms i, j [K]. (i) Arm i is weakly dominated by arm j iff Mi C Mj. (ii) Arm i dominates arm j iff Mj C\{0} Mi. (iii) Arm i strongly dominates arm j iff Mj C Mi. Definition 2.5 (Pareto Optimal Set). An arm i [K] is Pareto Optimal if it is not dominated by any other arm w.r.t. the cone C. The Pareto Optimal Set P is defined as the set of all Pareto Optimal arms.

Given a preference cone, a learner aims to identify exactly the Pareto Optimal Set from a finite set of arms [K] whose mean rewards belong to the Pareto Optimal Set w.r.t. C. Alternatively, this vector optimization problem can be represented in the policy space as finding a policy π K supported on the Pareto Optimal Set of arms. The following vector optimization problem yields this:

V (M) max π K Mπ over C. (1)

In this context, we denote the set of Pareto optimal policies as Π (M) arg maxπ K Mπ over C. We assume that Π (M) is non-empty. Example 2.2 (Pareto Optimal Sets for different cones). Figure 1 illustrates the Pareto Optimal Sets among 2-dimensional mean vectors of 200 randomly selected arms under preference cones Cπ/2

and Cπ/3. We observe that the Pareto Optimal Sets for them (in pink and blue respectively), are completely different for the same set of arms. Thus, we have to adapt to the available preferences to solve the aforementioned problem. As noted later, the geometry of this cone plays a crucial role in determining the Pareto front. In Pre PEx, we consider the problem of Equation (1), when the mean matrix M is unknown a priori but bounded, i.e., the entries of M, Mij [Mmin, Mmax]. We denote all such mean matrices by M. Identifying the policy will lead us to identify the true Pareto front P . In the noisy feedback setting, the reward at time t is Rt = Mkt + ηt, where ηt RL is the noise vector. We assume that the noise vectors ηt are independent of Mkt and also across time. Further, they are sub-Gaussian with parameter σ and adapted to the filtration Ft, which is a standard assumption in the literature. A policy π Π K is a randomized mapping from the history Ht to the probability simplex over the set of arms [K]. In Preference-based Pure EXploration (Pre PEX) problem, the goal of the learner is to identify a Pareto optimal policy in Π (Equation (1)) given an instance M and a preference cone C while observing only noisy rewards from the arms, and also using as few observations as possible. Definition 2.6 ((1 δ)-correct Pre PEX). An algorithm for Preference-based Pure Exploration (Pre PEx) is said to be (1 δ) correct if with probability 1 δ, it recommends a Pareto optimal policy π Π . For example, a pareto optimal policy for Cπ/2 would be a distribution in K with support on the arms corresponding to the pink reward vectors (Figure 1). For Cπ/3, it would be one with support on blue points. Finally, we make the standard assumption on the mean rewards. Assumption 2.1 (Single Parameter Exponential Family). Let X = (X1, , Xd) be a d-dimensional random vector with a distribution Pθ, θ Θ. Suppose X1, . . . , Xd are jointly continuous. The family of distributions {Pθ, θ Θ} is said to belong to the one parameter Exponential family if the density of X may be represented in the form f(x|θ) h(x) exp (η(θ)T(x) ψ(θ)). We assume that the mean reward for each vector belongs to a single-parameter exponential family with variance bounded by 1.

3 Lower Bound on Sample Complexity

We begin by deriving a KL-divergence based lower bound for Pre PEx using techniques from (Garivier and Kaufmann, 2016). Our lower bound is based on establishing a change-of-measure argument in the spirit of (Graves and Lai, 1997; Kaufmann et al., 2016). The lower bounds are derived first by defining a set of alternating instances Λ for a given bandit instance and then by trying to compute an optimal allocation policy w K that maximises the sum of minimum KL-divergence between any instance in Λ and the bandit instance under interaction. The key insight of our work is to formulate the identification of Pareto Set problem in the policy space rather than in the arm space as done in antecedent literature. This formulation helps us to derive the KL-based lower bound, which is more general than the existing suboptimality gap-based lower bounds (Auer et al., 2016; Ararat and Tekin, 2023; crepon et al., 2024).

The Alternating Instances with respect to Pareto Fronts. The learner needs to distinguish between all instance M M \ {M} for which the Pareto front associated with M is different from the one associated with M. At first, given an optimal policy of M, say π , it would appear that the set of confusing instances is Λπ (M)naive n M M : Mπ C maxπ Π Mπ o . However, this is fallacious since the instances whose rewards dominate M can also confuse a policy π. Given a π , the correct alternating set is the set of instances in M whose Pareto optimal set is not dominated by π corresponding to M.

Λπ (M) M M \ {M} : max π Π Mπ C Mπ

= M M \ {M} : z C s.t. max π Π z Mπ > z Mπ .

With this new alternate set defined, we now establish lower bounds on the performance of any Pre PEX algorithm. Theorem 3.1 (Lower Bound). Given a bandit model M M, a preference cone C, and a confidence level δ [0, 1), the expected stopping time of any (1 δ)-correct Pre PEx algorithm, to identify the

Pareto Optimal Set is

E[τ] TM,C log 1

where, the expectation is taken over the stochasticity of both the algorithm and the bandit instance. Here, TM,C is called the characteristic time of the Pre PEx instance (M, C) and is expressed as

(TM,C) 1 sup w K inf π Π\{π } π Π (M)

inf M Λπ (M) inf z C

k=1 wkd KL z Mk, z Mk , (3)

such that Λπ (M) Π\{π } n M M : z C, vect z(π π ) , vect M = 0 o .

Proof Intuition. First, we observe that an instance M is in alternating set if there exists a π Π\{π } and z C, such that z M(π π ) > 0. If π and π were pure strategies, it would have been exactly infz C\{0} z ( Ma Ma ) > 0. Let us denote the z achieving the inf as zinf, i.e., the preference for which Ma and Ma are least distinguishable. Thus, we observe that z inf Ma exactly functions as the mean of the arm a in an instance M, whereas z inf( Ma Ma) acts as the suboptimality gap. Now, we extend this idea in the classical lower bound scheme to get a nested optimization problem with inf over z C and M in the alternating set, and a sup over allocations w K. We further show that the inf for M appears at the boundary of the alternating set defined as Λ(M).

Discussions. (i) Novelty: In the best of our knowledge, this is the first lower bound for Pre PEx with fixed confidence with an explicit KL-based dependence. All the existing lower bounds are gap dependent, and valid for a narrow range on mean vectors or known preference cone, i.e. the right orthant. Our proof does not need such assumptions. The gap-dependent bounds are special case of ours (cf. Theorem 3.2 for the case of Gaussian rewards).

(ii) Geometric Insights. Theorem 3.1 provides multiple geometric insights into the affect of the ordering cone C on the characteristic time. First, the alternating set Λπ (M) is piece-wise polyhedral and non-convex. We address consequent issues in Section 4.1. Second, there is an additional minimization over the vectors lying in the cone C. We interpret the minimization over vectors in the cone as a instanceand preference-dependent scalarization of the distance between the given instance M and the corresponding most-confusing instance in Λπ (M) . Third, in the proof, we show that the reward gap using the best policy π and a given policy π for the most confusing instance belongs to the polar cone C of the preference cone C. The most confusing lies on the boundary of this polar cone and its projection the policy gaps (π π). Further insights can be obtained by imagining the polar cone to be orthogonal to the cone C. Then, the vector of reward-gaps for the most confusing instance for every objective is orthogonal to the generating rays of C. These novel geometric insights are complementary to the existing algebraic and statistical insights available in the lower bound literature (Kone et al., 2023a; Ararat and Tekin, 2023).

3.1 Characteraization of Lower Bounds for Gaussians To understand our lower bound better and to compare it with the literature, we present a reduction for Gaussian bandits. In Gaussian bandits, we assume that the reward vectors of arm a [K] are generated from a multivariate Gaussian distribution N(µa, Σ), where the covariance is a diagonal matrix: Σ Diag(σ2 1, . . . , σ2 L). Theorem 3.2 (Lower Bound for Gaussian Bandits). 1. Given any π Π (M) and N(π ) being the set of neighbouring policies of π , the most confusing instance of M belongs to the set n M M \ {M} : Mk,ℓ= Mk,ℓ β σ2 ℓ zℓ (π π)k

wk , π N(π ), z C \ {0}, k [K], ℓ [L] o ,

where β z M(π π) Tr(Σ) π π 2 Diag(1/w) .

2. The inverse of characteristic time, i.e. (T Gauss M,C ) 1, for an instance (M, C) is

inf π N(π ) π Π (M)

min z C\{0} (z M(π π))2

2Tr(Σ) π π 2 2 .

Consequences. First, we observe an interesting phenomenon that a bilinear projection of mean matrix M on the preferences and policy gaps operates as an extension of suboptimality gap in classical BAI. This is a reminiscent of the lower bound for pure exploration under known linear constraints as in Carlsson et al. (2024) who show that the hardness of the problem depends only on the projection of the mean vector on the policy gap. In addition to similar projection structure, preferences introduce a novel bilinearity here. Second, we show how the lower bound inflates with the covariance matrix for each objective. This shows the richness of our KL-divergence based lower bound as opposed to gap-based bounds which have difficulty accommodating variance related terms directly.

Connection to existing results. Our result generalizes several existing lower bounds for BAI.

1. BAI lower bound. Our lower bound is able to recover that of Kaufmann et al. (2016) for the standard BAI problem with fixed confidence. In the case of the standard BAI problem, the ordering cone is given by C R+ and therefore the minimization over C in (3) becomes redundant. The definition of the alternating set is then given by the set of instances which have a different optimal arm than µ which is exactly the set considered in (Kaufmann et al., 2016).

2. Pure exploration under known constraints. Our lower bound is able to recover the lower bound of Carlsson et al. (2023) for the BAI problem with fixed confidence and linear constraints. This is the case with L = 1 and the ordering cone being C R+ making the minimization over z C in (3) redundant. Λπ (M) becomes Λπ (M) = { µ : maxπ Π µ π µ π } and we retrieve exactly their Corollary 1.

4 Algorithm Design: Pre TS

In this section, we propose an algorithm that tracks the lower bound. However, this is not straightforward since the alternating set is non-convex. We first propose a convex relaxation for this set and then, design a Track and Stop style algorithm, called Pre TS.

4.1 Convex Relaxation of the Lower Bound One of the major differences regarding the structure of lower bounds compared to a standard BAI problem is that Λπ (M) is a piece-wise polyhedron, i.e., a union of hyperplanes. Each hyperplance corresponds to a policy π Π \ {π }. To make the optimization problem tractable and obtain a convex program, we relax Λπ (M) using its convex closure, denoted by ch (Λπ (M)). We note that the construction of such a convex relaxation for track-and-stop (when the lower bound problem is non-convex) has been done in the MDP setting Al Marjani and Proutiere (2021). We define ch (Λπ (M)) in Theorem 4.1 by formulating it as a disjunctive program, which we can reformulate further as a linear program (Balas, 1985).

Theorem 4.1. Let F Π\π n M M : z C, vect z (π π ) , vect M = 0 o . Fix

z C such that z = P

i αivi. Then, we have ch (F) = I, where I is defined as

I { M M : γ vect( M) γ0, γ = X

i uiαivect(v i (π π )), γ0 ui X

i αiv i π }. (4)

Using the convex hull (Eq. (4)), we quantify the optimal value for a given allocation w as

VC(w, M) min M ch( Λπ (M)) inf z C

k=1 wkd KL z M, z M .

The corresponding optimal allocation is

w (M) = arg max w K inf π Π\π

min M ch( Λπ (M)) inf z C

k=1 wkd KL z M, z M . (5)

Hereafter, we consider Equation (5) as the optimization problem to be tracked. To compute VC(w, M), we need access to the true instance M which is not available to us. Our Track-and-Stop strategy is based on repeatedly sampling an arm to construct an estimate of M, i.e. Mt, and exploiting continuity properties of VC(w, M) to show that VC(w, Mt) VC(w, M) and the cumulative number of arm plays Nt,k wk, wk w (M). These properties ensure that it makes sense to design a Track and Stop style algorithm for this problem.

Algorithm 1 Preference-based Track-and-Stop (Pre TS)

1: Input: Confidence parameter δ, preference cone C

2: if supπ t min M ch Λπ t ( ˆ Mt) minz C P

k Nk,td KL z ˆ M (k) t , z M (k) β(t, δ) then

3: Compute wt arg maxw K VC(w, ˆ Mt) 4: Play kt arg mink [K] Nk,t Pt s=1 ws 5: Observe reward rt 6: Construct an empirical estimator ˆ Mt 7: end if 8: Construct a Pareto Front ˆPτ from empirical means ˆ Mτ 9: Return: ˆPτ

Theorem 4.2 (Analytical Properties). For all M RL K and all preference cones C, we get 1. The mapping (w, M) VC(w, M) is continuous. 2. The characteristic time mapping M TM,C is continuous. 3. The set valued function M w (M) is upper-hemicontinuous. 4. The set w (M) is convex.

Discussion: Cost of Convexification. For Gaussian bandits, as we can get the analytical form of the most confusing instance M (Theorem 3.2), we do not pay any extra cost of convexification. In the non-Gaussian settings, where we cannot find such analytical forms for the most confusing instances, the minimum value of the inner minimisation problem under the convex hull (Equation (5)) can go lower than the minimum value found in the original non-convex set of instances (Equation (3)). Thus, the characteristic time attained by solving the convex relaxation might be higher than that of the original lower bound. Hence, an algorithm solving the convex relaxation has a higher stopping time. However, convexification is essential for the computational feasibility of a lower bound-tracking algorithm for Pre PEx. This computational-statistical trade-off will be interesting to study in the future.

4.2 Algorithm: Preference-based Track-and-Stop (Pre TS)

We now construct a general recipe to design a Pre PEx algorithm when we do not have access to the true instance M. The fundamental element of any such recipe is constructing an estimate of M. For a given set of observed rewards {Rt}T t=1, we obtain a column-wise empirical average of the observed rewards and use it as our estimator of M. Now, we elaborate the three key components of our Pre PEx algorithm Preference-based Track and Stop (Pre TS, Algorithm 1).

1. Sampling Rule: For the sampling rule, we consider a Track-and-Stop strategy (Garivier and Kaufmann, 2016). It tracks the optimal proportion of arm sampling by plugging in the empirical estimates of means and empirical count Nk,t in the convexified lower bound. This leads to an allocation policy with improved information acquisition.

2. Stopping Rule: Our ultimate stopping goal is to identify arms that are on the Pareto front. Based on this, we define the confidence set as:

( M M : min z C

k Nk,td KL(z ˆ M (k) t , z M (k)) β(t, δ)

where β(t, δ) P

a S 3 ln (1 + ln (Nk,t)) + KT ln( 1

and T is defined in Equation (17). Our

first claim is to show that the true instance belongs to the confidence ellipsoid with high probability. Lemma 4.1 (Confidence Ball). For any t N and c(t, δ) is defined in Equation (6), we have P (M / c(t, δ)) δ.

Thus, we can now formalise the corresponding Chernoff-stopping rule as

min M ch( Λπ ( ˆ Mt)) min z C

k Nk,td KL z ˆ M (k) t , z M (k) β(t, δ) (7)

Given the estimates ˆ Mt, the problem in Equation (7) can be solved efficiently. Next, we show that upon stopping with Equation (7), Pre TS returns the true Pareto Front P with probability 1 δ. Let

ˆPt denote the estimated Pareto Front at time t, which is constructed using estimates ˆ Mt. Then at stopping time τ, we have

P P = ˆPt P

k,ℓ Nk,td KL z ˆ M (k) t , z M (k) β(t, δ)

k,ℓ Nk,td KL z ˆ M (k) t , z M (k) β(t, δ)

where, the last inequality is true due to Theorem 4.3, a concentration result on the KL-divergence with preference projected mean rewards.

Theorem 4.3. For all ρ (K + 1), z C and t N, we have that:

k [K] Nk,td KL z ˆ M (k) t , z M (k) ρ

exp ( ρ) ρ log(t)

K exp (K + 1)

3. Recommendation Rule: At the end of stopping time τ, the algorithm returns an estimate of the Pareto Front ˆPτ.

5 Upper Bound on Sample Complexity

Now, we prove upper bound on the expected sample complexity of Pre TS. This requires us to the Track-an-Stop proof technique. But the challenge is to show concentration of the pareto fronts under a suitable metric.

Concentrating to the Pareto Front. To show that upon stopping the algorithm returns the true Pareto Frontier, we need to establish a valid metric to show such convergence. Usually, the distance between sets is measured using the Hausdorff metric (Costantini and Vitolo, 1995), i.e. d H( ˆPτ, P) max n supk ˆ Pτ infk P Mk Mk , supk P infk ˆ Pτ Mk Mk o . But the Hausdorff distance only defines a pseudo-distance between sets and Z may not be closed under this metric. To circumvent this issue, we build upon the notion of a gap-based metric considered in the antecedent literature (Auer et al., 2016) to measure the distance between the mean reward of an arm and a given Pareto Front. We extend it to a distance metric between elements in the space of Pareto Fronts Z.

Definition 5.1 (Distance from Pareto Front). The distance of the mean of arm k from the Pareto Front P is d(k, P ) infε 0 ε, such that Mk + ε1 C Mk , k P . Equivalently,

d(k, P ) inf k P max

0, sup z C B(1) z (Mk Mk)

Definition 5.2 (Distance between Pareto Fronts). We define the metric between Pareto Fronts d P ((, ) , ) : Z Z R 0 as d P ˆP, P max n supk ˆ P d(k, P ), supk P d(k, ˆP) o .

In the appendix, we establish that (i) d( , ) is a valid metric on Z, and (ii) Z is compact and complete under d( , ). Now, we leverage this metric to show that the Pareto Front defined by the arm-wise constructed estimator ˆ Mt concentrates towards the true Pareto Front.

Theorem 5.1 (Concentration of mean estimates). For any pair (i, j) [K] [K] and z C, we have z (Mi Mj) z ˆ Mi,t ˆ Mj,t βij(t) ,

where β2 ij(t) 4 z 2 1

a {i,j} log (4 + log(Na(t))) P

a {i,j} 1 Na(t) , K1

2 , and h(x) x + log(x).

Proof Sketch. This is a consequence of jointly applying a vectorial concentration result for multipleobjectives of each arm (Kaufmann and Koolen, 2021), and pairwise time-uniform concentration bounds (Kone et al., 2023a). A key observation here is that the confidence radii depends on the magnitude of the preference vector z and scales with different objectives accordingly.

Sample Complexity of Pre TS. Using this new concentration result for the Pareto Front and the stopping rule in Equation 7, we derive an upper bound on the expected stopping time of Pre TS.

Theorem 5.2 (Upper Bound on Sample Complexity). For any α > 0 and c(t, δ) defined in (7), we have that the stopping time satisfies

lim δ 0 E[τ] log 1

δ TM,C M RL K

The basic outline of the proof follows a general strategy to prove Track-and-Stop result. However, the new arguments lie in establishing that the Pareto fronts converge under a suitable metric sufficiently fast. Our proof implies that Pre TS matches the convex relaxation of the lower bound asymptotically at the corresponding risk level δ. Strictly, speaking this is not asymptotically optimal since, we do not track the exact lower-bound.

6 Conclusion and Future Works

We study the fixed-confidence version of preference-based pure exploration problem under linear stochastic bandit feedback, where each arm corresponds to a reward vector ordered according to a preference cone. We derive a novel lower bound for this problem. We leverage the lower bound further to derive a track-and-stop based algorithm for Pre PEx problem. As future work, it would be interesting to verify our results on a real-world datasets.

Additionally, it would be interesting and challenging to study how other asymptotically optimal pure exploration strategies, e.g. gamified explorers (Degenne and Koolen, 2019), top-two algorithms (Jourdan et al., 2022), can be adapted to this setting. In general, improving the computational efficiency and studying the optimality gap with respect to the non-convex lower-bound problem would be of fundamental interest.

Al Marjani, A. and Proutiere, A. (2021). Adaptive sampling for best policy identification in markov decision processes. In International Conference on Machine Learning, pages 7459 7468. PMLR.

Ararat, C. and Tekin, C. (2023). Vector optimization with stochastic bandit feedback. In International Conference on Artificial Intelligence and Statistics, pages 2165 2190. PMLR.

Audibert, J.-Y. and Bubeck, S. (2010). Best arm identification in multi-armed bandits.

Auer, P., Chiang, C.-K., Ortner, R., and Drugan, M. (2016). Pareto front identification from stochastic bandit feedback. In Artificial intelligence and statistics, pages 939 947. PMLR.

Balas, E. (1985). Disjunctive programming and a hierarchy of relaxations for discrete optimization problems. SIAM Journal on Algebraic Discrete Methods, 6(3):466 486.

Berge, C. (1877). Topological spaces: Including a treatment of multi-valued functions, vector spaces and convexity. Oliver & Boyd.

Bubeck, S., Munos, R., and Stoltz, G. (2009). Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory: 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings 20, pages 23 37. Springer.

Carlsson, E., Basu, D., Johansson, F., and Dubhashi, D. (2024). Pure exploration in bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pages 334 342. PMLR.

Carlsson, E., Basu, D., Johansson, F. D., and Dubhashi, D. (2023). Pure exploration in bandits with linear constraints. ar Xiv preprint ar Xiv:2306.12774.

Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225 240. Springer.

Costantini, C. and Vitolo, P. (1995). On the infimum of the hausdorff metric topologies. Proceedings of the London Mathematical Society, 3(2):441 480.

crepon, e., Garivier, A., and M Koolen, W. (2024). Sequential learning of the Pareto front for multi-objective bandits. In Dasgupta, S., Mandt, S., and Li, Y., editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 3583 3591. PMLR.

Degenne, R. and Koolen, W. M. (2019). Pure exploration with multiple correct answers. In Neural Information Processing Systems.

Donsker, M. D. and Varadhan, S. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. Communications on pure and applied mathematics, 28(1):1 47.

Drugan, M. M. and Nowe, A. (2013). Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 international joint conference on neural networks (IJCNN), pages 1 8. IEEE.

Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105.

Gabillon, V., Ghavamzadeh, M., and Lazaric, A. (2012). Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pages 3212 3220.

Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027.

Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., Light, Y., Mc Glinchey, S., Michalovich, D., Al-Lazikani, B., et al. (2012). Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100 D1107.

Graves, T. L. and Lai, T. L. (1997). Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715 743.

Hasselgren, C. and Oprea, T. I. (2024). Artificial intelligence for drug discovery: Are we there yet? Annual Review of Pharmacology and Toxicology, 64:527 550.

Jahn, J. et al. (2009). Vector optimization. Springer.

Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. (2014). lil ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423 439.

Jamieson, K. and Nowak, R. (2014). Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1 6. IEEE.

Jayatunga, M. K., Xie, W., Ruder, L., Schulze, U., and Meier, C. (2022). Ai in small-molecule drug discovery: a coming wave. Nat. Rev. Drug Discov, 21(3):175 176.

Jedra, Y. and Proutiere, A. (2020). Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017.

Jourdan, M., Degenne, R., Baudry, D., de Heide, R., and Kaufmann, E. (2022). Top two algorithms revisited. Advances in Neural Information Processing Systems, 35:26791 26803.

Kabanov, Y. (2009). Markets with Transaction Costs Mathematical Theory. Springer.

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. (2012). Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655 662.

Karag ozl u, E. M., Yıldırım, Y. C., Ararat, C., and Tekin, C. (2024). Learning the pareto set under incomplete preferences: Pure exploration in vector bandits. In International Conference on Artificial Intelligence and Statistics, pages 3070 3078. PMLR.

Kaufmann, E., Capp e, O., and Garivier, A. (2016). On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17:1 42.

Kaufmann, E. and Koolen, W. M. (2021). Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22(246):1 44.

Kone, C., Kaufmann, E., and Richert, L. (2023a). Adaptive algorithms for relaxed pareto set identification. Advances in Neural Information Processing Systems, 36:35190 35201.

Kone, C., Kaufmann, E., and Richert, L. (2023b). Bandit pareto set identification: the fixed budget setting. ar Xiv preprint ar Xiv:2311.03992.

Korkmaz, I. O., Ararat, C., and Tekin, C. (2023). Bayesian vector optimization with gaussian processes.

Lattimore, T. and Szepesv ari, C. (2020). Bandit Algorithms. Cambridge University Press.

Li, L., Jamieson, K., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1 52.

Lizotte, D. J. and Laber, E. B. (2016). Multi-objective markov decision processes for data-driven decision support. Journal of Machine Learning Research, 17(210):1 28.

L ohne, A. (2011). Vector optimization with infimum and supremum. Springer Science & Business Media.

Lu, S., Wang, G., Hu, Y., and Zhang, L. (2019). Multi-objective generalized linear bandits. ar Xiv preprint ar Xiv:1905.12879.

Magureanu, S., Combes, R., and Proutiere, A. (2014). Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975 999. PMLR.

Mullard, A. (2014). New drugs cost us $2.6 billion to develop. Nature reviews drug discovery, 13(12).

Munro, A. P., Janani, L., Cornelius, V., Aley, P. K., Babbage, G., Baxter, D., Bula, M., Cathie, K., Chatterjee, K., Dodd, K., et al. (2021). Safety and immunogenicity of seven covid-19 vaccines as a third dose (booster) following two doses of chadox1 ncov-19 or bnt162b2 in the uk (cov-boost): a blinded, multicentre, randomised, controlled, phase 2 trial. The Lancet, 398(10318):2258 2276.

Peskun, P. H. (1973). Optimum monte-carlo sampling using markov chains. Biometrika, 60(3):607 612.

Reker, D. and Schneider, G. (2015). Active-learning strategies in computer-assisted drug discovery. Drug discovery today, 20(4):458 465.

Sadybekov, A. V. and Katritch, V. (2023). Computational approaches streamlining drug discovery. Nature, 616(7958):673 685.

Shukla, A. (2022). Optimization and Learning in Dynamic Environments: Models and Algorithms. Columbia University.

Smer-Barreto, V., Quintanilla, A., Elliott, R. J., Dawson, J. C., Sun, J., Campa, V. M., Lorente-Mac ıas, A., Unciti-Broceta, A., Carragher, N. O., Acosta, J. C., et al. (2023). Discovery of senolytics using machine learning. Nature communications, 14(1):3445.

Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828 836.

Sun, D., Gao, W., Hu, H., and Zhou, S. (2022). Why 90Acta Pharmaceutica Sinica B, 12(7):3049 3062.

Tekin, C. and Turgay, E. (2017). Multi-objective contextual bandits with a dominant objective. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6. IEEE.

Turgay, E., Oner, D., and Tekin, C. (2018). Multi-objective contextual bandit problem with similarity information. In International Conference on Artificial Intelligence and Statistics, pages 1673 1681. PMLR.

Wang, J., Basu, D., and Trummer, I. (2022). Procrastinated tree search: Black-box optimization with delayed, noisy, and multi-fidelity feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10381 10390.

Wang, J., Trummer, I., and Basu, D. (2021). Demonstrating udo: A unified approach for optimizing transaction code, physical design, and system parameters via reinforcement learning. In Proceedings of the 2021 International Conference on Management of Data, pages 2794 2797.

Zuluaga, M., Krause, A., et al. (2016). e-pal: An active learning approach to the multi-objective optimization problem. Journal of Machine Learning Research, 17(104):1 32.

Table of Contents

A Notations 15

B Proofs of Lower Bounds 16 B.1 Generic Lower Bound: Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . 16 B.2 Lower Bound for Gaussians: Proof of Theorem 3.2 . . . . . . . . . . . . . . . . 17 B.3 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.4 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

C Proof of the Stopping Time 20 C.1 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Proofs for Sample Complexity Upper Bound 22 D.1 Pairwise Concentration Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D.2 Concentration to the Pareto Front . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.3 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

E Reduction to Best-arm Identification 27

F Useful Existing Results 28

A Notations

Notation Description

C, C\{0} Given convex cone and induced partial order

K, L number of arms and objectives

P , ˆPt Ground truth Pareto set and estimated Pareto set

Z Space of all Pareto Frontiers on [0, 1]K

M RK L matrix with mean reward of K arms

rkt, Mkt, ηt observed reward, mean reward and noise

c(t, δ) Confidence ball at time t with confidence δ

d H(X, Y ) Hausdroff distance between sets X and Y

Distance metric between Pareto Fronts P and ˆP

w Allocation vector

Π Family of policies ˆ Mℓ,kt, Mℓk Estimated and true of mean rewards

ch (S) Convex hull of set S

Λπ (M) Set of alternating instances associated with M

Table 1: Table of Notations

B Proofs of Lower Bounds

B.1 Generic Lower Bound: Proof of Theorem 3.1

Proof. The proof follows the basic structure of constructing a lower bound as in Kaufmann et al. (2016). Recall that their inverse of characteristic time is given by

T = sup w Π inf M Λπ (M)

k wkd KL Mk, Mk (9)

The main challenge for our setting is describing the alternating set Λπ (M) and scalarization of the given instance M M. Given a matrix of arm-objective mean-rewards M, ordering cone C and a family of policies Π, the Pareto front is the set of optimal values (ordered wrt C) of

max π Π Mπ . (10)

Step 1: Constructing set of alternative instances Let π arg maxπ Π Mπ.

The set of confusing instances given Π and C is the set of all matrices M which have a different Pareto front than M when using the policy π . Therefore, the optimal values of (10) with instance M are not-dominated by those of instance M. Hence, the set of alternative instances, Λπ (M) is given by:

Λπ (M) := M M : max π Π Mπ C Mπ

Let π arg maxπ Π Mπ over C which implies Mπ C Mπ or equivalently:

z C, π Π \ {π } s.t. z Mπ > z Mπ

Therefore, the alternative set can be written as:

Λπ (M) π Π\{π } n M M : z C, z Mπ > z Mπ o

= π Π\{π } n M M : z C : z M (π π ) > 0 o

where, represents a bilinear product and, its complement is given by:

Λπ (M) π Π\{π } n M M : z C : z M (π π ) 0 o

= π Π\{π } n M M : M (π π ) C o

where, ri(C ) denotes the relative interior of the polar cone to C. Since C is a polyhederal cone, it is closed and convex and therefore, its polar cone is non-empty, closed and convex. Therefore, Λπ (M) is non-empty.

Step 2: Hardest instance lies on the boundary We now show that given π , and π, the hardest instances M M are such that:

M (π π ) bd(C ).

Fix π Π \ {π } and let M n M M : M (π π ) ri(C ) o . Then,

by convexity of n M M : z C : z M (π π ) > 0 o there exists M n M M : M (π π ) bd(C ) o such that z Mk z M k z Mk z M k , z C.

Since d KL z M, is decreasing, we have: d KL z Mk, z M k d KL z Mk, z M k .

Using the above arguments we see that the minimum argument of P

k wkd KL z M , z M is

such that M n M M : M (π π ) bd(C ) o .

The optimization problem now becomes:

sup w Π inf M Λπ (M) inf z C

k wkd KL z Mk, z Mk

= sup w K inf π Π\π min M Λπ (M) inf z C

k=1 wkd KL z Mk, z Mk ,

Λπ (M) Π\π n M M : z C, z M(π π ) = 0 o

= Π\π n M M : z C, vect z (π π ) , vect( M) = 0 o . (11)

Finally, to accommodate for multiple optimal policies π , the inner minimisation problem becomes

sup w K inf π Π\π

min M Λπ (M) inf z C

k=1 wkd KL z Mk, z Mk

This concludes the proof.

B.2 Lower Bound for Gaussians: Proof of Theorem 3.2

Proof. Step 1. Simplifying the KL-divergence for a Gaussian bandit instance with identical variance across all objectives yields

VC(w, M) = min M Λπ (M) min z C

ℓ=1 (z(ℓ))2

µ(ℓ) k M (ℓ) k 2

2σ2 ℓ . (12)

Recalling that due to the projection lemma, the M achieving the minimum satisfies

k=1 M k (π k πk) = 0, z C (13)

Now, we formulate the Lagrangian of (12) with dual variables β for (13) as

L w, M, M, γ, β

ℓ=1 (z(ℓ))2 µ(ℓ) k M (ℓ) k

+ βz M (π π)

ℓ=1 (z(ℓ))2 µ(ℓ) k M (ℓ) k

+ βz M (π π) (14)

Step 2. Taking the derivative w.r.t. M, we have

M (ℓ) k = wk z(ℓ) 2 2

µ(ℓ) k M (ℓ) k + βz(ℓ) (π π)k ,

and setting it to zero, we get

wk z(ℓ) 2 2

µ(ℓ) k M (ℓ) k = βz(ℓ) (π π)k

= M (ℓ) k = µ(ℓ) k 2σ2 ℓβ z(ℓ) (π π)k

2wk z(ℓ) 2 = µ(ℓ) k σ2 ℓβ (π π)k

wkz(ℓ) . (15)

The last equality holds for any z(ℓ) = 0. Therefore, the Lagrangian now becomes

ℓ=1 (z(ℓ))2 µ(ℓ) k M (ℓ) k

k=1 M (ℓ) k (π π)k

ℓ=1 (z(ℓ))2 (π π)2 k (z(ℓ))2w2 k

β2σ2 ℓ 2 + β

ℓ=1 z(ℓ) K X

µ(ℓ) k βσ2 ℓ (π π)k

ℓ=1 σ2 ℓ (π π)2 k wk + β

k=1 z(ℓ) (π π)k µ(ℓ) k

Step 3. By taking the derivative with respect to β, we have

ℓ=1 σ2 ℓ (π π)2 k wk +

k=1 z(ℓ) (π π)k µ(ℓ) k ,

and setting it to zero, leads to:

β = PL ℓ=1 PK k=1 z(ℓ) (π π)k µ(ℓ) k PK k=1 PL ℓ=1 σ2 ℓ (π π)2 k wk = z M σ2o 2 Diag(1/w) , (16)

where σ2 o PL ℓ=1 σ2 ℓ, and π π.

From (15), excluding for the origin lying within the cone, we get:

M (ℓ) k = µ(ℓ) k βσ2 ℓ (π π)k

= µ(ℓ) k σ2 ℓ zℓ k

z M σ2o 2 Diag(1/w) Finally, the Lagrangian from (14) leads to

z M σ2o 2 Diag(1/w)

σ2 o 2 Diag(1/w) + (z M )2

σ2o 2 Diag(1/w)

2σ2o 2 Diag(1/w) .

Thus, the characteristic time inverse is given by,

max w K inf π N(π ) π Π (M)

min z C\{0} (z M )2

2σ2o 2 Diag(1/w) = inf π N(π ) π Π (M)

min z C\{0} (z M )2

B.3 Proof of Theorem 4.1

This proof follows directly from Theorem 3.1 in Balas (1985). Recall that the set F is given by:

F Π\{π } n M M : vect(z (π π )), vect( M) = 0 o

Any z C, we have z = P

i αivi. Rewriting, every hyperplane in F as Pπ = n M M vect P i αiv i π , vect( M) = P

i αiv i π o . Then, by Theorem 3.1 in Balas (1985),

C(F) is given by:

C(F) = γ vect( M) γ0, γ = P

i uiαivect(v i (π π )) γ0 ui P

i αiv i π , M M

B.4 Proof of Theorem 4.2

Recall that:

VC(w, M) min M ch( Λπ (M)) inf z C

k=1 wkd KL z M, z M

w (M) arg max w K inf π Π\π

min M ch( Λπ (M)) min z C

k=1 wkd KL z M, z M

For (1) and (2) observe that (z, M) z M and (z, M, M) d KL z M, z M are con-

tinuous maps for all (z, M) C M and (z, M, M) C M ch ( Λπ (M)). Further, P

k wkd KL z M, z M is continuous in all its elements. Fix a sequence (wt, , zt, Mt)

Π C M such that (wt, zt, Mt) (w, z, M). For any ϵ, t 1 such that (wt, zt, Mt) (w, z, M) ϵ t t . Further, ch (Λπ (Mt)) ch (Λπ (M)). Therefore, for every ϵ , t 1 such that t t we X

k wk,td KL z t Mt, z t Mt X

k wkd KL z M, z t Mt ϵ Mt RK L

Taking t max{t , t }, we have: inf M Λπ (M) inf z C

k wk,td KL z t Mt, z t Mt inf M Λπ (M) inf z C

k wkd KL z M, z t Mt

inf M Λπ (M) inf z C

k wk,td KL z t Mt, z t Mt X

k wkd KL z M, z t Mt

For (3), we define f(w, M) = inf M ch( Λπ (M)) infz C P

k wkd KL z M, z M and C(w) =

Π. Then, from Berge s Theorem (Theorem F.1 in Appendix), we get w (M) is upperhemicontinuous. For (4), the convexity of w (M) follows since the optimal solution

max w Π inf M ch( Λπ (M)) inf z C

k wkd KL z M, z M

is concave for any given π and π .

C Proof of the Stopping Time

C.1 Proof of Lemma 4.1

The proof follows by showing that infz C P

k Nk,td KL z ˆ Mk,t, z Mk is an appropriate stochastic process and can be bounded using the mixture of martingales technique of Kaufmann and Koolen (2021).

To this end, given a z C, we define the random variable as

k max{0, Nk,td KL z inf ˆ Mk,t, z inf Mk 3 ln(1 + ln Nk,t)} ,

where zinf arg infz C P

k Nk,td KL z ˆ Mk,t, z Mk

Since we assume Mk,ℓbelongs to exponential family of distributions, for any given zinf = 0, z inf Mk being a non-degenerate linear transform also belongs to the exponential family.

Now, plugging in Xk(t) in Theorem 7 of (Kaufmann and Koolen, 2021) yields

k [K] Nk,td KL z inf ˆ Mk,t, z inf Mk X

k [K] 3 ln (1 + ln (Nk,t)) + KT

Here, T : R+ R+ is such that

T (x) 2 h3/2

h 1(1 + x) + ln (2ζ(2))

u 1, h(u) = u ln(u) (18)

z [1, e], x 0, hz(x) =

( exp 1 h 1(x) h 1(x) if x h 1 1 ln(z)

z(x ln(ln(z))) else , (19)

and ζ(2) = P n=1 n 2.

C.2 Proof of Theorem 4.3

First, we prove the following lemma required to proceed with Theorem 4.3. Lemma C.1. For any k = 1, 2, . . . , K, let 1 tk t. Let η > 0 and define the event:

C k [K]Ck k [K]{tk Nk,t (1 + η)tk}

and let 1C indicate whether the event holds. For ρ (1 + η)K, for all z C we have:

k [K] Nk,td KL z ˆ Mk,t, z Mk ρ

Proof. Fix ζ RK + and t 0. Define mk,t such that:

mk,t = m, if 0 m z Mk, s.t. td KL m, z Mk = ζk 0, otherwise

By monotonicity of td KL (, ) , t mk,t is increasing. With t = Nk,t, we have that

Nk,td KL mk,Nk,t, z Mk = ζk Nk,td KL z ˆ Mk,t, z Mk , = z ˆ Mk,t

(a) mk,Nk,t

(b) mk,(1+η)tk,

where (a) follows from monotonicity of d KL ( , ) and (b) follows from monotonicity of m(ℓ) k,t.

With tkd KL z ˆ Mk,tk(1+η), z Mk = ζk 1+η and non-negativity of d KL ( , ) we have:

P k [K] n 1Ck Nk,td KL z ˆ Mk,t, z Mk ζk o

P k [K] n z ˆ Mk,t mk,Nk,t, Ck o

P k [K] n z ˆ Mk,t mk,(1+η)tk, Ck o

P k [K] n z ˆ Mk,t mk,(1+η)tk, Ck o

k [K] tkd KL mk,(1+η)tk, z Mk

where (a) follows from Lemma F.2.

Using Lemma F.3, with Zk = Nk,td KL z ˆ Mk,t, z Mk and a = 1 (1+η), we have that:

k [K] Nk,td KL z ˆ Mk,t, z Mk ρ

This concludes the proof.

Now, we provide the proof of Theorem 4.3.

Proof. Let us define D = log(t) log(1+η) and set D = {1, 2, . . . , D}K. Let us define

k [K] Nk,td KL z ˆ Mk,t, z Mk ρ

Bd = K k=1 (1 + ξ)dk 1 Nk,t (1 + ξ)dk

We have A = d D(A Bd), hence P(A) P

d D P(A Bd).

For η = 1 ρ 1 and ρ (1 + η)K, we get ρ K + 1.

Now, we use Lemma C.1 with η = 1 ρ 1 and tk = (1 + η)dk 1 to obtain for all d D:

P (A Bd) ρe

K exp ρ (1 + η)

By a union bound on D, we have:

Noting that D = log(t) log(1+η) and η = 1 ρ 1, we get:

P (A) exp ( ρ) ρ ρ log(t)

D Proofs for Sample Complexity Upper Bound

D.1 Pairwise Concentration Bounds

Lemma D.1 (Pairwise concentration (Proof of Theorem 5.1)). Consider the event:

Et i [K] j =i Li,j(t) z µi z µj Ui,j(t) (20)

where Lij(t) = z (ˆµi,t ˆµj,t) βij(t) and Uij(t) = z (ˆµi,t ˆµj,t)+βij(t), and βij(t) is defined as

β2 ij(t) 4 z 2 1

a {i,j} log (4 + log(Na(t)))

Here, K1 K(K 1)

2 , h( ) x + log(1 + x). Then, we get

P ( t=1Et) 1 δ .

Proof. We have the following:

Et = (i,j) Li,j z µi z µj Ui,j

= (i,j) z (ˆµi,t ˆµj,t) z (µi µj) βij(t)

where, (i, j) [K] [K] is the set of arm pairs. By a union bound, we have the following:

Et = P t 1 : Et holds

= P t 1 : z (ˆµi,t ˆµj,t) z (µi µj) βij(t)

P t 1 : |z (ˆµi,t µi) z (ˆµj,t µj)| βij(t)

(a) follows from

1. z (ˆµi,t µi) is a z 1 sub-Gaussian as ˆµi,t µi is 1-sub-Gaussian, and

2. Lemma 7 of (Kone et al., 2023a) that states that for two 1-sub-Gaussian and centred random variables X and Y , the following holds true with probability 1 δ. 1 p

+ log log(e4p) + log log(e4q) 1

given h(x) x + ln x.

We now show that the Pareto fronts under the metric dp. Lemma D.2. (Z, dp) is a complete metric space.

Proof. From Definition 5.2, for two Pareto fronts P1, P2 Z, we have that:

dp(P1, P2) max sup k P1 d(k, P1), sup k P2 d(k, P1)

d(k, P) = inf k P max

0, sup z C B(1) z (µk µk)

1. We first show that dp(P1, P2) is a metric. Let P1, P2 Z. To show that dp is a metric, we show that:

(a) Symmetry: dp(P1, P2) is symmetric by definition (b) Triangle Inequality: We show that dp(P1, P3) dp(P1, P2) + dp(P2, P3).

dp (P1, P3) = max max k P1 min k P3 µk (X3) µk(X1), max k P3 min k P1 µk (X1) µk(X3)

We have that:

max k P1 min k P3 µk (X3) µk(X1)

max k P1 min k P3 µk (X3) + min k P2 µk (X2) max k P2 µk (X2) µk(X1)

max k P(X2) min k P3 µk (X3) µk (X2) + max k P1 min k P(X2) µk (X2) µk(X1)

Using a similar argument:

max k P3 min k P1 µk (X1) µk(X2) max k P(X2) min k P1 µk (X1) µk (X2) + max k P3 min k P(X2) µk (X2) µk(X3)

Noting that for any positive numbers a, b, c, d, max{a + b, c + d} = max{a + c, b + d}, we have:

dp(P1, P3) max n max k P(X2) min k P3 µk (X3) µk (X2) + max k P1 min k P(X2) µk (X2) µk(X1),

max k P(X2) min k P1 µk (X1) µk (X2) + max k P3 min k P(X2) µk (X2) µk(X3) o

= max n max k P(X2) min k P1 µk (X1) µk (X2) + max k P1 min k P(X2) µk (X2) µk (X1),

max k P(X2) min k P3 µk (X3) µk (X2) + max k P3 min k P(X2) µk (X2) µk(X3) o

= dp(P1, P2) + dp(P2, P3)

(c) We now show that dp(P1, P2) = 0 P1 = P2. The implication P1 = P2 = dp(P1, P2) = 0 is immediate. For the other side, note that by Definition 5.2, we have:

dp (P1, P2) = 0 = sup k P1 (k, P2) = 0 and sup k P2 (k, P1) = 0

Further, supk P1 (k, P2) = 0 implies:

k P1, k C k , k P2 k P1, k P2 A similar agrument using supk P1 (k, P1) = 0 implies that k P2, k C k , k P1.

2. We now show that Z is compact under the metric dp. Consider a sequence of Pareto fronts P1, P2, . . . , Pn Z and P be the candidate for limiting Pareto front.

Boundedness of P is immediate. Pn P, therefore, ϵ > 0, N(ϵ) s.t. n > N(ϵ) and dp(Pn, P) < ϵ. Let µk be a limit point of P, i.e., a sequence µk,n P such that µk,n µk. Since dp (Pn, P) 0 for each µk,n P there exists µk,n,m Pn s.t. µk,n,m µk,n. Using a diagonalization argument, we can obtain a subsequence µk,n,m µk. Since Pn is compact, µk must lie in P and therefore, P is closed.

D.2 Concentration to the Pareto Front

Lemma D.3. There exists constants C > 0 such that:

P GT 2K2 1 1 exp( C)T exp CT 1/8 ,

where GT = T t=h(T ) n d P ˆPt, P ϵ o .

Proof. We then have that:

P d P ˆPt, P ϵ = P max n d P ˆPt, P , d P P , ˆPt o ϵ

P d P ˆPt, P ϵ + P d P P , ˆPt ϵ

Focusing on the first term we have:

d P ˆPt, P = inf k ˆ Pt sup k P max 0, max z C B(1) z Mk ˆ Mk,t

(a) = min k ˆ Pt max k P max 0, max z C B(1) z Mk ˆ Mk ,t + ˆ Mk ,t ˆ Mk,t

(b) max k P max 0, max z C B(1) z Mk ˆ Mk ,t

+ max k P max 0, max z C B(1) z ˆ Mk ,t ˆ Mk,t

Recall that GT = T t=h(T ) n d P ˆPt, P ϵ o , then using a union bound, we have:

t=h(T ) P d P ˆPt, P ϵ

(k,k ) P z ˆ Mk,t ˆ Mk ,t z (Mk Mk ) ξ

+P z ˆ Mk,t ˆ Mk ,t z (Mk Mk ) + ξ #

Let T be such that h(T) K2. Since t h(T), one has that Nk,t

Then using a union bound on each pair and number of arms, as well as a Chernoff bound, we have that each of the terms in the above inequality are bounded as

P z ˆ Mk,t ˆ Mk ,t z (Mk Mk ) ξ

= P z ˆ Mk,t ˆ Mk ,t z (Mk Mk ) ξ, Nk,t

t K P z ˆ Mk,s ˆ Mk ,s z (Mk Mk ) ξ

t K P z ˆ Mk,s Mk z ˆ Mk ,s Mk ξ/2 ξ/2

P z ˆ Mk,s Mk ξ/2 + P z ˆ Mk ,s Mk ξ/2

exp sd KL z Mk ξ/2, z Mk + exp sd KL z Mk + ξ/2, z Mk

t K)d KL z Mk ξ/2, z Mk

1 exp ( d KL (z Mk ξ/2, z Mk)) + exp (

t K)d KL z Mk + ξ/2, z Mk

1 exp ( d KL (z Mk + ξ/2, z Mk ))

1 1 exp ( d KL (z Mk ξ/2, z Mk)) + 1 1 exp ( d KL (z Mk + ξ/2, z Mk ))

t K) min k,k {d KL z Mk ξ/2, z Mk , d KL z Mk + ξ/2, z Mk }

Similarly, we have:

P z ˆ Mk,t ˆ Mk ,t z (Mk Mk ) + ξ

1 1 exp ( d KL (z Mk + ξ/2, z Mk)) + 1 1 exp ( d KL (z Mk ξ/2, z Mk ))

t K) min k,k {d KL z Mk + ξ/2, z Mk , d KL z Mk ξ/2, z Mk }

Now, we set

C min k d KL z Mk + ξ/2, z Mk , d KL z Mk ξ/2, z Mk

1 1 exp ( d KL (z Mk ξ/2, z Mk)) + 1 1 exp ( d KL (z Mk + ξ/2, z Mk ))

+ 1 1 exp ( d KL (z Mk + ξ/2, z Mk)) + 1 1 exp ( d KL (z Mk ξ/2, z Mk ))

We observe that B 2K2 1 1 exp( C). Now, we get that

t=h(T ) B exp C

t 2K2 1 1 exp( C)T exp CT 1/8 .

D.3 Proof of Theorem 5.2

Theorem D.1 (Restating Theorem 5.2). For any α > 0 and c(t, δ) defined in (7), we have that the stopping time satisfies :

lim δ 0 E[τ] log 1

δ α TF(M) M RK L

Proof. Step 1: Good Event Let T N, and h(T) =

T, define the good event:

GT = T t=h(T ) n dp( ˆPt, P ) f(ϵ) o (21)

where, f(ϵ) is such that:

dp( ˆPt, P ) f(ϵ) = sup w w ( ˆ Pτ ) sup w w (P ) w w ϵ

Step 2: Concentration of Good Event In Lemma D.3, we show that:

P GT 2K2 1 1 exp( C)T exp CT 1 8

Step 3: Tracking Lemma We have:

Nk,t w k(M) Nk,t

s=1 w k( ˆ Ms) + 1

s=1 w k( ˆ Ms) w k(M)

s=1 w k( ˆ Ms) w k(M)

t + 1) t + h(T)

s=1 w k( ˆ Ms) w k(M)

From (Garivier and Kaufmann, 2016), we have that:

t wk,t K 1 +

Step 4: Complexity of the good event

Assume t Tϵ, and let:

Cϵ (M) inf w ,M VC (w, M) , (w, M) s.t. w w 3ϵ, d P ˆPt, P f(ϵ)

Then, we have that:

VC Nt, ˆ Mt t Cϵ(M)

Step 5: Bounding the stopping time for good and bad events

Let τδ be the stopping time, then:

From the stopping rule (Equation (7)), we get that:

t=Tϵ 1VC(Nt, ˆ Mt) β(t,δ)

t=Tϵ 1t Cϵ(M) β(t,δ)

T + β(t, δ)

where, β(t, δ) is defined in Equation (6).

Define Tδ = inf n T N :

T + c(t,δ) Cϵ(M) T o . Hence, we have:

E [τδ] Tϵ + Tδ +

T =1 2K2 1 1 exp( C)T exp CT 1/8 Tϵ + Tδ + T

Let C(η) = inf{T : T

T T (1+η)}. Then:

Tδ C(η) + inf T N : TCϵ(M)

(1 + η) β(t, δ)

Step 6: Obtaining the asymptotic bounds Taking limits:

lim δ 0 inf E [τδ]

δ αT (M) α 1

E Reduction to Best-arm Identification

We briefly discuss how the metric d P ( , ) extends existing notions of gap in best-arm and Pareto-front identification literature. Specifically, we proceed with the three following observations.

1. Observe that d P (P , P ) = 0. 2. Further, iff C represents the component-wise ordering as in Pareto-front identification (Auer et al., 2016; Kone et al., 2023a), then z(ℓ) = 1, ℓ [L].

d P ([K] \ P , P ))

sup k [K]\P d (k, P ) , sup k P d ([K] \ P , k)

sup k [K]\P inf k P max 0, min ℓ [L] (Mℓk Mℓk ) ,

sup k P inf k [K]\P max 0, min ℓ [L] (Mℓk Mℓk ) )

sup k [K]\P inf k P max {0, m(k , k)} ,

sup k P inf k [K]\P max {0, M(k, k )}

= sup k P inf k [K]\P max {0, M(k, k )} .

3. Finally, when there is only a single objective, i.e., |L| = 1 and assuming an unique optimal arm (fairly common assumption in BAI literature), we have:

d P ([K] \ P , P )) = max

sup k [K]\P d (k, P ) , sup k P d ([K] \ P , k)

sup k [K]\k max {0, (µk µk )} ,

sup k [K]\k max {0, (µk µk )}

= min k =k k ,

which is exactly the gap for one-dimensional bandit.

F Useful Existing Results

Theorem F.1 (Berge s Maximum Theorem (Berge, 1877)). Let U and V be topological spaces, f : U V R and C : U V be non-empty compact set for all u U. Then, if C is continuous at u, f (u) = maxv C(u) f(u, v) is continuous and C (u) = {v C(u) : f (u) = f(u, v)} is upper-hemicontinuous. Theorem F.2 (Donsker-Vardhan Variational Formula (Donsker and Varadhan, 1975)). For mutual information d KL (P, Q), we have that:

d KL (P, Q) = sup f EP [f] ln EQ [exp(f)]

Lemma F.1 (Peskun Ordering (Peskun, 1973)). For any two random variables X, Y on RK the following are equivalent:

2. For all x RK, P [X x] P [Y x]

3. For all non-negative functions f1, f2, . . . , f K, we have that: E[ΠK i=1fi] E[ΠK i=1fi] Lemma F.2 (Magureanu et al. (2014)). For any k = 1, 2, . . . , K, let us define 1 tk t. Then for all 0 Ck Mk, we have

P k [K] n ˆ Mk,t Ck, tk Nk,t o exp

k [K] tkd KL (Ck, Mk)

Lemma F.3 (Magureanu et al. (2014)). Let a > 0 and K 2 and Z RK such that for all ξ RK + we have:

P (Z ζ) exp

Then, for all ρ K

a R+, we have

K exp( aρ) .

Lemma F.4 (Single-arm concentration Kaufmann and Koolen (2021)). The following is a δ uniformly valid confidence interval on z µ:

v u u t2 LCg ln( 1

k [K] c ln (d + ln ln(Nk,t)) P

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and precede the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] . [NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available. Please provide a short (1 2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While [Yes] is generally preferable to [No] , it is perfectly acceptable to answer [No] provided a proper justification is given (e.g., error bars are not reported because it would be too computationally expensive or we were unable to find the license for the dataset we used ). In general, answering [No] or [NA] is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist , Keep the checklist subsection headings, questions/answers and guidelines below. Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification:

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments.

If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper poses no such risks.

Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with humansubjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.