# transductive_active_learning_theory_and_applications__258514ca.pdf

Transductive Active Learning: Theory and Applications

Jonas Hübotter Department of Computer Science ETH Zürich, Switzerland

Bhavya Sukhija Department of Computer Science ETH Zürich, Switzerland

Lenart Treven Department of Computer Science ETH Zürich, Switzerland

Yarden As Department of Computer Science ETH Zürich, Switzerland

Andreas Krause Department of Computer Science ETH Zürich, Switzerland

We study a generalization of classical active learning to real-world settings with concrete prediction targets where sampling is restricted to an accessible region of the domain, while prediction targets may lie outside this region. We analyze a family of decision rules that sample adaptively to minimize uncertainty about prediction targets. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We demonstrate their strong sample efficiency in two key applications: active fine-tuning of large neural networks and safe Bayesian optimization, where they achieve state-of-the-art performance.

1 Introduction

Machine learning, at its core, is about designing systems that can extract knowledge or patterns from data. One part of this challenge is determining not just how to learn given observed data but deciding what data to obtain next, given the information already available. More formally, given an unknown and sufficiently regular function f over a domain X: How can we learn f sample-efficiently from (noisy) observations? This problem is widely studied in active learning and experimental design (Chaloner & Verdinelli, 1995; Settles, 2009).

Active learning methods commonly aim to learn f globally, i.e., across the entire domain X. However, in many real-world problems, (i) the domain is so large that learning f globally is hopeless or (ii) agents have limited information and cannot access the entire domain (e.g., due to restricted access or to act safely). Thus, global learning is often not desirable or even possible. Instead, intelligent systems are typically required to act in a more directed manner and extrapolate beyond their limited information. This work formalizes the above two aspects of active learning, which have remained largely unaddressed by prior work. We provide a comprehensive overview of related work in Section 6.

Directed transductive active learning We consider the generalized problem of transductive active learning, where given two arbitrary subsets of the domain X; a target space A X, and a sample space S X, we study the question:

How can we learn f within A by actively sampling observations within S?

Correspondence to jonas.huebotter@inf.ethz.ch

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Figure 1: Instances of transductive active learning with target space A shown in blue and sample space S shown in gray. The points denote plausible observations within S to learn A. In (A), the target space contains everything within S as well as points outside S. In (B, C, D), one makes observations directed towards learning about a particular target. Prior work on inductive active learning has focused on the instance A = S.

This problem is ubiquitous in real-world applications such as safe Bayesian optimization, where S is a set of safe parameters and A might represent parameters outside S whose safety we want to infer. Active fine-tuning of neural networks is another example, where the target space A represents the test set over which we want to minimize risk, and the sample space S represents the dataset from which we can retrieve data points to fine-tune our model to A. Figure 1 visualizes some instances of transductive active learning.

Whereas most prior work has focused on the global inductive instance X = A = S, Mac Kay (1992) was the first to consider specific target spaces A and proposed the principle of selecting points in S to minimize the posterior uncertainty about points in A. Since then, several works have studied this principle empirically (e.g., Seo et al., 2000; Yu et al., 2006; Bogunovic et al., 2016; Wang et al., 2021; Kothawade et al., 2021; Bickford Smith et al., 2023). In this work, we model f as a Gaussian process or (equivalently) as a function in a reproducing kernel Hilbert space, for which the above principle is analytically and computationally tractable. Our contributions are:

Theory (Section 3): We are the first to give rates for the uniform convergence of uncertainty over the target space A to the smallest attainable value, given samples from the sample space S (Theorems 3.2 and 3.3), Our results provide a theoretical justification for the principle of minimizing posterior uncertainty in transductive active learning, and indicate that transductive active learning can be more sample efficient than inductive active learning.

Applications: We show that transductive active learning improves upon the state-of-the-art in the batch-wise active fine-tuning of neural networks for image classification (Section 4) and in safe Bayesian optimization (Section 5).

2 Problem Setting

We assume for now that the target space A and sample space S are finite, and relax these assumptions in the appendices. We model f as a stochastic process and denote the marginal random variables f(x) by fx, and joint random vectors {fx}x X for some X X, |X| < by f X. Let y X denote the noisy observations of f X, {yx = fx + εx}x X, where εx is independent noise.2 We study the adaptive setting, where in round n the agent selects a point xn S and observes yn = yxn. The agent s choice of xn may depend on the outcome of prior observations Dn 1 def = {(xi, yi)}i<n.

Background on information theory We briefly recap several important concepts from information theory of which we provide formal definitions in Appendix B. The (differential) entropy H[f] is one possible measure of uncertainty about f and the conditional entropy H[f | y] is the (expected) posterior uncertainty about f after observing y. The information gain I(f; y) = H[f] H[f | y] measures the (expected) reduction in uncertainty about f due to y. We denote the information gain about A from observing X by I(f A; y X). The maximum information gain about A from n observations within S is γA,S(n) def = max X S |X| n I(f A; y X).

This information capacity measures the information about f A that is accessible from within S, and has been used previously (e.g., by Srinivas et al., 2009; Chowdhury & Gopalan, 2017; Vakili et al., 2021) in the setting where X = A = S, taking the form of γn def = γX (n) def = γX,X (n). We remark that γA,S(n) γS(n) holds uniformly for all A, S, and n due to the data processing inequality. Generally, γA,S(n) can be substantially smaller if the target space is a sparse subset of the sample space.

2X may be a multiset in which case repeated occurrence of x corresponds to independent observations of yx.

3 Main Results

We analyze the following principle for transductive active learning:

Select samples to minimize the posterior uncertainty about f within A. ( )

This principle yields a family of simple and natural decision rules which depend on the chosen measure of uncertainty . Two natural measures of uncertainty are (1) the entropy of prediction targets, H[f A], and (2) their total variance, P

x A Var[fx ]. The corresponding decision rules are

(1) xn = arg min x S H[f A | Dn 1, yx] = arg max x S I(f A; yx | Dn 1), (ITL)

(2) xn = arg min x S tr Var[f A | Dn 1, yx] (VTL)

with an implicit expectation over the feedback yx. That is, ITL (short for Information-based Transductive Learning) and VTL (Variance-based TL) select xn so as to minimize the uncertainty about the prediction targets f A (in expectation) after having received the feedback yn. Unlike VTL, ITL takes into account the mutual dependence between points in A. These decision rules were suggested previously (Mac Kay, 1992; Seo et al., 2000; Yu et al., 2006) without deriving theoretical guarantees; and they generalize several widely used algorithms which we discuss in more detail in Section 6. Most prominently, in the inductive setting where S A, ITL reduces to xn = arg maxx S I(fx; yx | Dn 1), i.e., is undirected and reduces to standard uncertainty-based active learning strategies (cf. Appendix C.1). The convergence properties for the special instance of ITL with S = A have been studied extensively. To the best of our knowledge, we are the first to extend these guarantees to the more general setting of transductive active learning.

In our presented results, we make the following assumption. Assumption 3.1. In the case of ITL, the information gain ψA(X) = I(f A; y X) is submodular. In the case of VTL, the variance reduction ψA(X) = tr Var[f A] tr Var[f A | y X] is submodular. Under this assumption, ψA(x1:n) is a constant factor approximation of max X S,|X| n ψA(X) due to the seminal result on submodular function maximization by Nemhauser et al. (1978). Similar assumptions have been made, e.g., by Bogunovic et al. (2016) and Kothawade et al. (2021). Assumption 3.1 is satisfied exactly for ITL when S A and f is a Gaussian process (cf. Lemma C.9), and we provide an extensive discussion of our results in Appendix C.4 for instances where Assumption 3.1 is satisfied approximately, relying on the notion of weak submodularity (Das & Kempe, 2018).

3.1 Gaussian Process Setting

When f GP(µ, k) is a Gaussian process (GP, Williams & Rasmussen, 2006) with known mean function µ and kernel k, and the noise εx is mutually independent and zero-mean Gaussian with known variance, the ITL and VTL objectives have a closed form expression (cf. Appendix F) and can be optimized efficiently. Further, the information capacity γn is sublinear in n for a rich class of GPs (Srinivas et al., 2009; Vakili et al., 2021), with rates summarized in Table 3 of the appendix.

Convergence to irreducible uncertainty So far, our discussion was centered around the role of the target space A in facilitating directed learning. An orthogonal contribution of this work is to study extrapolation from the sample space S to points x A \ S. To this end, we derive bounds on the marginal posterior variance σ2 n(x) def = Var[f(x) | Dn] for points in A. These bounds depend on the instance of transductive active learning (i.e., A and S) and might be of independent interest for active learning. For ITL and VTL, they imply uniform convergence of the variance for a rich class of GPs. To the best of our knowledge, this work is the first to present such bounds.

We define the irreducible uncertainty as the variance of f(x) provided complete knowledge of f in S:

η2 S(x) def = Var[fx | f S].

As the name suggests, η2 S(x) represents the smallest uncertainty one can hope to achieve from observing only within S. For all x S, it is easy to see that η2 S(x) = 0. However, the irreducible uncertainty of x S may be (and typically is!) strictly positive. Theorem 3.2 (Bound on marginal variance for ITL and VTL). Let Assumption 3.1 hold and the data be selected by either ITL or VTL.. Assume that f GP(µ, k) with known mean function µ

and kernel k, the noise εx is mutually independent and zero-mean Gaussian with known variance, and γn is sublinear in n. Then there exists a constant C such that for any n 1 and x A,

σ2 n(x) η2 S(x) | {z } irreducible

+ C γA,S(n) n | {z } reducible

Moreover, if x A S, there exists a constant C such that

σ2 n(x) C γA,S(n)

Intuitively, Equation (1) of Theorem 3.2 can be understood as bounding an epistemic generalization gap (Wainwright, 2019) of the learner. The reducible uncertainty converges to zero at all prediction targets x A, e.g., for linear, Gaussian, and smooth Matérn kernels. As to be expected, a smaller target space (i.e., more targeted sampling) leads to faster convergence due to a smaller information capacity γA,S(n) γn. Equation (2) matches prior results for the setting S = A. We provide a formal proof of Theorem 3.2 in Appendix C.6.

3.2 Agnostic Setting

The result from the GP setting translates also to the agnostic setting, where the ground truth f may be any sufficiently regular fixed function on X.3 In this case, we use the model f from Section 3.1 as a (misspecified) model of f , with some kernel k and zero mean function µ( ) = 0. We denote by µn(x) = E[f(x) | Dn] the posterior mean of f. W.l.o.g. we assume in the following result that the prior variance is bounded, i.e., Var[f(x)] 1. Theorem 3.3 (Bound on approximation error for ITL and VTL, following Abbasi-Yadkori (2013); Chowdhury & Gopalan (2017)). Let Assumption 3.1 hold and the data be selected by either ITL or VTL. Pick any δ (0, 1). Assume that f lies in the reproducing kernel Hilbert space Hk(X) of the kernel k with norm f k < , the noise εn is conditionally ρ-sub-Gaussian, and γn is sublinear in n. Let βn(δ) = f k + ρ p

2(γn + 1 + log(1/δ)). Then for any n 1 and x A, jointly with probability at least 1 δ,

|f (x) µn(x)| βn(δ) h ηS(x) | {z } irreducible

+ νA,S(n) | {z } reducible

where ν2 A,S(n) denotes the reducible part of Equation (1).

We provide a formal proof of Theorem 3.3 in Appendix C.7. Theorem 3.3 generalizes approximation error bounds of prior works to the extrapolation setting, where some prediction targets x A lie outside the sample space S. For prediction targets x A S, the irreducible uncertainty vanishes, and we recover previous results from the setting S = A.

Theorems 3.2 and 3.3 show that ITL and VTL efficiently learn f at the prediction targets A for large classes of sufficiently regular functions f. In the following, we validate these results experimentally by showing that ITL and VTL exhibit strong empirical performance in a broad range of applications.

3.3 Experiments in the Gaussian Process Setting

Before demonstrating ITL and VTL on GPs to develop more intuition, we introduce a natural correlation-based baseline, which will later uncover connections to existing approaches:

xn = arg max x S

x A Cor[fx, fx | Dn 1]. (CTL)

How does the smoothness of f affect ITL? We contrast two extreme kernels: the Gaussian kernel k(x, x ) = exp( x x 2 2 /2) and the Laplace kernel k(x, x ) = exp( x x 1). In the mean-squared sense, the Gaussian kernel yields a smooth process f whereas the Laplace kernel yields a continuous but non-differentiable f (Williams & Rasmussen, 2006). Figure 2 shows

3Here f (x) denotes the mean observation yx = f (x) + ϵx

Figure 2: Initial 25 samples of ITL under a Gaussian kernel with lengthscale 1 (left) and a Laplace kernel with lengthscale 10 (right). Shown in gray is the sample space S and shown in blue is the target space A. In three of the four examples, points outside the target space provide useful information.

0 500 Number of Iterations

Entropy of f A

0 500 Number of Iterations

Std Dev in A

LOCAL UNSA TRUVAR

Figure 3: Entropy of f A ranging from 3850 to 3725 and the mean marginal standard deviations of f A ranging from 0 to 0.15. Experiment is using the Gaussian kernel of the left instance (A S) from Figure 2. It can be seen that ITL and VTL outperform UNSA and RANDOM. Uncertainty bands correspond to one standard error over 10 random seeds.

how ITL adapts to the smoothness of f: Under the smooth Gaussian kernel, points outside A provide higher-order information. In contrast, under the rough Laplace kernel and if A S, points outside A do not provide any additional information, and therefore are not sampled by ITL. If, however, A S, information leaks A even under a Laplace kernel prior. That is, even for non-smooth functions, the point with most information need not be in A.

Does ITL outperform uncertainty sampling? Uncertainty sampling (UNSA, Lewis & Catlett, 1994) is one of the most popular active learning methods. UNSA selects points x with high prior uncertainty: xn = arg maxx S σ2 n 1(x). This is in stark contrast to ITL and VTL which select points x that minimize posterior (epistemic) uncertainty about A. It can be seen that UNSA is the special undirected case of ITL when S A and observation noise is homoscedastic (cf. Appendix C.1).

We compare UNSA to ITL, VTL, and CTL in Figure 3. We observe that ITL and VTL outperform UNSA which also samples points that are not informative about A. Further, ITL and VTL outperform local UNSA (i.e., UNSA constrained to A S) which neglects all information provided by points outside A.4 As one would expect, VTL has an advantage with respect to reducing the total variance of f A, whereas ITL reduces the entropy of f A faster. We include ablations in Appendix H where we, in particular, observe that the advantage of ITL and VTL over UNSA increases as the volume of prediction targets shrinks in comparison to the size of domain.

4 Active Fine-Tuning of Neural Networks

Fine-tuning a large pre-trained model is a costand computation-effective approach to improve performance on a given target domain (Lee et al., 2022). While previous work has studied the effectiveness of various training procedures for fine-tuning (e.g., Eustratiadis et al., 2024), we ask: How can we select the right data for fine-tuning to a specific task? This active fine-tuning problem is an instance of the introduced directed transductive learning problem: Concretely, consider a supervised setting, where the function f maps inputs x X to outputs y Y. We have access

4If A S then local UNSA does not even converge to the irreducible uncertainty.

to noisy samples from a training set S on X, and we would like to learn f such that our estimate minimizes a given risk measure, such as classification error, with respect to a test distribution PA on X. The goal is to actively and efficiently sample from S to minimize risk with respect to PA.5 We show in this section that ITL and VTL can learn f from only few examples from S.

How can we leverage the latent structure learned by the pre-trained model? As common in related work, we approximate the (pre-trained) neural network (NN) f( ; θ) as a linear function in a latent embedding space, f(x; θ) β ϕθ(x), with weights β Rp and embeddings ϕθ : X Rp. Common choices of embeddings include last-layer embeddings (Devlin et al., 2019; Holzmüller et al., 2023), neural tangent embeddings arising from neural tangent kernels (Jacot et al., 2018) which are motivated by their relationship to the training and fine-tuning of ultra-wide NNs (Arora et al., 2019; Lee et al., 2019; Khan et al., 2019; He et al., 2020; Malladi et al., 2023), and loss gradient embeddings (Ash et al., 2020). We provide a comprehensive overview of embeddings in Appendix J.2. Now, supposing the prior β N(0, Σ), often with Σ = I (Khan et al., 2019; He et al., 2020; Antorán et al., 2022; Wei et al., 2022), this approximation of f is a Gaussian process with kernel k(x, x ) = ϕθ(x) Σϕθ(x ) which quantifies the similarity between points in terms of their alignment in the learned latent space. Note that the correlation k(x, x )/ p

k(x, x)k(x , x ) between two points x, x is equal to the cosine similarity of their embeddings.

In this context, Theorem 3.2 bounds the epistemic posterior uncertainty about a prediction using the approximation β ϕθ(x), given that the model is trained using data selected by ITL or VTL. Theorem 3.3 bounds the generalization error when using the posterior mean of β for prediction. This extends recent work which has studied estimators of this generalization error (Wei et al., 2022).

Batch selection: Diversity via conditional embeddings Efficient labeling and training necessitates a batch-wise selection of inputs. The selection of a batch of size b > 1 can be seen as an individual non-adaptive active learning problem, and significant recent work has shown that batch diversity is crucial in this setting (Ash et al., 2020; Zanette et al., 2021; Holzmüller et al., 2023; Pacchiano et al., 2024). An information-based batch-wise selection strategy is formalized by the following nonadaptive transductive active learning problem (Chen & Krause, 2013) and the greedy approximation of Bn by ITL which selects elements xn,i of the n-th batch iteratively based on xn,1:i 1:

Bn = arg max B S,|B|=b I(f A; y B | Dn 1); xn,i = arg max x S I(f A; yx | Dn 1, yxn,1:i 1). (3)

The batch Bn is diverse and informative by design. We show that under Assumption 3.1, B n = xn,1:b yields a constant-factor approximation of Bn (cf. Appendix C.3).

4.1 Experiments on Active Fine-Tuning

Our empirical evaluation is motivated by the following practical example: We deploy a pre-trained image classifier to user s phones who use it within their local environment. We would like to locally fine-tune a user s model to their environment. Since the users images A are unlabeled, this requires selecting a small number of relevant and diverse images from the set of labeled images S. As such, we will focus here on the setting where the points in our test set do not lie in our training set (i.e., A S = ), and discuss alternative instances such as active domain adaptation in Appendix I.

Testbeds & architectures We use the MNIST (Le Cun et al., 1998) and CIFAR-100 (Krizhevsky et al., 2009) datasets as testbeds. In both cases, we take S to be the training set, and we consider the task of learning the digits 3, 6, and 9 (MNIST) or the first 10 categories of CIFAR-100.6 For MNIST, we train a simple convolutional neural network with Re LU activations, three convolutional layers with max-pooling, and two fully-connected layers. For CIFAR-100, we fine-tune an Efficient Net-B0 (Tan & Le, 2019) pre-trained on Image Net (Deng et al., 2009), augmented by a final fully-connected layer. We train the NNs using the cross-entropy loss and the ADAM optimizer (Kingma & Ba, 2014).

Results In Figure 4, We compare against (i) active learning methods which largely aim for sample diversity but which are not directed towards the target distribution PA (e.g., BADGE; Ash et al., 2020), and (ii) search methods that aim to retrieve the most relevant samples from S with respect to the targets PA (e.g., maximizing cosine similarity to target embeddings as is common in vector databases;

5The setting with target distributions PA can be reduced to considering target sets A (cf. Appendix E). 6That is, we restrict PA to the support of points with labels {3, 6, 9} (MNIST) or labels {0, . . . , 9} (CIFAR100) and train a neural network using few examples drawn from the training set S.

100 CIFAR-100

0 20 40 60 80 100 Number of Samples

# Samples from PA

0 20 40 60 80 100 Number of Batches of Size 10

CTL COSINESIMILARITY

Figure 4: Active fine-tuning on MNIST (left) and CIFAR-100 (right). RANDOM selects each observation uniformly at random from S. The batch size is 1 for MNIST and 10 for CIFAR-100. Uncertainty bands correspond to one standard error over 10 random seeds. We see that transductive active learning with ITL and VTL significantly outperforms competing methods, and in particular, retrieves substantially more samples from the support of PA. See Appendix J for details and ablations.

Settles & Craven, 2008; Johnson et al., 2019). INFORMATIONDENSITY (ID, Settles & Craven, 2008) is a heuristic approach aiming to combine (i) diversity and (ii) relevance. In Appendix J.5, we also compare against a wide range of additional baselines (e.g., CORESET (Sener & Savarese, 2017), TYPICLUST (Hacohen et al., 2022), PROBCOVER (Yehuda et al., 2022), etc.) that fall into one of the categories (i) and (ii), and which perform similar to the baselines listed here.

We observe that ITL, VTL, and CTL consistently and significantly outperform random sampling from S as well as all baselines. We see that relevance-based methods such as COSINESIMILARITY have an initial advantage over RANDOM but for batch sizes larger than 1 they quickly fall behind due to diminishing informativeness of the selected data. In contrast, diversity-based methods such as BADGE are more competitive with RANDOM but do not explicitly aim to retrieve relevant samples.

Remarkably, transductive active learning outperforms random data selection even in the MNIST experiment where the model is randomly initialized. This suggests that the learned embeddings can be informative for data selection even in the early stages of training, bootstrapping the learning progress.

Balancing sample relevance and diversity Our proposed methods unify approaches to coverage (promoting diverse samples) and search (aiming for relevant samples with respect to a given query A) which leads to the significant improvement upon the state-of-the-art in Figure 4. Notably, for a batch size and query size of 1 and if correlations are non-negative, ITL, VTL, CTL, and the canonical cosine similarity are equivalent. CTL can be seen as a direct generalization of cosine similarity-based retrieval to batch and query sizes larger than one. In contrast to CTL, ITL and VTL may also sample points which exhibit a strong negative correlation (which is also informative).

We observe empirically that ITL obtains samples from PA at more than twice the rate of COSINESIMILARITY, which translates to a significant improvement in accuracy in more difficult learning tasks, while requiring fewer (labeled) samples from S. This phenomenon manifests for both MNIST and CIFAR-100, as well as imbalanced datasets S or imbalanced reference samples from PA (cf. Appendix J.6). The improvement in accuracy appears to increase in the large-data regime, where the learning tasks become more difficult. Akin to a previously identified scaling trend with size of the pretraining dataset (Tamkin et al., 2022), this suggests a potential scaling trend where the improvement of ITL over random batch selection grows as models are fine-tuned on a larger pool of data.

Towards task-driven few-shot learning Being able to efficiently and automatically select data may allow dynamic few-shot fine-tuning to individual tasks (Vinyals et al., 2016; Hardt & Sun, 2024), e.g., fine-tuning the model to each test point / query / prompt. Such task-driven few-shot learning can be seen as a form of memory recall akin to associative memory (Hopfield, 1982). Our results are a first indication that task-driven learning can lead to substantial performance gains, and we believe that this is a promising direction for future studies.

5 Safe Bayesian Optimization

Another practical problem that can be cast as directed learning is safe Bayesian optimization (Safe BO, Sui et al., 2015; Berkenkamp et al., 2021) which has applications in natural science (Cooper & Netoff, 2022) and robotics (Wischnewski et al., 2019; Sukhija et al., 2023; Widmer et al., 2023). Safe BO solves the following optimization problem max x S f (x) where S = {x X | g (x) 0} (4)

which can be generalized to multiple constraints. The functions f and g , and hence also the safe set S , are unknown and have to be actively learned from data. However, it is crucial that the data collection does not violate the constraint, i.e., xn S , n 1.

Safe Bayesian optimization as Transductive Active Learning In the agnostic setting from Section 3.2, GPs f and g can be used as well-calibrated models of the ground truths f and g , and we denote lowerand upper-confidence bounds by lf n(x), lg n(x) and uf n(x), ug n(x), respectively. These confidence bounds induce a pessimistic safe set Sn def = {x | lg n(x) 0} and an optimistic safe set b Sn def = {x | ug n(x) 0} which satisfy Sn S b Sn with high probability at all times. Similarly, the set of potential maximizers

An def = {x b Sn | uf n(x) max x Sn lf n(x )} (5)

contains the solution to Equation (4) at all times with high probability.

The (simple) regret rn(S) def = maxx S f (x) f (bxn) with bxn def = arg maxx Sn lf n(x) measures the worst-case performance of a decision rule. To achieve small regret, one faces an explorationexpansion dilemma wherein one needs to explore points that are known-to-be-safe, i.e., lie in the estimated safe set Sn, and might be optimal, while at the same time discovering new safe points by expanding Sn. Accordingly, a natural choice for the target space of Safe BO is An since it captures both exploration and expansion simultaneously.7 To prevent constraint violation, the sample space is restricted to the pessimistic safe set Sn. In Safe BO, both the target space and sample space change with each round n, and we generalize our theoretical results from Section 3 in Appendix C to this setting. Theorem 5.1 (Convergence to safe optimum). Pick any ϵ > 0, δ (0, 1). Assume that f , g lie in the reproducing kernel Hilbert space Hk(X) of the kernel k, and that the noise εn is conditionally ρ-sub-Gaussian. Then, we have with probability at least 1 δ,

Safety: for all n 1, xn S .

Moreover, assume S0 = and denote with R the largest reachable safe set starting from S0. Then, the convergence of reducible uncertainty implies that there exists n > 0 such that with probability at least 1 δ,

Optimality: for all n n , rn(R) ϵ.

We provide a formal proof in Appendix C.8. Central to the proof is the application of Theorem 3.3 to show that the safety of parameters outside the safe set Sn can be inferred efficiently. In Section 3, we outline settings where the reducible uncertainty converges which is the case for a very general class of functions, and for such instances Theorem 5.1 guarantees optimality in the largest reachable safe set R. R represents the largest set any safe learning algorithm can explore without violating the safety constraints (with high probability) during learning (cf. Definition C.29). Our guarantees are similar to those of other Safe BO algorithms (Berkenkamp et al., 2021) but require fewer assumptions and generalize to continuous domains. We obtain Theorem 5.1 from a more general result (Theorem C.34) which can be specialized to yield free novel convergence guarantees for problems other than Bayesian optimization, such as level set estimation, by choosing an appropriate target space.

7An alternative possibility is to weigh each point in An according to how likely it is to be the safe optimum. Which approach performs better is task-dependent, and we include a detailed discussion in Appendix K.1.

0 1000 Number of Iterations 0.0

0 100 Number of Iterations 0.0

10.0 2d Task

0 50 Number of Iterations 0.0

Quadcopter Controller Tuning

ISE SAFEOPT

ORACLE SAFEOPT HEURISTIC SAFEOPT

Figure 5: We compare ITL and VTL to ORACLE SAFEOPT, which has oracle knowledge of the Lipschitz constants, SAFEOPT, where the Lipschitz constants are estimated from the GP, as well as HEURISTIC SAFEOPT and ISE, and observe that ITL and VTL systematically perform well. We compare against additional baselines in Appendix K.1. The regret is evaluated with respect to the ground truth objective f and constraint g , and averaged over 10 (in synthetic experiments) and 25 (in the quadcopter experiment) random seeds. Additional details can be found in Appendix K.4.

5.1 Experiments on Safe Bayesian Optimization

We evaluate two synthetic experiments for a 1d and 2d parameter space, respectively (cf. Appendix K.4 for details), which demonstrate the various shortcomings of existing Safe BO baselines. Additionally, as third experiment, we safely tune the controller of a quadcopter.

Safe controller tuning for a quadcopter We consider a quadcopter with unknown dynamics; st+1 = T (st, ut) where ut Rdu is the control signal and st Rds is the state at time t. The inputs ut are calculated through a deterministic function of the state π : S U which we call the policy. The policy is parameterized via parameters x X, e.g., PID controller gains, such that ut = πx(st). The goal is to find the optimal parameters with respect to an unknown objective f while satisfying some unknown constraint(s) g (x) 0, e.g., the quadcopter does not fall on the ground. This is a typical Safe BO problem which is widely applied for safe controller learning in robotics (Berkenkamp et al., 2021; Baumann et al., 2021; Widmer et al., 2023).

Results We compare ITL and VTL to SAFEOPT (Berkenkamp et al., 2021), which is undirected, i.e., expands in all directions including ones that are known-to-be suboptimal, and ISE (Bottero et al., 2022), which is solely expansionist does not trade-off expansion-exploration. We provide a detailed discussion of baselines in Appendix K.2. In all our experiments, summarized in Figure 5, we observe that ITL and VTL systematically perform well, i.e., better or on par with the state-of-the-art. We attribute this to its directed exploration and less conservative expansion over SAFEOPT (cf. 1d task and quadcopter experiment), and natural trade-off between expansion and exploration as opposed to ISE (see 2d task). Generally, VTL has a slight advantage over ITL, which is because VTL minimizes marginal variances (as opposed to entropy), which are decisive for expanding the safe set. While ITL and VTL do not violate constraints, we observe that other methods that do not explicitly enforce safety such as EIC (Gardner et al., 2014) lead to constraint violation (cf. Appendix K.4.2).

6 Related Work

(Inductive) active learning The special case of transductive active learning where A = S = X has been widely studied. We refer to this special instance as inductive active learning, since the goal is to extract as much information as possible as opposed to making predictions on a specific target set.

Several works have previously found entropy-based decision rules to be useful for inductive active learning (Krause & Guestrin, 2007; Guo & Greiner, 2007; Krause et al., 2008) and semi-supervised learning (Grandvalet & Bengio, 2004). The variance-based VTL has previously been proposed by Cohn (1993) in the special case of inductive active learning without proving theoretical guarantees. VTL was then recently re-derived by Shoham & Avron (2023) along other experimental design

criteria under the lens of minimizing risk for inductive one-shot learning in overparameterized models. Substantial work on active learning has studied entropy-based criteria in parameter-space, most notably BALD (Mac Kay, 1992; Houlsby et al., 2011; Gal et al., 2017; Kirsch et al., 2019), which selects xn = arg maxx X I(θ; yx | Dn 1), where θ is the random parameter vector of a parametric model (e.g., obtained via Bayesian deep learning). Such methods are inherently inductive in the sense that they do not facilitate learning on specific prediction targets.

Transductive active learning In contrast, ITL operates in output-space where it is straightforward to specify prediction targets, and which is computationally easier. Special cases of ITL when S = X and |A| = 1 have been proposed in the foundational work of Mac Kay (1992) on directed output-space active learning. As generalization to larger target spaces, Mac Kay (1992) proposed mean-marginal ITL,

xn = arg max x S

x A I(fx ; yx | Dn 1) , (MM-ITL)

for which we derive analogous versions of Theorems 3.2 and 3.3 in Appendix D.3. We note that similarly to VTL, MM-ITL disregards the mutual dependence of points in the target space A and differs from VTL only in a different weighting of the posterior marginal variances of the prediction targets (cf. Appendix D.3). Recently, Bickford Smith et al. (2023) generalized MM-ITL by treating the prediction target as a random variable, and Kothawade et al. (2021) and Bickford Smith et al. (2024) demonstrated the use of output-space decision rules for image classification tasks in a pre-training context.

Influence functions measure the change in a model s prediction when a single data point is removed from the training data (Cook, 1977; Koh & Liang, 2017; Pruthi et al., 2019). Influence functions have been used for data selection in settings closely related to the transductive active fine-tuning of neural networks proposed in this work (Xia et al., 2024). They select data that reduces a first-order Taylor approximation to the test loss after fine-tuning a neural network, which corresponds to maximizing cosine similarity to the prediction targets in a loss-gradient embedding space. We show in our experiments that transductive active learning can substantially outperform COSINESIMILARITY. We attribute this primarily to influence functions implicitly assuming that the influence of selected data adds linearly (i.e., two equally scored data points are expected to doubly improve the model performance, Xu & Kazantsev, 2019, Section 3.2). This assumption does not hold in practice as seen, e.g., by simply duplicating data. The same limitation applies to the related approach of datamodels (Ilyas et al., 2022).

Other work on directed active learning Directed active learning methods have been proposed for the problem of determining the optimum of an unknown function, also known as best-arm identification (Audibert et al., 2010) or pure exploration bandits (Bubeck et al., 2009). Entropy search methods (Hennig & Schuler, 2012; Hernández-Lobato et al., 2014) are widely used and select xn = arg maxx X I(x ; yx | Dn 1) in input-space where x = arg maxx fx. Similarly to ITL, output-space entropy search methods (Hoffman & Ghahramani, 2015; Wang & Jegelka, 2017), which select xn = arg maxx X I(f ; yx | Dn 1) with f = maxx fx, are more computationally tractable. In fact, output-space entropy search is a special case of ITL with a stochastic target space (cf. Equation (47) in Appendix K.1). Bogunovic et al. (2016) analyze TRUVAR in the context of Bayesian optimization and level set estimation. TRUVAR is akin to VTL with a similar notion of target space , but their algorithm and analysis rely on a threshold scheme which requires that A S. Fiez et al. (2019) introduce the transductive linear bandit problem, which is a special case of transductive active learning limited to a linear function class and with the objective of determining the maximum within an initial candidate set.8 We mention additional more loosely related works in Appendix A.

7 Conclusion

We investigated the generalization of active learning to settings with concrete prediction targets and/or with limited information due to constrained sample spaces. This provides a flexible framework, applicable also to other domains than were discussed (such as recommender systems, molecular design, robotics, etc.) by varying the choice of target space and sample space. Further, we proved novel generalization bounds which may be of independent interest for active learning. Finally, we demonstrated across broad applications that sampling relevant and diverse points (as opposed to only one of the two) leads to a substantial improvement upon the state-of-the-art.

8The transductive bandit problem can be solved analogously to Safe BO, by maintaining the set An.

Acknowledgements

Many thanks to Armin Lederer, Johannes Kirschner, Jonas Rothfuss, Lars Lorch, Manish Prajapat, Nicolas Emmenegger, Parnian Kassraie, and Scott Sussex for their insightful feedback on different versions of this manuscript, as well as Anton Baumann for helpful discussions. We further thank Freddie Bickford Smith for a constructive discussion regarding the relationship between our work and prior work.

This project was supported in part by the European Research Council (ERC) under the European Union s Horizon 2020 research and Innovation Program Grant agreement no. 815943, the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, and by a grant of the Hasler foundation (grant no. 21039).

Abbasi-Yadkori, Y. Online learning for linearly parametrized control problems. Ph D thesis, University of Alberta, 2013.

Antorán, J., Janz, D., Allingham, J. U., Daxberger, E., Barbano, R. R., Nalisnick, E., and Hernández Lobato, J. M. Adapting the linearised laplace model evidence for modern deep learning. In ICML, 2022.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net. Neur IPS, 32, 2019.

Arthur, D., Vassilvitskii, S., et al. k-means++: The advantages of careful seeding. In SODA, volume 7, 2007.

Ash, J., Goel, S., Krishnamurthy, A., and Kakade, S. Gone fishing: Neural active learning with fisher embeddings. Neur IPS, 34, 2021.

Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J., and Agarwal, A. Deep batch active learning by diverse, uncertain gradient lower bounds. ICLR, 2020.

Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, 2010.

Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., et al. A cookbook of self-supervised learning. ar Xiv preprint ar Xiv:2304.12210, 2023.

Barrett, A. B. Exploration of synergistic and redundant information sharing in static and dynamical gaussian systems. Physical Review E, 91(5), 2015.

Baumann, D., Marco, A., Turchetta, M., and Trimpe, S. Gosafe: Globally optimal safe robot learning. In ICRA, 2021.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In ICML, volume 26, 2009.

Beraha, M., Metelli, A. M., Papini, M., Tirinzoni, A., and Restelli, M. Feature selection via mutual information: New theoretical insights. In IJCNN, 2019.

Berkenkamp, F., Schoellig, A. P., and Krause, A. Safe controller optimization for quadrotors with gaussian processes. In ICRA, 2016.

Berkenkamp, F., Krause, A., and Schoellig, A. P. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. Machine Learning, 2021.

Berlind, C. and Urner, R. Active nearest neighbors in changing environments. In ICML, 2015.

Bickford Smith, F., Kirsch, A., Farquhar, S., Gal, Y., Foster, A., and Rainforth, T. Prediction-oriented bayesian active learning. In AISTATS, 2023.

Bickford Smith, F., Foster, A., and Rainforth, T. Making better use of unlabelled data in bayesian active learning. In AISTATS, 2024.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In ICML, 2015.

Bogunovic, I., Scarlett, J., Krause, A., and Cevher, V. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. Neur IPS, 29, 2016.

Bottero, A., Luis, C., Vinogradska, J., Berkenkamp, F., and Peters, J. R. Information-theoretic safe exploration with gaussian processes. Neur IPS, 35, 2022.

Bottero, A. G., Luis, C. E., Vinogradska, J., Berkenkamp, F., and Peters, J. Information-theoretic safe bayesian optimization. ar Xiv preprint ar Xiv:2402.15347, 2024.

Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in multi-armed bandits problems. In ALT, volume 20, 2009.

Chaloner, K. and Verdinelli, I. Bayesian experimental design: A review. Statistical Science, 1995.

Chandra, B. Quadrotor simulation, 2023. URL https://github.com/Bharath2/ Quadrotor-Simulation.

Chen, Y. and Krause, A. Near-optimal batch mode active learning and adaptive submodular optimization. In ICML, 2013.

Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In ICML, 2017.

Cohn, D. Neural network exploration using optimal experiment design. Neur IPS, 6, 1993.

Coleman, C., Chou, E., Katz-Samuels, J., Culatana, S., Bailis, P., Berg, A. C., Nowak, R., Sumbaly, R., Zaharia, M., and Yalniz, I. Z. Similarity search for efficient active learning and search of rare concepts. In AAAI, volume 36, 2022.

Cook, R. D. Detection of influential observation in linear regression. Technometrics, 19(1), 1977.

Cooper, S. E. and Netoff, T. I. Multidimensional bayesian estimation for deep brain stimulation using the safeopt algorithm. med Rxiv, 2022.

Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.

Das, A. and Kempe, D. Algorithms for subset selection in linear regression. In STOC, volume 40, 2008.

Das, A. and Kempe, D. Approximate submodularity and its applications: Subset selection, sparse approximation and dictionary selection. JMLR, 19(1), 2018.

Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. Laplace redux-effortless bayesian deep learning. Neur IPS, 34, 2021.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.

Emmenegger, N., Mutn y, M., and Krause, A. Likelihood ratio confidence sets for sequential decision making. Neur IPS, 37, 2023.

Esfandiari, H., Karbasi, A., and Mirrokni, V. Adaptivity in adaptive submodularity. In COLT, 2021.

Eustratiadis, P., Dudziak, Ł., Li, D., and Hospedales, T. Neural fine-tuning search for few-shot learning. ICLR, 2024.

Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. Neur IPS, 32, 2019.

Fu, B., Cao, Z., Wang, J., and Long, M. Transferable query selection for active domain adaptation. In CVPR, 2021.

Gal, Y., Islam, R., and Ghahramani, Z. Deep bayesian active learning with image data. In ICML, 2017.

Gao, M., Zhang, Z., Yu, G., Arık, S. Ö., Davis, L. S., and Pfister, T. Consistency-based semisupervised active learning: Towards minimizing labeling cost. In ECCV, 2020.

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. P. Bayesian optimization with inequality constraints. In ICML, volume 2014, 2014.

Geifman, Y. and El-Yaniv, R. Deep active learning over the long tail. ar Xiv preprint ar Xiv:1711.00941, 2017.

Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. Neur IPS, 17, 2004.

Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In ICML, 2017.

Graybill, F. A. An introduction to linear statistical models. Literary Licensing, LLC, 1961.

Guo, Y. and Greiner, R. Optimistic active-learning using mutual information. In IJCAI, volume 7, 2007.

Hacohen, G., Dekel, A., and Weinshall, D. Active learning on a budget: Opposite strategies suit high and low budgets. ICML, 2022.

Hardt, M. and Sun, Y. Test-time training on nearest neighbors for large language models. ICLR, 2024.

He, B., Lakshminarayanan, B., and Teh, Y. W. Bayesian deep ensembles via the neural tangent kernel. Neur IPS, 33, 2020.

Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR, 2017.

Hennig, P. and Schuler, C. J. Entropy search for information-efficient global optimization. JMLR, 13 (6), 2012.

Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. Neur IPS, 27, 2014.

Hoffman, M. W. and Ghahramani, Z. Output-space predictive entropy search for flexible global optimization. In Neur IPS workshop on Bayesian Optimization, 2015.

Holzmüller, D., Zaverkin, V., Kästner, J., and Steinwart, I. A framework and benchmark for deep batch active learning for regression. JMLR, 24(164), 2023.

Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8), 1982.

Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. Bayesian active learning for classification and preference learning. Co RR, 2011.

Hübotter, J., Sukhija, B., Treven, L., As, Y., and Krause, A. Active few-shot fine-tuning. ICLR workshop on Bridging the Gap Between Practice and Theory in Deep Learning, 2024.

Ilyas, A., Park, S. M., Engstrom, L., Leclerc, G., and Madry, A. Datamodels: Predicting predictions from training data. ar Xiv preprint ar Xiv:2202.00622, 2022.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Neur IPS, 31, 2018.

Johnson, J., Douze, M., and Jégou, H. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3), 2019.

Kaddour, J., Sæmundsson, S., et al. Probabilistic active meta-learning. Neur IPS, 33, 2020.

Kassraie, P. and Krause, A. Neural contextual bandits without regret. In AISTATS, 2022.

Khan, M. E. E., Immer, A., Abedi, E., and Korzepa, M. Approximate inference turns deep networks into gaussian processes. Neur IPS, 32, 2019.

Khanna, R., Elenberg, E., Dimakis, A., Negahban, S., and Ghosh, J. Scalable greedy feature selection via weak submodularity. In AISTATS, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2014.

Kirsch, A. Black-box batch active learning for regression. ar Xiv preprint ar Xiv:2302.08981, 2023.

Kirsch, A., Van Amersfoort, J., and Gal, Y. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Neur IPS, 32, 2019.

Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R., and Krause, A. Adaptive and safe bayesian optimization in high dimensions via one-dimensional subspaces. In ICML, 2019.

Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In ICML, 2017.

Kothawade, S., Beck, N., Killamsetty, K., and Iyer, R. Similar: Submodular information measures based active learning in realistic scenarios. Neur IPS, 34, 2021.

Krause, A. and Golovin, D. Submodular function maximization. Tractability, 3, 2014.

Krause, A. and Guestrin, C. Nonmyopic active learning of gaussian processes: an explorationexploitation approach. In ICML, volume 24, 2007.

Krause, A., Singh, A., and Guestrin, C. Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 9(2), 2008.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Kumari, L., Wang, S., Das, A., Zhou, T., and Bilmes, J. An end-to-end submodular framework for data-efficient in-context learning. In NAACL, 2024.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Neur IPS, 30, 2017.

Le Cun, Y., Cortes, C., and Burges, C. J. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as gaussian processes. ICLR, 2018.

Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. Neur IPS, 32, 2019.

Lee, Y., Chen, A. S., Tajwar, F., Kumar, A., Yao, H., Liang, P., and Finn, C. Surgical fine-tuning improves adaptation to distribution shifts. Neur IPS workshop on Distribution Shifts, 2022.

Lewis, D. and Gale, W. A sequential algorithm for training text classifiers. In SIGIR, 1994.

Lewis, D. D. and Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings. 1994.

Mac Kay, D. J. Information-based objective functions for active data selection. Neural computation, 4(4), 1992.

Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncertainty in deep learning. Neur IPS, 32, 2019.

Malladi, S., Wettig, A., Yu, D., Chen, D., and Arora, S. A kernel-based view of language model fine-tuning. In ICML, 2023.

Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015.

Mehta, R., Shui, C., Nichyporuk, B., and Arbel, T. Information gain sampling for active learning in medical image classification. In UNSURE, 2022.

Murphy, K. P. Probabilistic machine learning: Advanced topics. MIT Press, 2023.

Mutny, M. and Krause, A. Experimental design for linear functionals in reproducing kernel hilbert spaces. Neur IPS, 35, 2022.

Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions i. Mathematical programming, 14, 1978.

Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. JACM, 2013.

Pacchiano, A., Lee, J. N., and Brunskill, E. Experiment planning with function approximation. Neur IPS, 37, 2024.

Peng, H., Long, F., and Ding, C. Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8), 2005.

Prabhu, V., Chandrasekaran, A., Saenko, K., and Hoffman, J. Active domain adaptation via clustering uncertainty-weighted embeddings. In ICCV, 2021.

Pruthi, G., Liu, F., Kale, S., and Sundararajan, M. Estimating training data influence by tracing gradient descent. In Neur IPS, 2019.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Neur IPS, 20, 2007.

Rai, P., Saha, A., Daumé III, H., and Venkatasubramanian, S. Domain adaptation meets active learning. In NAACL HLT workshop on Active Learning for Natural Language Processing, 2010.

Rothfuss, J., Koenig, C., Rupenyan, A., and Krause, A. Meta-learning priors for safe bayesian optimization. In COLT, 2023.

Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., et al. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1), 2018.

Saha, A., Rai, P., Daumé, H., Venkatasubramanian, S., and Du Vall, S. L. Active supervised domain adaptation. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD, 2011.

Scheffer, T., Decomain, C., and Wrobel, S. Active hidden markov models for information extraction. In IDA, 2001.

Schreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert, H., and Toussaint, M. Safe exploration for active learning with gaussian processes. In ECML PKDD, 2015.

Sener, O. and Savarese, S. Active learning for convolutional neural networks: A core-set approach. ICLR, 2017.

Seo, S., Wallat, M., Graepel, T., and Obermayer, K. Gaussian process regression: Active data selection and test point rejection. In Mustererkennung 2000. Springer, 2000.

Settles, B. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.

Settles, B. and Craven, M. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, 2008.

Shoham, N. and Avron, H. Experimental design for overparameterized learning with application to single shot deep active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Shwartz-Ziv, R. and Le Cun, Y. To compress or not to compress self-supervised learning and information theory: A review. ar Xiv preprint ar Xiv:2304.09355, 2023.

Soviany, P., Ionescu, R. T., Rota, P., and Sebe, N. Curriculum learning: A survey. IJCV, 2022.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, volume 27, 2009.

Strang, G. Introduction to linear algebra. SIAM, 5 edition, 2016.

Su, J.-C., Tsai, Y.-H., Sohn, K., Liu, B., Maji, S., and Chandraker, M. Active adversarial domain adaptation. In WACV, 2020.

Sui, Y., Gotovos, A., Burdick, J., and Krause, A. Safe exploration for optimization with gaussian processes. In ICML, 2015.

Sukhija, B., Turchetta, M., Lindner, D., Krause, A., Trimpe, S., and Baumann, D. Gosafeopt: Scalable safe exploration for global optimization of dynamical systems. Artificial Intelligence, 2023.

Tamkin, A., Nguyen, D., Deshpande, S., Mu, J., and Goodman, N. Active learning helps pretrained models learn the intended task. Neur IPS, 35, 2022.

Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.

Tu, S., Frostig, R., Singh, S., and Sindhwani, V. JAX: A python library for differentiable optimal control on accelerators, 2023. URL http://github.com/google/trajax.

Turchetta, M., Berkenkamp, F., and Krause, A. Safe exploration for interactive machine learning. Neur IPS, 32, 2019.

Vakili, S., Khezeli, K., and Picheny, V. On information gain and regret bounds in gaussian process bandits. In AISTATS, 2021.

Vapnik, V. Estimation of dependences based on empirical data. Springer Science & Business Media, 1982.

Vergara, J. R. and Estévez, P. A. A review of feature selection methods based on mutual information. Neural computing and applications, 24, 2014.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. Neur IPS, 29, 2016.

Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

Wang, C., Sun, S., and Grosse, R. Beyond marginal uncertainty: How accurately can bayesian regression models estimate posterior predictive correlations? In AISTATS, 2021.

Wang, Z. and Jegelka, S. Max-value entropy search for efficient bayesian optimization. In ICML, 2017.

Wei, A., Hu, W., and Steinhardt, J. More than a toy: Random matrix models predict how real-world neural representations generalize. In ICML, 2022.

Widmer, D., Kang, D., Sukhija, B., Hübotter, J., Krause, A., and Coros, S. Tuning legged locomotion controllers via safe bayesian optimization. CORL, 2023.

Wilks, S. S. Certain generalizations in the analysis of variance. Biometrika, 1932.

Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.

Wischnewski, A., Betz, J., and Lohmann, B. A model-free algorithm to safely approach the handling limit of an autonomous racecar. In ICCVE, 2019.

Xia, M., Malladi, S., Gururangan, S., Arora, S., and Chen, D. Less: Selecting influential data for targeted instruction tuning. In ICML, 2024.

Xu, M. and Kazantsev, G. Understanding goal-oriented active learning via influence functions. In Neur IPS Workshop on Machine Learning with Guarantees, 2019.

Ye, J., Wu, Z., Feng, J., Yu, T., and Kong, L. Compositional exemplars for in-context learning. In ICML, 2023.

Yehuda, O., Dekel, A., Hacohen, G., and Weinshall, D. Active learning through a covering lens. Neur IPS, 35, 2022.

Yu, H. and Kim, S. Passive sampling for regression. In ICDM, 2010.

Yu, K., Bi, J., and Tresp, V. Active learning via transductive experimental design. In ICML, volume 23, 2006.

Zanette, A., Dong, K., Lee, J. N., and Brunskill, E. Design of experiments for stochastic contextual linear bandits. Neur IPS, 34, 2021.

Zheng, H., Liu, R., Lai, F., and Prakash, A. Coverage-centric coreset selection for high pruning rates. ICLR, 2023.

A general principle of transductive learning was already formulated by the famous computer scientist Vladimir Vapnik in the 20th century. Vapnik proposes the following imperative for a complex world :

When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one.

Vapnik (1982)

These appendices provide additional background, proofs, experiment details, and ablation studies.

A Additional Related Work 20

B Background 20

B.1 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C Proofs 21

C.1 Undirected Case of ITL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2 Non-adaptive Data Selection & Submodularity . . . . . . . . . . . . . . . . . . . 21

C.3 Batch Diversity: Batch Selection as Non-adaptive Data Selection . . . . . . . . . . 22

C.4 Measures of Synergies & Approximate Submodularity . . . . . . . . . . . . . . . 23

C.5 Convergence of Marginal Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C.6 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.7 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C.8 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

C.9 Useful Facts and Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

D Interpretations & Approximations of Principle ( ) 37

D.1 Interpretations of ITL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

D.2 Interpretations of VTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

D.3 Mean Marginal ITL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

D.4 Correlation-based Transductive Learning . . . . . . . . . . . . . . . . . . . . . . . 40

D.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

E Stochastic Target Spaces 41

F Closed-form Decision Rules 41

G Computational Complexity 42

H Additional GP Experiments & Details 42

I Alternative Settings for Active Fine-Tuning 43

I.1 Prediction Targets are Contained in Sample Space: A S . . . . . . . . . . . . . 43

I.2 Active Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

J Additional NN Experiments & Details 44

J.1 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

J.2 Embeddings and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

J.3 Towards Uncertainty Quantification in Latent Space . . . . . . . . . . . . . . . . . 47

J.4 Batch Selection via Conditional Embeddings . . . . . . . . . . . . . . . . . . . . 47

J.5 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

J.6 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

J.7 Ablation study of noise standard deviation ρ . . . . . . . . . . . . . . . . . . . . . 52

K Additional Safe BO Experiments & Details 54

K.1 A More Exploitative Stochastic Target Space . . . . . . . . . . . . . . . . . . . . 54

K.2 Detailed Comparison with Prior Works . . . . . . . . . . . . . . . . . . . . . . . . 55

K.3 Jumping Past Local Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

K.4 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A Additional Related Work

The general principle of non-active transductive learning was introduced by Vapnik (1982). The notion of target from transductive active learning is akin to the notion of task in curriculum learning (Bengio et al., 2009; Graves et al., 2017; Soviany et al., 2022). The study of settings where the irreducible uncertainty is zero is related to the study of estimability in experimental design (Graybill, 1961; Mutny & Krause, 2022). In feature selection, selecting features that maximize information gain with respect to a to-be-predicted label is a standard approach (Peng et al., 2005; Vergara & Estévez, 2014; Beraha et al., 2019) which is akin to ITL (cf. Appendix D). The themes of relevance and diversity are also important for efficient in-context learning (e.g., Ye et al., 2023; Kumari et al., 2024) and data pruning (Zheng et al., 2023). Transductive active learning is complimentary to other learning methodologies, such as semi-supervised learning (Gao et al., 2020), self-supervised learning (Shwartz-Ziv & Le Cun, 2023; Balestriero et al., 2023), and meta-learning (Kaddour et al., 2020; Rothfuss et al., 2023).

B Background

B.1 Information Theory

Throughout this work, log denotes the natural logarithm. Given random vectors x and y, we denote by

H[x] def = Ep(x)[ log p(x)],

H[x | y] def = Ep(x,y)[ log p(x | y)], and

I(x; y) def = H[x] H[x | y]

the (differential) entropy, conditional entropy, and information gain, respectively (Cover, 1999).9

The multivariate information gain (Murphy, 2023) between random vectors x, y, z is given by

I(x; y; z) def = I(x; y) I(x; y | z) (6) = I(x; y) + I(x; z) I(x; y, z). (7)

When I(x; y; z) = 0 it is said that y and z interact regarding their information about x. If the interaction is positive, it is said that the information of z about x is redundant given y. Conversely, if the interaction is negative, it is said that the information of z about x is synergistic with y. The notion of synergy is akin to the frequentist notion of suppressor variables in linear regression (Das & Kempe, 2008).

B.2 Gaussian Processes

The stochastic process f is a Gaussian process (GP, Williams & Rasmussen (2006)), denoted f GP(µ, k), with mean function µ and kernel k if for any finite subset X = {x1, . . . , xn} X, f X N(µX, KXX) is jointly Gaussian with mean vector µX(i) = µ(xi) and covariance matrix KXX(i, j) = k(xi, xj).

In the following, we formalize the assumptions from the GP setting (cf. Section 3.1). Assumption B.1 (Gaussian prior). We assume that f GP(µ, k) with known mean function µ and kernel k. Assumption B.2 (Gaussian noise). We assume that the noise εx is mutually independent and zero-mean Gaussian with known variance ρ2(x) > 0. We write PX = diag ρ2(x1), . . . , ρ2(xn).

Under Assumptions B.1 and B.2, the posterior distribution of f after observing points X is GP(µn, kn) with

µn(x) = µ(x) + Kx X(KXX + PX) 1(y X µX),

kn(x, x ) = k(x, x ) Kx X(KXX + PX) 1KXx ,

σ2 n(x) = kn(x, x).

9One has to be careful to ensure that I(x; y) exists, i.e., |I(x; y)| < . We will assume that this is the case throughout this work. When x and y are jointly Gaussian, this is satisfied when the noise variance ρ2 is positive.

For Gaussian random vectors f and y, the entropy is H[f] = n

2 log(2πe) + 1

2 log |Var[f]|, the information gain is I(f; y) = 1

2(log |Var[y]| log |Var[y | f]|), and

γn = max X X |X| n

1 2 log I + P 1 X KXX .

We will write

σ2 def = maxx X σ2 0(x), and σ2 def = maxx X σ2 0(x) + ρ2(x).

The following is a brief overview of the structure of this section:

1. Appendix C.1 relates ITL in the inductive learning setting (S A) to prior work. 2. Appendix C.2 relates the designs selected by ITL and VTL to the optimal designs for corresponding non-adaptive objectives. 3. Appendix C.3 shows that batch selection via ITL or VTL leads to informative and diverse batches, utilizing the results from Appendix C.2. 4. Appendix C.4 introduces measures of synergies that generalize the submodularity assumption (cf. Assumption 3.1). 5. Appendix C.5 proves key results on the convergence of the ITL and VTL objectives. 6. Appendix C.6 proves Theorem 3.2 (convergence in GP setting). 7. Appendix C.7 proves Theorem 3.3 (convergence in agnostic setting). 8. Appendix C.8 proves Theorem 5.1 (convergence in safe BO application). 9. Appendix C.9 includes useful facts.

C.1 Undirected Case of ITL

We briefly examine the important special case of ITL where S A. In this setting, for all x S, the decision rule of ITL simplifies to

I(f A; yx | Dn) (i) = I(f A\{x}; yx | fx, Dn) + I(fx; yx | Dn)

(ii) = I(fx; yx | Dn) = H[yx | Dn] H[εx]

where (i) follows from the chain rule of information gain and x S A; and (ii) follows from the conditional independence f A yx | fx.

If additionally f is a GP then

H[yx | Dn] H[εx] = 1

2 log 1 + Var[fx | Dn]

This decision rule has also been termed total information gain (Mac Kay, 1992). When S A and observation noise is homoscedastic, this decision rule is equivalent to uncertainty sampling.

C.2 Non-adaptive Data Selection & Submodularity

Recall the non-myopic information gain ψA(X) = I(f A; y X) (ITL) and variance reduction ψA(X) = tr Var[f A] tr Var[f A | y X] (VTL) objective functions from Assumption 3.1. In this section, we will relate the designs selected by ITL and VTL to the optimal designs for these objectives. To this end, consider the non-adaptive optimization problem

X = arg max X S |X|=k

Lemma C.1. For both ITL and VTL, ψA is non-negative and monotone.

Proof. For ITL, ψA(X) 0 follows from the non-negativity of mutual information. To conclude monotonicity, note that for any X X S,

I(f A; y X ) = H[f A] H[f A | y X ] H[f A] H[f A | y X] = I(f A; y X)

due to monotonicity of conditional entropy (which is also called the information never hurts principle).

For VTL, recall that tr Var[f A | y X] tr Var[f A | y X ] for any X X S (with an implicit expectation over y X, y X ). Non-negativity and monotonicity of ψA then follow analogously to ITL.

Lemma C.2. The marginal gain A(x | X) def = ψA(X {x}) ψA(X) of x S given X S is the ITL and VTL objective, respectively.

Proof. For ITL,

A(x | X) = I(f A; y X, yx) I(f A; y X) = H[f A | y X] H[f A | y X, yx] = I(f A; yx | y X)

which is precisely the ITL objective.

A(x | X) = tr Var[f A | y X] tr Var[f A | y X, yx] = tr Var[f A | y X, yx] + const

which is precisely the VTL objective.

Definition C.3 (Submodularity). ψA is submodular if and only if for all x S and X X S,

A(x | X ) A(x | X).

Theorem C.4 (Nemhauser et al. (1978)). Let Assumption 3.1 hold. For any n 1, if ITL or VTL selected x1:n, respectively, then ψA(x1:n) (1 1/e) max X S |X| n ψA(X).

Proof. This is a special case of a canonical result from non-negative monotone submodular function maximization (Nemhauser et al., 1978; Krause & Golovin, 2014).

C.3 Batch Diversity: Batch Selection as Non-adaptive Data Selection

Recall the non-adaptive optimization problem

Bn,k = arg max B S |B|=k

I(f A; y B | Dn 1)

from Equation (3) with batch size k > 0, and denote by B n,k = xn,1:k the greedy approximation from Equation (3). The selection of an individual batch can be seen as a single non-adaptive optimization problem with marginal gain

n(x | B) = I(f A; y B, yx | Dn 1) I(f A; y B | Dn 1) = H[f A | Dn 1, y B] H[f A | Dn 1, y B, yx] = I(f A; yx | Dn 1, y B)

and which is precisely the objective function of ITL from Equation (3). Hence, the approximation guarantees from Theorems C.4 and C.11 apply. The derivation is analogous for VTL.

Prior work has shown that the greedy solution B n is also competitive with a fully sequential batchless decision rule (Chen & Krause, 2013; Esfandiari et al., 2021).

C.4 Measures of Synergies & Approximate Submodularity

We will now show that downstream synergies , if present, can be seen as a source of learning complexity, which is orthogonal to the information capacity γn.

Example C.5. Consider the example where f is a stochastic process of three random variables X, Y, Z where X and Y are Bernoulli (p = 1 2), and Z is the XOR of X and Y . Suppose that observations are exact (i.e., εn = 0), that the target space A comprises the output variable Z while the sample space S comprises the input variables X and Y . Observing any single X or Y yields no information about Z: I(Z; X) = I(Z; Y ) = 0, however, observing both inputs jointly perfectly determines Z: I(Z; X, Y ) = 1. Thus, γn(A; S) = 1 if n 2 and γn(A; S) = 0 else.

Learning about Z in examples of this kind is difficult for agents that make decisions greedily, since the next action (observing X or Y ) yields no signal about its long-term usefulness. We call a sequence of observations, such as {X, Y }, synergistic since its combined information value is larger than the individual values. The prevalence of synergies is not captured by the information capacity γn(A; S) since it measures only the joint information gain of n samples within S. Instead, the prevalence of synergies is captured by the sequence Γn def = maxx S A(x | x1:n), which measures the maximum information gain of yn+1. If Γn > Γn 1 at any round n, this indicates a synergy. The following key object measures the additional complexity due to synergies.

Definition C.6 (Task complexity). For n 1, assuming Γi > 0 for all 1 i n, we define the task complexity as

αA,S(n) def = max i {0,...,n 1} Γn 1

Note that αA,S(n) is large only if the information gain of yn is larger than that of a previous observation yi. Intuitively, if αA,S(n) is large, the agent had to discover the implicit intermediate observations y1, . . . , yn 1 that lead to downstream synergies. We will subsequently formalize the intimate connections of the task complexity to synergies and submodularity. Note that in the GP setting, αA,S(n) can be computed online by keeping track of the smallest Γi during previous rounds i. Further, note that αA,S(n) 1 if ψA is submodular.

C.4.1 The Information Ratio

Another object will prove useful in our analysis of synergies.

Consider an alternative multiplicative interpretation of the multivariate information gain (cf. Equation (7)), which we call the information ratio of X S given D S, |X|, |D| < :

κ(X | D) def = P

x X A(x | D)

A(X | D) [0, ). (8)

Observe that κ(X | D) measures the synergy properties of y X with respect to f A given y D in a multiplicative sense. That is, if κ(X | D) > 1 then information in y X is redundant, whereas if κ(X | D) < 1 then information in y X is synergistic, and if κ(X | D) = 1 then y X do not mutually interact with respect to f A (all given y D). In the degenerate case where A(X | D) = 0 (which implies P

x X A(x | D) = 0) we therefore let κ(X | D) = 1.

The information ratio of ITL is strictly positive in the Gaussian case We prove the following straightforward lower bound to the information ratio of ITL.

Lemma C.7. Let X, D S, |X|, |D| < . If f A and y X D are jointly Gaussian then κ(X | D) > 0.

Proof. W.l.o.g. assume D = . We let X = {x1, . . . , xk} and prove lower and upper bound separately. We assume w.l.o.g. that I(f A; y X) > 0 which implies |Var[f A | y X]| < |Var[f A]|. Thus, there exists some i such that f A and yxi are dependent, so |Var[f A | yxi]| < |Var[f A]| which implies I(f A; yxi) > 0. We therefore conclude that κ(X) > 0.

The following example shows that this lower bound is tight.

Example C.8 (Synergies of Gaussian random variables, inspired by Section 3 of Barrett (2015)). Consider the three random variables X, Y , and Z (think A = {X} and S = {Y, Z}) which are jointly Gaussian with mean vector 0 and covariance matrix

"1 a a a 1 0 a 0 1

, for 2a2 < 1

where the constraint on a is to ensure that Σ is positive definite. Computing the mutual information, we have

I(X; Y ) = I(X; Z) = 1

2 log(1 a2)

and I(X; Y, Z) = 1

2 log(1 2a2). Therefore, I(X; Y ) + I(X; Z)

I(X; Y, Z) = log(1 2a2 + a4)

log(1 2a2) < 1.

log(1 2a2 + a4)

log(1 2a2) = 0,

and hence perhaps unintuitively even if Y and Z are uncorrelated, their information about X may be arbitrarily synergistic.

C.4.2 The Submodularity of the Special Undirected Case of ITL

In the inductive active learning problem considered in most prior works, where S A and f is a Gaussian process, it holds for ITL that αA,S(n) = 1 since all learning targets appear explicitly in S: Lemma C.9. Let S A. Then ψA of ITL is submodular.

Proof. Fix any x S and X X S. Let X def = X \ X . By the definition of conditional information gain, we have A(x | X) = I(yx; f A | y X) = I(yx; f A, y X | y X) I(yx; y X | y X). Since for any x S and X S, yx y X | f A, this simplifies to I(yx; f A | y X) = I(yx; f A | y X) I(yx; y X | y X). It then follows from I(yx; y X | y X) 0 that A(x | X) = I(yx; f A | y X) I(yx; f A | y X) = A(x | X ).

This implies that αA,S(n) 1 for any n and κ(X | D) 1 for any X, D S when S A.

C.4.3 The Submodularity Ratio

Building upon the theory of maximizing non-negative monotone submodular functions (Nemhauser et al., 1978; Krause & Golovin, 2014), Das & Kempe (2018) define the following notion of approximate submodularity:

Definition C.10 (Submodularity ratio). The submodularity ratio of ψA up to cardinality n 1 is κA(n) def = min D x1:n X S:|X| n D X=

κ(X | D), (9)

where they define 0

0 1. ψA is said to be κ-weakly submodular for some κ > 0 if infn N κA(n) κ.

As a special case of Theorem 6 from Das & Kempe (2018), applying that ψA is non-negative and monotone, we obtain the following result. Theorem C.11 (Das & Kempe (2018)). For any n 1, if ITL or VTL selected x1:n, respectively, then ψA(x1:n) (1 e κA(n)) max X S |X| n ψA(X).

If ψA is submodular, it is implied that κA(n) 1 for all n 1 in which case Theorem C.11 recovers Theorem C.4.

C.5 Convergence of Marginal Gain

Our following analysis allows for changing target spaces An and sample spaces Sn (cf. Section 5), and to this end, we redefine Γn def = maxx Sn An(x | x1:n). The following theorems show that the marginal gains of ITL and VTL converge to zero, and will serve as the main tool for establishing Theorems 3.2 and 3.3. We will abbreviate αA,S(n) by αn. Theorem C.12 (Convergence of Marginal Gain for ITL). Assume that Assumptions B.1 and B.2 are satisfied. Fix any integers n1 > n0 0, = n1 n0 + 1 such that for all i {n0, . . . , n1 1}, Ai+1 Ai and S def = Si+1 = Si. Further, assume |An0| < . Then, if the sequence {xi+1}n1 i=n0 was generated by ITL,

Moreover, if n0 = 0,

Γn1 αn1 γA0,S( )

Proof. We have

i=n0 max x S I(f Ai; yx | y1:i)

i=n0 I(f Ai; yxi+1 | Di)

i=n0 I(f An0; yxi+1 | Di)

i=n0 I(f An0; yxi+1 | yxn0+1:i, Dn0)

I(f An0; yxn0+1:n1+1 | Dn0)

max X S |X|= I(f An0; y X | Dn0)

max X S |X|= I(f X; y X | Dn0)

max X S |X|= I(f X; y X)

where (i) follows from the definition of the task complexity αn1 (cf. Definition C.6); (ii) uses the objective of ITL and that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations; (iii) uses Ai+1 Ai and monotonicity of information gain; (iv) uses that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations; (v) uses the chain rule of information gain; (vi) uses y X f An0 | f X and the data processing inequality. The conditional independence follows from the assumption that the observation noise is independent. Similarly, y X Dn0 | f X which implies (vii).

If n0 = 0, then the bound before line (vi) simplifies to αn1γA0,S( )/ .

The result for VTL is stated, for simplicity, only for the case where the target space and sample space are fixed. Theorem C.13 (Convergence of Marginal Gain for VTL). Assume that Assumptions B.1 and B.2 are satisfied. Then for any n 1, if the sequence {xi}n i=1 is generated by VTL,

x A γ{x },S(n). (12)

We remark that P

x A γ{x },S(n) |A|γA,S(n).

Proof. We have

h tr Var[f A | y1:i] min x S tr Var[f A | y1:i, yx] i

h tr Var[f A | Di] tr Var[f A | Di+1] i

i=0 log Var[fx | Dn]

Var[fx | Dn+1]

i=0 I fx ; yxn+1 Dn

(iv) = 2σ2αn

i=0 I fx ; yxn+1 yx1:n

(v) = 2σ2αn

x A I fx ; yx1:n

x A max X S |X|=n I(fx ; y X)

x A γ{x },S(n)

where (i) follows from the definition of the task complexity αn1 (cf. Definition C.6); (ii) follows from the VTL decision rule and that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations; (iii) follows from Lemma C.38 and monotonicity of variance; (iv) uses that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations; and (v) uses the chain rule of mutual information. The remainder of the proof is analogous to the proof of Theorem C.12 (cf. Appendix C.5).

Keeping track of the task complexity online In general, the task complexity αn may be larger than one in the directed setting (i.e., when S A). However, note that αn can easily be evaluated online by keeping track of the smallest Γi during previous rounds i.

C.6 Proof of Theorem 3.2

We will now prove Theorem 3.2. We first prove the convergence of marginal variance within S for ITL, before proving the convergence outside S in Appendix C.6.1.

Lemma C.14 (Uniform convergence of marginal variance within S for ITL). Assume that Assumptions B.1 and B.2 are satisfied. For any n 0 and x A S,

σ2 n(x) 2 σ2 Γn. (13)

Proof. We have

σ2 n(x) = Var[fx | Dn] Var[fx | fx, Dn] | {z } 0

(i) = Var[yx | Dn] ρ2(x) (Var[yx | fx, Dn] ρ2(x)) = Var[yx | Dn] Var[yx | fx, Dn]

(ii) σ2 log Var[yx | Dn] Var[yx | fx, Dn]

= 2 σ2 I(fx; yx | Dn)

(iii) 2 σ2 I(f A; yx | Dn)

(iv) 2 σ2 max x S I(f A; yx | Dn)

(v) = 2 σ2 max x S I(f A; yx | y1:n)

= 2 σ2 Γn where (i) follows from the noise assumption (cf. Assumption B.2); (ii) follows from Lemma C.38 and using monotonicity of variance; (iii) follows from x A and monotonicity of information gain; (iv) follows from x S; and (v) uses that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations.

C.6.1 Convergence outside S for ITL

We will now show convergence of marginal variance to the irreducible uncertainty for points outside the sample space.

Our proof roughly proceeds as follows: We construct an approximate Markov boundary of x in S, and show (1) that the size of this Markov boundary is independent of n, and (2) that a small uncertainty reduction within the Markov boundary implies that the marginal variances at the Markov boundary and(!) x are small. Definition C.15 (Approximate Markov boundary). For any ϵ > 0, n 0, and x X, we denote by Bn,ϵ(x) the smallest (multi-)subset of S such that

Var[fx | Dn, y Bn,ϵ(x)] η2 S(x) + ϵ. (14)

We call Bn,ϵ(x) an ϵ-approximate Markov boundary of x in S.

Equation (14) is akin to the notion of the smallest Markov blanket in S of some x X (called a Markov boundary) which is the smallest set B S such that fx f S | f B. Lemma C.16 (Existence of an approximate Markov boundary). For any ϵ > 0, let k be the smallest integer satisfying γk

k ϵλmin(KSS)

2|S|σ2 σ2 . (15)

Then, for any n 0 and x X, there exists an ϵ-approximate Markov boundary Bn,ϵ(x) of x in S with size at most k.

Lemma C.16 shows that for any ϵ > 0 there exists a universal constant bϵ (with respect to n and x) such that

|Bn,ϵ(x)| bϵ n 0, x X. (16)

We defer the proof of Lemma C.16 to Appendix C.6.3 where we also provide an algorithm to compute Bn,ϵ(x).

Lemma C.17. For any ϵ > 0, n 0, and x X,

σ2 n(x) 2σ2 I(fx; y Bn,ϵ(x) | Dn) + η2 S(x) + ϵ (17)

where Bn,ϵ(x) is an ϵ-approximate Markov boundary of x in S.

Proof. We have

σ2 n(x) = Var[fx | Dn] η2 S(x) + η2 S(x)

(i) Var[fx | Dn] Var[fx | y Bn,ϵ(x), Dn] + η2 S(x) + ϵ

(ii) σ2 log

Var[fx | Dn] Var[fx | y Bn,ϵ(x), Dn]

+ η2 S(x) + ϵ

= 2σ2 I(fx; y Bn,ϵ(x) | Dn) + η2 S(x) + ϵ

where (i) follows from the defining property of an ϵ-approximate Markov boundary (cf. Equation (14)); and (ii) follows from Lemma C.38 and using monotonicity of variance.

Lemma C.18. For any ϵ > 0, n 0, and x A,

I(fx; y Bn,ϵ(x) | Dn) bϵ κn(Bn,ϵ(x))Γn (18)

where Bn,ϵ(x) is an ϵ-approximate Markov boundary of x in S, |Bn,ϵ(x)| bϵ, and where κn( ) def = κ( | x1:n) denotes the information ratio from Equation (8).

We remark that κn( ) > 0 as is shown in Lemma C.7, and hence, the right-hand side of the inequality is well-defined.

Proof. We use the abbreviated notation B = Bn,ϵ(x). We have

I(fx; y B | Dn) (i) I(f A; y B | Dn)

(ii) 1 κn,bϵ

x B I(f A; y x | Dn)

(iii) bϵ κn,bϵ max x B I(f A; y x | Dn)

(iv) bϵ κn,bϵ max x S I(f A; y x | Dn)

(v) = bϵ κn,bϵ max x S I(f A; y x | y1:n)

= bϵ κn,bϵ Γn

where (i) follows from monotonicity of mutual information; (ii) follows from the definition of the information ratio κn,bϵ (cf. Equation (8)); (iii) follows from b bϵ; (iv) follows from B S; and (v) uses that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations.

Proof of Theorem 3.2 for ITL . The case where x A S is shown by Lemma C.14 with C = 2 σ2.

To prove the more general result, fix any x A and ϵ > 0. By Lemma C.16, there exists an ϵ-approximate Markov boundary Bn,ϵ(x) of x in S such that |Bn,ϵ(x)| bϵ. We have

σ2 n(x) (i) 2σ2 I(fx; y Bn,ϵ(x) | Dn) + η2 S(x) + ϵ

(ii) 2σ2bϵ κn(Bn,ϵ(x))Γn + η2 S(x) + ϵ

where (i) follows from Lemma C.17; and (ii) follows from Lemma C.18.

Let ϵ = c γ n n with c = 2|S|σ2 σ2/λmin(KSS). Then, by Equation (15), bϵ can be bounded for instance by n. Together with Theorem C.12 this implies for ITL that

σ2 n(x) η2 S(x) + 2σ2 n Γn + cγ n/ n

η2 S(x) + c γn/ n

for a constant c , e.g., c = 2σ2 + c.

C.6.2 Convergence outside S for VTL

Proof of Theorem 3.2 for VTL . Analogously to Lemma C.17, we have

σ2 n(x) = Var[fx | Dn] η2 S(x) + η2 S(x)

(i) Var[fx | Dn] Var[fx | y Bn,ϵ(x), Dn] + η2 S(x) + ϵ

where (i) follows from the defining property of an ϵ-approximate Markov boundary (cf. Equation (14)). Further, we have

Var[fx | Dn] Var[fx | y Bn,ϵ(x), Dn]

x Bn,ϵ(x) (Var[fx | Dn] Var[fx | y x, Dn])

x Bn,ϵ(x) (tr Var[f A | y1:n] tr Var[f A | y x, y1:n])

(iii) bϵΓn where (i) follows from the submodularity of ψA; (ii) uses that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations; and (iii) follows from the definition of Γn and Lemma C.16.

The remainder of the proof is analogous to the result for ITL, using Theorem C.13 to bound Γn.

C.6.3 Existence of an Approximate Markov Boundary

We now derive Lemma C.16 which shows the existence of an approximate Markov boundary of x in S. Lemma C.19. For any S S and k 0, there exists B S with |B| = k such that for all x S,

Var[fx | y B] 2 σ2 γk

Proof. We choose B S greedily using the acquisition function

xk def = arg max x S I(f S; y x | y Bk 1)

where Bk = x1:k. Note that this is the undirected special case of ITL, and hence, we have

Var fx | y Bk (i) 2 σ2Γk

(ii) 2 σ2 γk

k where (i) is due to Lemma C.14; and (ii) is due to Theorem C.12 and αS,S(k) 1.

Lemma C.20. Given any ϵ > 0 and B S S with |S| < , such that for any x S,

Var[fx | y B] ϵλmin(KSS)

|S|σ2 . (20)

Then for any x X,

Var[fx | y B] Var[fx | f S] + ϵ. (21)

Proof. We will denote the right-hand side of Equation (20) by ϵ . We have

Var[fx | y B]

(i) = Ef S[Varfx[fx | f S, y B] | y B] + Varf S[Efx[fx | f S, y B] | y B]

(ii) = Varfx[fx | f S, y B] + Varf S[Efx[fx | f S, y B] | y B]

(iii) = Varfx[fx | f S] | {z } irreducible uncertainty

+ Varf S[Efx[fx | f S] | y B] | {z } reducible (epistemic) uncertainty

where (i) follows from the law of total variance; (ii) uses that the conditional variance of a Gaussian depends only on the location of observations and not on their value; and (iii) follows from fx y B | f S since B S. It remains to bound the reducible uncertainty.

Let hx : Rd R, f S 7 E[fx | f S] where we write d def = |S|. Using the formula for the GP posterior mean, we have

hx(f S) = E[fx] + z (f S E[f S])

where z def = K 1 SS KSx. Because h is a linear function in f S we have for the reducible uncertainty that

Varf S[hx(f S) | y B] = z Var[f S | y B]z

(i) d z diag Var[f S | y B]z

(ii) ϵ d z z

= ϵ d Kx SK 1 SS K 1 SS KSx

ϵ d λmin(KSS)Kx SK 1 SS KSx

(iii) ϵ dσ2

where (i) follows from Lemma C.37; (ii) follows from the assumption that Var[fx | y B] ϵ for all x S; and (iii) follows from

Kx SK 1 SS KSx Kxx = σ2

since Kxx Kx SK 1 SS KSx 0.

Proof of Lemma C.16. Let B S be the set of size k generated by Lemma C.19 to satisfy Var[fx | y B] 2 σ2γk/k for all x S. We have for any x X,

Var[fx | Dn, y B] (i) Var[fx | y B]

(ii) Var[fx | f S] + ϵ

where (i) follows from monotonicity of variance; and (ii) follows from Lemma C.20; using |S| < and the condition on k.

We remark that Lemma C.19 provides an algorithm (just undirected ITL!) to compute an approximate Markov boundary, and the set B returned by this algorithm is a valid approximate Markov boundary for all x X. One can simply swap-in ITL with target space {x} for undirected ITL to obtain tighter (but instance-dependent) bounds on the size of the approximate Markov boundary.

C.6.4 Generalization to Continuous S for Finite Dimensional RKHSs

In this subsection we generalize Theorem 3.2 to continuous sample spaces S. We will make the following assumption:

Assumption C.21. The RKHS of the kernel k is finite dimensional. In other words, the kernel k can be expressed as k(x, x ) = ϕ(x) ϕ(x ) for some feature map ϕ : X Rd with d < .

In the following, we will denote the design matrix of the sample space S by Φ def = [ϕ(x) : x S] R|S| d, and we denote by ΠΦ its orthogonal projection onto the orthogonal complement of the span of Φ. In particular, it holds that

1. ΠΦv = 0 for all v span Φ, and 2. ΠΦv = v for all v (span Φ) .

Especially, v ker ΠΦ if and only if v span Φ. This projection can be computed as follows: Lemma C.22. It holds that ΠΦ = I Φ (ΦΦ ) 1Φ. (22)

Proof. Φ (ΦΦ ) 1Φ is the orthogonal projection onto the span of Φ (see, e.g., Strang, 2016, page 211).

Lemma C.23. Under Assumption C.21, the irreducible uncertainty η2 S(x) of x X is

η2 S(x) = ϕ(x) 2 ΠΦ (23)

where v A =

v Av denotes the Mahalanobis distance.

Proof. This is an immediate consequence of the formula for the conditional variance of multivariate Gaussians (cf. Appendix B.2), applied to the linear kernel.

Lemmas C.22 and C.23 imply that η2 S(x ) = 0 for all x X with ϕ(x ) span Φ. That is, the irreducible uncertainty is zero for points in the span of the sample space. In contrast, for points x with ϕ(x ) (span Φ) , the irreducible uncertainty equals the initial uncertainty: η2 S(x ) = σ2 0(x ). The irreducible uncertainty of any other point x can be computed by simple decomposition of ϕ(x) into parallel and orthogonal components.

Assuming that Assumption C.21 holds and given any (non-finite) S X, there exists a basis ΩS X in the space of embeddings ϕ( ) such that span S = span ΩS and |ΩS| d. The generalized existence of an approximate Markov boundary for continuous domains can then be shown analogously to Lemma C.16: Lemma C.24 (Existence of an approximate Markov boundary for a continuous domain). Let S be any (continuous) subset of X and let Assumption C.21 hold with d < . Further, for any ϵ > 0, let k be the smallest integer satisfying γk

k ϵλmin(KΩSΩS)

2dσ2 σ2 . (24)

Then, for any n 0 and x X, there exists an ϵ-approximate Markov boundary Bn,ϵ(x) of x in S with size at most k.

Proof sketch. The proof follows analogously to Lemma C.16 by conditioning on the finite set ΩS as opposed to S.

C.7 Proof of Theorem 3.3

We first formalize the assumptions of Theorem 3.3: Assumption C.25 (Regularity of f ). We assume that f is in a reproducing kernel Hilbert space Hk(X) associated with a kernel k and has bounded norm, that is, f k B for some finite B R. Assumption C.26 (Sub-Gaussian noise). We further assume that each εn from the noise sequence {εn} n=1 is conditionally zero-mean ρ(xn)-sub-Gaussian with known constants ρ(x) > 0 for all x X. Concretely,

n 1, λ R : E eλϵn Dn 1 exp λ2ρ2(xn)

where Dn 1 corresponds to the σ-algebra generated by the random variables {xi, ϵi}n 1 i=1 and xn.

We make use of the following foundational result, showing that under the above two assumptions the (misspecified) Gaussian process model from Section 3.1 is an all-time well-calibrated model of f : Lemma C.27 (Well-calibrated confidence intervals; Abbasi-Yadkori (2013); Chowdhury & Gopalan (2017)). Pick δ (0, 1) and let Assumptions C.25 and C.26 hold. Let

βn(δ) = f k + ρ p

2(γn + 1 + log(1/δ))

where ρ = maxx X ρ(x).10 Then, for all x X and n 0 jointly with probability at least 1 δ,

|f (x) µn(x)| βn(δ) σn(x)

where µn(x) and σ2 n(x) are mean and variance (as defined in Appendix B.2) of the GP posterior of f(x) conditional on the observations Dn, pretending that εi is Gaussian with variance ρ2(xi).

The proof of Theorem 3.3 is a straightforward application of Lemma C.27 and Theorem 3.2:

Proof of Theorem 3.3. By Theorem 3.2, we have that for all x A,

η2 S(x) + ν2 A,S(n) ηS(x) + νA,S(n).

The result then follows by application of Lemma C.27.

C.8 Proof of Theorem 5.1

In this section, we derive our main result on Safe BO. In Appendix C.8.1, we give the definition of the reachable safe set R and derive the conditions under which convergence to the reachable safe set is guaranteed. Then, in Appendix C.8.2, we prove Theorem 5.1.

Notation In the agnostic setting from Section 3.2 (i.e., under Assumptions C.25 and C.26), Lemma C.27 provides us with the following (1 δ)-confidence intervals (CIs)

Cn(x) def = Cn 1(x) [µn(x) βn(δ) σn(x)] (25)

where C 1(x) = R. We write un(x) def = max Cn(x), ln(x) def = min Cn(x), and wn(x) def = un(x) ln(x) for its upper bound, lower bound, and width, respectively.

We learn separate statistical models f and {g1, . . . , gq} for the ground truth objective f and ground truth constraints {g 1, . . . , g q}. We write I def = {f, 1, . . . , q} and collect the constraints in Is def = {1, . . . , q}. Without loss of generality, we assume that the confidence intervals include the ground truths with probability at least 1 δ jointly for all i I.11 For i I, denote by un,i, ln,i, wn,i, ηi, βn,i the respective quantities. In the following, we do not explicitly denote the dependence of βn on δ.

To improve clarity, we will refer to the set of potential maximizers defined in Equation (5) as Mn and denote by An an arbitrary target space.

We point out the following corollary: Corollary C.28 (Safety). With high probability, jointly for any n 0 and any i Is,

x Sn : g i (x) 0. (26)

C.8.1 Convergence to Reachable Safe Set

Definition C.29 (Reachable safe set). Given any pessimistic safe set S X and any ϵ 0 and β 0, we define the reachable safe set up to (ϵ, β)-slack and its closure as

Rϵ,β(S) def = S {x X \ S | g i (x) β(ηi(x; S) + ϵ) 0 for all i Is} Rϵ,β(S) def = lim n (Rϵ,β)n(S)

where (Rϵ,β)n denotes the n-th composition of Rϵ,β with itself.

10βn(δ) can be tightened adaptively (Emmenegger et al., 2023). 11This can be achieved by taking a union bound and rescaling δ.

Remark C.30. Convergence of the safe set to the closure of the reachability operator can only be guaranteed for finite safe sets (|S | < ). The following proofs readily generalize to continuous domains by considering convergence within the k-th composition of the reachability operator with itself for some k < . In this case the sample complexity grows with k rather than |S |. The only required modification is to lift the assumption of Theorem C.12 that information is gained only while safe sets remain constant (i.e., Si+1 = Si for all i). This assumption is straightforward to lift since for any n 0 and T 1,

max x Sn A(x | x1:n+T ) 1

t=1 max x Sn A(x | x1:n+t) 1

t=1 max x Sn+t A(x | x1:n+t) γT

using submodularity for the first inequality and the monotonicity of the safe set for the second inequality. In particular, this shows that one continues learning about points in the original safe set even as the safe set grows.

We denote by S0 the initial pessimistic safe set induced by the (prior) statistical model g (cf. Section 5) and write Rϵ,β def = Rϵ,β(S0). Lemma C.31 (Properties of the reachable safe set). For all S, S X, ϵ 0, and β 0:

(i) S S = Rϵ,β(S ) Rϵ,β(S),

(ii) Rϵ,β(S) S = Rϵ,β(S) S, and

(iii) R0,0( ) = R0,0 = S .

Proof (adapted from lemma 7.1 of Berkenkamp et al. (2021)).

1. Let x Rϵ,β(S ). If x S then x Rϵ,β(S), so let x S. Then, by definition, for all i Is, f i (x) βηi(x; S ) ϵ 0. By the monotonicity of variance, ηi(x; S ) ηi(x; S) for all i I, and hence f i (x) βηi(x; S) ϵ 0 for all i Is. It follows that x Rϵ,β(S).

2. By the monotonicity of variance, ηi(x; Rϵ,β(S)) ηi(x; S) for all x X and i I. Thus, by definition of the safe region, we have that Rϵ,β(Rϵ,β(S)) S. The result follows by taking the limit.

3. The result follows directly from the definition of the true safe set S (cf. Equation (4)).

Clearly, we cannot expand the safe set beyond R0,0. The following is our main intermediate result, showing that either we expand the safe set at some point or the uncertainty converges to the irreducible uncertainty. Lemma C.32. Given any n0 0, ϵ > 0, let n be the smallest integer such that νn , ϵ2 ϵ where ϵ = ϵ/2. Let βn0+n = maxi Is βn0+n ,i. Assume that the sequence of target spaces is monotonically decreasing, i.e., An+1 An. Then, we have with high probability (at least) one of x An0+n , i I :

wn0+n ,i(x) βn0+n [ηi(x; Sn0+n ) + ϵ]

and An0+n Rϵ,βn0+n (Sn0+n ) Sn0+n

or |Sn0+n +1| > |Sn0|.

Proof. Suppose that |Sn0+n +1| = |Sn0|. Then, by Theorem 3.3 (using that the sequence of target spaces is monotonically decreasing), for any x An0+n and i I,

wn0+n ,i(x) βn0+n [ηi(x; Sn0+n ) + ϵ].

As Sn0+n +1 = Sn0+n we have for all x An0+n \ Sn0+n and i Is, with high probability that

0 > ln0+n ,i(x) g i (x) wn0+n ,i(x) g i (x) βn0+n [ηi(x; Sn0+n ) + ϵ].

It follows that An0+n Rϵ,βn0+n (Sn0+n ) Sn0+n .

To gather more intuition about the above lemma, consider the target space

En def = b Sn \ Sn. (27)

We call En the potential expanders since it contains all points which might be safe, but are not yet known to be safe. Under this target space, the above lemma simplifies slightly:

Lemma C.33. For any n 0 and ϵ, β 0, if En An then with high probability,

Sn (An Rϵ,β(Sn)) = Rϵ,β(Sn).

Proof. With high probability, Rϵ,β(Sn) b Sn = Sn En. The lemma is a direct consequence.

The above lemmas can be combined to yield our main result of this subsection, establishing the convergence of ITL to the reachable safe set.

Theorem C.34 (Convergence to reachable safe set). For any ϵ > 0, let n be the smallest integer satisfying the condition of Lemma C.32, and define n def = (|S | + 1)n . Let βn βn,i for all n n , i Is. Assume that the sequence of target spaces is monotonically decreasing, i.e., An+1 An. Then, the following inequalities hold jointly with probability at least 1 δ:

(i) n 0, i Is : g i (xn) 0,

(ii) An Rϵ, βn Sn R0,0 = S ,

convergence to safe region

(iii) x An , i I : wn ,i(x) βn ηi(x; Rϵ, βn ) + ϵ,

convergence of width

(iv) x Rϵ, βn , i I : ηi(x; Rϵ, βn ) = 0.

convergence of width within safe region

Proof. (i) is a direct consequence of Corollary C.28. Sn S follows directly from the pessimistic safe set Sn from (ii) being a subset of the true safe set S . (iv) follows directly from the definition of irreducible uncertainty. Thus, it remains to establish An Rϵ, βn Sn and (iii).

Recall that with high probability |Sn| [0, |S |] for all n 0. Thus, the size of the pessimistic safe set can increase at most |S | many times. By Lemma C.32, using the assumption on n , the size of the pessimistic safe set increases at least once every n iterations, or else:

x An0+n , i I : wn0+n ,i(x) βn0+n [ηi(x; Sn0+n ) + ϵ] and An0+n Rϵ,βn0+n (Sn0+n ) Sn0+n . (28)

Because the safe set can expand at most |S | many times, Equation (28) occurs eventually for some n0 |S |n . In this case, since βn βn0+n and An An0+n (as n0 + n n ) we have that

An Rϵ, βn (Sn0+n ) An0+n Rϵ,βn0+n (Sn0+n )

By Lemma C.31 (ii), this implies

An Rϵ, βn Sn0+n Sn .

We emphasize that Theorem C.34 holds for arbitrary target spaces An. If additionally, En An for all n 0 then by Lemma C.33, Theorem C.34 (ii) strengthens to Rϵ, βn Sn . Intuitively, En An ensures that one aims to expand the safe set in all directions. Conversely, if En An then one aims only to expand the safe set in the direction of An (or not at all if An Sn).

Free convergence guarantees in many applications Theorem C.34 can be specialized to yield convergence guarantees in various settings by choosing an appropriate target space An. Straightforward application of Theorem C.34 (informally) requires that the sequence of target spaces is monotonically decreasing (i.e., An+1 An), and that each target space An is an over-approximation of the actual set of targeted points (such as the set of optimas in the Bayesian optimization setting). We discuss two such applications in the following.

1. Pure expansion: For example, for the target space En, Theorem C.34 bounds the convergence of the safe set to the reachable safe set. In this case, the transductive active learning problem corresponds to the pure expansion setting, also addressed by the ISE baseline discussed in Section 5. The ISE baseline, however, does not establish convergence guarantees of the kind of Theorem C.34. Note that En satisfies the (informal) requirements laid out previously, since it is monotonically decreasing by definition, and with high probability, any point x S that is not in Sn is contained within En. 2. Level set estimation: Given any τ R, we denote the (safe) τ-level set of f by Lτ def = {x S | f (x) = τ}. We define the potential level set as

Lτ n def = {x b Sn | lf n(x) τ uf n(x)}. (29)

That is, Lτ n is the subset of the optimistic safe set b Sn where the τ-level set of f may be located. Analogously to the potential expanders, it is straightforward to show that Lτ n over-approximates the true τ-level set and is monotonically decreasing.

We remark that our guarantees from this section also apply to the standard ( unsafe ) setting where S = S0 = X.

C.8.2 Convergence to Safe Optimum

In this section, we specialize Theorem C.34 for the case that the target space contains the potential maximizers Mn (cf. Equation (5)). It is straightforward to see that the sequence Mn is monotonically decreasing (i.e., Mn+1 Mn). The following lemma shows that the potential maximizers overapproximate the set of safe maxima X def = arg maxx S f (x). Lemma C.35 (Potential maximizers over-approximate safe maxima). For all n 0 and with probability at least 1 δ,

(i) x X implies x Mn and

(ii) x Mn implies x X .

Proof. If x Mn then

un,f(x) < max x Sn ln,f(x ) max x S ln,f(x )

where we used Sn S with high probability, which directly implies with high probability that x X .

For the other direction, if x X then

un,f(x) max x S ln,f(x ) max x Sn ln,f(x )

with high probability.

We denote the set of optimal actions which are safe up to (ϵ, β)-slack by

X ϵ,β def = arg max x Rϵ,β f (x),

and by f ϵ,β the maximum value attained by f at any of the points in X ϵ,β. The regret can be expressed as

rn( Rϵ,β) = f ϵ,β f (bxn)

The following theorem formalizes Theorem 5.1 and establishes convergence to the safe optimum.

Theorem C.36 (Convergence to safe optimum). For any ϵ > 0, let n be the smallest integer satisfying the condition of Lemma C.32, and define n def = (|S | + 1)n . Let βn βn,i for all n n , i Is. Then, the following inequalities hold jointly with probability at least 1 δ:

(i) n 0, i Is : g i (xn) 0,

(ii) n n : rn( Rϵ, βn ) ϵ.

convergence to safe optimum

Proof. Fix any x X ϵ, βn Rϵ, βn . Assume w.l.o.g. that x Mn .12 Then, with high probability,

f ϵ, βn = f (x ) un ,f(x )

= ln ,f(x ) + wn ,f(x )

(i) ln ,f(bxn ) + wn ,f(x ) f (bxn ) + wn ,f(x )

(ii) f (bxn ) + ϵ

where (i) follows from the definition of bxn; and (ii) follows from Theorem C.34 and noting that x Mn Rϵ, βn .

We have shown that f (bxn ) f ϵ, βn ϵ, which implies rn ( Rϵ, βn ) ϵ. Since the upperand lower-confidence bounds are monotonically decreasing / increasing, respectively, we have that for all n n , rn( Rϵ, βn ) ϵ.

C.9 Useful Facts and Inequalities

We denote by the Loewner partial ordering of symmetric matrices. Lemma C.37. Let A Rn n be a positive definite matrix with diagonal D. Then, A n D.

Proof. Equivalently, one can show n D A 0. We write A def = D 1/2QD 1/2, and thus, Q = D 1/2AD 1/2 is a positive definite symmetric matrix with all diagonal elements equal to 1. It remains to show that

n D A = D 1/2(n I Q)D 1/2 0.

Note that Pn i=1 λi(Q) = tr Q = n, and hence, all eigenvalues of Q belong to (0, n).

Lemma C.38. If a, b (0, M] for some M > 0 and b a then

b a M log b

If additionally, a M for some M > 0 then

b a M log b

Proof. Let f(x) def = log x. By the mean value theorem, there exists c (a, b) such that

1 c = f (c) = f(b) f(a)

b a = log b log a

b a = log( b

b a = c log b

12Otherwise, with high probability, f (bxn ) > f ϵ, βn .

Under the additional condition that a M , we obtain

b a = c log b

D Interpretations & Approximations of Principle ( )

We give a brief overview of interpretations and approximations of ITL, as well as alternative decision rules adhering to the fundamental principle ( ).

The discussed interpretations of ( ) differ mainly in how they quantify the uncertainty about A. In the GP setting, this uncertainty is captured by the covariance matrix Σ of f A, and we consider two main ways of scalarizing Σ:

1. the total (marginal) variance tr Σ, and 2. the generalized variance |Σ|.

The generalized variance which was originally suggested by Wilks (1932) as a generalization of variance to multiple dimensions takes into account correlations. In contrast, the total variance discards all correlations between points in A.

All discussed decision rules following principle ( ) (i.e., ITL, VTL, MM-ITL) differ only in their weighting of the points in A, and they coincide when |A| = 1.

D.1 Interpretations of ITL

We briefly discuss three interpretations of ITL.

Minimizing generalized variance In the GP setting, ITL can be equivalently characterized as minimizing generalized posterior variance:

xn = arg max x S I(f A; yx | Dn)

= arg max x S

1 2 log |Var[f A | Dn 1]| |Var[f A | Dn 1, yx]|

= arg min x S |Var[f A | Dn 1, yx]| . (32)

Maximizing relevance and minimizing redundancy An alternative interpretation of ITL is

I(f A; yx | Dn) = I(f A; yx) | {z } relevance

I(f A; yx; Dn) | {z } redundancy

where I(f A; yx; Dn) = I(f A; yx) I(f A; yx | Dn) denotes the multivariate information gain (cf. Appendix B). In this way, ITL can be seen as maximizing observation relevance while minimizing observation redundancy. This interpretation is common in the literature on feature selection (Peng et al., 2005; Vergara & Estévez, 2014; Beraha et al., 2019).

Steepest descent in measure spaces ITL can be seen as performing steepest descent in the space of probability measures over f A, with the KL divergence as metric:

I(f A; yx | Dn) = Eyx[KL(p(f A | Dn, yx) p(f A | Dn))].

That is, ITL finds the observation yielding the largest update to the current density.

D.2 Interpretations of VTL

Quantifying the uncertainty about f A by the marginal variance of points in A rather than entropy (or generalized variance), the principle ( ) leads to VTL. Note that if |A| = 1, then VTL is equivalent to ITL. Unlike the similar, but more technical, TRUVAR algorithm proposed by Bogunovic et al. (2016), VTL does not require truncated variances, and hence, VTL can be applied to constrained settings (where A S) as well.

Relationship to ITL Note that the ITL criterion in the GP setting can be expressed as

xn = arg min x S tr log Var[f A | Dn 1, yx] (34)

where for a positive semi-definite matrix A with spectral decomposition A = V ΛV we write log A = V log ΛV for the logarithmic matrix function. To derive Equation (34) we use that log |A| = P

i log λi(A) = tr log A. Hence, ITL and VTL are identical up to a different weighting of the eigenvalues of the posterior covariance matrix.

Minimizing a bound to the approximation error Chowdhury & Gopalan (2017) (page 19) bound the approximation error |f (x) µn(x)| by

|kt(x) (Kt + P t) 1ε1:t| | {z } variance

+ |f (x) kt(x) (Kt + P t) 1f 1:t| | {z } bias

where kt(x) def = Kxx1:t, Kt def = Kx1:tx1:t, and P t def = Px1:t. Similar to a standard bias-variance decomposition, the first term measures variance and the second term measures bias. Following Lemma C.27, VTL can be seen as greedily minimizing this bound to the approximation error (i.e., both bias and variance).

Maximizing correlation to prediction targets weighted by their variance It can be shown (see the proof below) that the VTL decision rule is equivalent to

xn = arg max x S

x A Var[fx | Dn 1] Cor[fx , yx | Dn 1]2. (35)

That is, VTL maximizes the squared correlation between the next observation and the prediction targets, weighted by their variance. Intuitively, prediction targets are weighted by their variance since more can be learned about a prediction target with higher variance. This is precisely what leads to the diverse sample selection, and is akin to uncertainty sampling among the prediction targets and then selecting the observation which is most correlated with the selected prediction target.

Proof. Starting with the VTL objective, we have

arg min x S

x A Var[fx | Dn, yx] = arg min x S

Var[fx | Dn] Cov[fx , yx | Dn]2

Var[yx | Dn]

= arg max x S

Var[fx | Dn] Cov[fx , yx | Dn]2

Var[fx | Dn] Var[yx | Dn] + const

= arg max x S

x A Var[fx | Dn] Cor[fx , yx | Dn]2 + const.

D.3 Mean Marginal ITL

Mac Kay (1992) previously proposed mean-marginal ITL (MM-ITL) in the setting where S = X, which selects

xn = arg max x S

x A I(fx ; yx | Dn 1) (36)

and which simplifies in the GP setting to

xn = arg max x S

x A log Var[fx | Dn 1] Var[fx | Dn 1, yx]

= arg min x S

x A log Var[fx | Dn 1, yx]

= arg min x S tr log diag Var[f A | Dn 1, yx]. (37)

Analogously to the derivation of Equation (34), this can also be expressed as

xn = arg min x S |diag Var[f A | Dn 1, yx]| . (38)

Effectively, MM-ITL ignores the mutual interaction between points in A. As can be seen from Equation (37) and as is also mentioned by Mac Kay (1992), MM-ITL is equivalent to VTL up to a different weighting of the points in A: instead of minimizing the average posterior variance (as in VTL), MM-ITL minimizes the average posterior log-variance. Under the lens of principle ( ), this can be seen as minimizing the average marginal entropy of predictions within the target space:

xn = arg min x S

x A H[fx | Dn 1, yx] .

We remark that MM-ITL is a special case of EPIG (Bickford Smith et al., 2023, Appendix E.2).

Not a generalization of uncertainty sampling Unlike ITL, MM-ITL is not a generalization of uncertainty sampling. The reason is precisely that MM-ITL ignores the mutual interaction between points in A. Consider the example where X = S = A = {1, . . . , 10} where f1:9 are highly correlated while f10 is mostly independent of the other points. Visually, imagine a smooth function (i.e., under a Gaussian kernel) with points 1 through 9 close to each other and point 10 far away. Further, suppose that point 10 has a slightly larger marginal variance than the others. Then, MM-ITL would select one of the points 1 : 9 since this leads to the largest reduction in the marginal (log-)variances (i.e., to a small posterior uncertainty ).13 In contrast, ITL selects the point with the largest prior marginal variance (cf. Appendix C.1), point 10, since this leads to the largest reduction in entropy.14

Similarity to VTL Observe that the following decision rule is equivalent to VTL:

xn = arg max x S tr Var[f A | Dn 1] tr Var[f A | Dn 1, yx].

By Lemma C.38, for any x S, this objective value can be tightly lowerand upper-bounded (up to constant-factors) by X

x A log Var[fx | Dn 1] Var[fx | Dn 1, yx]

x A I(fx ; yx | Dn 1) (see MM-ITL)

x A log 1 Cor[fx , yx | Dn 1]2 (39)

where (i) is detailed in example 8.5.1 of Cover (1999). Thus, VTL and MM-ITL are closely related.

Experiments In our experiments with Gaussian processes from Figures 2 and 6, we observe that MM-ITL performs similarly to VTL and CTL.

Convergence of uncertainty We derive a convergence guarantee for MM-ITL which is analogous to the guarantees for ITL from Theorem C.12 and for VTL from Theorem C.13. We will assume for simplicity that Γn is monotonically decreasing in n (i.e., αn 1). Theorem D.1 (Convergence of uncertainty reduction of MM-ITL). Assume that Assumptions B.1 and B.2 are satisfied. Then for any n 1, if Γ0 Γn 1 and the sequence {xi}n i=1 is generated by MM-ITL, then

x A γn({x }; S). (40)

Proof. We have

13This is because the observation reduces uncertainty not just about the observed point itself. 14Because points f1:9 are highly correlated, H[f1:9] is already small .

i=0 max x S

x A I(fx ; yx | Dn)

x A I(fx ; yxn+1 | Dn)

i=0 I(fx ; yxn+1 | yx1:n)

x A I fx ; yx1:n

x A max X S |X|=n I(fx ; y X)

x A γn({x }; S)

where (i) follows by assumption; (ii) follows from the MM-ITL decision rule; (iii) uses that the posterior variance of Gaussians is independent of the realization and only depends on the location of observations; and (iv) uses the chain rule of mutual information. The remainder of the proof is analogous to the proof of Theorem C.12 (cf. Appendix C.5).

Noting that

I(fx ; yx | Dn 1) X

x A I(fx ; yx | Dn 1)

for any n 1, x X, and x A, Theorem 3.2 can be readily rederived for MM-ITL (cf. Lemmas C.14 and C.18). Hence, the posterior marginal variances of MM-ITL can be bounded uniformly in terms of Γn analogously to ITL.

D.4 Correlation-based Transductive Learning

We will briefly look at the CTL (Correlation-based TL) decision rule

xn = arg max x S

x A Cor[fx, fx | Dn 1] (41)

which permits no interpretation under principle ( ). However, if all correlations are non-negative (such as for the standard Gaussian and Matérn kernels), CTL is closely related to ITL, VTL, and MM-ITL (cf. Equations (35) and (39)). In this case, if |A| = 1, then all decision rules coincide.

If, on the other hand, correlations may be negative then there is a crucial difference between CTL and the decision rules motivated from principle ( ). Namely, decision rules following ( ) exhibit a preference for points with high absolute correlation to prediction targets as opposed to CTL which prefers points with high positive correlation. This stems from the intuitive fact that points with a strong negative correlation are equally informative as points with a strong positive correlation. Nevertheless, we observe in our experiments that (even for a linear kernel which does not ensure non-negative correlations) points selected by ITL and VTL are typically positively correlated with prediction targets.

D.5 Summary

We have seen that ITL, VTL, and MM-ITL can be seen as different interpretations of the same fundamental principle ( ), with the approximations CTL. If |A| = 1 and correlations are non-negative, then all four decision rules are equivalent. CTL prefers points with high positive

correlation whereas the other decision rules prefer points with high absolute correlation. ITL is the only decision rule that takes into account the mutual dependence between points in A, and VTL and MM-ITL differ only in their weighting of the posterior marginal variances of points in A.

E Stochastic Target Spaces

When the target space A is large, it may be computationally infeasible to compute the exact objective. A natural approach to address this issue is to approximate the target space by a smaller set of size K.

Discretizing the target space One possibility is to discretize the target space A. Compact target spaces can be addressed, e.g., via discretization arguments which are common in the Bayesian optimization literature (see, e.g., appendix C.1 of Srinivas et al. (2009)). That is, if the target space can be covered approximately using a finite (possibly large) set of points, the guarantees of Theorem 3.2 extend directly. This, however, can be impractical when the required size of discretization for sufficiently small approximation error is large. In the following, we briefly discuss a natural alternative approach based on sampling points from A.

Target distributions Let A X be a (possibly continuous) target space, and let PA be a probability distribution supported on A. In iteration n, a subset An of K points is sampled independently from A according to the distribution PA and the objective is computed on this subset. Formally, this amounts to a single-sample Monte Carlo approximation of

xn arg max x S EA iid PA[I(f A; yx | Dn 1)]. (42)

The convergence guarantees from Appendix C can be generalized to the setting of stochastic target spaces by estimating how often points near a specified prediction target x A are sampled. Definition E.1 (γ-ball at x). Given x A and any γ 0, we call the set

Bγ(x) def = {x X | x x γ}

the γ-ball at x. Further, we call PA(Bγ(x)) the weight of that ball. Proposition E.2 (sketch). Given any n 1, K 1, γ > 0, and x A, suppose that Bγ(x) has weight p > 0. Assume that the ITL objective is LI-Lipschitz continuous. Then, with probability at least 1 exp( (1 p)n/(8K)),

σ2 n(x) η2 S(x) + CLIγ γk(n) p

where k(n) def = Kpn/2.

Proof sketch. Let Yi Binom(K, p) denote the random variable counting the number of occurrences of a point from Bγ(x) in Ai. Moreover, we write Xi def = 1{Bγ(x) Ai = }. Note that

ν def = EXi = P(Bγ(x) Ai = ) = 1 P(Yi = 0) = 1 (1 p)K Kp

where the approximation stems from a first-order truncation of the Bernoulli series. Let X def = Pn i=1 Xi with EX = nν Kpn.

Using the assumed Lipschitz-continuity of the objective, we know that I(f A ; yx | Dn 1) LIγI(f A; yx | Dn 1) where A def = (A \ {xγ}) {x} and xγ is the point from the γ-ball at x. The bound then follows analogously to Theorem 3.2.

Finally, by Chernoff s bound, at least Kpn/2 iterations contain a point from Bγ(x) with probability at least 1 exp( Kpn/8).

This strategy can also be used to generalize the VTL, CTL, and MM-ITL objectives to stochastic target spaces.

F Closed-form Decision Rules

Below, we list the closed-form expressions for the ITL and VTL objectives. In the following, kn denotes the kernel conditional on Dn.

I(f A; yx | Dn 1) = 1

2 log Var[yx | Dn 1] Var[yx | f A, Dn 1]

kn 1(x, x) + ρ2

ˆkn 1(x, x) + ρ2

where ˆkn(x, x) = kn(x, x) kn(x, A)Kn(A, A) 1kn(A, x).

tr Var[f A | Dn 1, yx] = X

kn 1(x , x ) kn 1(x, x )2

kn 1(x, x) + ρ2

G Computational Complexity

Evaluating the acquisition function of ITL in round n requires computing for each x S,

I(f A; yx | Dn)

2 log |Var[f A | Dn]| |Var[f A | yx, Dn]|

2 log Var[yx | Dn] Var[yx | f A, Dn]

(backward).

Let |S| = m and |A| = k. Then, the forward method has complexity O m k3 . For the backward method, observe that the variances are scalar and the covariance matrix Var[f A | Dn] only has to be inverted once for all points x. Thus, the backward method has complexity O k3 + m .

When the size m of S is relatively small (and hence, all points in S can be considered during each iteration of the algorithm), GP inference corresponds simply to computing conditional distributions of a multivariate Gaussian. The performance can therefore be improved by keeping track of the full posterior distribution over f S of size O m2 and conditioning on the latest observation during each iteration of the algorithm. In this case, after each observation the posterior can be updated at a cost of O m2 which does not grow with the time n, unlike classical GP inference.

Overall, when m is small, the computational complexity of ITL is O k3 + m2 . When m is large (or possibly infinite) and a subset of m points is considered in a given iteration, the computational complexity of ITL is O k3 + m n3 , neglecting the complexity of selecting the m candidate points. In the latter case, the computational cost of ITL is dominated by the cost of GP inference.

Khanna et al. (2017) discuss distributed and stochastic approximations of greedy algorithms to (weakly) submodular problems that are also applicable to ITL.

H Additional GP Experiments & Details

We use homoscedastic Gaussian noise with standard deviation ρ = 0.1 and a discretization of X = [ 3, 3]2 of size 2 500. Uncertainty bands correspond to one standard error over 10 random seeds.

Additional experiments Figure 6 includes the following additional experiments:

1. Extrapolation Setting (A S = ): Right experiment from Figure 2 under the Gaussian kernel. ITL has a similar advantage as in the setting shown in Figure 3. 2. Heteroscedastic Noise: Left experiment from Figure 2 under the Gaussian kernel with heteroscedastic Gaussian noise

ρ(x) = 1 if x [ 1

0.1 otherwise .

If observation noise is heteroscedastic, in considering posterior rather than prior uncertainty, ITL avoids points with high aleatoric uncertainty, which accelerates learning.

0 100 200 300 400 500 Number of Iterations

Entropy of f A

Extrapolation Setting (A S = )

0 100 200 300 400 500 Number of Iterations

Entropy of f A

Heteroscedastic Noise

0 100 200 300 400 500 Number of Iterations

Entropy of f A

Effect of Smoothness: Laplace Kernel

0 100 200 300 400 500 Number of Iterations

Entropy of f A

Sparser Target

ITL VTL CTL UNSA LOCAL UNSA TRUVAR RANDOM

Figure 6: Additional GP experiments

3. Effect of Smoothness: Experiment from Figure 3 under the Laplace kernel. All algorithms except for US and RANDOM perform equally well. This validates our claims from Section 3.3: in the extreme non-smooth case of a Laplace kernel and A S, points outside A do not provide any additional information, and ITL and local UNSA coincide.

4. Sparser Target: Experiment from Figure 3 under the Gaussian kernel, but with domain extended to X = [ 10, 10]2.

Hyperparameters of TRUVAR As suggested by Bogunovic et al. (2016), we use η2 (1) = 1, r = 0.1, and δ = 0 (even though the theory only holds for δ > 0). The TRUVAR baseline only applies when A S (cf. Section 6).

Smoothing to reduce numerical noise Applied running average with window 5 to entropy curves of Figures 2 and 6 to smoothen out numerical noise.

I Alternative Settings for Active Fine-Tuning

In our main experiments, we consider the setting A S = , i.e., the prediction targets cannot be used for fine-tuning since their labels are not known. This setting is particularly relevant for practical applications where the model is fine-tuned dynamically at test time to each prediction target (or a small set of prediction target). Put differently, in this transductive setting, extrapolation to new prediction targets happens at test-time with knowledge of the prediction target(s). This is in contrast to a more traditional inductive setting, where extrapolation happens at train-time without knowledge of the concrete prediction targets, but under the assumption of samples from (or knowledge of) the target distribution. In the following, we briefly survey two settings motivated from an inductive perspective.

I.1 Prediction Targets are Contained in Sample Space: A S

If labels can be obtained cheaply, one can also fine-tune on the prediction targets directly, i.e., A S. Note, however, that the set A is still assumed to be small (e.g., |A| = 100 in the CIFAR-100 experiment). We perform an experiment in this setting and report the results in Figure 7. The experiment shows that similarly to the GP experiment from Figure 2 there can be additional value in fine-tuning the model on relevant data selected from S beyond simply fine-tuning the model on A.

0 20 40 60 80 100 Number of Batches of Size 10

0 20 40 60 80 100 Number of Batches of Size 10

# Samples from Support of PA

ITL VTL EIG TYPICLUST PROBCOVER MAXDIST RANDOM

Figure 7: Evaluation of CIFAR-100 experiment in the setting A S, i.e., one can also sample from the 100 prediction targets A. The solid black line denotes the performance of the model fine-tuned on all of A. This experiment shows that there is additional value in fine-tuning the model on relevant data from S beyond simply fine-tuning the model on A. The baselines are summarized in Appendix J.5

I.2 Active Domain Adaptation

Active DA (Rai et al., 2010; Saha et al., 2011; Berlind & Urner, 2015) studies the problem of selecting the most informative samples from a (large) target domain A, given a model trained on a source domain S. This problem can be cast as an instance of transductive active learning with target space A and sample space S = S A where the model is already conditioned on all of S. This is slightly different from the setting considered in Section 4 where A is small and not necessarily part of the sample space. We hypothesize that ITL behaves similarly to recent work on active DA (Su et al., 2020; Prabhu et al., 2021; Fu et al., 2021): querying informative and diverse samples from A that are dissimilar to S. Evaluating ITL and VTL empirically in this setting is a promising direction for future work.

J Additional NN Experiments & Details

We outline the active fine-tuning of NNs in Algorithm 1.

Algorithm 1 Active Fine-Tuning of NNs

Given: initialized or pre-trained model f, small sample A PA initialize dataset D = repeat

sample S PS subsample target space A u.a.r. A initialize batch B = compute kernel matrix K over domain [S, A ] repeat b times

compute acquisition function w.r.t. A , based on K add maximizer x S of acquisition function to B update conditional kernel matrix K obtain labels for B and add to dataset D update f using data D

In Appendix J.1, we detail metrics and hyperparameters. We describe in Appendices J.2 and J.3 how to compute the (initial) conditional kernel matrix K, and in Appendix J.4 how to update this matrix K to obtain conditional embeddings for batch selection.

In Appendix J.5, we show that ITL and CTL significantly outperform a wide selection of commonly used heuristics. In Appendices J.6 and J.7, we conduct additional experiments and ablations.

Table 1: Hyperparameter summary of NN experiments. (*) we train until convergence on oracle validation accuracy.

MNIST CIFAR-100

ρ 0.01 1 M 30 100 m 3 10 k 1 000 1 000 batch size b 1 10 # of epochs (*) 5 learning rate 0.001 0.001

Hübotter et al. (2024) discusses additional motivation and related work that has previously studied active fine-tuning, but which has largely focused on the training algorithm rather than data selection.

J.1 Experiment Details

We evaluate the accuracy with respect to PA using a Monte Carlo approximation with out-of-sample data:

accuracy(bθ) E(x,y) PA1{y = arg max i fi(x; bθ)}.

We provide an overview of the hyperparameters used in our NN experiments in Table 1. The effect of noise standard deviation ρ is small for all tested ρ [1, 100] (cf. ablation study in Table 2).15 M denotes the size of the sample A PA. In each iteration, we select the target space A A as a random subset of m points from A.16 We provide an ablation over m in Appendix J.6.

During each iteration, we select the batch B according to the decision rule from a random sample from PS of size k.17

Since we train the MNIST model from scratch, we train from random initialization until convergence on oracle validation accuracy.18 We do this to stabilize the learning curves, and provide the least biased (due to the training algorithm) results. For CIFAR-100, we train for 5 epochs (starting from the previous iterations model) which we found to be sufficient to obtain good performance.

We use the ADAM optimizer (Kingma & Ba, 2014). In our CIFAR-100 experiments, we use a pre-trained Efficient Net-B0 (Tan & Le, 2019), and fine-tune the final and penultimate layers. We freeze earlier layers to prevent overfitting to the few-shot training data.

To prevent numerical inaccuracies when computing the ITL objective, we optimize

I(y A; yx | Dn 1) = 1

2 log Var[yx | Dn 1] Var[yx | y A, Dn 1]

instead of Equation (43), which amounts to adding ρ2 to the diagonal of the covariance matrix before inversion. This appears to improve numerical stability, especially when using gradient embeddings.19

15We use a larger noise standard deviation ρ in CIFAR-100 to stabilize the numerics of batch selection via conditional embeddings (cf. Table 2). 16This appears to improve the training, likely because it prevents overfitting to peculiarities in the finite sample A (cf. Figure 16). 17In large-scale problems, the work of Coleman et al. (2022) suggests to use an (approximate) nearest neighbor search to select the (large) candidate set rather than sampling u.a.r. from PS. This can be a viable alternative to simply increasing k and suggests future work. 18That is, to stop training as soon as accuracy on a validation set from PA decreases in an epoch. 19In our experiments, we observe that the effect of various choices of ρ on this slight adaptation of the ITL decision rule has negligible impact on performance. The more prominent effect of ρ appears to arise from the batch selection via conditional embeddings (cf. Table 2).

In our experiments, we use last-layer neural tangent embeddings20 and Σ = I to evaluate ITL and VTL, and select inputs for labeling and training f. Notably, we use this linear Gaussian approximation of f only to guide the active data selection and not for inference.

J.2 Embeddings and Kernels

Using a neural network to parameterize f, we evaluate the canonical approximations of f by a stochastic process in the following.

An embedding ϕ(x) is a latent representation of an input x. Collecting the embeddings as rows in the design matrix Φ of a set of inputs X, one can approximate the network by the linear function f X = Φβ with weights β. Approximating the weights by β N(µ, Σ) implies that f X N(Φµ, ΦΣΦ ). The covariance matrix KXX = ΦΣΦ can be succinctly represented in terms of its associated kernel k(x, x ) = ϕ(x) Σϕ(x ). Here,

ϕ(x) is the latent representation of x, and Σ captures the dependencies in the latent space.

While any choice of embedding ϕ is possible, the following are common choices:

1. Last-Layer: A common choice for ϕ(x) is the representation of x from the penultimate layer of the neural network (Holzmüller et al., 2023). Interpreting the early layers as a feature encoder, this uses the low-dimensional feature map akin to random feature methods (Rahimi & Recht, 2007). 2. Output Gradients (e NTK): Another common choice is ϕ(x) = θ f(x; θ) where θ are the network parameters (Holzmüller et al., 2023). Its associated kernel is known as the empirical neural tangent kernel (e NTK) and the posterior mean of this GP approximates ultra-wide NNs trained with gradient descent (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019; Khan et al., 2019; He et al., 2020; Malladi et al., 2023). Kassraie & Krause (2022) derive bounds of γn under this kernel. If θ is restricted to the weights of the final linear layer, then this embedding is simply the last-layer embedding. 3. Loss Gradients: Another possible choice is ϕ(x) = θ ℓ(f(x; θ), ˆy(x))|θ=bθ where ℓis a loss function, ˆy(x) is the predicted label, and bθ are the current parameter estimates (Ash et al., 2020). 4. Outputs (e NNGP): Another possible choice is ϕ(x) = f(x), i.e., the output of the network. Its associated kernel is known as the empirical neural network Gaussian process (e NNGP) kernel (Lee et al., 2018). 5. Predictive (Kirsch, 2023): Given a Bayesian neural network (Blundell et al., 2015) or probabilistic (deep) ensemble (Lakshminarayanan et al., 2017), which induce samples θ1, . . . , θK p(θ) from the distribution over network parameters, one can approximate the predictive covariance k(x, x ) = Covθ[f(x; θ), f(x ; θ)]. This kernel measures proximity in the prediction space rather than parameter space and as such does not require gradient information. The corresponding feature map is ϕ(x) = 1

K [ f(x; θ1) f(x; θK)] where f(x; θk) def = f(x; θk) 1

K PK l=1 f(x; θl).

In the additional experiments from this appendix we use last-layer embeddings unless noted otherwise. We compare the performance of last-layer and the loss gradient embedding ϕ(x) = θ ℓCE(f(x; θ), ˆy(x))|θ=bθ (45)

where θ are the parameters of the final output layer, bθ are the current parameter estimates, ˆy(x) = arg maxi fi(x; bθ) are the associated predicted labels, and ℓCE denotes the cross-entropy loss. This gradient embedding captures the potential update direction upon observing a new point (Ash et al., 2020). Moreover, Ash et al. (2020) show that for most neural networks, the norm of these gradient embeddings are a conservative lower bound to the norm assumed by taking any other proxy label ˆy(x). In Figure 8, we observe only negligible differences in performance between this and the last-layer embedding.

20We observe essentially the same performance with loss gradient embeddings, cf. Appendix J.2.

0 20 40 60 80 100 50

0 20 40 60 80 100 20

100 CIFAR-100

0 20 40 60 80 100 Number of Samples

# Samples from Support of PA

0 20 40 60 80 100 Number of Batches of Size 10

G-ITL L-ITL G-CTL L-CTL RANDOM

Figure 8: Comparison of loss gradient ( G- ) and last-layer embeddings ( L- ).

0 20 40 60 80 100 Number of Samples

0 20 40 60 80 100 Number of Samples

# Samples from Support of PA

G-ITL G-ITL (LA) L-ITL L-ITL (LA) RANDOM

Figure 9: Uncertainty quantification (i.e., estimation of Σ) via a Laplace approximation (LA, Daxberger et al. (2021)) over last-layer weights using a Kronecker factored log-likelihood Hessian approximation (Martens & Grosse, 2015) and the loss gradient embeddings from Equation (45). The results are shown for the MNIST experiment. We do not observe a performance improvement beyond the trivial approximation Σ = I.

J.3 Towards Uncertainty Quantification in Latent Space

A straightforward and common approximation of the uncertainty about NN weights is given by Σ = I, and we use this approximation throughout our experiments.

The poor performance of UNSA (cf. Appendix J.5) with this approximation suggests that with more sophisticated approximations, the performance of ITL, VTL, and CTL can be further improved. Further research is needed to study the effect of more sophisticated approximations of uncertainty in the latent space. For example, with parameter gradient embeddings, the latent space is the network parameter space where various approximations of Σ based on Laplace approximation (Daxberger et al., 2021; Antorán et al., 2022), variational inference (Blundell et al., 2015), or Markov chain Monte Carlo (Maddox et al., 2019) have been studied. We also evaluate Laplace approximation (LA, Daxberger et al. (2021)) for estimating Σ but see no improvement (cf. Figure 9). Nevertheless, we believe that uncertainty quantification is a promising direction for future work, with the potential to improve performance of ITL and its variations substantially.

J.4 Batch Selection via Conditional Embeddings

We will refer to the greedy decision rule from Equation (3) as BACE, short for Batch selection via Conditional Embeddings. BACE can be implemented efficiently using the Gaussian approximation of f X from Appendix J.2 by iteratively conditioning on the previously selected points xn,1:i 1, and updating the kernel matrix KXX using the closed-form formula for the variance of conditional

0 20 40 60 80 100 Number of Batches of Size 10

0 20 40 60 80 100 Number of Batches of Size 10

# Samples from Support of PA

ITL ITL (top-b) CTL CTL (top-b) RANDOM

Figure 10: Advantage of batch selection via conditional embeddings over top-b selection in the CIFAR-100 experiment.

KXX KXX 1 Kxjxj + ρ2 KXxj Kxj X (46)

where j denotes the index of the selected xn,i within X and ρ2 is the noise variance. Note that Kxjxj is a scalar and KXxj is a row vector, and hence, this iterative update can be implemented efficiently.

We remark that Equation (3) is a natural extension of previous non-adaptive active learning methods, which typically maximize some notion of distance between points in the batch, to the directed setting (Ash et al., 2020; Zanette et al., 2021; Holzmüller et al., 2023; Pacchiano et al., 2024). BACE simultaneously maximizes distance between points in a batch and minimizes distance to points in A.

The sample efficiency of BACE Bn, and therefore also the greedily constructed B n (which gives a constant-factor approximation with respect to the objective), yields diverse batches by design. In Figure 10, we compare BACE to selecting the top-b points according to the decision rule (which does not yield diverse batches). We observe a significant improvement in accuracy and data retrieval when using BACE. We expect the gap between both approaches to widen further with larger batch sizes.

Computational complexity of BACE As derived in Appendix G, a single batch selection step of BACE has complexity O b(k3 + m2) where b is the size of the batch, k = |A| is the size of the target space, and m = |S| is the size of the candidate set. In the case of large m, an alternative implementation whose runtime does not depend on m is described in Appendix G.

J.5 Baselines

In Figure 11, we compare against additional baselines:

Both TYPICLUST (Hacohen et al., 2022) and PROBCOVER (Yehuda et al., 2022) are recent methods to select points that cover the data distribution well. To maintain comparability between algorithms, we use the same embeddings as for ITL which are re-computed before every new batch selection. ITL significantly outperforms TYPICLUST & PROBCOVER, which only attempt to cover S well without taking A into account (i.e., are undirected ).

Mehta et al. (2022) introduced EIG for training neural classification models, which uses the same decision rule as ITL, but approximates the conditional entropy based on the networks softmax output rather than using a GP approximation. We approximate the conditional entropy using a single gradient step of the hallucinated updates on the parameters of the final layer, as mentioned by Mehta et al. (2022). We observe that EIG is not competitive for batch-wise selection (CIFAR-100) since it does not encourage batch diversity. Moreover, we observe that EIG is orders of magnitude slower than ITL (since it has to compute |S| C individual gradient steps where C is the number of classes). We note that since our datasets are balanced, the AEIG algorithm from Mehta et al. (2022) coincides with EIG.

100 CIFAR-100

0 20 40 60 80 100 Number of Samples

# Samples from PA

0 20 40 60 80 100 Number of Batches of Size 10

ITL EIG TYPICLUST PROBCOVER RANDOM

Figure 11: Comparison to baselines for the experiment of Figure 4.

Since, EIG does not have an open-source implementation, we implemented it ourselves following Mehta et al. (2022). For TYPICLUST & PROBCOVER, we use the author s implementation. In the figure, we show that ITL & VTL substantially outperform all baselines.

In the following, we briefly describe other commonly used undirected decision rules.

Denote the softmax distribution over labels i at inputs x by

pi(x; bθ) exp(fi(x; bθ)).

The following heuristics computed based on the softmax distribution aim to quantify the uncertainty about a particular input x:

MAXENTROPY (Settles & Craven, 2008):

xn = arg max x S H[p(x; bθn 1)].

MAXMARGIN (Scheffer et al., 2001; Settles & Craven, 2008):

xn = arg min x S p1(x; bθn 1) p2(x; bθn 1)

where p1 and p2 are the two largest class probabilities.

LEASTCONFIDENCE (Lewis & Gale, 1994; Settles & Craven, 2008; Hendrycks & Gimpel,

2017; Tamkin et al., 2022):

xn = arg min x S p1(x; bθn 1)

where p1 is the largest class probability.

An alternative class of decision rules aims to select diverse batches by maximizing the distances between points. Embeddings ϕ(x) induce the (Euclidean) embedding distance

dϕ(x, x ) def = ϕ(x) ϕ(x ) 2 .

Similarly, a kernel k induces the kernel distance

dk(x, x ) def = q

k(x, x) + k(x , x ) 2k(x, x ).

It is straightforward to see that if k(x, x ) = ϕ(x) ϕ(x ), then embedding and kernel distances coincide, i.e., dϕ(x, x ) = dk(x, x ).

MAXDIST (Holzmüller et al., 2023; Yu & Kim, 2010; Sener & Savarese, 2017; Geifman & El-Yaniv, 2017) constructs the batch by choosing the point with the maximum distance to the nearest previously selected point:

xn = arg max x S min i<n d(x, xi)

Similarly, K-MEANS++ (Holzmüller et al., 2023) selects the batch via K-MEANS++ seeding (Arthur et al., 2007; Ostrovsky et al., 2013). That is, the first centroid x1 is chosen uniformly at random and the subsequent centroids are chosen with a probability proportional to the square of the distance to the nearest previously selected centroid:

P(xn = x) min i<n d(x, xi)2.

When using the loss gradient embeddings from Equation (45), this decision rule is known as BADGE (Ash et al., 2020).

Finally, we summarize common kernel-based decision rules.

UNDIRECTED ITL chooses

xn = arg max x S I(f S; yx | Dn 1)

= arg max x S I(fx; yx | Dn 1) .

This can be shown to be equivalent to MAXDET (Holzmüller et al., 2023) which selects

xn = arg max x S

where Kx denotes the kernel matrix over x1:n 1 {x}, conditioned on the prior observations Dn 1. UNSA (Lewis & Catlett, 1994) which with embeddings ϕn 1 after round n 1 corresponds to:

xn = arg max x S σ2 n 1(x) = arg max x S

ϕn 1(x) 2 2 .

With batch size b = 1, UNSA coincides with UNDIRECTED ITL. When evaluated with gradient embeddings, this acquisition function is similar to previously used embedding length or gradient length heuristics (Settles & Craven, 2008). UNDIRECTED VTL (Cohn, 1993) is the special case of VTL without specified prediction targets (i.e., A = S). In the literature, this decision rule is also known as BAIT (Holzmüller et al., 2023; Ash et al., 2021).

We compare to the abovementioned decision rules and summarize the results in Figure 12. We observe that most undirected decision rules perform worse (and often significantly so) than RANDOM. This is likely due to frequently selecting points from the support of PS which are not in the support of PA since the points are adversarial examples that the model bθ is not trained to perform well on. In the case of MNIST, the poor performance can also partially be attributed to the well-known cold-start problem (Gao et al., 2020).

In Figure 4, we also compare to the following directed decision rules:

COSINESIMILARITY (Settles & Craven, 2008) selects xn = arg maxx S ϕn 1(x, A) where

ϕ(x, A) def = 1 |A|

ϕ(x) ϕ(x ) ϕ(x) 2 ϕ(x ) 2 .

INFORMATIONDENSITY (Settles & Craven, 2008) is defined as the multiplicative combination of MAXENTROPY and COSINESIMILARITY:

xn = arg max x S H[p(x; bθn 1)] ϕn 1(x, A) β

where β > 0 controls the relative importance of both terms. We set β = 1 in our experiments.

100 CIFAR-100

0 20 40 60 80 Number of Samples

# Samples from Support of PA

0 20 40 60 80 Number of Batches of Size 10

MAXENTROPY MINMARGIN

LEASTCONFIDENCE MAXDIST

K-MEANS++ UNDIRECTED ITL

UNDIRECTED VTL UNSA

Figure 12: Comparison of undirected baselines for the experiment of Figure 4. In the MNIST experiment, UNSA and UNDIRECTED ITL coincide, and we therefore only plot the latter.

0 20 40 60 80 100

0 10 20 30 40 50 60 70 80 0

100 CIFAR-100

0 20 40 60 80 100 Number of Samples

# Samples from Support of PA

0 10 20 30 40 50 60 70 80 Number of Batches of Size 10

ITL CTL COSINESIMILARITY INFORMATIONDENSITY BADGE RANDOM

Figure 13: Imbalanced PS experiment.

0 20 40 60 80 100 50

0 20 40 60 80 100 20

100 CIFAR-100

0 20 40 60 80 100 Number of Samples

# Samples from Support of PA

0 20 40 60 80 100 Number of Batches of Size 10

ITL CTL COSINESIMILARITY INFORMATIONDENSITY BADGE RANDOM

Figure 14: Imbalanced A PA experiment.

0 20 40 60 80 Number of Batches of Size 10

0 20 40 60 80 Number of Batches of Size 10

# Samples from Support of PA

ITL (k = 50 000)

ITL (k = 1 000)

VTL (k = 50 000)

VTL (k = 1 000)

CTL (k = 50 000)

CTL (k = 1 000)

Figure 15: Performance of VTL & choice of k in the CIFAR-100 experiment.

J.6 Additional experiments

We conduct the following additional experiments:

1. Imbalanced PS (Figure 13): We artificially remove 80% of the support of PA from PS. For example, in case of MNIST, we remove 80% of the images with labels 3, 6, and 9 from PS. This makes the learning task more difficult, as PA is less represented in PS, meaning that the targets are more sparse. The trend of ITL outperforming CTL which outperforms RANDOM is even more pronounced in this setting. 2. Imbalanced A PA (Figure 14): We artificially remove 50% of part of the support of PA while generating A PA to evaluate the robustness of ITL and CTL in presence of an imbalanced target space A. Concretely, in case of MNIST, we remove 50% of the images with labels 3 and 6 from A. In case of CIFAR-100, we remove 50% of the images with labels {0, . . . , 4} from A. We still observe the same trends as in the other experiments. 3. VTL & choice of k (Figure 15): We observe that VTL performs almost as well as ITL. Additionally, we evaluate the effect of the number of points k at which the decision rule is evaluated. Not surprisingly, we observe that the performance of ITL, VTL, and CTL improves with larger k. 4. Choice of m (Figure 16): Next, we evaluate the choice of m, i.e., the size of the target space A relative to the number M of candidate points A PA. We write p = m/M. We generally observe that a larger p leads to better performance (with p = 1 being the best choice). However, it appears that a smaller p can be beneficial with respect to accuracy when a large number of batches are selected. We believe that this may be because a smaller p improves the diversity between selected batches. 5. Choice of M (Figure 17): Finally, we evaluate the choice of M, i.e., the size of A PA. Not surprisingly, we observe that the performance of ITL improves with larger M.

J.7 Ablation study of noise standard deviation ρ

In Table 2, we evaluate the CIFAR-100 experiment with different noise standard deviations ρ. We observe that the performance of batch selection via conditional embeddings drops (mostly for the less numerically stable gradient embeddings) if ρ is too small, since this leads to numerical inaccuracies when computing the conditional embeddings. Apart from this, the effect of ρ is negligible.

100 CIFAR-100

0 20 40 60 80 Number of Samples

# Samples from Support of PA

0 20 40 60 80 Number of Batches of Size 10

ITL (p = 0.05) ITL (p = 0.1) ITL (p = 0.5) ITL (p = 1)

Figure 16: Evaluation of the choice of m relative to the size M of A PA. Here, p = m/M.

0 20 40 60 80 Number of Batches of Size 10

0 20 40 60 80 Number of Batches of Size 10

# Samples from Support of PA

ITL (M = 10) ITL (M = 50) ITL (M = 100) ITL (M = 500)

Figure 17: Evaluation of the choice of M, i.e., the size of A PA, in the CIFAR-100 experiment.

Table 2: Ablation study of noise standard deviation ρ in the CIFAR-100 experiment. We list the accuracy after 100 rounds per decision rule, with its standard error over 10 random seeds. (top-b) denotes variants where batches are selected by taking the top-b points according to the decision rule rather than using batch selection via conditional embeddings. Shown in bold are the best performing decision rules, and shown in italics are results due to numerical instability.

ρ 0.0001 0.01 1 100

G-ITL 78.26 1.40 79.12 1.19 87.16 0.29 87.18 0.28 L-ITL 87.52 0.48 87.52 0.41 87.53 0.35 86.47 0.22 G-CTL 58.68 2.11 81.44 1.04 86.52 0.44 86.92 0.56 L-CTL 86.40 0.71 86.38 0.75 86.00 0.69 84.78 0.39 G-ITL (top-b) 85.84 0.54 85.92 0.52 85.84 0.54 85.55 0.46 L-ITL (top-b) 85.44 0.58 85.46 0.54 85.44 0.59 85.29 0.36 G-CTL (top-b) 82.27 0.67 82.27 0.67 82.27 0.67 82.27 0.67 L-CTL (top-b) 80.73 0.68 80.73 0.68 80.73 0.68 80.73 0.68 BADGE 83.24 0.60 83.24 0.60 83.24 0.60 83.24 0.60 INFORMATIONDENSITY 79.24 0.51 79.24 0.51 79.24 0.51 79.24 0.51 RANDOM 82.49 0.66 82.49 0.66 82.49 0.66 82.49 0.66

0 1000 Number of Iterations

0 100 Number of Iterations

0 50 Number of Iterations

Controller Tuning for Quadcopter

ITL-An ITL-PAn VTL-An VTL-PAn ISE-BO

Figure 18: We perform the tasks of Figure 5 using Thompson sampling to evaluate the stochastic target space PAn. We additionally compare to GOOSE (cf. Appendix K.2.3) and ISE-BO (cf. Appendix K.2.4).

K Additional Safe BO Experiments & Details

In Appendix K.1, we discuss the use of stochastic target spaces in the safe BO setting. We provide a comprehensive overview of prior works in Appendix K.2 and an additional experiment highlighting that ITL, unlike SAFEOPT, is able to jump past local barriers in Appendix K.3. In Appendix K.4, we provide details on the experiments from Figure 5.

K.1 A More Exploitative Stochastic Target Space

Alternatively to the target space An which comprises all potentially optimal points, we evaluate the stochastic target space

PAn( ) = P( arg max x X:g(x) 0 f(x) = | Dn) (47)

which effectively weights points in An according to how likely they are to be the safe optimum, and is therefore more exploitative than the uniformly-weighted target space discussed so far. Samples from PAn can be obtained efficiently via Thompson sampling (Thompson, 1933; Russo et al., 2018). Observe that PAn is supported precisely on the set of potential maximizers An. We provide a formal analysis of stochastic target spaces in Appendix E. Whether transductive active learning with An or PAn performs better is task-dependent, as we will see in the following.

Note that performing ITL with this target space is analogous to output-space entropy search (Wang & Jegelka, 2017). Samples from PAn can be obtained via Thompson sampling (Thompson, 1933; Russo et al., 2018). That is, in iteration n + 1, we sample K N independent functions f (j) f | Dn from the posterior distribution and select K points x(1), . . . , x(K) which are a safe maximum of f (1), . . . , f (K), respectively.

Experiments In Figure 18, we contrast the performance of ITL with PAn to the performance of ITL with the exact target space An. We observe that their relative performance is instance dependent: in tasks that require more difficult expansion, ITL with An converges faster, whereas in simpler tasks (such as the 2d experiment), ITL with PAn converges faster. We compare against the GOOSE algorithm (Turchetta et al., 2019) which is a heuristic extension of SAFEOPT that explores more greedily in directions of (assumed) high reward (cf. Appendix K.2.3). GOOSE suffers from the same limitations as SAFEOPT, which were highlighted in Section 5, and additionally is limited by its heuristic approach to expansion which fails in the 1d task and safe controller tuning task. Analogously to our experiments with SAFEOPT, we also compare against ORACLE GOOSE which has oracle knowledge of the true Lipschitz constants.

The different behaviors of ITL with An and PAn, respectively, as well as SAFEOPT and GOOSE are illustrated in Figure 19. We observe that ITL with An and SAFEOPT expand the safe set more uniformly since the set of potential maximizers encircles the true safe set.21 Intuitively, this is because the set of potential maximizers conservatively captures migh points might be safe and

21This is because typically, there will always remain points in b Sn \ Sn of which the safety cannot be fully determined, and since, they cannot be observed, it can also not be ruled out that they have high objective value.

optimal. In contrast, ITL with PAn and GOOSE focus exploration and expansion in those regions where the objective is likely to be high.

Figure 19: The first 100 samples of (A) ITL with An, (B) SAFEOPT, (C) ORACLE SAFEOPT, (D) ITL with PAn, (E) GOOSE, (F) ORACLE GOOSE. The white region denotes the pessimistic safe set S100, the light gray region denotes the true safe set S (i.e., the island ), and the darker gray regions denotes unsafe points (i.e., the ocean ).

K.2 Detailed Comparison with Prior Works

The most widely used method for Safe BO is SAFEOPT (Sui et al., 2015; Berkenkamp et al., 2021) which keeps track of separate candidate sets for expansion and exploration and uses UNSA to pick one of the candidates in each round. Treating expansion and exploration separately, sampling is directed towards expansion in all directions even those that are known to be suboptimal. The safe set is expanded based on a Lipschitz constant of g , which is assumed to be known. In most real-world settings, this constant is unknown and has to be estimated using the GP. This estimate is generally conservative and results in suboptimal performance. To this end, Berkenkamp et al. (2016) proposed HEURISTIC SAFEOPT which relies solely on the confidence intervals of g to expand the safe set, but lacks convergence guarantees. More recently, Bottero et al. (2022) proposed ISE which queries parameters from Sn that yield the most information about the safety of another parameter in X. Hence, ISE focuses solely on the expansion of the safe set Sn and does not take into account the objective f. In practice, this can lead to significantly worse performance on the simplest of problems (cf. Figure 5). In contrast, ITL balances expansion of and exploration within the safe set. Furthermore, ISE does not have known convergence guarantees of the kind of Theorem 5.1. In parallel independent work, Bottero et al. (2024) proposed a combination of ISE and max-value entropy search (Wang & Jegelka, 2017) for which they derive a similar guarantee to Theorem 5.1.22 Similar to SAFEOPT, their method aims to expand the safe set in all directions including those that are known to be suboptimal. In contrast, ITL directs expansion only towards potentially optimal regions.

In the 1d task and quadcopter experiment (cf. Figure 5), we observe that SAFEOPT and even ORACLE SAFEOPT converge significantly slower than ITL to the safe optima. We believe this is due to their conservative Lipschitz-continuity/global smoothness-based expansion, as opposed to ITL s expansion,

22We provide an empirical evaluation in Appendix K.2.4.

which adapts to the local smoothness of the constraints. HEURISTIC SAFEOPT, which does not rely on the Lipschitz constant for expansion, does not efficiently expand the safe set due to its heuristic that only considers single-step expansion. This is especially the case for the 1d task. Furthermore, in the 2d task, we notice the suboptimality of ISE since it does not take into account the objective, and purely aims to expand the safe set. ITL, on the other hand, balances expansion and exploration.

K.2.1 SAFEOPT

SAFEOPT (Sui et al., 2015; Berkenkamp et al., 2021) is a well-known algorithm for Safe BO.

Lipschitz-based expansion SAFEOPT expands the set of known-to-be safe points by assuming knowledge of an upper bound Li to the Lipschitz constant of the unknown constraints g i .23 In each iteration, the (pessimistic) safe set Sn is updated to include all points which can be reached safely (with respect to the Lipschitz continuity) from a known-to-be-safe point x Sn. Formally,

SSAFEOPT n def = [

x SSAFEOPT n 1

ln,i(x) Li x x 2 0 for all i Is}.

The expansion of the safe set is illustrated in Figure 20.

We remark two main limitations of this approach. First, the Lipschitz constant is an additional safety critical hyperparameter of the algorithm, which is typically not known. The RKHS assumption (cf. Assumption C.25) induces an assumption on the Lipschitz continuity, however, the worst-case a-priori Lipschitz constant is typically very large, and prohibitive for expansion. Second, the Lipschitz constant is global property of the unknown function, meaning that it does not adapt to the local smoothness. For example, a constraint may be flat in one direction (permitting straightforward expansion) and steep in another direction (requiring slow expansion). Furthermore, the Lipschitz constant is constant over time, whereas ITL is able to adapt to the local smoothness and reduce the (induced) Lipschitz constant over time.

Undirected expansion SAFEOPT addresses the trade-off between expansion and exploration by focusing learning on two different sets. First, the set of maximizers

MSAFEOPT n def = {x SSAFEOPT n | un,f(x) max x SSAFEOPT n ln,f(x)}

which contains all known-to-be-safe points which are potentially optimal. Note that if SSAFEOPT n = Sn then MSAFEOPT n An since An contains points which are potentially optimal and potentially safe but possibly unsafe.

To facilitate expansion, for each point x Sn, the algorithm considers a set of expanding points

F SAFEOPT n (x) def = {x X \ SSAFEOPT n | un,i(x) Li x x 2 0 for all i Is}

A point is expanding if it is unsafe initially and can be (optimistically) deduced as safe by observing x. The set of expanders corresponds to all known-to-be-safe points which optimistically lead to expansion of the safe set:

GSAFEOPT n def = {x Sn | |Fn(x)| > 0}.

That is, an expander is a safe point x which is close to at least one expanding point x . Observe that here, we start with a safe x and then find a close and potentially safe x using the Lipschitz-property of the constraint function. Thus, the set of expanding points is inherently limited by the assumed Lipschitzness (cf. Figure 20), and generally a subset of the potential expanders En (cf. Equation (27)):

Lemma K.1. For any n 0, if SSAFEOPT n = Sn then [

x Sn F SAFEOPT n (x) En.

23Recall that due to the assumption that g i k < , g i is indeed Lipschitz continuous.

Figure 20: Illustration of the expansion of the safe set à la SAFEOPT. Here, the blue region denotes the pessimistic safe set S, the red region denotes the true safe set S , and the orange region denotes the optimistic safe set b S. Whereas ITL learns about the point x directly, SAFEOPT expands the safe set using the reduction of uncertainty at x, and then extrapolating using the Lipschitz constant (cf. Equation (48)). The dashed orange line denotes the expanding points of SAFEOPT which under-approximate the optimistic safe set of ITL (cf. Lemma K.1). Thus, ITL may even learn about points in b S which are out of reach for SAFEOPT.

Proof. Without loss of generality, we consider the case where Is = {i}. We have

En = b Sn \ Sn = {x X \ Sn | un,i(x) 0}.

The result follows directly by observing that Li x x 2 0.

SAFEOPT then selects xn+1 according to uncertainty sampling within the maximizers and expanders: MSAFEOPT n GSAFEOPT n . We remark that due to the separate handling of expansion and exploration, SAFEOPT expands the safe set in all directions even those that are known to be suboptimal. In contrast, ITL only expands the safe set in directions that are potentially optimal by balancing expansion and exploration through the single set of potential maximizers An.

Based on uncertainty sampling As mentioned in the previous paragraph, SAFEOPT selects as next point the maximizer/expander with the largest prior uncertainty.24 In contrast, ITL selects the point within Sn which minimizes the posterior uncertainty within An. Note that the two approaches are not identical as typically MSAFEOPT n GSAFEOPT n SSAFEOPT n and An Sn.

We show empirically in Section 3.3 that depending on the kernel choice (i.e., the smoothness assumptions), uncertainty sampling within a given target space neglects higher-order information that can be attained by sampling outside the set. This can be seen even more clearly when considering linear functions, in which case points outside the maximizers and expanders can be equally informative as points inside.

Finally, note that the set of expanders is constructed greedily , i.e., only considering single-step expansion. This is necessitated as the inference of safety is based on single reference points. Instead, ITL directly quantifies the information gained towards the points of interest without considering intermediate reference points.

Requires homoscedastic noise SAFEOPT imposes a homoscedasticity assumption on the noise which is an artifact of the analysis of uncertainty sampling. It is well known that in the presence of heteroscedastic noise, one has to distinguish epistemic and aleatoric uncertainty. Uncertainty sampling fails because it may continuously sample a high variance point where the variance is dominated by aleatoric uncertainty, potentially missing out on reducing epistemic uncertainty at points with small aleatoric uncertainty. In contrast, maximizing mutual information naturally takes

24The use of uncertainty sampling for safe sequential decision-making goes back to Schreiter et al. (2015) and Sui et al. (2015).

into account the two sources of uncertainty, preferring those points where epistemic uncertainty is large and aleatoric uncertainty is small (cf. Appendix C.1).

Suboptimal reachable safe set Sui et al. (2015) and Berkenkamp et al. (2021) show that SAFEOPT converges to the optimum within the closure RSAFEOPT ϵ (S0) of

RSAFEOPT ϵ (S) def = S {x X | x S such that f i (x ) (Li x x 2 + ϵ) 0 for all i Is}.

Note that analogously to the expansion of the safe set, the expansion of the reachable safe set is based on inferring safety through a reference point in S and using Lipschitz continuity. This is opposed to the reachable safe set of ITL (cf. Definition C.29).

We remark that under the additional assumption that a Lipschitz constant is known, ITL can easily be extended to expand its safe set based on the kernel and the Lipschitz constant, resulting in a strictly larger reachable safe set than SAFEOPT. We leave the concrete formalization of this extension to future work. Moreover, we do not evaluate this extension in our experiments, as we observe that even without the additional assumption of a Lipschitz constant, ITL outperforms SAFEOPT in practice.

K.2.2 HEURISTIC SAFEOPT

Berkenkamp et al. (2016) also implement a heuristic variant of SAFEOPT which does not assume a known Lipschitz constant. This heuristic variant uses the same (pessimistic) safe sets Sn as ITL. The set of maximizers is identical to SAFEOPT. As expanders, the heuristic variant considers all safe points x Sn that if x were to be observed next with value un(x) lead to |Sn+1| > |Sn|. We refer to this set as GH-SAFEOPT n . The next point is then selected by uncertainty sampling within MSAFEOPT n GH-SAFEOPT n .

The heuristic variant shares some properties with SAFEOPT, such that it is based on uncertainty sampling, not adapting to heteroscedastic noise, and separate notions of maximizers and expanders (leading to an undirected expansion of the safe set). Note that there are no known convergence guarantees for heuristic SAFEOPT. Importantly, note that similar to SAFEOPT the set of expanders is constructed greedily , and in particular, does only take into account single-step expansion. In contrast, an objective such as ITL which quantifies the information gained towards expansion also actively seeks out multi-step expansion.

K.2.3 GOOSE

To address the undirected expansion of SAFEOPT discussed in the previous section, Turchetta et al. (2019) proposed goal-oriented safe exploration (GOOSE). GOOSE extends any unsafe BO algorithm (which we subsequently call an oracle) to the safe setting. In our experiments, we evaluate GOOSE-UCB which uses UCB as oracle and which is also the variant studied by Turchetta et al. (2019). In the following, we assume for ease of notation that Is = {c}.

Given the oracle proposal x , GOOSE first determines whether x is safe. If x is safe, x is queried next. Otherwise, GOOSE first learns about the safety of x by querying expansionist points until the oracle s proposal is determined to be either safe or unsafe.

GOOSE expands the safe set identically to SAFEOPT according to Equation (48). In the context of GOOSE, SSAFEOPT n is called the pessimistic safe set. To determine that a point cannot be deduced as safe, GOOSE also keeps track of a Lipschitz-based optimistic safe set:

b SGOOSE n,ϵ def = [

x SSAFEOPT n 1

un,c(x) Lc x x 2 ϵ 0}.

We summarize the algorithm in Algorithm 2 where we denote by O(X) the oracle proposal over the domain X.

It remains to discuss the heuristic used to select the expansionist points. GOOSE considers all points x SSAFEOPT n with confidence bands of size larger than the accuracy ϵ, i.e.,

WGOOSE n,ϵ def = {x SSAFEOPT n | un,c(x) ln,c(x) > ϵ}.

Algorithm 2 GOOSE

Given: Lipschitz constant Lc, prior model {f, gc}, oracle O, and precision ϵ Set initial safe set SSAFEOPT 0 based on prior b SGOOSE n,ϵ X n 0 for k from 1 to do

x k O( b SGOOSE n,ϵ ) while x k SSAFEOPT n do Observe expansionist point xn+1, set n n + 1, and update model and safe sets end while Observe x k, set n n + 1, and update model and safe sets end for

Which of the points in this set is evaluated depends on a set of learning targets AGOOSE n,ϵ def = b SGOOSE n,ϵ \ SSAFEOPT n akin to the potential expanders En (cf. Equation (27)), to each of which we assign a priority h(x). When h(x) is large, this indicates that the algorithm is prioritizing to determine whether x is safe. We use as heuristic the negative ℓ1-distance between x and x . GOOSE then considers the set of potential immediate expanders GGOOSE n,ϵ (α) def = {x WGOOSE n,ϵ | x AGOOSE n,ϵ with

priority α such that un,c(x) Lc x x 2 0}.

The expansionist point selected by GOOSE is then any point in GGOOSE n,ϵ (α ) where α denotes the largest priority such that |GGOOSE n,ϵ (α )| > 0.

We observe empirically that the sample complexity of GOOSE is not always better than that of SAFEOPT. Notably, the expansion of the safe set is based on a greedy heuristic. Moreover, determining whether a single oracle proposal x is safe may take significant time. Consider the (realistic) example where the prior is uniform, and UCB proposes a point which is far away from the safe set and suboptimal. GOOSE will typically attempt to derive the safety of the proposed point until the uncertainty at all points within SSAFEOPT 0 is reduced to ϵ.25 Thus, GOOSE can waste a significant number of samples, aiming to expand the safe set towards a known-to-be suboptimal point. In larger state spaces, due to the greedy nature of the expansion strategy, this can lead to GOOSE being effectively stuck at a suboptimal point for a significant number of rounds.

K.2.4 ISE and ISE-BO

Recently, Bottero et al. (2022) proposed an information-theoretic approach to efficiently expand the safe set which they call information-theoretic safe exploration (ISE). Specifically, they choose the next action xn by approximating arg max x Sn 1 max x X I(1{gx 0}; yx | Dn 1) | {z } αISE(x)

In a parallel independent work, Bottero et al. (2024) extended ISE to the Safe BO problem where they propose to choose xn according to

arg max x Sn 1 max{αISE(x), αMES(x)} (ISE-BO)

where αMES denotes the acquisition function of max-value entropy search (Wang & Jegelka, 2017). Similarly to SAFEOPT, ISE-BO treats expansion and exploration separately, which leads to undirected expansion of the safe set. That is, the safe set is expanded in all directions, even those that are known to be suboptimal. In contrast, ITL balances expansion and exploration through the single set of potential maximizers An. With a stochastic target space, ITL generalizes max-value entropy search (cf. Appendix K.1).

We evaluate ISE-BO in Figure 18 and observe that it does not outperform ITL and VTL in any of the tasks, while performing poorly in the 1d task and suboptimally in the 2d task.

25This is because the proposed point typically remains in the optimistic safe set when it is sufficiently far away from the pessimistic safe set.

Figure 21: The ground truth f is shown as the dashed black line. The solid black line denotes the constraint boundary. The GP prior is given by a linear kernel with sin-transform and mean 0.1x. The light gray region denotes the initial optimistic safe set b S0 and the dark gray region denotes the initial pessimistic safe set S0.

Figure 22: First 100 samples of ITL using the potential expanders En (cf. Equation (27)) as target space (left) and SAFEOPT sampling only from the set of expanders GSAFEOPT n (right).

K.3 Jumping Past Local Barriers

In this additional experiment we demonstrate that ITL is able to extrapolate safety beyond local unsafe barriers , which is a fundamental limitation of Lipschitz-based methods such as SAFEOPT. We consider the ground truth function and prior statistical model shown in Figure 21. Note that initially, there are three disjoint safe regions known to the algorithm corresponding to two of the three safe bumps of the ground truth function. In this experiment, the main challenge is to jump past the local barrier separating the leftmost and initially unknown safe bump .

Figure 22 shows the sampled points during the first 100 iterations of SAFEOPT and ITL. Clearly, SAFEOPT does not discover the third safe bump while ITL does. Indeed, it is a fundamental limitation of Lipschitz-based methods that they can never jump past local barriers , even if the oracle Lipschitz constant were to be known and tight (i.e., locally accurate) around the barrier. This is because Lipschitz-based methods expand to the point x based on a reference point x , and by definition, if x is added to the safe set so are all points on the line segment between x and x . Hence, if there is a single point on this line segment which is unsafe (i.e., a barrier ), the algorithm will never expand past it. This limitation does not exist for kernel-based algorithms as expansion occurs in function space.

Moreover, note that for a non-stationary kernel such as in this example, ITL samples the closest points in function space rather than Euclidean space. We observe that SAFEOPT still samples locally at the boundary whereas ITL samples the most informative point which in this case is

Figure 23: Ground truth and prior well-calibrated model in 1d synthetic experiment. The function serves simultaneously as objective and as constraint. The light gray region denotes the initial safe set S0.

0 200 400 600 800 1000 Number of Iterations

ISE SAFEOPT

ORACLE SAFEOPT HEURISTIC SAFEOPT

Figure 24: Size of Sn in 1d synthetic experiment. The dashed black line denotes the size of S . In this task, discovering the optimum is closely linked to expansion of the safe set, and HEURISTIC SAFEOPT fails since it does not expand the safe set sufficiently.

the local maximum of the sinusoidal function. In other words, ITL adapts to the geometry of the function. This generally leads us to believe that ITL is more capable to exploit (non-stationary) prior knowledge than distance-based methods such as SAFEOPT.

K.4 Experiment Details

K.4.1 Synthetic Experiments

1d task Figure 23 shows the objective and constraint function, as well as the prior. We discretize using 500 points. The main difficulty in this experiment lies in sufficiently expanding the safe set to discover the global maximum. Figure 24 plots the size of the safe set Sn for the compared algorithms, which in this experiment matches the achieved regret closely.

2d task We model our constraint in the form of a spherical island where the goal is to get a good view of the coral reef located to the north-east of the island while staying in the interior of the island during exploration (cf. Figure 25). The precise objective and constraint functions are unknown to the agent. Hence, the agent has to gradually and safely update its belief about boundaries of the island and the location of the coral reef. The prior is obtained by a single observation within the center of the island [ 0.5, 0.5]2. We discretize using 2 500 points.

Figure 25: Ground truth in 2d synthetic experiment.

K.4.2 Safe Controller Tuning for Quadcopter

Modeling the real-world dynamics We learn a feedback policy (i.e., control gains ) to compensate for inaccuracies in the initial controller. In our experiment, we model the real world dynamics and the adjusted model using the PD control feedback (Widmer et al., 2023),

δt(x) def = (x x)[(s st) ( s st)], (49)

where x are the unknown ground truth disturbance parameters, and s and s are the desired state and state derivative, respectively. This yields the following ground truth dynamics:

st+1(x) = T (st, ut + δt(x)). (50)

The feedback parameters x = [xp xd] can be split into xp tuning the state difference which are called proportional parameters and xd tuning the state derivative difference which are called derivative parameters. We use the critical damping heuristic to relate the proportional and derivative parameters: xd = 2 xp. We thus consider the restricted domain X = [0, 20]4 where each dimension corresponds to the proportional feedback to one of the four rotors.

Ground truth disturbance parameters are sampled from a chi-squared distribution with one degree of freedom (i.e., the square of a standard normal distribution), x p χ2 1, and x d is determined according to the critical damping heuristic.

The learning problem The goal of our learning problem is to move the quadcopter from its initial position s(0) = [1 1 1] (in Euclidean space with meter as unit) to position s = [0 0 2] . Moreover, we aim to stabilize the quadcopter at the goal position, and therefore regularize the control signal towards an action u which results in hovering (approximately) without any disturbances. We formalize these goals with the following objective function:

f (x) def = σ

t=0 s st(x) 2 Q + u ut(x) 2 R

where σ(v) def = tanh((v 100)/100) is used to smoothen the objective function and ensure that its range is [ 1, 1]. The non-smoothed control objective in Equation (51) is known as a linear-quadratic regulator (LQR) which we solve exactly for the undisturbed system using ILQR (Tu et al., 2023). Finally, we want to ensure at all times that the quadcopter is at least 0.5 meter above the ground, that is,

g (x) def = min t [T ] sz t (x) 0.5 (52)

where we denote by sz t the z-coordinate of state st.

We use a time horizon of T = 3 seconds which we discretize using 100 steps. The objective is modeled by a zero-mean GP with a Matérn(ν = 5/2) kernel with lengthscale 0.1, and the constraint is modeled by a GP with mean 0.5 and a Matérn(ν = 5/2) kernel with lengthscale 0.1. The prior is obtained by a single observation of the safe seed [0 0 0 10] .

Adaptive discretization We discretize the domain X adaptively using coordinate LINEBO (Kirschner et al., 2019). That is, in each iteration, one of the four control dimensions is selected uniformly at random, and the active learning oracle is executed on the corresponding one-dimensional subspace.

Safety Using the (unsafe) constrained BO algorithm EIC (Gardner et al., 2014) leads constraint violation,26 while ITL and VTL do not violate the constraints during learning for any of the random seeds.

Hyperparameters The observation noise is Gaussian with standard deviation ρ = 0.1. We let β = 10. The control target is u = [1.766 0 0 0] .

The state space is 12-dimensional where the first three states correspond to the velocity of the quadcopter, the next three states correspond to its acceleration, the following three states correspond to its angular velocity, and the last three states correspond to its angular velocity in local frame. The LQR parameters are given by

Q = diag {1, 1, 1, 1, 1, 1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1} and R = 0.01 diag {5, 0.8, 0.8, 0.3}.

The quadcopter simulation was adapted from Chandra (2023).

Each one-dimensional subspace is discretized using 2 000 points.

Random seeds We repeat the experiment for 25 different seeds where the randomness is over the ground truth disturbance, observation noise, and the randomness in the algorithm.

26On average, 1.6 iterations of the first 50 violate the constraints.

Table 3: Magnitudes of γn for common kernels. The magnitudes hold under the assumption that X is compact. Here, Bν is the modified Bessel function. We take the magnitudes from Theorem 5 of Srinivas et al. (2009) and Remark 2 of Vakili et al. (2021). The notation e O( ) subsumes log-factors. For ν = 1/2, the Matérn kernel is equivalent to the Laplace kernel. For ν , the Matérn kernel is equivalent to the Gaussian kernel. The functions sampled from a Matérn kernel are ν 1 mean square differentiable. The kernel-agnostic bound follows by simple reduction to a linear kernel in |X| dimensions.

Kernel k(x, x ) γn

Linear x x O(d log(n))

Gaussian exp x x 2

e O logd+1(n)

Laplace exp x x 1

e O n d 1+d log 1 1+d (n)

Matérn 21 ν

e O n d 2ν+d log 2ν 2ν+d (n)

any O(|X| log(n))

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The code is publicly available. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justification: Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: See the extensive discussions in the appendices. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: All individual experiments require only small compute resources.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.