# pfns4bo_incontext_learning_for_bayesian_optimization__a7124224.pdf PFNs4BO: In-Context Learning for Bayesian Optimization Samuel M uller 1 2 Matthias Feurer 1 Noah Hollmann 1 2 3 Frank Hutter 1 2 In this paper, we use Prior-data Fitted Networks (PFNs) as a flexible surrogate for Bayesian Optimization (BO). PFNs are neural processes that are trained to approximate the posterior predictive distribution (PPD) through in-context learning on any prior distribution that can be efficiently sampled from. We describe how this flexibility can be exploited for surrogate modeling in BO. We use PFNs to mimic a naive Gaussian process (GP), an advanced GP, and a Bayesian Neural Network (BNN). In addition, we show how to incorporate further information into the prior, such as allowing hints about the position of optima (user priors), ignoring irrelevant dimensions, and performing non-myopic BO by learning the acquisition function. The flexibility underlying these extensions opens up vast possibilities for using PFNs for BO. We demonstrate the usefulness of PFNs for BO in a large-scale evaluation on artificial GP samples and three different hyperparameter optimization testbeds: HPO-B, Bayesmark, and PD1. We publish code alongside trained models at github.com/automl/PFNs4BO. 1. Introduction Gaussian processes (GPs) are today's de facto standard surrogate model in Bayesian Optimization (BO; Frazier, 2018; Garnett, 2022). This dominance can be attributed to both their strong performance and their mathematical convenience. However, a GP can only model priors that can be encoded as a valid kernel function and is jointly normaldistributed. Moreover, while kernel hyperparameters could be treated in a Bayesian manner with Markov chain Monte Carlo, this is typically not done due to the high computational cost, even though the fully Bayesian treatment was already shown to yield stronger results a decade ago (Be- 1University of Freiburg, Germany 2Prior Labs 3Charit e Berlin University of Medicine, Germany. Correspondence to: Samuel M uller . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Acquisition Value (EI) Function Input Function Value Observation GP Ground Truth PFN GP Figure 1: Our proposed Prior-data Fitted Network almost exactly approximates a Gaussian Process posterior with fixed hyperparameters. We plot the exact and approximated GP prediction for (i) the mean and (ii) expected improvement. For the simple GP model approximated here, a ground truth can be exactly calculated, which is generally not the case, see Section 4.1. PFNs, however, can be extended to approximate any prior one can sample from. nassi et al., 2011; Snoek et al., 2012; Eriksson & Jankowiak, 2021). Instead, GPs are usually fitted using maximum likelihood, often called Empirical Bayes . This makes Bayesian optimization less principled from a Bayesian perspective. The recently proposed Prior-data Fitted Networks (PFNs, M uller et al., 2022) show that fast approximate Bayesian inference is possible by training a neural network, more specifically, a specific type of neural process (Garnelo et al., 2018), to mimic the posterior predictive distribution (PPD) in a single forward pass using in-context learning. This is a powerful approach, as it makes approximate Bayesian inference readily usable in novel applications and allows using any prior that we can sample from, e.g. a GP kernel and its hyperparameters, or also a Bayesian neural network. PFNs can be a robust and generalizable PPD approximation method, e.g. for tabular data using a prior over different neural architectures and their weights (Hollmann et al., 2023). In this work, we demonstrate the flexibility and effectiveness of PFNs as a Bayesian drop-in replacement for Gaussian Processes. We perform Bayesian optimization with PFNs4BO: In-Context Learning for Bayesian Optimization three different kinds of priors, which we describe in Section 4: a GP-based and a BNN-based prior, as well as a prior that mimics the state-of-the-art BO implementation, HEBO (Cowen-Rivers et al., 2022). In Section 3, we show that standard acquisition functions, such as probability of improvement (PI), expected improvement (EI), or upper confidence bound (UCB), are easy to implement analytically in the PFN framework and that PFNs can produce sensible results for BO, as illustrated in Figure 1 for a PFN trained on a simple GP prior for which results closely follow the ground truth GP posterior. We also show that PFNs can readily be combined with gradient-based optimization techniques to optimize input warping (Snoek et al., 2014) and acquisition functions. In Section 5, we show how our approach can be extended to allow the user to specify prior beliefs about where optima lie, which in our experiments significantly improves performance when accurate prior beliefs exist, as well as an extension towards non-myopic optimization. In our experiments, we focus on BO for hyperparameter optimization (HPO, Feurer & Hutter, 2019). HPO is a crucial task for achieving top performance with machine learning algorithms and a key application of BO (Brochu et al., 2010; Snoek et al., 2012; Garnett, 2022). We show that PFNs perform strongly across three benchmarks, including tuning hyperparameters of large neural networks. 2. Bayesian Optimization BO is a popular technique to find maxima of (noisy) blackbox functions in as few evaluations as possible (Garnett, 2022). BO is an iterative procedure that switches between modeling outcomes based on all observations Dk up to a time-step k with a probabilistic model ˆf(y|x, Dk), and using an acquisition function α( ˆf(y|x, Dk)) to decide which point to query next. The acquisition function accepts a distribution over possible outcomes, typically the predictive posterior distribution, to trade off exploitation (trying to improve solutions in known good areas) and exploration (reducing the posterior uncertainty in unknown areas). In this work, we restrict optimization to realand integer-valued inputs x, but BO has been extended to more complex design spaces (Hutter et al., 2011; Swersky et al., 2013; Korovina et al., 2020; Ru et al., 2021; Daulton et al., 2022). We show pseudocode for the BO loop in Algorithm 1 (purple depicts the standard setting, and green PFN-based models, introduced in Sections 3-5). We refer to Brochu et al. (2010), Frazier (2018) and Garnett (2022) for thorough introductions to Bayesian optimization. We will now focus on GPs for BO and briefly review other models in Appendix K. Gaussian Processes GPs have been widely adopted as probabilistic surrogates in BO due to their flexibility and analytic tractability (Rasmussen & Williams, 2006). How- Algorithm 1 Bayesian optimization with GPs or PFNs Input hyperparameter prior settings , PFN qθ trained on a prior distribution over datasets p(D) , initial observations D = {(x1, y1), ..., (xk, yk)}, search space X, number of BO iterations K, black-box function f to optimize, acquisition function α Output Best observed input x and response y for i k + 1 to K do Fit GP model ˆf to data D Suggest x arg maxˆx X α(ˆx, D, ˆf or qθ( | D) ) Update history with response D D {(x, f(x))} end for Return best configuration: arg max(xi,yi) D yi ever, traditional GPs exhibit a cubic scaling, which limits their application in large-data settings. They also assume a joint Gaussian distribution of the data introducing a model mismatch for long-tailed data. Furthermore, the predominant use of stationary kernels renders the optimization of nonstationary and heteroscedastic functions problematic. Two further problems come from the kernel that defines the GP s covariance matrix. First, the model prior must be encoded as a valid kernel function, which complicates representing categorical or hierarchical concepts. Second, the kernel hyperparameters need to be tuned to the data, which suffers from the curse of dimensionality when using Empirical Bayes. The current state-of-the-art BO method using GPs for hyperparameter optimization is HEBO (Cowen-Rivers et al., 2022). We describe HEBO and extend on it in Section 4.2. 3. Bayesian Optimization with PFNs We will now show how to train and use PFNs, combine common acquisition functions with them, and incorporate gradient-based optimization at suggestion time for acquisition function optimization and input warping (Snoek et al., 2014). 3.1. Background on Prior-Data Fitted Networks Prior-data Fitted Networks (PFNs, M uller et al., 2022) are neural networks trained to approximate the Posterior Predictive Distribution p(y|x, D) for supervised learning tasks. We visualize their use in Figure 2a. Prior-fitting Prior-fitting is the training of a PFN to approximate the PPD and thus perform Bayesian prediction for a particular, chosen, prior. We assume that there is a sampling scheme for the prior s.t. we can sample datasets of inputs and outputs from it: D p(D). This require- PFNs4BO: In-Context Learning for Bayesian Optimization Done once, offline Sample synthetic datasets Di from prior: Di p(D) Train a PFN qθ on synthetic datasets {D1, . . . , Dn} Obtain qθ(ytest|xtest, Dreal) with a single forward pass Done per real-world dataset, online Real-world training dataset Dreal and test point xtest (a) Prior-fitting and inference (x1, y1)(x2, y2)(x3, y3) q( |x5, D) q( |x4, D) (b) Architecture and attention mechanism Figure 2: (a) The PFN learns to approximate the PPD of a given prior offline to yield predictions on new observations in a single forward pass. (b) The positions representing the training samples (xi, yi) can attend only to each other; test positions (for x4 and x5) attend only to the training positions. Plots based on M uller et al. (2022), with permission. ment is easy to satisfy for most priors. For a GP, for example, this can be achieved by sampling outcomes from the GP prior. We describe our priors in more detail in Section 4. We repeatedly sample synthetic datasets D = {(xi, yi)}i {1,...,n} and optimize the PFN's parameters θ to make predictions for (xtest, ytest) D, conditioned on the rest of the dataset Dtrain = D \ {(xtest, ytest)}. Our PFN qθ is an approximation to the PPD and thus accepts a training set and a test input and returns a distribution over outcomes for the test input. The loss of the PFN training is the cross-entropy on the held-out examples E(xtest,ytest) Dtrain p(D)[ log qθ(ytest|xtest, Dtrain)], (1) and minimizing this loss approximates Bayesian prediction (M uller et al., 2022). Crucially, this synthetic priorfitting phase is performed only once for a given prior p(D) as part of algorithm development. For technical details of our training procedure, we refer to Appendix E. Real-World Inference During inference, the trained model is applied to unseen real-world observations (see Figure 2a). For a novel dataset with observations D and test features x, feeding D, x as an input to the model trained above yields an approximation to the PPD qθ(y|x, D) through in-context learning. Architecture PFNs rely on a Transformer (Vaswani et al., 2017) that encodes each feature vector and label as a vector representation, allowing these representations to attend to each other, as depicted in Figure 2b. They accept a variablelength training set D of feature and label vectors (treated as a set-valued input to exploit permutation invariance) as well as a variable-length query set of feature vectors {x1, . . . , xm} and return estimates of the PPD for each query. This architecture also fulfills all requirements for a neural process architecture (Nguyen & Grover, 2022b). Regression Head PFNs do not use a standard regression head, but instead make predictions with a discretized distribution, the Riemann distribution. Consequently, the approximated PPD q becomes a discrete distribution. This allows PFNs to treat regression as a classification problem, shown to be clearly advantageous for PFNs over a classical regression head (M uller et al., 2022). Please see Appendix E.1 for details. 3.2. PFNs as Models in Bayesian Optimization We contrast Bayesian optimization with surrogates based on a PFN and a GP with Empirical Bayes in Algorithm 1; the standard components colored in purple are replaced by the new ones in green. The trained PFN is passed to the BO algorithm and used in the acquisition function as a replacement of the standard GP model. While PFNs have an up-front cost of fitting the prior, Empirical Bayes incurs the online cost of fitting the hyperparameters in each iteration. Crucially, PFNs incur the training cost exactly once per prior, and a single trained PFN can be used for BO on different tasks with varying dimensions, since the data D is passed to the PFN for in-context learning (in a single forward pass). The offline training of the PFN can be compared to developing the code base for GP regression and tuning its hyperparameter spaces: a one-time investment that is then generally applicable. We list important details for BO with PFNs in Appendix D. 3.3. Prior Work using Transformers as Surrogates Three recent works propose Transformers as surrogates. Maraval et al. (2022) investigate the feasibility of using PFNs in Bayesian Optimization (BO) on toy functions in their early work. They demonstrate that using GP priors with a PFN can match or outperform GPs while being an order of magnitude faster. However, their work only evaluates a small set of toy functions and does not leverage the flexibility of PFNs to model extensions to a simple GP model. Finally, the authors train one PFN per search space dimensionality, while we share one network across all search spaces. Their paper reports that (i) models trained on uniformly distributed inputs impair predictive accuracy on non-uniform data, and (ii) their PFNs require many novel observations to adapt posteriors. However, they used extremely few epochs for training, and we observe that both of these issues vanish when increasing the number of epochs for prior-fitting to the magnitude used by M uller et al. (2022). Similarly, Nguyen & Grover (2022a) used a Transformer to PFNs4BO: In-Context Learning for Bayesian Optimization do Bayesian Optimization on toy functions. Interestingly, they did not use a Riemann distribution to predict the output, but instead parameterized a normal distribution with the outputs of the neural network. The only application to real-world data was done with the recent Optformer (Chen et al., 2022). Optformer is a Transformer that utilizes transfer learning on previously recorded BO optimizations performed by other optimizers. As such, it is not a Bayesian method; instead, the Optformer models full BO trajectories, including observations, rather than acting as a surrogate model. Additionally, the Optformer incorporates additional user-provided information, such as the names of the optimization dimensions or of the metric to optimize. 3.4. Acquisition Functions When we condition a PFN on new observations, we obtain an approximation of the PPD in a forward-pass p(y|x, D) qθ(y|x, D) in the form of a Riemann distribution, that is, a piece-wise constant distribution over bins spanning a reasonable output range (M uller et al., 2022). The Riemann distribution allows us to calculate the utility for different acquisition functions exactly, e.g., EI, PI, or UCB, as described by Chen et al. (2022). To exemplify the general approach, we outline how to compute PI(y) = R [y > f ]p(y)dy for the unbounded Riemann distribution in Appendix F. We experimented with EI, PI and UCB and found simple EI to work robustly. We show a comparison in Appendix Table 4. 3.5. Gradient-based Optimization at Inference Time The PFN, being a standard Transformer under the hood, propagates gradients from its outputs to its inputs. We use gradient ascent to find the acquisition function's maximum and to tune our input warping, which we outline below. Acquisition Function Optimization Our acquisition functions are optimized with an extensive initial random search followed by a gradient-based optimization of the strongest candidates, similar to previous work (Snoek et al., 2012). We refer to Appendix C for more details. Input Warping To improve performance on search spaces with misspecified scales, e.g., a log scale that is not declared as such, we warp features before passing them to the PFN. This can be necessary as missing log scaling can result in almost all values lying in a range that is difficult for the PFN to handle numerically (e.g., in the lowest percentile of the search space). We follow Cowen-Rivers et al. (2022) and use the CDF of a Kumaraswamy distribution w( ; a, b), where a and b are tuned per feature, to warp features to w(x; a, b) = 1 (1 xa)b. (2) Traditionally, the parameters a and b are tuned to maximize the likelihood of the observed data, i.e., maximize p(D). While the data likelihood can be computed with a PFN, this is expensive, as one needs to compute the factorization p(y1|x1)p(y2|x2, {(x1, y1)}) . Instead, we use the likelihood Q (x,y) D p(y|x, D) of observing the same ys again, which is cheaper to compute. We view this as a measure for the amount of noise a PFN uses to explain the data D. E.g., if a prediction p(y1|x1, {(xi, yi)}i {1,...,n}) has a high probability, the network does not assign a lot of probability to noise changing the value of y when re-evaluating. While this approximation works well for our prior, it might fail in other setups. We ablate the impact of input warping in Appendix L. 4. Priors for Bayesian Optimization In this work we use a set of three different priors to show the flexibility of PFNs as a surrogate. 4.1. A Simple Prior Based on a Simple GP (GP) We use the prior of a simple GP with fixed hyperparameters to show-case our method (see Figure 1 and Figures 11 and 12), as it allows us to compare against the ground truth posterior. Here we use an RBF kernel with zero mean. During prior-fitting, we sample N inputs xi uniformly at random from the unit-hypercube. Then we sample all outputs y from the GP prior. We can simply sample y N(0, K), where Ki,j = k RBF(xi, xj). 4.2. A HEBO-inspired Prior (HEBO and HEBO+) HEBO (Cowen-Rivers et al., 2022) is a state-of-the-art BO method that won the Neur IPS black-box competition (Turner et al., 2021) and demonstrated excellent performance in a recent empirical evaluation (Eggensperger et al., 2021). It performs non-linear input and output warping for robust surrogate modeling and combines it with a well-engineered GP prior. We use it as a starting point for our own HEBO-inspired prior for PFN training. Our method considers a set of parameters ϕ: the lengthscale per dimension, the output scale and the noise. These are modeled as independent random variables subject to some distribution p(ϕ; ψ) that depends on hyperparameters ψ. To sample functions, we perform the following four steps per dataset: i) Sample all x from a uniform distribution, ii) sample the parameters ϕ p(ϕ; ψ), iii) draw the outputs for our dataset y N(0, K(x; ϕ)), where K is the covariance matrix defined by the 3/2 Mat ern kernel, that uses the sampled outputscale, lengthscale and noise. For additional details on ψ, we refer to Appendix B. We use this prior in a form that is as close to the original HEBO as possible and a variant we dub HEBO+ that extends on HEBO (see Section 5.2) and is tuned to work well PFNs4BO: In-Context Learning for Bayesian Optimization with PFNs (see Section 7.1). 4.3. Bayesian Neural Network Prior (BNN Prior) We follow previous work (M uller et al., 2022; Hollmann et al., 2023) in building our Bayesian Neural Network (BNN) prior. To sample datasets from this prior, we (i) first sample one network architecture, i.e., the number of layers, the number of hidden nodes, the sparseness, the amount of Gaussian noise added per unit, and the standard deviation with which to sample weights (ii) sample the network s weights from a normal distribution. For each dataset, we first sample a BNN and then sample inputs to the BNN x uniformly at random from the unit hypercube, feed these through the network, and use the output values as the target y. Additionally, we employed input warping as described in Section 5.1. For more details, see Appendix B. 5. Prior Extensions PFNs offer versatility, making it simple to explore new priors for BO, such as the above BNN prior, which incorporates a distribution over architectures. In this section, we discuss modifications that can easily be combined with priors, but would be harder to incorporate in traditional GPs. 5.1. Input Warping In addition to using input warping after prior-fitting (see Section 3.5), we can include a Bayesian formulation of input warping in the prior directly and include it in priorfitting. That is, (i) we sample warping hyperparameters hwarp randomly from a predefined distribution and then (ii) warp inputs of a synthetic prior dataset with an exponential transform and hyperparameters hwarp. While we found this to be beneficial, it was not powerful enough to remove the need for feature warping after prior-fitting completely. 5.2. Spurious Dimensions Many real-world tuning tasks contain irrelevant features, which do not influence the output (Bergstra & Bengio, 2012). While it would be hard with traditional GPs and Empirical Bayes to encode a specific chance of a feature being spurious in the prior, for PFNs, we can do this easily by adding irrelevant features at random during prior-fitting. These features are simply not fed to the GP/NN which generates the outcomes for the dataset. In our final HEBO+ prior we use 30% irrelevant features. We found that this improves performance, especially for large search spaces, and show a steep improvement on the three largest search spaces in HPO-B in Table 3 of the Appendix. Figure 3: An example of the impact of the user prior on the prior belief about the position of the optimum. 5.3. User Priors The ability to incorporate practitioner's knowledge is pivotal in order to improve the usefulness of automated HPO approaches for Machine Learning as well as in other BO applications. Previous work has shown that user knowledge can improve BO performance tremendously (Ramachandran et al., 2020; Li et al., 2020; Souza et al., 2021; Hvarfner et al., 2022). One way to incorporate explicit user knowledge is through a user's belief about the location of the optimum in the search space. While previous work relies on warping the space (Ramachandran et al., 2020), reweighting the posterior (Souza et al., 2021) or reweighting the acquisition function (Hvarfner et al., 2022), PFNs, can integrate these priors in a more direct and sound way. In this work we let the user define an interval I I, in which they belief the optimum lies, and a weighting ρ [0, 1], indicating their confidence. This yields a prior: p(D|ρ, I) = ρ p(D|m I) + (1 ρ) p(D), (3) where m is the maximum of the function underlying the dataset D; y f(x) for all x, y D. Figure 3 shows an exemplary change to the density of the optimum in the prior. We train a single PFN that can accept any interval I I and confidence ρ [0, 1] and adapt its prior on the fly to be p(D|ρ, I). The extra inputs, ρ and I, are fed to the PFN using an extra position with its own linear encoder similar to style embeddings for language models (Dai et al., 2019). In Appendix G, we detail how we build a PFN prior p(D, ρ, I) that is cheap to sample and allows the neural network to adapt to any interval I out of a set of |I| = 15 different intervals per search space dimension. 5.4. Non-Myopic Acquisition Function Approximation Non-myopic acquisition functions provide an optimal exploration and exploitation strategy, optimizing sampling policies over a rolling horizon. Though promising, computing them is computationally expensive and thus rarely used in practice. We demonstrate how to use PFNs to enable an effective approximation. The Knowledge Gradient (Frazier, 2018) measures the predicted improvement in the maximum value of the black-box function by obtaining an additional observation at the candidate point x: αKG(x; D) = Ep(y|x,D)[τ(D {(x, y)}) τ(D)], (4) PFNs4BO: In-Context Learning for Bayesian Optimization # Features 1 2 10 # Wins Empirical Bayes 206 144 169 # Wins PFN 239 154 171 # Ties 555 702 660 Table 1: BO performance with minimal confounding factors after 50 evaluations. The majority of runs yielded ties, showing PFNs are a strong alternative to Empirical Bayes. where τ(D) = maxx X E[y|x, D]. Knowledge gradient is the optimal acquisition function choice when optimizing for the mean in the following step. It is thus an approximation for the non-myopic setting with a one step look-ahead. Traditionally one can approximate αKG with a Monte Carlo estimate. For this, one needs to sample a set of N outcomes y at the current position and an additional set of M positions x at which to evaluate the mean per y. This incurs costs of M N. Alternatively, this can be solved by a two-level optimization of random batches (Wu & Frazier, 2016; Wu et al., 2017). With a PFN, we can get away without any prediction time optimization, though. The PFN can directly learn to approximate the αKG. In Appendix H we detail this method and in Appendix H.1 show results, with this non-myopic method outperforming standard EI on four search spaces with few dimensions. 6. Bayesian optimization experiments In this section we analyze the performance of PFNs on prior samples with as few as possible confounding factors. First, we verify that the PFN predictions match the true GP posterior without Empirical Bayes; as shown in Figures 11 and 12 in the appendix. This is not the case when training the PFN on very few ( 100k) prior samples but the approximation becomes tighter with enough samples ( 20M). Next, we consider prediction performance with the original HEBO prior, following their hyper-prior setup as close as possible. We evaluate the HEBO with Empirical Bayes with our reimplementation of the HEBO model in GPy Torch (Gardner et al., 2018), which has some slight adaptions which we detail in Appendix I. Table 8 shows that both, PFNs and Empirical Bayes, can approximate the posterior similarly well in terms of likelihood on prior samples. Finally, we compare BO performance on samples from the HEBO prior. We create a discrete benchmark with 1 000 different datasets per number of dimensions, each containing 1 000 evaluations sampled uniformly at random in [0.0, 1.0]. Table 1 shows that across dimensionalities, the GP with empirical Bayes and our PFN approximation perform on par, most of the time they find the same maxima. We refer to Appendix I for a more detailed comparison. 7. HPO Experiments We conduct a large-scale evaluation of PFNs as surrogates for Hyperparameter Optimization (HPO). We present results on three diverse benchmarks: HPO-B (Pineda Arango et al., 2021), a discrete benchmark with a focus on tree-based ML models including XGBoost (Chen & Guestrin, 2016), Bayesmark, a continuous HPO benchmark which was used in the evaluation of the baseline HEBO (Cowen-Rivers et al., 2022), and PD1 (Wang et al., 2021), a discrete large neural network tuning benchmark. We run every optimizer for 50 function evaluations per problem. HPO-B provides baseline results for 5 repetitions, and we therefore also conduct 5 repetitions, except for Bayesmark, where we conduct 10 repetitions because it is considerably more noisy. In total we evaluated 105 tasks from a total of 19 search spaces. We describe a set of ablation studies in Appendix L, in which we analyse our HEBO-inspired model against a set of HEBO variations to find the changes that improve performance. Moreover, we study the handling of spurious features and different acquisition functions. Model selection Our HEBO and BNN priors contain meta hyperparameters, which we chose once upfront. This is similar to the GP meta hyperparameters which are chosen once and fixed across tasks, e.g. the length scale prior for HEBO. To determine our global prior hyperparameters, we split off a set of 7 validation search spaces from our largest benchmark, HPO-B, as validation search spaces. On these we performed one random search for each prior and picked the strongest hyperparameters based on average rank. We list our search space split in Appendix A. 7.1. HPO-B Benchmark We perform experiments on the 9 test search spaces of the HPO-B benchmark (Pineda Arango et al., 2021), which contain a total of 51 different tasks spread across the search spaces. HPO-B is a discrete benchmark in which the BO tool can only query recorded function evaluations and contains search spaces for Decision trees, SVMs, linear models, random forests and XGBoost. We compare our method to the following baselines (using the results provided by HPO-B for 1-4): 1) Random Search (RS; Bergstra & Bengio, 2012), 2) Gaussian Processes (GP; (Jones et al., 1998; Snoek et al., 2012) 3) DNGO (Snoek et al., 2015), 4) Deep Kernel Gaussian Processes (DGP; Wistuba & Grabocka, 2021), 5) HEBO (Cowen-Rivers et al., 2022). We evaluate both PFN priors in Figure 4 (top). We can see that the BNN prior performs similar to the best baseline, while the HEBO+ prior outperforms the baselines. Additionally, in Figure 9 of the Appendix, we study the impact of input warping on the PFN model and also compare against PFNs4BO: In-Context Learning for Bayesian Optimization Average Rank Average Regret Random GP HEBO DNGO DGP PFN (HEBO+) PFN (BNN) Number of trials Average Rank Number of trials Average Regret Random Random (User Prior) HEBO PFN (BNN) PFN (HEBO+) PFN (HEBO+) + User Prior Figure 4: Aggregated average rank and average regret over time on the HPO-B (top) and PD1 (bottom) benchmarks. Shading indicates 95% confidence intervals. (Top) Aggregate over all HPO-B test search spaces. We can see that both priors seem to yield advantages for the PFN, yielding top performance. We provide per-search-space results in Figure 18 in the Appendix. (Bottom) Aggreate over all PD1 tasks, including user priors. We can see that user priors clearly improve performance. a BNN baseline and two Optformers (Chen et al., 2022). Early on we found that HPO-B has many instances of wrongly scaled hyperparameters, where a log transformation is missing. We found that approximately 10% of the hyperparameters are missing log scaling in HPO-B using a heuristic. We show one example of such a parameter in the appendix in Figure 15. While this likely is a bug in the original development of the benchmark, we still include this benchmark as it is a very important real-world setting. Users, like the creators of HPO-B, could very well miss to specify a log scaling. However it is very distinct from the standard BO setting, where user knowledge is used to apply logarithms where needed. Thus, we used this benchmark to show-case our inference time input warping, as described in Section 3.5, that can find the right transformation for extremely mis-specified parameters. 7.2. Continuous HPO on Bayes Mark We use Bayesmark1 to test the PFN's ability to optimize hyperparameters in continuous search spaces, i.e. give the PFN the ability to propose any hyperparameter, not only previously queried hyperparameters and show results in Figure 4. We evaluated 10 seeds on 9 methods from scikit-learn (Pedregosa et al., 2011), i.e. 9 search spaces, training on 4 1https://github.com/uber/bayesmark different datasets with accuracy as evaluation metric. This yields a total of 36 different tasks. We compare our PFNs to the following baselines: 1) HEBO's competition version for Bayes Mark (Turner et al., 2021), 2) Hyper Opt (Bergstra et al., 2013), 3) Py SOT (Eriksson et al., 2019), and 4) Random Search. For this benchmark, the question of initial design arises. First we used d standard Sobol samples, where d is the dimensionality of the search space. In this setting, we can see that the PFN with a BNN prior is outperformed by HEBO, while outperforming all other baselines, while the HEBO+ prior helps the PFN to perform comparable or slightly worse than HEBO. Since our PFN is fully Bayesian, there should be no need for an initial design, as the model will never overfit the data. We thus experimented with two more initial designs of size one as the PPD cannot be used without a single training point: We initialize i) at the middle of the search space and ii) in the lower corner of the search space. The first turned out to be slightly better than standard Sobol sampling, while the second was clearly superior to all other approaches. These results are surprising, especially considering the bad performance these initializations seem to have at step 0. 7.3. Neural Network HPO with PD1 and User Priors While the focus of the previous benchmarks was HPO for traditional machine learning methods, the PD1 bench- PFNs4BO: In-Context Learning for Bayesian Optimization Number of trials Average Rank Number of trials Average Regret Random Hyper Opt Py SOT HEBO PFN (HEBO+, sobol) PFN (BNN, sobol) PFN (HEBO+, init min) PFN (HEBO+, init mid) Figure 5: Average rank and average regret over time on the Bayes Mark Benchmark, where tasks were optimized for accuracy. Shading indicates 95% confidence intervals. Using a single initial data point works well for the PFN. mark Wang et al. (2021) focuses on real-world large neural network setups. PD1 is a discrete benchmark, considering different relevant tuning tasks e.g. of Res Nets (He et al., 2016) or Transformer language models (Vaswani et al., 2017). All tasks in PD1 share the same five dimensional search space, which is given in Table 2 of the Appendix. For each task we ran 5 seeds with a different (shared) single initial given evaluation. We do not perform any further initial design for our PFNs. Figure 4 (bottom) shows that this time the PFN with BNN prior is clearly outperformed by the HEBO baseline, but that the PFN with the HEBO+ prior performs better. User Prior Since PD1 only has a single search space, we used it to evaluated the performance of a simple user prior. As we lacked experience with the optimizer setup used in PD1, we defined a single generic user prior for the benchmark based on the location of optima across all 18 tasks. We did not choose to make it a very strong prior, though, as in the most extreme case it places half of the prior weight on optima in one-fourth of the transformed search space. This generic user prior does not reveal the exact location of optima, but rather provides hints towards them. The definition can be found in Table 2 of the Appendix. In Figure 4 (bottom), we can see that the user prior yields strong improvements over our HEBO+ prior. We contrast our method to a quasi-random search that samples randomly from the user prior. Although this approach improves upon random search, it is not competitive with any BO method, thus the prior clearly does not make the BO problem trivial. 8. Conclusion & Limitations We showed that PFNs can be trained in a way that they can be powerful BO surrogates and incorporate function priors that would be complicated to model otherwise. Still, we believe that there is a lot of room to improve upon our work. Unfortunately, our work has several limitations. First, PFNs tend to work worse for data that has a very low likelihood in the prior distribution, which is not such a big problem for GPs. Second, we could not find a setting that recovers mislabeled log dimensions, but still performs very strong on correctly specified search spaces. That is why we use input-warping only for HPO-B, but not for the other benchmarks. Third, compared to GPs, we cannot draw joint samples from multiple data points, which prohibits the straight-forward adaptation of some acquisition functions such as the noisy EI (Letham et al., 2018). Fourth, we have a strong focus on accuracy as evaluation metric in HPO tasks, with only PD-1 having non-accuracy tasks. Our method seems to work worse for non-accuracy based tasks, as they typically have very different dynamics, even when using a power transformation. Fourth, PFNs were only shown to perform well up to around 1000 data points (Hollmann et al., 2023) so far, but we are hopeful for future improvements. Furthermore, we plan to extend our work to contain more elaborate priors that can handle non-stationary functions (Assael et al., 2014; Martinez-Cantin, 2015; Wabersich & Toussaint, 2016), heteroscedastic noise (Griffiths et al., 2022), discrete and categorical variables (Daulton et al., 2022), discontinuous functions (Jenatton et al., 2017; L evesque et al., 2017) and 100s or 1000s of irrelevant features (Wang et al., 2016). In addition, we expect that it will be easier to use other random processes such as student-t processes (Shah et al., 2014) or other non-Gaussian output distributions, like log-normals (Eggensperger et al., 2018). We have demonstrated how user priors can be added into the model, and we plan to extend this to allow more flexible user priors that allow specifying details about the behavior of the black-box function. Following up on this, a great addition will be learning user priors via transfer learning (Wistuba et al., 2016; van Rijn & Hutter, 2018; Vanschoren, 2019; Feurer et al., 2022) or allowing simpler user priors based on just a few good starting points. We would like to end this paper by once again highlighting PFNs4BO: In-Context Learning for Bayesian Optimization the potential of PFNs as surrogate in BO as they can be easily adapted to model any efficient to sample prior. Acknowledgements Robert Bosch Gmb H is acknowledged for financial support. This research was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 417962828 and by the state of Baden W urttemberg through bw HPC and the German Research Foundation (DFG) through grant no INST 39/963-1 FUGG. The authors acknowledge funding through the research network Responsive and Scalable Learning for Robots Assisting Humans (Re Sca Le) of the University of Freiburg. The Re Sca Le project is funded by the Carl Zeiss Foundation. We acknowledge funding through the European Research Council (ERC) Consolidator Grant Deep Learning 2.0 (grant no. 101045765). Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the ERC. Neither the European Union nor the ERC can be held responsible for them. Assael, J., Wang, Z., and Freitas, N. Heteroscedastic treed Bayesian optimisation. ar Xiv:1410.7172 [cs.LG], 2014. Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449 458. PMLR, 2017. Benassi, R., Bect, J., and Vazquez, E. Robust Gaussian process-based global optimization using a fully Bayesian expected improvement criterion. In Coello, C. (ed.), Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION 11), volume 6683 of Lecture Notes in Computer Science, pp. 176 190. Springer, 2011. Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281 305, 2012. Bergstra, J., Bardenet, R., Bengio, Y., and K egl, B. Algorithms for hyper-parameter optimization. In Shawe Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. (eds.), Proceedings of the 24th International Conference on Advances in Neural Information Processing Systems (Neur IPS 11), pp. 2546 2554. Curran Associates, 2011. Bergstra, J., Yamins, D., and Cox, D. Making a science of model search: Hyperparameter Optimization in hundreds of dimensions for vision architectures. In Dasgupta, S. and Mc Allester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning (ICML 13), pp. 115 123. Omnipress, 2013. Brochu, E., Cora, V., and de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. ar Xiv:1012.2599v1 [cs.LG], 2010. Calandra, R., Peters, J., Rasmussen, C., and Deisenroth, M. Manifold Gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN 16), pp. 3338 3345. International Neural Network Society and IEEE Computational Intelligence Society, IEEE, 2016. Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Krishnapuram, B., Shah, M., Smola, A., Aggarwal, C., Shen, D., and Rastogi, R. (eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 16), pp. 785 794. ACM Press, 2016. Chen, T., Fox, E., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. In Xing, E. and Jebara, T. (eds.), Proceedings of the 31th International Conference on Machine Learning, (ICML 14). Omnipress, 2014. Chen, Y., Song, X., Lee, C., Wang, Z., Zhang, Q., Dohan, D., Kawakami, K., Kochanski, G., Doucet, A., Ranzato, M., Perel, S., and de Freitas, N. Towards learning universal hyperparameter optimizers with transformers. In Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems (Neur IPS 22), 2022. Cowen-Rivers, A., Lyu, W., Tutunov, R., Wang, Z., Grosnit, A., Griffiths, R., Maraval, A., Jianye, H., Wang, J., Peters, J., and Ammar, H. HEBO: Pushing the limits of sample-efficient hyper-parameter optimisation. Journal of Artificial Intelligence Research, 74:1269 1349, 2022. Dai, N., Liang, J., Qiu, X., and Huang, X. Style transformer: Unpaired text style transfer without disentangled latent representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5997 6007. Association for Computational Linguistics, 2019. Daulton, S., Wan, X., , Eriksson, D., Balandat, M., Osborne, M., and Bakshy, E. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. In Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems (Neur IPS 22), 2022. PFNs4BO: In-Context Learning for Bayesian Optimization Eggensperger, K., Lindauer, M., and Hutter, F. Neural networks for predicting algorithm runtime distributions. In Lang, J. (ed.), Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 18), pp. 1442 1448, 2018. Eggensperger, K., Haase, K., M uller, P., Lindauer, M., and Hutter, F. Neural model-based optimization with rightcensored observations. ar Xiv:2009:13828 [cs.AI], 2020. Eggensperger, K., M uller, P., Mallik, N., Feurer, M., Sass, R., Klein, A., Awad, N., Lindauer, M., and Hutter, F. HPOBench: A collection of reproducible multi-fidelity benchmark problems for HPO. In Vanschoren, J. and Yeung, S. (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Curran Associates, 2021. Eriksson, D. and Jankowiak, M. High-dimensional bayesian optimization with sparse axis-aligned subspaces. In Uncertainty in Artificial Intelligence, pp. 493 503. PMLR, 2021. Eriksson, D., Bindel, D., and Shoemaker, C. py SOT and POAP: an event-driven asynchronous framework for surrogate optimization. ar Xiv:1908.00420 [math.OC], 2019. Feurer, M. and Hutter, F. Hyperparameter Optimization. In Hutter, F., Kotthoff, L., and Vanschoren, J. (eds.), Automated Machine Learning: Methods, Systems, Challenges, chapter 1, pp. 3 38. Springer, 2019. Available for free at http://automl.org/book. Feurer, M., Letham, B., Hutter, F., and Bakshy, E. Practical transfer learning for bayesian optimization. ar Xiv:1802.02219v4 [stat.ML], 2022. Frazier, P. A tutorial on Bayesian optimization. ar Xiv:1807.02811 [stat.ML], 2018. Gardner, J., Pleiss, G., Weinberger, Q., Bindel, D., and Wilson, A. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems (Neur IPS 18), pp. 7576 7586. Curran Associates, 2018. Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y., Rezende, D., and Eslami, S. Conditional neural processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning (ICML 18), volume 80, pp. 1704 1713. Proceedings of Machine Learning Research, 2018. Garnett, R. Bayesian Optimization. Cambridge University Press, 2022. in preparation. Griffiths, R.-R., Aldrick, A., Garcia-Ortegon, M., Lalchand, V., and Lee, A. Achieving robustness to aleatoric uncertainty with heteroscedastic Bayesian optimisation. Machine Learning: Science and Technology, 3, 2022. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR 16), pp. 770 778. Computer Vision Foundation and IEEE Computer Society, IEEE, 2016. Hebbal, A., Brevault, L., Balesdent, M., Taibi, E.-G., and Melab, N. Efficient global optimization using deep Gaussian processes. In 2018 IEEE Congress on Evolutionary Computation (CEC), 2018. Hollmann, N., M uller, S., Eggensperger, K., and Hutter, F. Tab PFN: A transformer that solves small tabular classification problems in a second. In International Conference on Learning Representations (ICLR 23), 2023. Published online: iclr.cc. Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Coello, C. (ed.), Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION 11), volume 6683 of Lecture Notes in Computer Science, pp. 507 523. Springer, 2011. Hvarfner, C., Stoll, D., Souza, A., Nardi, L., Lindauer, M., and Hutter, F. πBO: Augmenting Acquisition Functions with User Beliefs for Bayesian Optimization. In Proceedings of the International Conference on Learning Representations (ICLR 22), 2022. Published online: iclr.cc. Jenatton, R., Archambeau, C., Gonz alez, J., and Seeger, M. Bayesian Optimization with Tree-structured Dependencies. In Precup, D. and Teh, Y. (eds.), Proceedings of the 34th International Conference on Machine Learning (ICML 17), volume 70, pp. 1655 1664. Proceedings of Machine Learning Research, 2017. Jones, D., Schonlau, M., and Welch, W. Efficient global optimization of expensive black box functions. Journal of Global Optimization, 13:455 492, 1998. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR 15), 2015. Published online: iclr.cc. Korovina, K., Xu, S., Kandasamy, K., Neiswanger, W., Poczos, B., Schneider, J., and Xing, E. Chembo: Bayesian PFNs4BO: In-Context Learning for Bayesian Optimization optimization of small organic molecules with synthesizable recommendations. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 20), pp. 3393 3403. Proceedings of Machine Learning Research, 2020. Letham, B., Karrer, B., Ottoni, G., and Bakshy, E. Constrained Bayesian optimization with noisy experiments. Bayesian Analysis, 2018. Li, J., Liu, Y., Liu, J., and Wang, W. Neural architecture optimization with graph VAE. ar Xiv:2006.10310 [cs.LG], 2020. Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR 17), 2017. Published online: iclr.cc. L evesque, J., Durand, A., Gagn e, C., and Sabourin, R. Bayesian optimization for conditional hyperparameter spaces. In Howell, B. (ed.), 2017 International Joint Conference on Neural Networks (IJCNN 17), pp. 286 293. International Neural Network Society and IEEE Computational Intelligence Society, IEEE, 2017. Maraval, A., Zimmer, M., Grosnit, A., Tutunov, R., Wang, J., and Ammar, H. Sample-efficient optimisation with probabilistic transformer surrogates. ar Xiv:2205.13902 [cs.LG], 2022. Martinez-Cantin, R. Locally-biased Bayesian optimization using nonstationary Gaussian processes. In de Freitas, N., Adams, R., Shahriari, B., Calandra, R., and Shah, A. (eds.), Neur IPS Workshop on Bayesian Optimization in Theory and Practice (Bayes Opt 15), 2015. M uller, S., Hollmann, N., Arango, S., Grabocka, J., and Hutter, F. Transformers can do Bayesian inference. In Proceedings of the International Conference on Learning Representations (ICLR 22), 2022. Published online: iclr.cc. Nguyen, T. and Grover, A. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning (ICML 22), volume 162 of Proceedings of Machine Learning Research, pp. 16569 16594. PMLR, 2022a. Nguyen, T. and Grover, A. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning (ICML 22), volume 162 of Proceedings of Machine Learning Research, pp. 16569 16594. PMLR, 2022b. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Perrone, V., Jenatton, R., Seeger, M., and Archambeau, C. Scalable hyperparameter transfer learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems (Neur IPS 18), pp. 6845 6855. Curran Associates, 2018. Pineda Arango, S., Jomaa, H., Wistuba, M., and Grabocka, J. HPO-B: A large-scale reproducible benchmark for black-box HPO based on Open ML. In Vanschoren, J. and Yeung, S. (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Curran Associates, 2021. Ramachandran, A., Gupta, S., Rana, S., Li, C., and Venkatesh, S. Incorporating expert prior in Bayesian optimisation via space warping. Knowledge-Based Systems, 195, 2020. Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. The MIT Press, 2006. Ru, B., Wan, X., Dong, X., and Osborne, M. Interpretable Neural Architecture Search via Bayesian optimisation with weisfeiler-lehman kernels. In Proceedings of the International Conference on Learning Representations (ICLR 21), 2021. Published online: iclr.cc. Schilling, N., Wistuba, M., Drumond, L., and Schmidt Thieme, L. Joint model choice and hyperparameter optimization with factorized multilayer perceptrons. In Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 15), pp. 72 79. IEEE Computer Society, IEEE, 2015. Scikit-Optimize. Scikit-Optimize. https://github. com/scikit-optimize/scikit-optimize, 2018. Shah, A., Wilson, A., and Ghahramani, Z. Student-t processes as alternatives to Gaussian processes. In Kaski, S. and Corander, J. (eds.), Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS 14), volume 33, pp. 877 885. Proceedings of Machine Learning Research, 2014. PFNs4BO: In-Context Learning for Bayesian Optimization Snoek, J., Larochelle, H., and Adams, R. Practical Bayesian optimization of machine learning algorithms. In Bartlett, P., Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems (Neur IPS 12), pp. 2960 2968. Curran Associates, 2012. Snoek, J., Swersky, K., Zemel, R., and Adams, R. Input warping for Bayesian optimization of non-stationary functions. In Xing, E. and Jebara, T. (eds.), Proceedings of the 31th International Conference on Machine Learning, (ICML 14), pp. 1674 1682. Omnipress, 2014. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, and Adams, R. Scalable Bayesian optimization using deep neural networks. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML 15), volume 37, pp. 2171 2180. Omnipress, 2015. Souza, A., Nardi, L., Oliveira, L., Olukotun, K., Lindauer, M., and Hutter, F. Bayesian optimization with a prior for the optimum. In Machine Learning and Knowledge Discovery in Databases. Research Track, Lecture Notes in Artificial Intelligence, pp. 265 296. Springer-Verlag, 2021. Springenberg, J., Klein, A., Falkner, S., and Hutter, F. Bayesian optimization with robust Bayesian neural networks. In Lee, D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Proceedings of the 29th International Conference on Advances in Neural Information Processing Systems (Neur IPS 16). Curran Associates, 2016. Swersky, K., Duvenaud, D., Snoek, J., Hutter, F., and Osborne, M. Raiders of the lost architecture: Kernels for Bayesian optimization in conditional parameter spaces. In Hoffman, M., Snoek, J., de Freitas, N., and Osborne, M. (eds.), Neur IPS Workshop on Bayesian Optimization in Theory and Practice (Bayes Opt 13), 2013. Turner, R., Eriksson, D., Mc Court, M., Kiili, J., Laaksonen, E., Xu, Z., and Guyon, I. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the Black-Box Optimization Challenge 2020. In Escalante, H. and Hofmann, K. (eds.), Proceedings of the Neural Information Processing Systems Track Competition and Demonstration, pp. 3 26. Curran Associates, 2021. van Rijn, J. and Hutter, F. Hyperparameter importance across datasets. In Guo, Y. and Farooq, F. (eds.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 18), pp. 2367 2376. ACM Press, 2018. Vanschoren, J. Meta-learning. In Hutter, F., Kotthoff, L., and Vanschoren, J. (eds.), Automated Machine Learning: Methods, Systems, Challenges, chapter 2, pp. 35 61. Springer, 2019. Available for free at http://automl.org/book. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems (Neur IPS 17). Curran Associates, Inc., 2017. Virtanen, P., Gommers, R., Oliphant, T., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K., Mayorov, N., Nelson, A., Jones, E., Kern, R., Larson, E., Carey, C., Polat, I., Feng, Y., Moore, E., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E., Harris, C., Archibald, A., Ribeiro, A., Pedregosa, F., van Mulbregt, P., Vijaykumar, A., Bardelli, A., Rothberg, A., Hilboll, A., Kloeckner, A., Scopatz, A., Lee, A., Rokem, A., Woods, C., Fulton, C., Masson, C., H aggstr om, C., Fitzgerald, C., Nicholson, D., Hagen, D., Pasechnik, D., Olivetti, E., Martin, E., Wieser, E., Silva, F., Lenders, F., Wilhelm, F., Young, G., Price, G., Ingold, G.-L., Allen, G., Lee, G., Audren, H., Probst, I., Dietrich, J., Silterra, J., Webber, J., Slaviˇc, J., Nothman, J., Buchner, J., Kulick, J., Sch onberger, J., de Miranda Cardoso, J., Reimer, J., Harrington, J., Rodr ıguez, J., Nunez-Iglesias, J., Kuczynski, J., Tritz, K., Thoma, M., Newville, M., K ummerer, M., Bolingbroke, M., Tartre, M., Pak, M., Smith, N., Nowaczyk, N., Shebanov, N., Pavlyk, O., Brodtkorb, P., Lee, P., Mc Gibbon, R., Feldbauer, R., Lewis, S., Tygier, S., Sievert, S., Vigna, S., Peterson, S., More, S., Pudlik, T., Oshima, T., Pingel, T., Robitaille, T., Spura, T., Jones, T., Cera, T., Leslie, T., Zito, T., Krauss, T., Upadhyay, U., Halchenko, Y., V azquez-Baeza, Y., and Sci Py 1.0 Contributors. Sci Py 1.0: fundamental algorithms for scientific computing in Python. Nature Methods, 17(3):261 272, 2020. Wabersich, K. and Toussaint, M. Advancing bayesian optimization: The mixed-global-local (mgl) kernel and length-scale cool down. In Calandra, R., Shahriari, B., Gonzalez, J., Hutter, F., and Adams, R. (eds.), Neur IPS Workshop on Bayesian Optimization: Black-box Optimization and Beyond, 2016. Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and de Feitas, N. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361 387, 2016. Wang, Z., Dahl, G., Swersky, K., Lee, C., Mariet, Z., PFNs4BO: In-Context Learning for Bayesian Optimization Nado, Z., Gilmer, J., Snoek, J., and Ghahramani, Z. Pretrained Gaussian processes for Bayesian optimization. ar Xiv:2207.03084v4 [cs.LG], 2021. Wistuba, M. and Grabocka, J. Few-shot bayesian optimization with deep kernel surrogates. In Proceedings of the International Conference on Learning Representations (ICLR 21), 2021. Published online: iclr.cc. Wistuba, M., Schilling, N., and Schmidt-Thieme, L. Twostage transfer surrogate model for automatic Hyperparameter Optimization. In Paolo, F., Landwehr, N., Manco, G., and Vreeken, J. (eds.), Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 16), Lecture Notes in Computer Science, pp. 199 214. Springer, 2016. Wu, J. and Frazier, P. The parallel knowledge gradient method for batch Bayesian optimization. In Lee, D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems. Curran Associates, 2016. Wu, J., Poloczek, M., Wilson, A., and Frazier, P. Bayesian optimization with gradients. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems (Neur IPS 17). Curran Associates, 2017. Zhu, C., Byrd, R., and Nocedal, J. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23(4):550 560, 1997. PFNs4BO: In-Context Learning for Bayesian Optimization A. Search spaces split For the HPO-B Benchmark, we split search spaces into test, validation, and training search spaces. We make sure to group similar algorithms into the same split. As validation search spaces, we use HPO-B IDs: 5527, 5891, 5906, 5971, 6767, 6766, and 5860. As test search spaces, we use HPO-B IDs: 5965, 7609, 5889, 6794, 5859, 4796, 7607, 5636, and 5970. B. Details on Our Priors Below we explain implementation details and the hyperparameters used in our priors. We found all hyperparameters using random search on the distinct validation search spaces of HPO-B, which ensures that we have an unbiased performance on the test set. B.1. Our HEBO-inspired Prior For our final model, not the one used for the direct comparison with HEBO on artificial data shown in Figure 10, we made some adaptions to the HEBO prior. We do not use a linear kernel and use the following priors, found by using BO on the validation search spaces of HPO-B, as described in Section 7.1. We use Gamma(concentration, rate) distributions for lengthand outputscale hyperparameters: For the outputscale we use Gamma(0.8452, 0.3993) and for the lengthscale we use Gamma(1.2107, 1.5212). For the noise we follow the original kernel exactly with log(ϵ) N( 4.63, 0.5). B.2. Our BNN prior We generate the inputs for the BNN via a uniform random sampling scheme from the unit hypercube, akin to the inputs to our GP priors. We employ a multi-layer perceptron (MLP) for the BNN with a tanh activation function. To account for model complexity, the number of layers in the MLP is sampled uniformly from the range of 8 to 15. Additionally, the number of hidden units per layer is uniformly sampled between 36 and 150. The weights of the network are sampled from a Normal distribution with 0 mean, and a standard deviation sampled uniformly at random from [0.089, 0.193]. Every weight is zeroed with a probability of 14.5%, and the remaining weights are rescaled by a factor of 1/(1. .145) (1/2) to counteract the changed distribution. Before any activation, we add a 0-mean Gaussian noise with a standard deviation uniformly sampled between 0.0003 and 0.0014. We add a noise uniformly at random sampled between 0.0004 and 0.0013. We used a zero-mean Gaussian for sampling the input warping parameter C1 (C2) with a standard deviation of 0.976 (0.8003). To be more efficient, we sampled the hyperparameters (the standard deviation of the distributions and the architecture) 16 times per batch, but used a batch size of 128. This is more efficient than sampling these per example, but adds correlation to the gradients inside a batch. C. Acquisition Function Optimization Acquisition functions are used in Bayesian optimization to determine the next point to sample in the optimization process. These functions take into account the current model of the objective function and the uncertainty of the model to determine the most promising points to sample. As the PFNs are standard neural networks, they are fully differentiable. Thus, we can compute derivatives of the acquisition function with respect to the input x via backpropagation. We can then apply gradient-based optimization techniques to find the candidate x that maximize the acquisition function given a PFN. Here we use standard techniques proposed by Snoek et al. (2012). We first create a set of candidates for x by sampling random candidate positions (N=100,000) and combine them with all observations {x1, . . . , xk}. Then, we use the 100 candidates with the highest acquisition function values as candidates for a gradient-based search with scipy's (Virtanen et al., 2020) L-BFGS-B (Zhu et al., 1997). Finally, we used the candidate with maximum EI after optimization as our proposal. In the case of integers or boolean hyperparameters we treat the value between the two next legal solutions as the probability of a coin flip and round in a probabilistic fashion. Furthermore, we skip candidates that we had already evaluated to ensure PFNs4BO: In-Context Learning for Bayesian Optimization that we do not evaluate the same candidate twice. D. BO Tricks We found that the following tricks improve the performance of our PFNs considerably and use them for all our experiments: We use a power transform to transform the observed outputs to a distribution more similar to a standard-normal, as proposed by Cowen-Rivers et al. (2022). They introduce this to handle heteroscedasticity, which might be the way it is helping the model. We additionally saw that it led to flattening outliers on the non-optimal end of the spectrum, while increasing the differences between points in magnitude close to the optimum, leading to a different exploration/exploitation trade-off. We found that training on datasets with inputs from the unit hypercube, i.e. X = [0, 1]d, and transforming all observation values into a unit hypercube based on their min/max and scaling leads to improved consistency. E. Training Details We train a single PFN which is shared across search spaces for one particular prior. This PFN is flexible across datasets with a varying number of features. To do this, we sample datasets with a number of dimensions sampled uniformly at random in {1, . . . , 18} during PFN training. Following Hollmann et al. (2023), we zero-pad the features when the number of features k is smaller than the maximum number of features K. We also linearly scale the features by K k to make sure the magnitude of the inputs is similar across different numbers of features. Our PFNs were trained with the standard PFN settings used by M uller et al. (2022): We use an embedding size of 512 and six layers in the transformer. Our models were trained with Adam (Kingma & Ba, 2015) and cosine-annealing (Loshchilov & Hutter, 2017) without any special tricks. The lr was chosen based on simple grid searches for minimal training loss in {1e 3, 3e 4, 1e 4, 5e 5}. We found that other hyperparameters did not have a large impact on final performance. Our final models, besides studies on smaller budgets like in Figure 11, trained for less than 24 hours on a cluster node with eight RTX 2080 Ti GPUs. E.1. Regression Heads 3 1 0 1 3 0 Figure 6: A visualisation of the Riemann distribution, with unbounded support. Plot based on (M uller et al., 2022) M uller et al. (2022) employ a Riemann distribution to model the output of the PFN architecture for regression tasks. Modeling continuous distributions with neural networks is challenging. In order to achieve robust performance in modeling PPDs, they utilized a distribution that is particularly compatible with neural networks. Drawing from the understanding that neural networks excel in classification tasks and taking inspiration from discretizations in distributional reinforcement learning (Bellemare et al., 2017), they employed a discretized continuous distribution named Riemann Distribution. It discretizes the space into buckets B, which are selected such that each bucket has equal probability in prior-data: p(y b) = 1/|B|, b B. A Riemann distribution with unbounded support is utilized, as suggested by M uller et al. (2022), which replaces the final bar on each side with a suitably scaled half-normal distribution, as shown in Figure 6. For a more precise definition, we direct the reader to M uller et al. (2022). PFNs4BO: In-Context Learning for Bayesian Optimization F. Riemann Distribution and Acquisition Function We outline how to compute the PI(f ) = R [y > f ]p(y)dy for the unbounded Riemann distribution. [y > f ]p(y)dy (5) [y > f ]p(y)dy + yi [y > f ]p(y)dy + Z y M+1 [y > f ]p(y)dy (6) = (1 Fl(y1 f )) + ( (yi+1 f ) p(bi) yi+1 yi , if yi < f < yi+1 [yi f ]p(bi) , else + Fr(f y M+1) (7) = (1 Fl(y1 f )) + i=1 (yi+1 min(yi+1, max(f , yi))) p(bi) yi+1 yi + Fr(f y M+1), (8) where Fl (Fr) is the CDF of the half-normal distribution used for the left (right) side. The acquisition function is divided into three terms: (1) a term that governs the probability mass for values lower than interval b1 and that goes up to values of y1, (2) a term that summarizes over all intervals b1, . . . , b M, and (3) a term that governs the probability mass for values larger than the interval b M and that start from values of y M + 1. G. User Prior Distribution We will describe our method for a one dimensional search space with a fixed discretization I into intervals I I. Nevertheless, the method can be extended to more dimensions and intervals of varying size. As an approximation to the prior, we will use the estimated maximum m, my = arg max(x,y) D y. We will thus from now on assume to be able to sample from a joint distribution of the dataset D and its maximum input m: p(D, m). An example of the effect of the form of the prior over the optimum is given in Figure 3. We train a single PFN that can be conditioned on arbitrary ρ and I on a prior of the form p(D, ρ, I). This yields the training objective: E{(x,y)} D,ρ,I p(D,ρ,I)[ log qθ(y|x, D, ρ, I)]. (9) The extra inputs, ρ and I, are fed to the PFN using an extra position with its own linear encoder similar to style embeddings for language models (Dai et al., 2019). During training we need p(D, ρ, I) to cover many ρ and I, preserve Equation 3, and be fast to sample from. To achieve this we actually sample the dataset D first and then a possible interval I. We actually condition the interval I on our maximum m and not the other way around. We first sample D, m p(D, m) and ρ U(0, 1) independently at random. Next we sample I using p(I|ρ, f) = ρ[m I] + (1 ρ)p(I), where p(I) = ED,m p(D,m)[m I]. It is easy to see that this distribution has a good coverage of both I and ρ. We can now show, rather easily, that this sampling scheme actually models our definition of p(D, m|I): p(D, m|I) = p(I|D, m)p(D, m) = p(I|m)p(D, m) = (ρ[m I] + (1 ρ)p(I))p(D, m) = ρ[m I]p(D, m) p(I) + (1 ρ)p(I)p(D, m) = ρp(D, m|m I) + (1 ρ)p(D, m), (14) where we assume a dependence on the independently distributed confidence ρ everywhere. We used p(D, m|m I) = [m I]p(D,m) p(I) , which is trivial to derive, as well as p(I) = Ef[[m I]], which was introduced in Section 5.3. PFNs4BO: In-Context Learning for Bayesian Optimization Number of trials Average Rank Number of trials Average Regret Random HEBO GP DNGO DGP PFN (HEBO+, EI) PFN (HEBO+, EI+KG) Figure 7: Performance over number of trials for the four search spaces of HPO-B with three or less dimensions. Figure 18 shows per-search-space results. In our experiments we approximate p(I) with a simple uniform distribution, as we saw Em[m I] to be close to a uniform distribution, and use the maximum example in the training set with maximal value as an approximation to the true maximum of the underlying function generating the dataset. We choose the set of intervals to be I = {[i/k, (i + 1)/k]|k {1, . . . , 5}, i {0, . . . , k 1}}. In Figure 16 of the Appendix one can see the impact of our user prior on the optimization behaviour. H. Non-Myopic Acquisition Function Approximation In the following we explain how PFNs can learn to directly in one forward pass approximate αKG as defined in Section 5.4. To do this, we first learn an approximation to the PPD qθ(y|x, D). Based on which we can learn an approximation of the distribution of means with the following loss E{(x,y)} D p(D)[q(µ) θ (Eq(y|x,D)[y]|D)], (15) which yields a q(µ) θ that is an approximation to the distribution of the random variable Ep(y|x,D)[y] with x p(x|D). We approximate the maximal mean τ(D) with the upper 1 per mille interval of q(µ) θ ( |D). Based on this in turn, we can finally approximate α(x; D) by training a second PFN to approximate the new target y = icdf(q(µ) θ ( |{(x, y)} D), .999) in the usual manner E{(x,y)} D p(D) log qlevel1,θ(y |x, D). (16) In practice we use the same PFN for both steps and add a style embedding, like for the user priors in Section 5.3. This style embedding indicates in which mode the PFN should operate, the standard, myopic, setting or the non-myopic setting. It would be very interesting for future work to generalize this to larger search spaces and multi-step look ahead. H.1. A Study on Our Knowledge Gradient Approximation We found that the PFN with the strong GP prior does not perform very well on very small (three or less dimensions) search spaces. Additionally, we saw that we could train our Knowledge Gradient (KG) models more successfully to approximate KG for few dimensions. Thus, we decided to show the impact of Knowledge Gradient for these kinds of search spaces. In Figure 7 we show the impact of KG for the four search spaces with 3 or less dimensions in HPO-B. We ablated how to exactly use KG for these search spaces on all other search spaces and found that a mixture of KG and EI worked best. That means, at every step it is a random coin flip whether to use KG or EI. We can see that this improves performance for small search spaces considerably over plain EI. I. Approximation Quality on Prior Data Our Adapted HEBO Prior We stayed as close as possible to HEBO, but had to make some adaptions as one cannot sample from the HEBO model as it is. Specifically, we had to introduce simple priors for the hyperparameters of both PFNs4BO: In-Context Learning for Bayesian Optimization 10 20 30 40 0.053 0.71 0.77 0.02 0.53 0.9 0.005 0.43 0.94 0.001 0.36 0.91 # Training Examples 0.8 # Features 1 2 10 Approx. Mean Regret Emp. Bay. 0.057 0.059 0.110 PFN 0.054 0.052 0.124 # Wins Emp. Bay. 206 144 169 PFN 239 154 171 Mean Rank Emp. Bay. 1.517 1.506 1.501 PFN 1.483 1.494 1.499 Figure 8: Left: For the HEBO prior, fraction of cases in which the PFN gives a higher likelihood to unseen examples than the original Empirical Bayes approximation by Cowen-Rivers et al. (2022). We aggregate across 1 000 datasets drawn from the prior. The optimization of the original Empirical Bayes approximation failed in 4.5% of cases. We ignored these cases to be as fair as possible. Right: BO performance after 50 evaluations on the prior with EI over 1 000 sampled datasets. The majority of runs yielded ties. Number of trials Average Rank Number of trials Average Regret Random HEBO Optformer-B EI Optformer-R EI BOHAMIANN PFN (HEBO+) PFN (BNN) Figure 9: Average Regret and Rank on the HPO-B Benchmark. We plot a different set of baselines that includes the meta learned Optformer baseline which was trained on real world data. While for few samples the Optformer baselines performs strongly (likely due to prior knowledge of the names of optimized parameters, which is only revealed to the Optformer), this advantage drops quickly. the kernels used in HEBO (Mat ern and linear kernel), as they did not have a prior attached. We introduce uniform priors (U(0, 1)) for the lengthscale of both the Mat ern and the Linear kernel and for the variance of the linear kernel. Next, we trained a PFN using the HEBO prior with probabilistic distributions over the hyperparameters (see Section 4.2) enabling PFNs to approximate hyperpriors. Comparison of the Likelihood of Prior Data For the HEBO prior, we compare the resulting PFN s fit vs. the Empirical Bayes approximation, as used in the original HEBO work. Figure 8 shows that the likelihood assigned to held-out outputs is higher for the PFN in many cases and across a varying number of optimized features; We hypothesize that this is because HEBO's empirical Bayes approximation becomes too greedy in high dimensions. BO comparison To assess our method qualitatively, we provide optimization trajectories of the HEBO+ on 1d samples from the prior in Figure 10 and on a 2d Branin function without initial design in Figures 13 (initialization at lower corner) and 14 (initialization at middle point). Additionally, we show comparisons of a PFN approximating a simple RBF-kernel GP, for which we thus have the ground truth posterior. Figure 11 shows an optimization of a 1d Ackley function and Figure 12 shows an optimization of a non-continuous function that is chosen to show out-of-distribution performance, as RBF-kernel functions are smooth. J. User Prior Details We list the user prior used for PD1 in Table 2. We use this one user prior for all different tasks in PD1. Our PFN, though, supports specifying any user prior at prediction time, i.e. you can use the weights we share with your specific user prior. PFNs4BO: In-Context Learning for Bayesian Optimization PFN HEBO Model Re-implementation Figure 10: Examples of optimization trajectories on functions sampled from the our HEBO prior. We compare to our re-implementation of HEBO hyper parameter ρ encoded min encoded max lr decay factor .5 0.74 0.99 lr initial .25 .01 .31 lr power .1 1.5 2. epoch .5 235 299 activation fn .5 1 1 Table 2: ρ defines what percentage of datasets in the prior should be sampled to explicitly have their maximum between the specified minimum and maximum. The rest is sampled as usually. K. Other BO Surrogates Another popular model class is Bayesian neural networks (BNNs) or approximations of them (Snoek et al., 2015; Schilling et al., 2015; Springenberg et al., 2016; Perrone et al., 2018; Eggensperger et al., 2020) because they promise better scaling to large numbers of observations. Two popular approaches, that we also compare against in our experiments, are DNGO (Snoek et al., 2015) and BOHAMIANN (Springenberg et al., 2016). DNGO trains a standard neural network and then replaces the output layer by a Bayesian linear regression, while BOHAMIANN adapts a mini-batch Monte Carlo method (Chen et al., 2014) for a fully Bayesian treatment of the network weights. Other models that have been investigated aim to strike a middle ground between GPs and BNNs, for example, Deep GPs (Hebbal et al., 2018), Deep Kernel GPs (Wistuba & Grabocka, 2021), or Manifold GPs (Calandra et al., 2016). However, these models suffer from the need to adapt hyperparameters online and, similar to the GP, make a parametric assumption about the distribution of the targets. Last but not least, trees and ensembles of trees have been used, which despite relying on frequentist uncertainty predictions, yielded strong results and overcame many of the drawbacks of the GPs (Hutter et al., 2011; Bergstra et al., 2011; Scikit Optimize, 2018). L. Ablation Studies In this Section we describe additional ablation studies we performed to further the understanding of our method. HEBO Variants In Section 6 we already showed that two variants of HEBO, one with a PFN as posterior approximation and one using MLE-II and a traditional GP, perform very similar on data in the prior. The original HEBO prior is slightly different, though, compared to the prior used there, e.g. there is no prior weighting on the lengthscale and it is unbounded. To further the understanding in comparison to the full HEBO method, we provide an ablation on HPO-B in Figure 17. Here, we additionally show the impact of the acquisition function of HEBO, which is mixture of EI, UCB and PI with added noise, and the initial design. The default initial design on HPO-B is five steps, but for search spaces with more than five PFNs4BO: In-Context Learning for Bayesian Optimization Mean (Step 1) Mean (Step 2) Mean (Step 3) Mean (Step 4) EI (Step 1) EI (Step 2) EI (Step 3) EI (Step 4) Mean (Step 5) Mean (Step 6) Mean (Step 7) Mean (Step 8) EI (Step 5) EI (Step 6) EI (Step 7) EI (Step 8) Mean (Step 21) Mean (Step 22) Mean (Step 23) Mean (Step 24) 10 0 10 0.0 EI (Step 21) 10 0 10 0.0 EI (Step 22) 10 0 10 0.00 EI (Step 23) 10 0 10 0.00 EI (Step 24) Mean (Step 29) Mean (Step 30) Mean (Step 31) Mean (Step 32) 10 0 10 0.00 EI (Step 29) 10 0 10 0.00 EI (Step 30) 10 0 10 0.00 EI (Step 31) 10 0 10 0.00 EI (Step 32) True Function Observations PFN (100K training datasets) PFN (1M training datasets) PFN (20M training datasets) PFN (40M training datasets) PFN (80M training datasets) Figure 11: In this figure, we compare models trained on a simple GP prior (with fixed hyper-parameters), thus we can compare to the exact posterior of the GP. We show how PFNs behave differently depending on how much they were trained. Vertical lines mark the maximum of the acquisition function. PFNs4BO: In-Context Learning for Bayesian Optimization Mean (Step 1) Mean (Step 2) Mean (Step 3) Mean (Step 4) EI (Step 1) EI (Step 2) EI (Step 3) EI (Step 4) Mean (Step 5) Mean (Step 6) Mean (Step 7) Mean (Step 8) EI (Step 5) EI (Step 6) EI (Step 7) EI (Step 8) Mean (Step 9) 5 0 5 5 0 5 Mean (Step 10) 5 0 5 5 0 5 Mean (Step 11) 5 0 5 5 0 5 Mean (Step 12) EI (Step 9) EI (Step 10) EI (Step 11) EI (Step 12) 5 0 5 5 0 5 Mean (Step 13) 5 0 5 5 0 5 Mean (Step 14) 5 0 5 5 0 5 Mean (Step 15) 5 0 5 5 0 5 Mean (Step 16) EI (Step 13) EI (Step 14) EI (Step 15) EI (Step 16) 5 0 5 5 0 5 Mean (Step 17) 5 0 5 5 0 5 Mean (Step 18) 5 0 5 5 0 5 Mean (Step 19) 5 0 5 5 0 5 Mean (Step 20) EI (Step 17) EI (Step 18) EI (Step 19) EI (Step 20) 5 0 5 5 0 5 Mean (Step 21) 5 0 5 5 0 5 Mean (Step 22) 5 0 5 5 0 5 Mean (Step 23) 5 0 5 5 0 5 Mean (Step 24) EI (Step 21) 5 0 5 0.0000 EI (Step 22) 5 0 5 0.000 EI (Step 23) 5 0 5 0.0000 EI (Step 24) 5 0 5 5 0 5 Mean (Step 25) 5 0 5 5 0 5 Mean (Step 26) 5 0 5 5 0 5 Mean (Step 27) 5 0 5 5 0 5 Mean (Step 28) 5 0 5 0.000 EI (Step 25) 5 0 5 0.000 EI (Step 26) 5 0 5 0.000 EI (Step 27) 5 0 5 0.000 EI (Step 28) True Function PFN-based Approximation of GP Groundtruth GP Observations Figure 12: This figure show cases the approximation quality of a PFN on out-of-distribution functions: the to-be-maximized function is discontinuous, while the RBF kernel consists of continuous functions only. We show the suggested next evaluation point (horizontal line), the mean and the EI for a RBF-kernel GP and its PFN approximation. This is a special case, where we can actually compute the exact posterior with the GP. We can see that the mean and EI of the PFN are good approximations. The EI diverges in some of the later steps, though, but not the suggested next evaluation point. PFNs4BO: In-Context Learning for Bayesian Optimization Figure 13: This figure showcases the optimization trajectory of a PFN with HEBO+ prior to minimize a Branin function starting from the lower corner of the search space. The newly queries point is white in all plots. Figure 14 shows the same experiment, but starting from the center. We can find that for a deterministic, smooth test function both trajectories converge to similar queries fast. PFNs4BO: In-Context Learning for Bayesian Optimization Figure 14: This figure showcases the optimization trajectory of a PFN with HEBO+ prior to minimize a Branin function starting from the center of the search space. The newly queries point is white in all plots. Figure 13 shows the same experiment, but starting from the lower corner. We can find that for a deterministic, smooth test function both trajectories converge to similar queries fast. PFNs4BO: In-Context Learning for Bayesian Optimization Figure 15: In this figure we show a typical search space with a missing log encoding in HPO-B. This is the search space 6766, which has two input dimension. The first dimension has almost almost no impact on the outcome (y), thus we plot only the second dimension (x). One can see that this hyperparameter clearly lives on a log scale, with all interesting curvature happening in the lowest 1% of the interval, a minimum (worst outcome) close to 10-6 and a maximum (best outcome) around 10-7. 0.0 0.5 1.0 0.5 Mean (Step 2) 0.0 0.5 1.0 0.5 Mean (Step 3) 0.0 0.5 1.0 0.5 Mean (Step 6) 0.0 0.5 1.0 0.5 Mean (Step 7) 0.0 0.5 1.0 EI (Step 2) 0.0 0.5 1.0 EI (Step 3) 0.0 0.5 1.0 0.00 EI (Step 6) 0.0 0.5 1.0 0.00 EI (Step 7) True Function PFN (GP, user prior: 50% chance opt in (0.75,1.)) PFN (GP, user prior: none) Observations Figure 16: Impact of a user prior on the mean prediction and acquisition function. We can see that adding the user prior for a maximum in the right-most quarter, increases EI and mean in that region. The important point we can see here is that the user-prior can be overwritten by the data. The algorithm samples much more outside of the specified interval than inside, because the actual curve is higher there. PFNs4BO: In-Context Learning for Bayesian Optimization Number of trials Average Rank Number of trials Average Regret Random GP DNGO DGP HEBO HEBO (EI) HEBO (EI, no initial design) HEBO (EI, no initial design, no acq noise) HEBO (Bo Torch, EI, no init design, no acq noise) PFN (HEBO+) PFN (HEBO+, no input warping) Figure 17: Ablation on HPO-B test spaces of different features of the HEBO baseline. Additionally, we show the performance of the PFN (HEBO+) without input warping on HPO-B. We can see that PFN (HEBO+) performs worse without input warping. We can also see that most ablations on HEBO have little impact on the performance, but removing the initial design, but for the default 5 seeds of HPO-B, in addition to using EI without any noise improves performance of the standard HEBO implementation to be more competitive with the PFN (HEBO+). Method Rank Mean Regret Mean Random 2.765 0.076 HEBO 2.406 0.070 PFN (HEBO+, no ignored feat's) 2.639 0.103 PFN (HEBO+, 30% ignored feat's) 2.189 0.091 Table 3: Mean rank and mean regret over time on the three search spaces of HPO-B with the most hyperparameters, which all have XGBoost as method. We can see that for large search spaces it considerably helps to allow the PFN to ignore certain features. dimensions, HEBO draws more random samples. While we can see that most of the ablations on HEBO have little impact on performance, we can improve its performance by removing all initial design on top of the standard HPO-B design and replacing the acquisition function with simple EI only. In Figure 17, we also show the impact of disabling input warping for our PFN (HEBO+). While this does make performance worse, the only non-HEBO baseline that can beat the PFN after 50 trials is DGP, even though the search spaces of HPO-B contain many ill-conditioned dimensions with missing log transforms. Ignoring features In Table 3 we show the impact of adding meaningless features to the prior, as described in Section 5.2, for search spaces with many dimensions. We can see that performance is improved on average. At the same time, we found no impact on smaller search spaces. Ablation of the Acquisition Function Table 4 shows a simple ablation of acquisition functions on the test search spaces of HPO-B. We can see that all EI variants and UCB perform similarly well, while PI seems to be worse choice. PFNs4BO: In-Context Learning for Bayesian Optimization Random HEBO GP DNGO DGP HEBO (GP) HEBO (BNN) Figure 18: Average Ranks across search spaces. We can see that the PFN generally works well, but also has clear failure modes for some search spaces, which need further investigation. PFNs4BO: In-Context Learning for Bayesian Optimization Method Rank Mean Regret Mean Random 7.658 0.125 HEBO 5.467 0.092 GP 5.539 0.075 DNGO 5.332 0.057 DGP 5.604 0.059 EI 4.969 0.053 EI(predict mean) 4.883 0.059 PI 5.227 0.071 PI (predicted mean) 5.385 0.071 UCB (0.95 percentile) 4.932 0.053 Table 4: Ablation of different acq functions on HPO-B. We see that we actually did not make the best choice for the test set (but it is the best on the validation search spaces, so we stuck with it). M. Software Versions In this section we list the versions of the optimizers that we benchmarked: HEBO: we use the archived submission to the Neur IPS 2020 Black-Box optimization challenge that is available at https://github.com/huawei-noah/HEBO/tree/master/HEBO/archived_ submissions/hebo for Bayes Mark (HEBO 0.0.8 a376313), as it is optmized for the challenge, and adapted the public release to support discrete benchmarks as well as our ablations. We publish our adapted HEBO implementation with the supplementary material. It is branched off from https://github.com/huawei-noah/HEBO/ tree/405dc4ceb93a79f0d1f0eaa24f5458dd26de1d05. GP: We used the results from Pineda Arango et al. (2021) that are available at https://github.com/ releaunifreiburg/HPO-B/tree/main/results. DNGO: We used the results from Pineda Arango et al. (2021) that are available at https://github.com/ releaunifreiburg/HPO-B/tree/main/results. DGP: We used the results from Pineda Arango et al. (2021) that are available at https://github.com/ releaunifreiburg/HPO-B/tree/main/results. Hyper Opt: We used the implementation coming with Bayes Mark (https://github.com/uber/bayesmark/ tree/8c420e935718f0d6867153b781e58943ecaf2338), which is Hyper Opt 0.2.7 a376313. Py SOT: We used the implementation coming with Bayes Mark (https://github.com/uber/bayesmark/ tree/8c420e935718f0d6867153b781e58943ecaf2338), which is Py SOT 0.3.3 a376313. Moreover, we used the following versions of the benchmark platforms: HPO-B: https://github.com/releaunifreiburg/HPO-B/tree/f5d415e45012544c61b0e334a42aa69f6aae5d7f with mode= v3 Bayes Mark: https://github.com/uber/bayesmark/tree/8c420e935718f0d6867153b781e58943ecaf2338