# bayesian_optimization_over_hybrid_spaces__a3d44697.pdf Bayesian Optimization over Hybrid Spaces Aryan Deshwal 1 Syrine Belakaria 1 Janardhan Rao Doppa 1 We consider the problem of optimizing hybrid structures (mixture of discrete and continuous input variables) via expensive black-box function evaluations. This problem arises in many realworld applications. For example, in materials design optimization via lab experiments, discrete and continuous variables correspond to the presence/absence of primitive elements and their relative concentrations respectively. The key challenge is to accurately model the complex interactions between discrete and continuous variables. In this paper, we propose a novel approach referred as Hybrid Bayesian Optimization (Hy BO) by utilizing diffusion kernels, which are naturally defined over continuous and discrete variables. We develop a principled approach for constructing diffusion kernels over hybrid spaces by utilizing the additive kernel formulation, which allows additive interactions of all orders in a tractable manner. We theoretically analyze the modeling strength of additive hybrid kernels and prove that it has the universal approximation property. Our experiments on synthetic and six diverse realworld benchmarks show that Hy BO significantly outperforms the state-of-the-art methods. 1. Introduction A large number of science and engineering applications involve optimizing hybrid spaces (mixture of discrete and continuous input variables) guided by expensive black-box function evaluations. For example, in materials design optimization, discrete variables correspond to the presence/absence of primitive elements and continuous variables correspond to their relative concentrations, and evaluation of each design involves performing an expensive physical lab experiment. A popular and effective framework for optimizing expensive black-box functions is Bayesian optimization (BO) (Shahriari et al., 2016; Frazier, 2018; Greenhill et al., 2020; 1School of EECS, Washington State University, Pullman, USA. Correspondence to: Aryan Deshwal . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Belakaria et al., 2019; Zhou et al., 2020; Belakaria et al., 2020e;c;a;f). The key idea behind BO is to learn a surrogate statistical model and intelligently select the sequence of inputs for evaluation to approximately optimize the unknown objective. Gaussian process (GP) (Rasmussen & Williams, 2006) is the most popular choice for learning statistical models. GPs allow to incorporate domain knowledge about the problem in the form of a kernel over the input space and provide good uncertainty quantification. GPs have been successfully applied for both continuous (Shahriari et al., 2016; Belakaria et al., 2020b;d) and discrete spaces (Oh et al., 2019; Deshwal et al., 2021; Roustant et al., 2020). However, as we discuss in the related work section, there is very limited work on BO methods to optimize hybrid spaces (Hutter et al., 2010; 2011; Bergstra et al., 2011; Daxberger et al., 2020; Ru et al., 2020). Most of them employ non GP based surrogate models as it is challenging to define a generic kernel over hybrid spaces that can account for complex interactions between variables. To precisely fill this gap in our knowledge, we propose a novel approch referred as Hybrid Bayesian Optimization (Hy BO). Hy BO builds GP based surrogate models using diffusion kernels, which are naturally defined over continuous (Kondor & Vert, 2004) and discrete spaces (Kondor & Lafferty, 2002). We develop a principled approach to construct diffusion kernels over hybrid spaces. This approach employs the general formulation of additive Gaussian process kernels (Duvenaud et al., 2011) to define additive hybrid diffusion kernels. The key idea is to assign a base kernel for each discrete/continuous variable and construct an overall kernel by summing over all possible orders of interaction between these kernels. This construction procedure has two advantages: 1) Allows to leverage existing kernels for continuous and discrete spaces; and 2) Can automatically identify the strength of different orders of interaction in a data-driven manner for a given application. A key question about the modeling strength of this hybrid diffusion kernel is whether given sufficient data, can it approximate any black-box function defined over hybrid spaces. This question has been studied in the past in terms of a property called universality of a kernel (Steinwart, 2001; Micchelli et al., 2006; Sriperumbudur et al., 2011; Mania et al., 2018). We prove that the proposed hybrid diffusion kernel has universal approximation property by composing Bayesian Optimization over Hybrid Spaces a known result for continuous diffusion kernels with a novel result for discrete diffusion kernels. Our theoretical results have broader significance going beyond the BO literature. Our experiments on diverse synthetic benchmarks and realworld applications show that Hy BO performs significantly better than state-of-the-art methods. We also empirically demonstrate that superiority of Hy BO s performance is due to better surrogate model resulting from the proposed additive hybrid diffusion kernel. Contributions. The key contribution of this paper is the development and evaluation of the Hy BO approach to perform BO over hybrid spaces. Specific list includes: Development of a principled approach to construct additive diffusion kernels over hybrid spaces for building GP based surrogate statistical models. Theoretical analysis to prove that additive hybrid diffusion kernel has the universal approximation property. Experiments on synthetic and real-world benchmarks to show that Hy BO significantly improves over stateof-the-art methods. The code and data are available on the Git Hub repository https://github.com/ aryandeshwal/Hy BO. 2. Problem Setup and Hybrid Bayesian Optimization Approach Problem Setup. Let X be a hybrid space to be optimized over, where each element x X is a hybrid structure. Without loss of generality, let each hybrid structure x = (xd Xd, xc Xc) X be represented using m discrete variables and n continuous variables, where xd and xc stands for the discrete and continuous sub-space of X. Let each discrete variable vd from xd take candidate values from a set C(vd) and each continuous variable vc from xc take values from a compact subset of R. In parts of the ML literature, a distinction is made between categorical and discrete variables based on their values: categorical refers to an unordered set (e.g., different types of optimizers for neural network training) and discrete refers to an ordered set ( e.g., number of layers in a neural network). We do not make such distinction because our Hy BO approach works for both cases. Concretely, by our definition, a categorical variable is also a discrete variable, i.e., C(vd) is just the no. of candidate values for categorical variable vd. We are given a space of hybrid structures X. We assume an unknown, expensive real-valued objective function F : X 7 R, which can evaluate each hybrid structure x (also called an experiment) and produces an output y = F(x). For example, in highentropy alloys optimization application, xd corresponds to the presence/absence of metals and xc corresponds to their relative concentrations, and F(x) corresponds to running a physical lab experiment using additive manufacturing tech- niques. The main goal is to find a hybrid structure x X that approximately optimizes F by conducting a limited number of evaluations and observing their outcomes. Bayesian Optimization Framework. BO is a very efficient framework to solve global optimization problems using black-box evaluations of expensive objective functions (Shahriari et al., 2016). BO algorithms intelligently select the next input for evaluation guided by a learned statistical model to quickly direct the search towards optimal inputs. The three key elements of BO framework are: 1) Statistical model of the true function F(x). Gaussian Process (GP) (Rasmussen & Williams, 2006) is the most popular choice for statistical model. GPs allows to incorporate domain knowledge by defining an appropriate kernel over the input space and have better uncertainty quantification ability. A GP over a space X is a random process from X to R. It is characterized by a mean function µ : X 7 R and a covariance or kernel function k : X X 7 R. 2) Acquisition function (AF) to score the utility of evaluating a candidate input x X based on the statistical model M. Expected improvement (EI) (Mockus et al., 1978) is a prototypical acquisition function. 3) Optimization procedure to select the best scoring candidate input for evaluation according to AF. Algorithm 1 Hy BO Approach Input: X = Hybrid input space, K(x, x ) = Kernel over hybrid structures, AF(M, x) = Acquisition function parametrized by model M and input x, F(x) = expensive objective function Output: ˆxbest, the best structure 1: Initialize D0 initial training data; and t 0 2: repeat 3: Learn statistical model: Mt GP-LEARN(Dt, K) 4: Compute the next structure to evaluate: xt+1 arg maxx X AF(Mt, x) 5: xc Optimize continuous subspace conditioned on assignment to discrete variables xd 6: xd Optimize discrete subspace conditioned on assignment to continuous variables xc 7: Evaluate objective function F(x) at xt+1 to get yt+1 8: Aggregate the data: Dt+1 Dt {(xt+1, yt+1)} 9: t t + 1 10: until convergence or maximum iterations 11: ˆxbest arg maxxt D yt 12: return the best uncovered hybrid structure ˆxbest Hybrid Bayesian Optimization Approach. Our Hy BO approach is an instantiation of the generic BO framework by instantiating the statistical model and acquisition function optimization procedure for hybrid spaces (see Algorithm 1). Statistical model over hybrid structures. We employ GPs to build statistical models. To accurately model the complex interactions between discrete and continuous variables, we invoke a principled approach to automatically construct additive diffusion kernels over hybrid structures by leveraging Bayesian Optimization over Hybrid Spaces diffusion kernels over continuous and discrete spaces. Acquisition function optimization. Suppose Mt is the statistical model at iteration t. Let us assume that AF(Mt, x) is the acquisition function that need to be optimized to select the next hybrid structure xt+1 for function evaluation. We solve this problem using an iterative procedure that performs search over continuous sub-space (xc) and discrete sub-space (xd) alternatively. For searching continuous and discrete sub-spaces, we employ CMA-ES (Hansen, 2016) and hill-climbing with restarts respectively. We observed that one iteration of optimizing continuous and discrete subspaces gave good results and they were not sensitive to more iterations. All results of Hy BO are with one iteration. 3. Related Work The effectiveness of any BO approach over hybrid spaces depends critically on the choice of surrogate model. Prior work explored a variety of surrogate models. SMAC (Hutter et al., 2010) employs random forest, which may suffer from inaccurate uncertainty quantification due to its frequentist estimation. TPE (Bergstra et al., 2011) models each input dimension independently by a kernel density estimator, which can be restrictive due to large size of input dimensions and no inter-dependency among models of different input dimensions. Mi Va BO (Daxberger et al., 2020) employs a Bayesian linear regressor by defining features that capture the discrete part using BOCS model (Baptista & Poloczek, 2018; Deshwal et al., 2020a), continuous part using random fourier features (Rahimi & Recht, 2007), and pairwise interaction between continuous and discrete features. As the number of parameters increase, it will need a lot of training examples for learning accurate statistical model. GP based models overcome the drawbacks of all the above methods. (Garrido-Merch an & Hern andez-Lobato, 2020) provided a solution for BO over discrete spaces using an input-transformed kernel. A recent work referred as Co Ca BO (Ru et al., 2020) employs a sum kernel (summing a Hamming kernel over discrete subspace and a RBF kernel over continuous subspace) to learn GP models and showed good results over SMAC and TPE. Unfortunately, the sum kernel captures limited interactions between discrete and continuous variables. In contrast, our additive hybrid diffusion kernel allows to capture higher-order interactions among hybrid variables and our data-driven approach can automatically learn the strengths of these interactions from training data. Hyper Band (HB) (Li et al., 2017) and its model-based variant BOHB (Falkner et al., 2018) are efficient multi-fidelity methods for hyper-parameter optimization that build on existing methods to optimize hybrid spaces. Our Hy BO approach is complementary to this line of work. Prior methods perform search over discrete and continuous subspaces (e.g., gradient descent) to solve the acquisition function optimization problem. SMAC employs a handdesigned local search procedure. Mi Va BO uses integer program solvers to search discrete subspace. Learning methods to improve the accuracy of search (Deshwal et al., 2020b) are complementary to SMAC, Mi VABO, and Hy BO. Co Ca BO maintains a separate multi-armed bandit for each discrete variable and employs the EXP3 algorithm (Auer et al., 2002) to select their values independently. This method does not exploit dependencies among variables, which can be detrimental to accuracy. TPE samples from the learned density estimator to pick the best input for evaluation. 4. Diffusion Kernels over Hybrid Structures We first provide the details of key mathematical and computational tools that are needed to construct hybrid diffusion kernels. Next, we describe the algorithm to automatically construct additive diffusion kernels over hybrid structures. Finally, we present theoretical analysis to show that hybrid diffusion kernels satisfy universal approximation property. 4.1. Key Mathematical and Computational Tools Diffusion kernels (Kondor & Vert, 2004; Lafferty & Lebanon, 2005) are inspired from the diffusion processes occurring in physical systems like heat and gases. The mathematical formulation of these processes naturally lends to kernels over both continuous and discrete spaces(e.g., sequences, trees, and graphs). Diffusion kernel over continuous spaces. The popular radial basis function (RBF) kernel (also known as Gaussian kernel) (Kondor & Vert, 2004) is defined as follows: k(x, x ) = 1 2πσ2 e x x 2/2σ2 (4.1) where σ is the length scale hyper-parameter. This is the solution of the below continuous diffusion (heat) equation: tkx0(x, t) = kx0(x, t) (4.2) x2 D is the second-order differential operator known as the Laplacian operator, and kx0(x, t) = k(x, x ) with x = x0 and t = σ2/2. 4.2. Diffusion Kernel over discrete spaces The idea of diffusion kernels for continuous spaces is extended to discrete structures (e.g., sequences, graphs) (Kondor & Lafferty, 2002) by utilizing the spectral properties of a graph representation of the discrete space. A discrete analogue of the Equation 4.2 can be constructed by employing the matrix exponential of a graph and the graph Laplacian Bayesian Optimization over Hybrid Spaces operator L as given below: β eβL = LeβL (4.3) where L is the graph Laplacian of a suitable graph representation of the discrete input space and β is a hyper-parameter of the resulting diffusion kernel similar to the length scale parameter σ of the RBF kernel. The solution of Equation 4.3 defines a positive-definite kernel for discrete spaces known as the discrete diffusion kernel. According to Equation 4.3, one important ingredient required for defining diffusion kernels on discrete spaces is a suitable graph representation for discrete spaces. One such representation was proposed in a recent work (Oh et al., 2019). In this case, the entire discrete space is represented by a combinatorial graph G. Each node in the vertex set V of the graph corresponds to one candidate assignment of all the discrete variables. Two nodes are connected by an edge if the Hamming distance between the corresponding assignments for all discrete variables is exactly one. The diffusion kernel over this representation is defined as follows: k(V, V ) = exp( βL(G)) (4.4) k(V, V ) = Φ exp( βΠ)ΦT (4.5) where Φ = [φ1, , φ|V |] is the eigenvector matrix and Π = [π1, , π|V |] is the eigenvalue matrix, where φi s and πi s are the eigenvectors and eigenvalues of the graph Laplacian L(G) respectively. Although this graph representation contains an exponential number of nodes, (Oh et al., 2019) computes the graph Laplacian L(G) by decomposing it over the Cartesian product ( ) of m (number of discrete variables) sub-graphs (G1, G2 , Gm) with each sub-graph Gi representing one variable individually. This algorithmic approach has time-complexity O(Pm i=1(C(vi))3), where C(vi) is the number of candidate values (arity) for the ith discrete variable. However, this method is computationally expensive, especially, for problems with large-sized arity. To avoid this computational challenge, we leverage prior observation in (Kondor & Lafferty, 2002) which provides a closed-form of the discrete diffusion kernel by exploiting the structure of the above combinatorial graph representation. We explain this observation for binary variables {0, 1}. From its definition in Equation 4.4, the discrete diffusion kernel over single-dimensional input will be: k(xd, x d) = (1 e 2β) if xd = x d (1 + e 2β) if xd = x d (4.6) Since the kernel over m > 1 dimensions is defined using the Kronecker product over m dimensions, the above expression is easily extended to multiple dimensions setting giving: k(xd, x d) = (1 e 2βi) (1 + e 2βi) δ(xi d,x i d ) (4.7) where δ(xi d, x i d) = 0 if xi d is equal to x i d and 1 otherwise. The subscript d denotes that the variables are discrete and the superscript refers to the ith dimension of the discrete subspace. For general (discrete spaces with arbitray categories), we follow the same observation (Kondor & Lafferty, 2002) and use the following constant-time expression of the discrete diffusion kernel in our method: k(xd, x d) = 1 e C(vi)βi 1 + (C(vi) 1)e C(vi)βi δ(xi d,x i d ) 4.3. Diffusion Kernels over Hybrid Spaces Unifying view of diffusion kernels. Our choice of diffusion kernels is motivated by the fact that they can be naturally defined for both discrete and continuous spaces. In fact, there is a nice transition of the diffusion kernel from discrete to continuous space achieved by continuous space limit operation. More generally, both discrete and continuous diffusion kernel can be seen as continuous limit operation on two parameters of random walks: time and space. For illustration, consider a random walk on an evenly spaced grid where mean time of jump is t and mean gap between two points is s. If t 0, the resulting continuous time and discrete space random walk generates the diffusion kernel on discrete spaces. Additionally, in the limit of the grid spacing s going to zero, the kernel will approach the continuous diffusion kernel. Algorithm to construct hybrid diffusion kernels. We exploit the general formulation of additive Gaussian process kernels (Duvenaud et al., 2011) to define an additive hybrid diffusion kernel over hybrid spaces. The key idea is to assign a base kernel for each input dimension i {1, 2, , m + n}, where m and n stand for the number of discrete and continuous variables in hybrid space X; and construct an overall kernel by summing all possible orders of interactions (upto m + n) between these base kernels. In our case, the RBF kernel and the discrete diffusion kernel acts as the base kernel for continuous and discrete input dimensions respectively. The pth order of interaction (called pth additive kernel) is defined as given below: Kp = θ2 p X 1 i1= i=1 1w T xd exp( βπi) 1w T x d where the inner product in LHS follows from the reproducing property (Steinwart & Christmann, 2008) of a kernel k. Therefore, the functions k(xd, ) in the RKHS Hk of the discrete diffusion kernel are of the form { 1wj T xd; wj [0, 2n 1]}, which is the well-known Walsh Basis (Verel et al., 2018) for pseudo-Boolean functions. Therefore, any pseudo-Boolean function f can be represented by a linear combination of functions in Hk since they form a basis. Theorem 4.1 Let Xc be a compact and non-empty subset of Rn and κc be RBF kernel on Xc. Let Xd be the discrete space {0, 1}m and κd be discrete diffusion kernel on Xd. The additive hybrid diffusion kernel defined in Eqn 4.10, instantiated with kc and kd for continuous and discrete spaces respectively, is a universal kernel for the hybrid space Xc Xd. According to Equation 4.9, any pth order of interaction term in the additive hybrid diffusion kernel is defined as Qp d=1 kid(xid, x id) . Therefore, if each kid is universal over its corresponding dimension Xid (which is true from Propositions 1 and 2), we need to show that the product Qp d=1 kid(xid, x id) is universal over the union of dimensions Xi1 Xi2 Xip. This was proven by Lemma A.5 in (Steinwart et al., 2016). We provide the lemma here for completeness. Lemma 4.2 From (Steinwart et al., 2016) Let X Rm be a compact and non-empty subset, I, J {1, . . . , m} be Name Name in the suite Dimension Function 1 f001 i01 d10 10 (8d, 2c) Function 2 f001 i02 d10 10 (8d, 2c) Function 3 f001 i01 d20 20 (16d, 4c) Function 4 f001 i02 d20 20 (16d, 4c) Table 1. Benchmark problems from bbox-mixint suite. non-empty, and k I and k J be universal kernels on XI XJ, respectively. Then k I k J defined by k I k J(x, x ) := k I(x I, x I) k J(x J, x J) for all x, x XI XJ is a universal kernel on XI XJ. Since both continuous and discrete spaces are compact and Lemma 4.2 is true for arbitrary compact spaces, each order of interaction is universal with respect to its corresponding ambient dimension Xi1 Xi2 Xip. In particular, it is true for m + nth order of interaction which is defined over the entire hybrid space Xc Xd which proves the theorem. .5. Experiments and Results We first describe our experimental setup. Next, we discuss experimental results along different dimensions. 5.1. Benchmark Domains Synthetic benchmark suite. bbox-mixint is a challenging mixed-integer blackbox optimization benchmark suite (Tuˇsar et al., 2019) that contains problems of varying difficulty. This benchmark suite is available via COCO platform1. We ran experiments with multiple problems from this benchmark, but for brevity, we present canonical results on four benchmarks (shown in Table 1) noting that all the results show similar trends. Real world benchmarks. We employ six diverse realworld domains. The complete details (function definition, bounds for input variables etc.) are in the Appendix. 1) Pressure vessel design optimization. This mechanical design problem (Kannan & Kramer, 1994; Tanabe & Ishibuchi, 2020) involves minimizing the total cost of a cylindrical pressure vessel. There are two discrete (thickness of shell and head of pressure vessel) and two continuous (inner radius and length of cylindrical section) variables. 2) Welded beam design optimization. The goal in this material engineering domain (Deb & Goyal, 1996; Reklaitis et al., 1983) is to design a welded beam while minimizing the overall cost of the fabrication. There are six variables: two discrete (type of welding configuration and bulk material of the beam) and four continuous (weld thickness, welded joint length, beam width and thickness). 1https://github.com/numbbo/coco Bayesian Optimization over Hybrid Spaces 3) Speed reducer design optimization. In this domain from NASA (Cagnina et al., 2008), the goal is to minimize the weight of a speed reducer defined over seven input variables: one discrete (number of teeth on pinion) and six continuous (face width, teeth module, lengths of shafts between bearings, and diameters of the shafts) 4) Optimizing control for robot pushing. This is a 14 dimensional control parameter tuning problem, where a robot is trying to push objects toward a goal location (Wang et al., 2018). We consider a hybrid version of this problem by discretizing ten input variables corresponding to location of the robot and number of simulation steps. The remaining four parameters corresponding to rotation are kept as continuous. 5) Calibration of environmental model. The problem of calibration and uncertainty analysis of expensive environmental models is very important in scientific domains (Bliznyuk et al., 2008; Astudillo & Frazier, 2019). There are four input variables (one discrete and three continuous). 6) Hyper-parameter optimization. We consider hyperparameter tuning of a neural network model on a diverse set of benchmarks (Gijsbers et al., 2019): five discrete (hidden layer size, activation type, batch size, type of learning rate, and whether to use early stopping or not) and three continuous (learning rate initialization, momentum parameter, and regularization coefficient) hyper-parameters. 5.2. Experimental Setup Baseline methods. We compare Hy BO with four strong baselines: 1) Co Ca BO, a state-of-the-art method (Ru et al., 2020); 2) SMAC (Hutter et al., 2010); 3) TPE (Bergstra et al., 2011); 4) Hy BO w/o Marg is a special case of Hy BO, where we do not perform marginalization over the hyperparameters of the hybrid diffusion kernel; and 5) Cont-BO treats discrete variables as continuous and performs standard BO over continuous spaces (both modeling and acquisition function optimization). We did not include Mi Va BO (Daxberger et al., 2020) as there was no publicly available implementation (Daxberger) 2. Configuration of algorithms and baselines. We configure Hy BO as follows. We employ uniform prior for the length scale hyperparameter (σ) of the RBF kernel. Horse-shoe prior is used for β hyper-parameter of the discrete diffusion kernel (Equation 4.8) and hyper-parameters θ of the additive diffusion kernel (Equation 4.9). We employ expected improvement (Mockus et al., 1978) as the acquisition function. For acquisition function optimization, we perform iterative search over continuous and discrete sub-spaces as shown in Algorithm 1. For optimizing discrete subspace, we run local search with 20 restarts. We normalize each continuous variable to be in the range [ 1, 1] and employed CMA-ES 2Personal communication with the lead author. algorithm 3 for optimizing the continuous subspace. We found that the results obtained by CMA-ES were not sensitive to its hyper-parameters. Specifically, we fixed the population size to 50 and initial standard deviation to 0.1 in all our experiments. We employed the open-source python implementation of Co Ca BO 4, SMAC 5, and TPE 6. All the methods are initialized with same random hybrid structures. We replicated all experiments for 25 different random seeds and report the mean and two times the standard error in all our figures. Evaluation metric. We use the best function value achieved after a given number of iterations (function evaluations) as a metric to evaluate all methods. The method that uncovers high-performing hybrid structures with less number of function evaluations is considered better. 5.3. Results and Discussion Results on mixed integer benchmark suite. Figure 1 shows the canonical results on four benchmarks from bbox-mixint listed in Table 1 noting that all results show similar trends. Hy BO and its variant Hy BO-Round performs significantly better and converges much faster than all the other baselines. One key reason for this behavior is that hybrid diffusion kernel accounts for higher-order interactions between variables. Cont-BO performs the worst among all the methods. This shows that simply treating discrete variables as continuous is sub-optimal and emphasizes the importance of modeling the structure in discrete variables. Ablation results for statistical models. To understand the reasons for the better performance of Hy BO, we compare the performance of its surrogate model based on hybrid diffusion kernels with those of Co Ca BO and SMAC. We perform the following experiment. We constructed testing dataset (pairs of hybrid structures and their function evaluations) of size 200 via uniform random sampling. We compute the mean absolute error (MAE) of the three surrogate models as a function of training set size. The results are shown in Figure 2 which depicts the mean and two times standard error of the MAE on 25 random testing datasets. Hy BO clearly has very low error compared to Co Ca BO and SMAC on Function 1 and 2. Although Hy BO has similar MAE to Co Ca BO in the beginning on Function 3 and 4, it rapidly decreases as the training set size increases which is not the case for other two methods. This experiment provides strong empirical evidence for the fact that the proposed surrogate model in Hy BO can model hybrid spaces more accurately when compared to Co Ca BO and SMAC. Ablation results for marginalization in Hy BO. Bayesian 3https://github.com/CMA-ES/pycma 4https://github.com/rubinxin/Co Ca BO_code 5https://github.com/automl/SMAC3 6https://github.com/hyperopt/hyperopt Bayesian Optimization over Hybrid Spaces 0 25 50 75 100 125 150 175 200 Number of iterations Best function value Cont-BO TPE SMAC Co Ca BO Hy BO w/o Marg Hy BO 0 25 50 75 100 125 150 175 200 Number of iterations Best function value Cont-BO TPE SMAC Co Ca BO Hy BO w/o Marg Hy BO 0 25 50 75 100 125 150 175 200 Number of iterations Best function value Cont-BO TPE SMAC Co Ca BO Hy BO w/o Marg Hy BO 0 25 50 75 100 125 150 175 200 Number of iterations Best function value Cont-BO TPE SMAC Co Ca BO Hy BO w/o Marg Hy BO Figure 1. Results of Hy BO and state-of-the-art baselines on bbob-mixint benchmark suite for functions shown in Table 1. 100 120 140 160 180 200 Training set size SMAC Co Ca BO Hy BO 100 120 140 160 180 200 Training set size SMAC Co Ca BO Hy BO 100 120 140 160 180 200 Training set size SMAC Co Ca BO Hy BO 100 120 140 160 180 200 Training set size SMAC Co Ca BO Hy BO Figure 2. Results showing mean absolute test error with increasing size of training set on the bbob-mixint synthetic benchmarks. 0 20 40 60 80 100 Number of iterations Best (log) function value Pressure Vessel Design TPE SMAC Co Ca BO Hy BO 0 20 40 60 80 100 Number of iterations Best (log) function value Welded Beam Design TPE SMAC Co Ca BO Hy BO 0 20 40 60 80 100 Number of iterations Best (log) function value Environment Model TPE SMAC Co Ca BO Hy BO 0 20 40 60 80 100 Number of iterations Best function value 1e3 Speed Reducer Design TPE SMAC Co Ca BO Hy BO 0 20 40 60 80 100 Number of iterations Best function value TPE SMAC Co Ca BO Hy BO Figure 3. Results comparing the proposed Hy BO approach with state-of-the-art baselines on multiple real world benchmarks. treatment of hyper-parameters (marginalization) is one key component of our proposed Hy BO method. However, to decouple the efficacy of additive diffusion kernel from the usage of marginalization, we performed experiments using Hy BO without marginalization (Hy BO w/o Marg in Figures). As evident from Figure 1, Hy BO w/o Marg finds better solutions than all the baselines albeit with slower convergence which is improved by adding marginalization. Results for real-world domains. Figure 3 shows comparison of Hy BO approach with baseline methods on all realworld domains except hyper-parameter optimization. We make the following observations. 1) Hy BO consistently performs better than all the baselines on all these benchmarks. 2) Even on benchmarks such as speed reducer design and welded beam design where Hy BO finds a similar solution as Co Ca BO, it does so with much faster convergence. 3) Bayesian Optimization over Hybrid Spaces Dataset Cont-BO TPE SMAC Co Ca BO Hy BO blood transfusion 76.089 (0.325) 76.711 (0.432) 76.658 (0.418) 76.978 (0.455) 77.819 (0.463) kc1 85.185 (0.129) 85.637 (0.069) 85.453 (0.087) 85.415 (0.099) 85.466 (0.116) vehicle 80.501 (1.120) 80.913 (1.051) 83.669 (1.013) 82.882 (1.222) 86.104 (0.894) segment 87.253 (0.995) 87.792 (0.537) 89.986 (0.692) 89.639 (0.727) 91.433 (0.277) cnae 95.370 (0.103) 95.691 (0.082) 95.605 (0.063) 95.679 (0.108) 95.644 (0.135) jasmine 77.317 (0.216) 77.893 (0.071) 77.460 (0.189) 77.513 (0.202) 77.121 (0.172) Table 2. Results on the task of hyper-parameter tuning of neural network models. Bold numbers signify statistical significance. Benchmark TPE SMAC Co Ca BO Hy BO Synthetic Function 1 0.012 2.34 2.30 50 Synthetic Function 2 0.012 0.98 1.31 50 Synthetic Function 3 0.026 2.99 3.18 180 Synthetic Function 4 0.026 1.98 2.96 180 Pressure Vessel Design 0.003 0.34 0.85 20 Welded Beam Design 0.004 0.64 1.02 40 Speed Reducer Design 0.006 1.38 0.94 40 Push Robot 0.017 1.94 1.70 90 Environment model 0.005 0.31 0.50 40 Table 3. Computational cost in average wall-clock time (seconds) per BO iteration. Co Ca BO performs reasonably well on these benchmarks but its performance is worse than Hy BO demonstrating that its sum kernel (along with Hamming kernel for discrete spaces) is less powerful than hybrid diffusion kernel of Hy BO. 4). TPE has the worst performance on most benchmarks possibly a direct result of its drawback of not modeling the interactions between input dimensions. 5) SMAC performs poorly on all the benchmarks potentially due to poor uncertainty estimates from random forest surrogate model. Table 2 shows the final accuracy (mean and standard error) obtained by all methods including Hy BO on the task of tuning neural network models for six different datasets (BO curves are similar for all methods). Hy BO produces comparable or better results than baseline methods. Computational cost analysis. We compare the runtime of different algorithms including Hy BO. All experiments were run on a AMD EPYC 7451 24-Core machine. Table 3 shows the average wall-clock time (in seconds) per BO iteration. We can see that Hy BO is relatively expensive when compared to baseline methods. However, for real-world science and engineering applications, minimizing the cost of physical resources to perform evaluation (e.g., conducting an additive manufacturing experiment for designing materials such as alloys) is the most important metric. The computational cost for selecting inputs for evaluation is a secondary concern. Hy BO uses more time to select inputs for evaluation to minimize the number of function evaluations to uncover better structures. We provide a finer-analysis of the Hy BO runtime in Table 4. Each kernel evaluation time with all orders of interactions is very small. The overall runtime is spent on two major things: a) Sampling from posterior distributions of hyperparameters using slice sampling; and b) AFO using CMA-ES + local search. We can reduce the sampling time by considering Hy BO without marginalization which shows slightly worse performance, but takes only 10 percent of the sampling time in Hy BO. Orders of interaction Hy BO iteration AFO Sampling Kernel eval. 2 62 46 16 0.005 5 68 50 18 0.006 10 102 68 34 0.010 20 (Hy BO) 180 114 66 0.020 Table 4. Average runtime (seconds) for different orders of interaction within hybrid kernel for synthetic Function 3. 6. Conclusions We studied a novel Bayesian optimization approach referred as Hy BO for optimizing hybrid spaces using Gaussian process based surrogate models. We presented a principled approach to construct hybrid diffusion kernels by combining diffusion kernels defined over continuous and discrete sub-spaces in a tractable and flexible manner to capture the interactions between discrete and continuous variables. We proved that additive hybrid kernels have the universal approximation property. Our experimental results on diverse synthetic and real-world benchmarks show that Hy BO performs significantly better than state-of-the-art methods. Acknowledgements. This research is supported by NSF grants IIS-1845922, OAC-1910213, and CNS-1955353. Bayesian Optimization over Hybrid Spaces Astudillo, R. and Frazier, P. Bayesian optimization of composite functions. volume 97 of Proceedings of Machine Learning Research, pp. 354 363. PMLR, 09 15 Jun 2019. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):48 77, 2002. Baptista, R. and Poloczek, M. Bayesian optimization of combinatorial structures. In Proceedings of the 35th International Conference on Machine Learning, pp. 462 471, 2018. Belakaria, S., Deshwal, A., and Doppa, J. R. Max-value entropy search for multi-objective Bayesian optimization. In Neur IPS, 2019. Belakaria, S., Deshwal, A., and Doppa, J. R. Max-value entropy search for multi-objective Bayesian optimization with constraints. Co RR, abs/2009.01721, 2020a. URL https://arxiv.org/abs/2009.01721. Belakaria, S., Deshwal, A., and Doppa, J. R. Multi-fidelity multi-objective Bayesian optimization: An output space entropy search approach. In AAAI conference on Artificial Intelligence (AAAI), 2020b. Belakaria, S., Deshwal, A., and Doppa, J. R. Uncertainty aware search framework for multi-objective Bayesian optimization with constraints. Co RR, abs/2008.07029, 2020c. URL https://arxiv.org/abs/2008. 07029. Belakaria, S., Deshwal, A., and Doppa, J. R. Informationtheoretic multi-objective Bayesian optimization with continuous approximations. Co RR, abs/2009.05700, 2020d. URL https://arxiv.org/abs/2009.05700. Belakaria, S., Deshwal, A., Jayakodi, N. K., and Doppa, J. R. Uncertainty-aware search framework for multi-objective Bayesian optimization. In AAAI, 2020e. Belakaria, S., Jackson, D., Cao, Y., Doppa, J. R., and Lu, X. Machine learning enabled fast multi-objective optimization for electrified aviation power system design. In IEEE Energy Conversion Congress and Exposition (ECCE), 2020f. Bergstra, J., Bardenet, R., Bengio, Y., and K egl, B. Algorithms for hyper-parameter optimization. In Shawe Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 2546 2554, 2011. Bliznyuk, N., Ruppert, D., Shoemaker, C., Regis, R., Wild, S., and Mugunthan, P. Bayesian calibration and uncertainty analysis for computationally expensive models using optimization and radial basis function approximation. Journal of Computational and Graphical Statistics, 17 (2):270 294, 2008. Cagnina, L., Esquivel, S., and Coello, C. Solving engineering optimization problems with the simple constrained particle swarm optimizer. Informatica, 32:319 326, 2008. Chung, F. R. and Graham, F. C. Spectral graph theory. Number 92. American Mathematical Soc., 1997. Daxberger, E. Personal communication about Mi Va BO implementation and code. Daxberger, E., Makarova, A., Turchetta, M., and Krause, A. Mixed-variable bayesian optimization. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 2633 2639, 7 2020. Deb, K. and Goyal, M. A combined genetic adaptive search (Gene AS) for engineering design. Computer Science and Informatics, 26:30 45, 1996. Deshwal, A., Belakaria, S., and Doppa, J. R. Scalable combinatorial Bayesian optimization with tractable statistical models. Co RR, abs/2008.08177, 2020a. URL https://arxiv.org/abs/2008.08177. Deshwal, A., Belakaria, S., Doppa, J. R., and Fern, A. Optimizing discrete spaces via expensive evaluations: A learning to search framework. In AAAI Conference on Artificial Intelligence (AAAI), 2020b. Deshwal, A., Belakaria, S., and Doppa, J. R. Mercer features for efficient combinatorial Bayesian optimization. In AAAI, 2021. Duvenaud, D. K., Nickisch, H., and Rasmussen, C. E. Additive gaussian processes. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 226 234, 2011. Falkner, S., Klein, A., and Hutter, F. BOHB: robust and efficient hyper-parameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1436 1445, 2018. Frazier, P. I. A tutorial on bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018. Garrido-Merch an, E. C. and Hern andez-Lobato, D. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocomputing, 380:20 35, 2020. Bayesian Optimization over Hybrid Spaces Gijsbers, P., Le Dell, E., Poirier, S., Thomas, J., Bischl, B., and Vanschoren, J. An open source automl benchmark. ar Xiv preprint ar Xiv:1907.00909., 2019. Accepted at Auto ML Workshop at ICML 2019. Greenhill, S., Rana, S., Gupta, S., Vellanki, P., and Venkatesh, S. Bayesian optimization for adaptive experimental design: A review. IEEE Access, 8:13937 13948, 2020. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012. Hansen, N. The cma evolution strategy: A tutorial. ar Xiv preprint ar Xiv:1604.00772, 2016. Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration (extended version). Technical Report TR-2010-10, University of British Columbia, Department of Computer Science, 2010. Available online: http://www.cs.ubc.ca/ hutter/papers/10-TR-SMAC.pdf. Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In International conference on Learning and Intelligent Optimization, pp. 507 523, 2011. Kannan, B. K. and Kramer, S. N. An augmented lagrange multiplier based method for mixed integer discrete continuous optimization and its applications to mechanical design. Journal of Mechanical Design, 116(2):405 411, 06 1994. Kondor, R. and Lafferty, J. Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th International Conference on Machine Learning, volume 2002, pp. 315 322, 2002. Kondor, R. and Vert, J.-P. Diffusion kernels. Kernel methods in computational biology, pp. 171 192, 2004. Lafferty, J. and Lebanon, G. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research, 6(1): 129 163, 2005. Li, L., Jamieson, K. G., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: a novel bandit-based approach to hyper-parameter optimization. Journal of Machine Learning Research (JMLR), 18:185:1 185:52, 2017. Mania, H., Ramdas, A., Wainwright, M. J., Jordan, M. I., and Recht, B. On kernel methods for covariates that are rankings. Electronic Journal of Statistics, 12(2):2537 2577, 2018. Micchelli, C. A., Xu, Y., and Zhang, H. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651 2667, 2006. Mockus, J., Tiesis, V., and Zilinskas, A. The application of bayesian methods for seeking the extremum. Towards Global Optimization, 2(117-129), 1978. Neal, R. M. Slice sampling. Annals of statistics, 31(3): 705 741, 6 2003. Oh, C., Tomczak, J., Gavves, E., and Welling, M. Combinatorial bayesian optimization using the graph cartesian product. In Advances in Neural Information Processing Systems, pp. 2910 2920, 2019. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: machine learning in python. Journal of Machine Learning Research, 12:2825 2830, 2011. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, pp. 1177 1184, 2007. Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006. Reklaitis, G. V., Ravindran, A., and Ragsdell, K. M. Engineering optimization: methods and applications. Wiley New York, 1983. Roustant, O., Padonou, E., Deville, Y., Cl ement, A., Perrin, G., Giorla, J., and Wynn, H. Group kernels for gaussian process metamodels with categorical inputs. SIAM/ASA Journal on Uncertainty Quantification, 8(2):775 806, 2020. Ru, B., Alvi, A. S., Nguyen, V., Osborne, M. A., and Roberts, S. J. Bayesian optimisation over multiple continuous and categorical inputs. In International Conference on Machine Learning (ICML), 2020. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104 (1):148 175, 2016. Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12 (7), 2011. Steinwart, I. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(Nov):67 93, 2001. Bayesian Optimization over Hybrid Spaces Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008. Steinwart, I., Thomann, P., and Schmid, N. Learning with hierarchical gaussian kernels. ar Xiv preprint ar Xiv:1612.00824, 2016. Tanabe, R. and Ishibuchi, H. An easy-to-use real-world multi-objective optimization problem suite. Applied Soft Computing, 89:106078, 2020. Tuˇsar, T., Brockhoff, D., and Hansen, N. Mixed-integer benchmark problems for single-and bi-objective optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 718 726, 2019. Verel, S., Derbel, B., Liefooghe, A., Aguirre, H., and Tanaka, K. A surrogate model based on walsh decomposition for pseudo-boolean functions. In International Conference on Parallel Problem Solving from Nature, pp. 181 193. Springer, 2018. Wang, Z., Gehring, C., Kohli, P., and Jegelka, S. Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pp. 745 754, 2018. Zhou, Z., Belakaria, S., Deshwal, A., Hong, W., Doppa, J. R., Pande, P. P., and Heo, D. Design of multi-output switched-capacitor voltage regulator via machine learning. In DATE, 2020.