# mixedvariable_bayesian_optimization__0c78f57f.pdf Mixed-Variable Bayesian Optimization Erik Daxberger , ,1,2 , Anastasia Makarova ,3 , Matteo Turchetta2,3 and Andreas Krause3 1Department of Engineering, University of Cambridge 2Max Planck Institute for Intelligent Systems, T ubingen 3Department of Computer Science, ETH Zurich ead54@cam.ac.uk,{anmakaro,matteo.turchetta,krausea}@inf.ethz.ch The optimization of expensive to evaluate, blackbox, mixed-variable functions, i.e. functions that have continuous and discrete inputs, is a difficult and yet pervasive problem in science and engineering. In Bayesian optimization (BO), special cases of this problem that consider fully continuous or fully discrete domains have been widely studied. However, few methods exist for mixedvariable domains and none of them can handle discrete constraints that arise in many real-world applications. In this paper, we introduce MIVABO, a novel BO algorithm for the efficient optimization of mixed-variable functions combining a linear surrogate model based on expressive feature representations with Thompson sampling. We propose an effective method to optimize its acquisition function, a challenging problem for mixed-variable domains, making MIVABO the first BO method that can handle complex constraints over the discrete variables. Moreover, we provide the first convergence analysis of a mixed-variable BO algorithm. Finally, we show that MIVABO is significantly more sample efficient than state-of-the-art mixed-variable BO algorithms on several hyperparameter tuning tasks, including the tuning of deep generative models. 1 Introduction Bayesian optimization (BO) [Moˇckus, 1975] is a wellestablished paradigm to optimize costly-to-evaluate, complex, black-box objectives that has been successfully applied to many scientific domains. Most of the existing BO literature focuses on objectives that have purely continuous domains, such as those arising in tuning of continuous hyperparameters of machine learning algorithms, recommender systems, and preference learning [Shahriari et al., 2016]. More recently, problems with purely discrete domains, such as food safety control and model-sparsification in multi-component systems [Baptista and Poloczek, 2018] have been considered. However, many real-world optimization problems in science and engineering are of mixed-variable nature, involv- Equal contribution. Work done while at ETH Zurich. ing both continuous and discrete input variables, and exhibit complex constraints. For example, tuning the hyperparameters of a convolutional neural network involves both continuous variables, e.g., learning rate and momentum, and discrete ones, e.g., kernel size, stride and padding. Also, these hyperparameters impose validity constraints, as some combinations of kernel size, stride and padding define invalid networks. Further examples of mixed-variable, potentially constrained, optimization problems include sensor placement [Krause et al., 2008], drug discovery [Negoescu et al., 2011], optimizer configuration [Hutter et al., 2011] and many others. Nonetheless, only few BO methods can address the unconstrained version of such problem and no existing method can handle the constrained one. This work introduces the first algorithm that can efficiently optimize mixed-variable functions subject to known constraints with provable convergence guarantees. Related Work. Extending continuous BO methods [Shahriari et al., 2016] to mixed inputs requires ad-hoc relaxation methods to map the problem to a fully continuous one and rounding methods to map the solution back. This ignores the original domain structure, makes the solution quality dependent on the relaxation and rounding methods, and makes it hard to handle discrete constraints. Extending discrete BO methods [Baptista and Poloczek, 2018; Oh et al., 2019] to mixed inputs requires a discretization of the continuous domain part, the granularity of which is crucial: If it is too small, the domain becomes prohibitively large; if it is too large, the domain may only contain poorly performing values of the continuous inputs. Few BO methods address the mixed-variable setting. SMAC [Hutter et al., 2011] uses a random forest surrogate model. However, its frequentist uncertainty estimates may be too inaccurate to steer the sampling. TPE [Bergstra et al., 2011] uses kernel density estimation to find inputs that will likely improve upon and unlikely perform worse than the incumbent solution. While SMAC and TPE can handle hierarchical constraints, they cannot handle more general constraints over the discrete variables, e.g., cardinality constraints. They also lack convergence guarantees. Hyperband (HB) [Li et al., 2018] uses cheap but less accurate approximations of the objective to dynamically allocate resources for function evaluations. BOHB [Falkner et al., 2018] is the model-based counterpart of HB, based on TPE. They thus extend existing mixed-variable methods to the multi-fidelity setting rather than propos- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) ing new ones, which is complementary to our approach, rather than in competition with it. [Garrido-Merch an and Hern andez-Lobato, 2018] propose a Gaussian process kernel to model discrete inputs without rounding bias. Their method lacks guarantees and cannot handle discrete constraints. We instead use discrete optimizers for the acquisition function, which avoid bias by only making integer evaluations. Finally, while [Hern andez-Lobato et al., 2015; Gardner et al., 2014; Sui et al., 2015] extend continuous BO methods to handle unknown constraints, no method can handle known discrete constraints in a mixed-variable domain. Contributions. We introduce MIVABO, the first BO algorithm for efficiently optimizing mixed-variable functions subject to known linear and quadratic integer constraints, encompassing many of the constraints present in real-world domains (e.g. cardinality, budget and hierarchical constraints). It relies on a linear surrogate model that decouples the continuous, discrete and mixed components of the function using an expressive feature expansion (Sec. 3.1). We exploit the ability of this model to efficiently draw samples from the posterior over the objective (Sec. 3.2) by combining it with Thompson sampling, and show how to optimize the resulting constrained acquisition function (Sec. 3.3). While in continuous BO, optimizing the acquisition function is difficult but has well-established solutions, this is not true for mixedvariable spaces and doing this efficiently and accurately is a key challenge that hugely impacts the algorithm s performance. We also provide the first convergence analysis of a mixed-variable BO algorithm (Sec. 3.5). Finally, we demonstrate the effectiveness of MIVABO on a set of complex hyperparameter tuning tasks, where it outperforms state-of-theart methods and is competitive with human experts (Sec. 4). 2 Problem Statement We consider the problem of optimizing an unknown, costlyto-evaluate function defined over a mixed-variable domain, accessible through noisy evaluations and subject to known linear and quadratic constraints. Formally, we aim to solve minx X f(x) s.t. gc(x) 0, gd(x) 0, (1) where X X c X d with continuous subspace X c and discrete subspace X d. Both constraints gc(x) 0 over X c and gd(x) 0 over X d are known, and specifically gd(x) are linear or quadratic. We assume, that the domain of the continuous inputs is box-constrained and can thus, w.l.o.g., be scaled to the unit hypercube, X c = [0, 1]Dc. We further assume, w.l.o.g., that the discrete inputs are binary, i.e., vectors xd X d = {0, 1}Dd are vertices of the unit hypercube. This representation can effectively capture the domain of any discrete function. For example, a vector xd = [xd i ]Dd i=1 X d can encode a subset A of a ground set of Dd elements, such that xd i = 1 ai A and xd i = 0 ai / A, yielding a set function. Alternatively, xd X d can be a binary encoding of integer variables, yielding a function defined over integers. Background. BO algorithms are iterative black-box optimization methods which, at every step t, select an input xt X and observe a noise-perturbed output yt f(xt) + ϵ with ϵ iid N(0, β 1), β > 0. As evaluating f is costly, the goal is to query inputs based on past observations to find a global minimizer x arg minx X f(x) as efficiently and accurately as possible. To this end, BO algorithms leverage two components: (i) a probabilistic function model (or surrogate), that encodes the belief about f based on the observations available, and (ii) an acquisition function α : X R that expresses the informativeness of input x about the location of x , given the surrogate of f. Based on the model of f, we query the best input measured by the acquisition function, then update the model with the observation and repeat this procedure. The goal of the acquisition function is to simultaneously learn about inputs that are likely to be optimal and about poorly explored regions of the input space, i.e., to trade-off exploitation against exploration. Thus, BO reduces the original hard black-box optimization problem to a series of cheaper problems xt arg maxx X αt(x). However, in our case, these optimization problems involve mixed variables and exhibit linear and quadratic constraints and are thus still challenging. We now present MIVABO, an algorithm to efficiently solve the optimization problem in Eq. (1). 3 MIVABO Algorithm We first introduce the linear model used to represent the objective (Sec. 3.1) and describe how to do inference with it (Sec. 3.2). We then show how to use Thompson sampling to query informative inputs (Sec. 3.3) and, finally, provide a bound on the regret incurred by MIVABO. (Sec. 3.5). 3.1 Model We propose a surrogate model that accounts for both discrete and continuous variables in a principled way, while balancing two conflicting goals: Model expressiveness versus feasibility of Bayesian inference and of the constrained optimization of the mixed-variable acquisition function. Linear models defined over non-linear feature mappings, f(x) = w φ(x), are a class of flexible parametric models that strike a good tradeoff between model capacity, interpretability and ease of use through the definition of features φ : X RM. While the complexity of the model is controlled by the number of features, M, its capacity depends on their definition. Therefore, to make the design of a set of expressive features more intuitive, we treat separately the contribution to the objective f from the discrete part of the domain, from the continuous part of the domain, and from the interaction of the two, f(x) = P j {d,c,m} wj φj(xj) (2) where, for j {d, c, m}, φj(xj) = [φj i(xj)]Mj i=1 RMj and wj RMj are the feature and weight vector for the discrete, continuous and mixed function component, respectively. In many real-world domains, a large set of features can be discarded a priori to simplify the design space. It is common practice in high-dimensional BO to assume that only low-order interactions between the variables contribute significantly to the objective, which was shown for many practical problems [Rolland et al., 2018; Mutn y and Krause, 2018], including deep neural network hyperparameter tuning [Hazan et al., 2017]. Similarly, we focus on features defined over small subsets of the inputs. Formally, we consider Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) φ(x) = [φk(xk)]M k=1, where xk is a subvector of x containing exclusively continuous or discrete variables or a mix of both. Thus, the objective f(x) can be decomposed into a sum of low-dimensional functions fk(xk) wkφk(xk) defined over subspaces Xk X with dim(Xk) dim(X). This defines a generalized additive model [Rolland et al., 2018; Hastie, 2017], where the same variable can be included in multiple subvectors/features. The complexity of this model is controlled by the effective dimensionality (ED) of the subspaces, which is crucial under limited computational resources. In particular, let Dd maxk [M] dim(X d k ) denote the ED of the discrete component in Eq. (2), i.e. the dimensionality of the largest subspace that exclusively contains discrete variables. Analogously, Dc and Dm denote the EDs of the continuous and mixed component, respectively. Intuitively, the ED corresponds to the maximum order of the variable interactions present in f. Then, the number of features M O D Dd d + D Dc c + (Dd + Dc) Dm scales exponentially in the EDs only (as modeling up to L-th order interactions of N inputs requires PL l=0 N l O(N L) terms), which are usually small, even if the true dimensionality is large. Discrete Features φd. We aim to define features φd that can effectively represent the discrete component of Eq. (2) as a linear function, which should generally be able to capture arbitrary interactions between the discrete variables. To this end, we consider all subsets S of the discrete variables in X d (or, equivalently, all elements S of the powerset 2Xd of Xd) and define a monomial Q j S xd j for each subset S (where for S = , Q j xd j = 1). We then form a weighted sum of all monomials to yield the multi-linear polynomial wd φd(xd) = P S 2Xd w S Q j S xd j. This functional representation corresponds to the Fourier expansion of a pseudo Boolean function (PBF) [Boros and Hammer, 2002]. In practice, an exponential number of features can be prohibitively expensive and may lead to high-variance estimators as in BO one typically does not have access to enough data to robustly fit a large model. Alternatively, [Baptista and Poloczek, 2018; Hazan et al., 2017] empirically found that a second-order polynomial in the Fourier basis provides a practical balance between expressiveness and efficiency, even when the true function is of higher order. In our model, we also consider quadratic PBFs, wd φd(xd) = w + Pn i=1 w{i}xd i + P 1 i 0, which encourages w to be uniformly small, so that the final predictor is a sum of all features, each giving a small, nonzero contribution. Given the likelihood and prior, we infer the posterior p(w|X1:t, y1:t, α, β) p(y1:t|X1:t, w, β)p(w|α), which due to conjugacy is Gaussian, p(w|X1:t, y1:t) = N(m, S 1), with mean m = βS 1Φ 1:ty1:t RM and precision S = αI + βΦ 1:tΦ1:t RM M [Williams and Rasmussen, 2006]. This simple analytical treatment of the posterior distribution over w is a main benefit of this model, which can be viewed as a GP with a linear kernel in feature space. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 3.3 Acquisition Function We propose to use Thompson sampling (TS) [Thompson, 1933], which samples weights ew p(w|X1:t, y1:t, α, β) from the posterior and chooses the next input by solving bx arg minx X ew φ(x). TS intuitively focuses on inputs that are plausibly optimal and has previously been successfully applied in both discrete and continuous domains [Baptista and Poloczek, 2018; Mutn y and Krause, 2018]. TS requires solving bx arg minx X ew t φ(x), which is a challenging mixed-variable optimization problem. However, as ew t φ(x) decomposes as in Eq. (2), we can naturally use an alternating optimization scheme which iterates between optimizing the discrete variables xd conditioned on a particular setting of the continuous variables xc and vice versa, until convergence to some local optimum. While this scheme provides no theoretical guarantees, it is simple and thus widely and effectively applied in many contexts where the objective is hard to optimize. In particular, we iteratively solve bxd arg minxd X d ewd φd(xd) + ewm φm(xd, xc = bxc) , bxc arg minxc X c ewc φc(xc) + ewm φm(xd = bxd, xc) . Importantly, using the mixed features proposed in Sec. 3.1, these problems can be optimized by purely discrete and continuous optimizers, respectively. This also holds in the presence of mixed constraints gm(x) 0 if those decompose accordingly into discrete and continuous constraints. This scheme leverages independent subroutines for discrete and continuous optimization: For the discrete part, we exploit the fact that optimizing a second-order pseudo Boolean function is equivalent to a binary integer quadratic program (IQP) [Boros and Hammer, 2002], allowing us to exploit commonly-used efficient and robust solvers such as Gurobi or CPLEX. While solving general binary IQPs is NP-hard [Boros and Hammer, 2002], these optimizers are in practice very efficient for the dimensionalities we consider (i.e., Dd < 100). This approach allows us to use any functionality offered by these tools, such as the ability to optimize objectives subject to linear constraints Axd b, A RK Dd, b RK or quadratic constraints xd Qxd + q xd b, Q RDd Dd, q RDd, b R. For the continuous part, one can use optimizers commonly used in continuous BO, such as L-BFGS or DIRECT. In our experiments, we use Gurobi as the discrete and L-BFGS as the continuous solver within the alternating optimization scheme, which we always run until convergence. 3.4 Model Discussion BO algorithms are comprised of three major design choices: the surrogate model to estimate the objective, the acquisition function to measure informativeness of the inputs and the acquisition function optimizer to select queries. Due to the widespread availability of general-purpose optimizers for continuous functions, continuous BO is mostly concerned with the first two design dimensions. However, this is different for mixed-variable constrained problems. We show in Sec. 4 that using a heuristic optimizer for the acquisition function optimization leads to poor queries and, therefore, poor performance of the BO algorithm. Therefore, the tractability of the acquisition function optimization influences and couples the other design dimensions. In particular, the following considerations make the choice of a linear model and TS the ideal combination of surrogate and acquisition function for our problem. Firstly, the linear model is preferable to a GP with a mixed-variable kernel as the latter would complicate the acquisition function optimization for two reasons: (i) the posterior samples would be arbitrary nonlinear functions of the discrete variables and (ii) it would be non-trivial to evaluate them at arbitrary points in the domain. In contrast, our explicit feature expansion solves both problems, while second order interactions provide a valid discrete function representation [Baptista and Poloczek, 2018; Hazan et al., 2017] and lead to tractable quadratic MIPs with capacity for complex discrete constraints. Moreover, Random Fourier Features approximate common GP kernels arbitrarily well, and inference in MIVABO scales linearly with the number of data points, making it applicable in cases where GP inference, which scales cubically with the number of data points, would be prohibitive. Secondly, TS induces a simple relation between the surrogate and the resulting optimization problem for the acquisition function, allowing to trade off model expressiveness and optimization tractability, which is a key challenge in mixed-variable domains. Finally, the combination of TS and the linear surrogate facilitates the convergence analysis described in Sec. 3.5, making MIVABO the first mixed-variable BO method with theoretical guarantees. 3.5 Convergence Analysis Using a linear model and Thompson sampling, we can leverage convergence analysis from linearly parameterized multiarmed bandits, a well-studied class of methods for solving structured decision making problems [Abeille et al., 2017]. These also assume the objective to be linear in features φ(x) RM with a fixed but unknown weight vector w RM, i.e. E[f(x)|φ(x)] = w φ(x), and aim to minimize the total regret up to time T: R(T) = PT t=1(f(x ) f(xt)). We obtain the following regret bound for MIVABO: Proposition 1. Assume that the following assumptions hold in every iteration t = 1, . . . , T of the MIVABO algorithm: 1. ewt N(m, 24M ln T ln 1 δ S 1), i.e. with scaled variance. 2. xt = arg minx ew φ(x) is selected exactly.1 3. ewt 2 c, φ(xt) 2 c, f(x ) f(xt) 2 c ,c R+. Then, R(T) O M 3/2 δ with probability 1 δ. Prop. 1 follows from Theorem 1 in [Abeille et al., 2017] and works for infinite arms x X, |X| = . In our setting, both the discrete and continuous Fourier features (and, thus, the mixed features) satisfy the standard boundedness assumption, such that the proof indeed holds. Prop. 1 implies no-regret, lim T R(T)/T = 0, i.e., convergence to the global minimum, since the minimum found after T iterations is no further away from f(x ) than the mean regret R(T)/T. To our knowledge, MIVABO is the first mixed-variable BO algorithm for which such a guarantee is known to hold. 1To this end, one can use more expensive but theoretically backed optimization methods instead of the alternating one, such as the powerful and popular dual decomposition [Sontag et al., 2011]. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 4 Experiments We present experimental results on tuning the hyperparameters of two machine learning algorithms, namely gradient boosting and a deep generative model, on multiple datasets. Experimental Setup. For MIVABO2, we set the prior variance α, observation noise variance β, and kernel bandwidth σ to 1.0, and scale the variance as stated in Prop. 1. We compare against SMAC, TPE, random search, and the popular GPy Opt BO package. GPy Opt uses a GP model with the upper confidence bound acquisition function [Srinivas et al., 2010], and accounts for mixed variables by relaxing discrete variables to be continuous and later rounding them to the nearest discrete neighbor. To separate the influence of model choice and acquisition function optimization, we also consider the MIVABO model optimized by simulated annealing (SA) (MIVABO-SA) and the GP approach optimized by SA (GP-SA). We compare against the SA-based variants only in constrained settings, using more principled methods in unconstrained ones. To handle constraints, SA assigns high energy values to invalid inputs, making the probability of moving there negligible. We use SMAC, TPE and GPy Opt and SA with their respective default settings. 4.1 Gradient Boosting Tuning The Open ML database [Vanschoren et al., 2014] contains evaluations for various machine learning methods trained on several datasets with many hyperparameter settings. We consider extreme gradient boosting (XGBoost) [Chen and Guestrin, 2016], one of the most popular Open ML benchmarks, and tune its ten hyperparameters three are discrete and seven continuous to minimize the classification error on a held-out test set (without any constraints). We use two datasets, each containing more than 45000 hyperparameter settings. To evaluate hyperparameter settings for which no data is available, we use a surrogate modeling approach based on nearest neighbor [Eggensperger et al., 2015], meaning that the objective returns the error of the closest (w.r.t. Euclidean distance) setting available in the dataset. Fig. 1 shows that MIVABO achieves performance which is either significantly stronger than (left dataset) or competitive with (right dataset) the state-of-the-art mixed-variable BO algorithms on this challenging task. GPy Opt performs poorly, likely because it cannot account for discrete variables in a principled way. As compared to TPE and SMAC, MIVABO seems to benefit from more sophisticated uncertainty estimation. 4.2 Deep Generative Model (DGM) Tuning DGMs recently received considerable attention in the machine learning community. Despite their popularity and importance, effectively tuning their hyperparameters is a major challenge. We consider tuning the hyperparameters of a variational autoencoder (VAE) [Kingma and Welling, 2014] composed of a convolutional encoder and a deconvolutional decoder [Salimans et al., 2015]. The VAEs are evaluated on stochastically binarized MNIST, as in [Burda et al., 2016], 2We provide a Python implementation of MIVABO at https: //github.com/edaxberger/mixed_variable_bo. and Fashion MNIST. They are trained on 60000 images for 32 epochs, using Adam with a mini-batch size of 128. We report the negative log-likelihood (NLL; in nats) achieved by the VAEs on a held-out test set of 10000 images, as estimated via importance sampling using 32 samples per test point. To our knowledge, no other BO paper considered DGM tuning. VAE tuning is difficult due to the high-dimensional and structured nature of its hyperparameter space, and, in particular, due to constraints arising from dependencies between some of its parameters. We tune 25 discrete parameters defining the model architecture, e.g. the number of convolutional layers, their stride, padding and filter size, the number and width of fully-connected layers, and the latent space dimensionality. We further tune three continuous parameters for the optimizer and regularization. Crucially, mutual dependencies between the discrete parameters result in complex constraints, as certain combinations of stride, padding and filter size lead to invalid architectures. Particularly, for the encoder, the shapes of all layers must be integral, and for the decoder, the output shape must match the input data shape, i.e., one channel of size 28 28 for {Fashion}MNIST. The latter constraint is especially challenging, as only a small number of decoder configurations yield the required output shape. Thus, even for rather simple datasets such as {Fashion}MNIST, tuning such a VAE is significantly more challenging than, say, tuning a convolutional neural network for classification. While MIVABO can conveniently capture these restrictions via linear and quadratic constraints, the competing methods cannot. To enable a comparison that is as fair as possible, we thus use the following sensible heuristic to incorporate the knowledge about the constraints into the baselines: If a method tries to evaluate an invalid parameter configuration, we return a penalty error value, which will discourage a model-based method to sample this (or a similar) setting again. However, for fairness, we only report valid observations and ignore all configurations that violated a constraint. We set the penalty value to 500 nats, which is the error incurred by a uniformly random generator. We investigated the impact of the penalty value (e.g., we also tried 250 and 125 nats) and found that it does not qualitatively affect the results. Fig. 3 shows that MIVABO significantly outperforms the competing methods on this task, both on MNIST (left) and Fashion MNIST (right). This is because MIVABO can naturally encode the constraints and thus directly optimize over the feasible region in parameter space, while TPE, SMAC and GPy Opt need to learn the constraints from data. They fail to do so and get stuck in bad local optima early on. The modelbased approaches likely struggle due to sharp discontinuities in hyperparameter space induced by the constraint violation penalties (i.e., as invalid configurations may lie close to wellperforming configurations). In contrast, random search is agnostic to these discontinuities, and thus notably outperforms the model-based methods. Lastly, GP-SA and MIVABO-SA struggle as well, suggesting that while SA can avoid invalid inputs, the effective optimization of complex constrained objectives crucially requires more principled approaches for acquisition function optimization, such as the one we propose. This shows that all model choices for MIVABO (as discussed in Sec. 3.4) are necessary to achieve such strong results. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 1 10 20 30 40 50 60 70 80 90 100 # of sampled hyperparameter configurations validation error 1 10 20 30 40 50 60 70 80 90 100 # of sampled hyperparameter configurations Random TPE SMAC GPy Opt Figure 1: XGBoost hyperparameter tuning on monks-problem-1 (left) and steelplates-fault (right). Mean one std. of the validation error over 16 random seeds. MIVABO significantly outperforms the baselines on the first dataset, and is competitive on the second. Figure 2: Randomly chosen MNIST test images (left column) and their reconstructions by the best VAE models found by MIVABO, random search, GPy Opt, TPE and SMAC (left to right), thus ordered by NLL values, which seem to capture visual quality. 1 4 8 12 16 20 24 28 32 # of sampled hyperparameter configurations negative test log-likelihood (nats) 1 4 8 12 16 20 24 28 32 # of sampled hyperparameter configurations SMAC TPE GPy Opt Random Figure 3: VAE hyperparameter tuning on MNIST (left) and Fashion MNIST (right). Mean one std. of the NLL in nats, estimated using 32 importance samples, over 8 random seeds. Every model was trained for 32 epochs. MIVABO significantly outperforms the state-of-the-art baselines, demonstrating its ability to handle the complex constrained nature of the VAE s parameter space. Method Time NLL SMAC 0.32s 99.09 TPE 0.12s 97.05 GPy Opt 0.65s 97.33 Random 0.01s 93.74 MIVABO 7.39s 84.25 Figure 4: Mean wall-clock time of one iteration (excluding function evaluation time) and mean negative loglikelihood (NLL) in nats, estimated with 5000 importance samples, of the best VAEs found after 32 BO iterations (as in Fig. 3), when trained for 3280 epochs. Human expert baseline for even deeper models is 82-83 nats. Although log-likelihood scores allow for a quantitative comparison, they are hard to interpret for humans. Thus, for a qualitative comparison, Fig. 2 visualizes the reconstruction quality achieved on MNIST by the best VAE configuration found by all methods after 32 BO iterations. The VAEs were trained for 32 epochs each, as in Fig. 3. The likelihoods seem to correlate with the quality of appearance, and the model found by MIVABO arguably produces the visually most appealing reconstructions among all models. Note that while MIVABO requires more time than the baselines (see Fig. 4), this is still negligible compared to the cost of a function evaluation, which involves training a deep generative model. Finally, the best VAE found by MIVABO achieves 84.25 nats on MNIST when trained for 3280 epochs and using 5000 importance samples for log-likelihood estimation, i.e. the setting used in [Burda et al., 2016] (see Fig. 4). This is comparable to the performance of 82-83 nats achieved by human expert tuned models, e.g. as reported in [Salimans et al., 2015] (which use even more convolutional layers and a more sophisticated inference method), highlighting MIVABO s effectiveness in tuning complex deep neural network architectures. 5 Conclusion We propose MIVABO, the first method for efficiently optimizing expensive mixed-variable black-box functions subject to linear and quadratic discrete constraints. MIVABO combines a linear model of expressive features with Thompson sampling, making it simple yet effective. Moreover, it is highly flexible due to the modularity of its components, i.e., the mixed-variable features, and the optimization oracles for the acquisition procedure. This allows practitioners to tailor MIVABO to specific objectives, e.g. by incorporating prior knowledge in the feature design or by leveraging optimizers handling specific types of constraints. We show that MIVABO enjoys theoretical convergence guarantees that competing methods lack. Finally, we empirically demonstrate that MIVABO significantly improves optimization performance as compared to state-of-the-art methods for mixed-variable optimization on complex hyperparameter tuning tasks. Acknowledgements This research has been partially supported by SNSF NFP75 grant 407540 167189. Matteo Turchetta was supported through the ETH-MPI Center for Learning Systems. Erik Daxberger was supported through the EPSRC and Qualcomm. The authors thank Josip Djolonga, Mojm ır Mutn y, Johannes Kirschner, Alonso Marco Valle, David R. Burt, Ross Clarke, Wolfgang Roth as well as the anonymous reviewers of an earlier version of this paper for their helpful feedback. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) References [Abeille et al., 2017] Marc Abeille, Alessandro Lazaric, et al. Linear Thompson sampling revisited. EJS, 2017. [Baptista and Poloczek, 2018] Ricardo Baptista and Matthias Poloczek. Bayesian optimization of combinatorial structures. In ICML, 2018. [Bergstra et al., 2011] James S Bergstra, R emi Bardenet, Yoshua Bengio, and Bal azs K egl. Algorithms for hyperparameter optimization. In NIPS, 2011. [Boros and Hammer, 2002] Endre Boros and Peter L Hammer. Pseudo-boolean optimization. Discrete applied mathematics, 123(1-3):155 225, 2002. [Burda et al., 2016] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016. [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In KDD, 2016. [Eggensperger et al., 2015] Katharina Eggensperger, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Efficient benchmarking of hyperparameter optimizers via surrogates. In AAAI, 2015. [Falkner et al., 2018] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. In ICML, 2018. [Gardner et al., 2014] Jacob R. Gardner, Matt J. Kusner, Zhixiang Xu, Kilian Q. Weinberger, and John P. Cunningham. Bayesian optimization with inequality constraints. In ICML, 2014. [Garrido-Merch an and Hern andez-Lobato, 2018] Eduardo C Garrido-Merch an and Daniel Hern andez-Lobato. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Co RR, 2018. [Hastie, 2017] Trevor J Hastie. Generalized additive models. In Statistical models in S. Routledge, 2017. [Hazan et al., 2017] Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: A spectral approach. In ICRL, 2017. [Hern andez-Lobato et al., 2015] Jos e Miguel Hern andez Lobato, Michael A Gelbart, Matthew W Hoffman, Ryan P Adams, and Zoubin Ghahramani. Predictive entropy search for bayesian optimization with unknown constraints. JMLR, 2015. [Hutter et al., 2011] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In LION, 2011. [Jenatton et al., 2017] Rodolphe Jenatton, Cedric Archambeau, Javier Gonz alez, and Matthias Seeger. Bayesian optimization with tree-structured dependencies. In ICML, 2017. [Kingma and Welling, 2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. [Krause et al., 2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [Li et al., 2018] Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 2018. [Moˇckus, 1975] Jonas Moˇckus. On Bayesian methods for seeking the extremum. In Optimization Techniques, 1975. [Mutn y and Krause, 2018] Mojmir Mutn y and Andreas Krause. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In NIPS, 2018. [Negoescu et al., 2011] Diana M Negoescu, Peter I Frazier, and Warren B Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS, 2011. [Oh et al., 2019] Changyong Oh, Jakub M. Tomczak, Efstratios Gavves, and Max Welling. Combinatorial bayesian optimization using graph representations. Neur IPS, 2019. [Rahimi and Recht, 2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, 2008. [Rolland et al., 2018] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional Bayesian optimization via additive models with overlapping groups. In AISTATS, 2018. [Salimans et al., 2015] Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, 2015. [Shahriari et al., 2016] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. IEEE, 2016. [Sontag et al., 2011] David Sontag, Amir Globerson, and Tommi Jaakkola. Introduction to dual composition for inference. In Optimization for Machine Learning. 2011. [Srinivas et al., 2010] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, 2010. [Sui et al., 2015] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In ICML, 2015. [Thompson, 1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933. [Vanschoren et al., 2014] Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Open ML: networked science in machine learning. ACM SIGKDD, 15(2):49 60, 2014. [Williams and Rasmussen, 2006] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press Cambridge, MA, 2006. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)