# highdimensional_bayesian_optimization_via_nested_riemannian_manifolds__4ae0f9c8.pdf High-Dimensional Bayesian Optimization via Nested Riemannian Manifolds Noémie Jaquier1,2 1Idiap Research Institute 1920 Martigny, Switzerland noemie.jaquier@kit.edu Leonel Rozo2 2Bosch Center for Artificial Intelligence 71272 Renningen, Germany leonel.rozo@de.bosch.com Despite the recent success of Bayesian optimization (BO) in a variety of applications where sample efficiency is imperative, its performance may be seriously compromised in settings characterized by high-dimensional parameter spaces. A solution to preserve the sample efficiency of BO in such problems is to introduce domain knowledge into its formulation. In this paper, we propose to exploit the geometry of non-Euclidean search spaces, which often arise in a variety of domains, to learn structure-preserving mappings and optimize the acquisition function of BO in low-dimensional latent spaces. Our approach, built on Riemannian manifolds theory, features geometry-aware Gaussian processes that jointly learn a nested-manifold embedding and a representation of the objective function in the latent space. We test our approach in several benchmark artificial landscapes and report that it not only outperforms other high-dimensional BO approaches in several settings, but consistently optimizes the objective functions, as opposed to geometry-unaware BO methods. 1 Introduction Bayesian optimization (BO) is considered as a powerful machine-learning based optimization method to globally maximize or minimize expensive black-box functions [50]. Thanks to its ability to model complex noisy cost functions in a data-efficient manner, BO has been successfully applied in a variety of applications ranging from hyperparameters tuning for machine learning algorithms [51] to the optimization of parametric policies in challenging robotic scenarios [11, 16, 41, 49]. However, BO performance degrades as the search space dimensionality increases, which recently opened the door to different approaches dealing with the curse of dimensionality. A common assumption in high-dimensional BO approaches is that the objective function depends on a limited set of features, i.e. that it evolves along an underlying low-dimensional latent space. Following this hypothesis, various solutions based either on random embeddings [57, 43, 9] or on latent space learning [13, 23, 42, 59] have been proposed. Although these methods perform well on a variety of problems, they usually assume simple bound-constrained domains and may not be straightforwardly extended to complicatedly-constrained parameter spaces. Interestingly, several works proposed to further exploit the observed values of the objective function to determine or shape the latent space in a supervised manner [59, 42, 4]. However, the integration of a priori domain knowledge related to the parameter space is not considered in the learning process. Moreover, the aforementioned approaches may not comply easily to recover query points in a complex parameter space from those computed on the learned latent space. Other relevant works in high-dimensional BO substitute or combine the low-dimensional assumption with an additive property, assuming that the objective function is decomposed as a sum of functions of low-dimensional sets of dimensions [33, 37, 21, 44, 24]. Therefore, each low-dimensional partition 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. (b) S3 ++ S2 ++ Figure 1: Illustration of the low-dimensional assumption on Riemannian manifolds. (a) The function on S2 is not influenced by the value of x1 and may be represented more efficiently on the manifold S1. (b) The stiffness matrix of a robot is optimized to push objects lying on a table. As the stiffness along the axis x3 does not influence the pushing skill, the cost function may be better represented in a latent space S2 ++. Note that the manifolds dimensionality is limited here due to the difficulty of visualizing high-dimensional parameter spaces. However, these examples are extensible to higher dimensions. can be treated independently. In a similar line, inspired by the dropout algorithm in neural networks, other approaches proposed to deal with high-dimensional parameter spaces by optimizing only a random subset of the dimensions at each iteration [36]. Although the aforementioned strategies are well adapted for simple Euclidean parameter spaces, they may not generalize easily to complex domains. If the parameter space is not Euclidean or must satisfy complicated constraints, the problem of partitioning the space into subsets becomes difficult. Moreover, these subsets may not be easily and independently optimized as they must satisfy global constraints acting on the parameters domain. Introducing domain knowledge into surrogate models and acquisition functions has recently shown to improve the performance and scalability of BO [11, 3, 45, 30, 14]. Following this research line, we hypothesize that building and exploiting geometry-aware latent spaces may improve the performance of BO in high dimensions by considering the intrinsic geometry of the parameter space. Fig. 1 illustrates this idea for two Riemannian manifolds widely used (see 2 for a short background). The objective function on the sphere S2 (Fig. 1a) does not depend on the value x1 and is therefore better represented on the low-dimensional latent space S1. In Fig. 1b, the stiffness matrix X S3 ++ of a robot controller is optimized to push objects lying on a table, with Sd ++ the manifold of d d symmetric positive definite (SPD) matrices. In this case, the stiffness along the vertical axis x3 does not influence the robot s ability to push the objects. We may thus optimize the stiffness along the axes x1 and x2, i.e., in the latent space S2 ++. Therefore, similarly to high-dimensional BO frameworks where a Euclidean latent space of the Euclidean parameter space is exploited, the objective functions may be efficiently represented in a latent space that inherits the geometry of the original Riemannian manifold. In general, this latent space is unknown and may not be aligned with the coordinate axes. Following these observations, this paper proposes a novel high-dimensional geometry-aware BO framework (hereinafter called HD-Ga BO) for optimizing parameters lying on low-dimensional Riemannian manifolds embedded in high-dimensional spaces. Our approach is based on a geometryaware surrogate model that learns both a mapping onto a latent space inheriting the geometry of the original space, and the representation of the objective in this latent space (see 3). The next query point is then selected on the low-dimensional Riemannian manifold using geometry-aware optimization methods. We evaluate the performance of HD-Ga BO on various benchmark functions and show that it efficiently and reliably optimizes high-dimensional objective functions that feature an intrinsic low dimensionality (see 4). Potential applications of our approach are discussed in 5. 2 Background Riemannian Manifolds In machine learning, diverse types of data do not belong to a vector space and thus the use of classical Euclidean methods for treating and analyzing these variables is inadequate. A common example is unit-norm data, widely used to represent directions and orientations, that can be represented as points on the surface of a hypersphere. More generally, many data are normalized in a preprocessing step to discard superfluous scaling and hence are better explained through spherical representations [19]. Notably, spherical representations have been recently exploited to design Manifold d M(x, y) Sd [2] arccos(x Ty) Sd ++ [5] log(X) log(Y ) F Figure 2: Illustrations of the manifolds S2 (left) and S2 ++ (middle). Left: Points on the surface of the sphere, such as x and y belong to the manifold. Middle: One point corresponds to a matrix T11 T12 T12 T22 Sym2 in which the manifold is embedded. For both graphs, the shortest path between x and y is the geodesic represented as a red curve, which differs from the Euclidean path depicted in blue. u lies on the tangent space of x. The right table describes the distance operations on Sd and Sd ++. variational autoencoders [58, 12]. SPD matrices are also extensively used: They coincide with the covariance matrices of multivariate distributions and are employed as descriptors in many applications, such as computer vision [56] and brain-computer interface classification [8]. SPD matrices are also widely used in robotics in the form of stiffness and inertia matrices, controller gains, manipulability ellipsoids, among others. Both the sphere and the space of SPD matrices can be endowed with a Riemannian metric to form Riemannian manifolds. Intuitively, a Riemannian manifold M is a mathematical space for which each point locally resembles a Euclidean space. For each point x M, there exists a tangent space Tx M equipped with a smoothly-varying positive definite inner product called a Riemannian metric. This metric permits us to define curve lengths on the manifold. These curves, called geodesics, are the generalization of straight lines on the Euclidean space to Riemannian manifolds, as they represent the minimum length curves between two points in M. Fig. 2 illustrates the two manifolds considered in this paper and details the corresponding distance operations. The unit sphere Sd is a d-dimensional manifold embedded in Rd+1. The tangent space Tx Sd is the hyperplane tangent to the sphere at x. The manifold of d d SPD matrices Sd ++, endowed here with the Log-Euclidean metric [5], can be represented as the interior of a convex cone embedded in its tangent space Symd. Supplementary manifold operations used to optimize acquisition functions in HD-Ga BO are detailed in Appendix A. Geometry-aware Bayesian Optimization The geometry-aware BO (Ga BO) framework [30] aims at finding a global maximizer (or minimizer) of an unknown objective function f, so that x = argmaxx X f(x), where the design space of parameters X is a Riemannian manifold or a subspace of a Riemannian manifold, i.e. X M. With Ga BO, geometry-awareness is first brought into BO by modeling the unknown objective function f with a GP adapted to manifold-valued data. This is achieved by defining geometry-aware kernels measuring the similarity of the parameters on M. In particular, the geodesic generalization of the SE kernel is given by k(xi, xj) = θ exp( βd M(xi, xj)2), where d M( , ) denotes the Riemannian distance between two observations and the parameters β and θ control the horizontal and vertical scale of the function [31]. For manifolds that are not isometric to a Euclidean space, this kernel is valid, i.e. positive definite, only for parameters values β > βmin [18], where βmin can be determined experimentally [17, 30]. Other types of kernels are available for specific manifolds and may also be used in BO (see e.g., [45, 18, 25]). Secondly, the selection of the next query point xn+1 is achieved by optimizing the acquisition function on the manifold M. To do so, optimization algorithms on Riemannian manifolds are exploited [2]. These geometry-aware algorithms reformulate constrained problems as an unconstrained optimization on manifolds and consider the intrinsic structure of the space of interest. Also, they tend to show lower computational complexity and better numerical properties [29]. 3 High-Dimensional Geometry-aware Bayesian Optimization In this section, we present the high-dimensional geometry-aware BO (HD-Ga BO) framework that naturally handles the case where the design space of parameters X is (a subspace of) a highdimensional Riemannian manifold, i.e. X MD. We assume here that the objective function satisfies the low-dimensional assumption (i.e., some dimensions of the original parameter space do not influence its value) and thus only varies within a low-dimensional latent space. Moreover, we assume that this latent space can be identified as a low-dimensional Riemannian manifold Md inheriting the geometry of the original manifold MD, with d D. Notice that the same assumption is generally made by Euclidean high-dimensional BO frameworks, as the objective function is represented in a latent space Rd of RD. In particular, we model the objective function f : MD R as a composition of a structure-preserving mapping m : MD Md and a function g : Md R, so that f = g m. A model of the objective function is thus available in the latent space Md, which is considered as the optimization domain to maximize the acquisition function. As the objective function can be evaluated only in the original space MD, the query point z Z, with Z Md, obtained by the acquisition function is projected back into the high-dimensional manifold with the right-inverse projection mapping m : Md MD. In HD-Ga BO, the latent spaces are obtained via nested approaches on Riemannian manifolds featuring parametric structure-preserving mappings m : MD Md. Moreover, the parameters Θm and Θg of the mapping m and function g are determined jointly in a supervised manner using a geometryaware GP model, as detailed in 3.1. Therefore, the observed values of the objective function are exploited not only to design the BO surrogate model, but also to drive the dimensionality reduction process towards expressive latent spaces for a data-efficient high-dimensional BO. Considering nested approaches also allows us to build a mapping m that can be viewed as the pseudo-inverse of the mapping m. As explained in 3.3, the corresponding set of parameters Θm includes the projection mapping parameters Θm and a set of reconstruction parameters Θr, so Θm = {Θm, Θr}. Therefore, the parameters Θr are determined as to minimize the reconstruction error, as detailed in 3.2. Similarly to Ga BO [30], geometry-aware kernel functions are used in HD-Ga BO (see 3.1), and the acquisition function is optimized using techniques on Riemannian manifolds, although the optimization is carried out on the latent Riemannian manifold in HD-Ga BO. The proposed HD-Ga BO framework is summarized in Algorithm 1. Algorithm 1: HD-Ga BO Input: Initial observations D0 = {(xi, yi)}N0 i=1, xi MD, yi R Output: Final recommendation x N 1 for n = 0, 1 . . . , N do 2 Update the hyperparameters {Θm, Θg} of the geometry-aware m GP model ; 3 Project the observed data into the latent space, so that zi = m(xi) ; 4 Select the next query point zn+1 Md by optimizing the acquisition function in the latent space, i.e., zn+1 = argmaxz Z γn(z; {(zi, yi)}) ; 5 Update the hyperparameters Θm of the pseudo-inverse projection ; 6 Obtain the new query point xn+1 = m (zn+1) in the original space ; 7 Query the objective function to obtain yn+1 ; 8 Augment the set of observed data Dn+1 = {Dn, (xn+1, yn+1)} ; 3.1 HD-Ga BO Surrogate Model The choice of latent spaces is crucial for the efficiency of HD-Ga BO as it determines the search space for the selection of the next query point xn+1. In this context, it is desirable to base the latent-space learning process not only on the distribution of the observed parameters xn in the original space, but also on the quality of the corresponding values yn of the objective function. Therefore, we propose (i) to supervisedly learn a structure-preserving mapping onto a low-dimensional latent space, and (ii) to learn the representation of the objective function in this latent space along with the corresponding mapping. To do so, we exploit the so-called manifold Gaussian process (m GP) model introduced in [10]. It is important to notice that the term manifold denotes here a latent space, whose parameters are learned by the m GP, which does not generally correspond to a Riemannian manifold. In a m GP, the regression process is considered as a composition g m of a parametric projection m onto a latent space and a function g. Specifically, a m GP is defined as a GP so that f GP(µm, km) with mean function µm : X R and positive-definite covariance function km : X X R defined as µm(x) = µ m(x) and km(xi, xj) = k m(xi), m(xj) , with µ : Z R and k : Z Z R a kernel function. The m GP parameters are estimated by maximizing the marginal likelihood of the model, so that {Θ m, Θ g} = argmaxΘm,Θg p(y|X, Θm, Θg). In m GP [10], the original and latent spaces are subspaces of Euclidean spaces, so that X RD and Z Rd, respectively. Note that the idea of jointly learning a projection mapping and a representation of the objective function with a m GP was also exploited in the context of high-dimensional BO in [42]. In [10, 42], the mapping m : RD Rd was represented by a neural network. However, in the HDGa BO framework, the design parameter space X MD is a high-dimensional Riemannian manifold and we aim at learning a geometry-aware latent space Z Md that inherits the geometry of X. Thus, we define a structure-preserving mapping m : MD Md as a nested projection from a highto a low-dimensional Riemannian manifold of the same type, as described in 3.3. Moreover, as in Ga BO, we use a geometry-aware kernel function k that allows the GP to properly measure the similarity between parameters z = m(x) lying on the Riemannian manifold Md. Therefore, the surrogate model of HD-Ga BO is a geometry-aware m GP, that leads to a geometry-aware representation of the objective function in a locally optimal low-dimensional Riemannian manifold Md. Importantly, the predictive distribution for the m GP f GP(µm, km) at test input x is equivalent to the predictive distribution of the GP g GP(µ, k) at test input z = m( x). Therefore, the predictive distribution can be straightforwardly computed in the latent space. This allows the optimization function to be defined and optimized in the low-dimensional Riemannian manifold Md instead of the original high-dimensional parameter space MD. Then, the selected next query point zn+1 in the latent space needs to be projected back onto MD in order to evaluate the objective function. 3.2 Input Reconstruction from the Latent Embedding to the Original Space After optimizing the acquisition function, the selected query point zn+1 in the latent space needs to be projected back onto the manifold MD in order to evaluate the objective function. For solving this problem in the Euclidean case, Moriconi et al. [42] proposed to learn a reconstruction mapping r : Rd RD based on multi-output GPs. In contrast, we propose here to further exploit the nested structure-preserving mappings in order to project the selected query point back onto the original manifold. As shown in 3.3, a right-inverse parametric projection m : Md MD can be built from the nested Riemannian manifold approaches. This pseudo-inverse mapping depends on a set of parameters Θm = {Θm, Θr}. Note that the parameters Θm are learned with the m GP surrogate model, but we still need to determine the reconstruction parameters Θr. While the projection mapping m aimed at finding an optimal representation of the objective function, the corresponding pseudo-inverse mapping m should (ideally) project the data z lying on the latent space Md onto their original representation x in the original space MD. Therefore, the parameters Θr are obtained by minimizing the sum of the squared residuals on the manifold MD, so that Θ r = argmin Θr i=1 d2 MD xi, m (zi; Θm, Θr) . (1) 3.3 Nested Manifolds Mappings As mentioned previously, the surrogate model of HD-Ga BO learns to represent the objective function in a latent space Md inheriting the geometry of the original space MD. To do so, the latent space is obtained via nested approaches, which map a high-dimensional Riemannian manifold to a lowdimensional latent space inheriting the geometry of the original Riemannian manifold. While various other dimensionality reduction techniques have been proposed on Riemannian manifolds [20, 52, 53, 28, 46], the resulting latent space is usually formed by curves on the high-dimensional manifold MD. This would still require to optimize the acquisition function on MD with complex constraints, which may not be handled efficiently by optimization algorithms. In contrast, nested manifold mappings reduce the dimension of the search space in a systematic and structure-preserving manner, so that the acquisition function can be efficiently optimized on a low-dimensional Riemannian manifold with optimization techniques on Riemannian manifolds. Moreover, intrinsic latent spaces may naturally be encoded with nested manifold mappings in various applications (see Fig. 1). Nested mappings for the sphere and SPD manifolds are presented in the following. Sphere manifold The concept of nested spheres, introduced in [32], is illustrated in Fig. 3. Given an axis v SD, the sphere is first rotated so that v aligns with the origin, typically defined as the north pole (0, . . . , 0, 1)T. Then, the data x SD (in purple) are projected onto the subsphere AD 1 defined as AD 1(v, r) = {w SD : d SD(v, w) = r}, where r (0, π/2], so that (a) Rotation of S2 (b) Projection onto A1 (c) A1 identified with S1 Figure 3: Illustration of the nested sphere projection mapping. Data on the sphere S2, depicted by purple dots, are projected onto the subsphere A1, which is then identified with the sphere S1. x D = cos(r). The last coordinate of x is then discarded and the data z SD 1 (in blue) are obtained by identifying the subsphere AD 1 of radius sin(r) with the nested unit sphere SD 1 via a scaling operation. Specifically, given an axis v D SD and a distance r D (0, π/2], the projection mapping m D : SD SD 1 is computed as z = m D(x) = 1 sin(r D) | {z } scaling Rtrunc | {z } rotation + dim. red. sin(r D)x + sin d SD(v D, x) r D v D sin d SD(v D, x) | {z } projection onto AD 1 with d SD defined as in the table of Fig. 2, R SO(D) is the rotation matrix that moves v to the origin on the manifold and Rtrunc the matrix composed of the D 1 first rows of R. Notice also that the order of the projection and rotation operations is interchangeable. In (2), the data are simultaneously rotated and reduced after being projected onto AD 1. However, the same result may be obtained by projecting the rotated data Rx onto AD 1 using the rotated axis Rv and multiplying the obtained vector by the truncated identity matrix Itrunc RD 1 D. This fact will be later exploited to define the SPD nested mapping. Then, the full projection mapping m : SD Sd is defined via successive mappings (2), so that m = md+1 . . . m D 1 m D, with parameters {v D, . . . vd+1, r D, . . . rd+1} such that vk Sk and rk (0, π/2]. Importantly, notice that the distance d Sd(m(xi), m(xj)) between two points xi, xj SD projected onto Sd is invariant w.r.t the distance parameters {r D, . . . rd+1} (see Appendix B for a proof). Therefore, when using distance-based kernels, the parameters set of the m GP projection mapping corresponds to Θm = {v D, . . . vd+1}. The m GP parameters optimization is thus carried out with techniques on Riemannian manifolds on the domain SD Sd+1 Mg, where Mg is the space of GP parameters Θg (usually Mg R . . . R). As shown in [32], an inverse transformation m 1 D : SD 1 SD can be computed as x = m 1 D (z) = RT sin(rd+1)z cos(rd+1) Therefore, the query point selected by the acquisition function in the latent space can be projected back onto the original space with the inverse projection mapping m : Sd SD given by m = m D . . . m d+1. As the axes parameters are determined within the m GP model, the set of reconstruction parameters is given by Θr = {r D, . . . , rd+1}. SPD manifold Although not explicitly named as such, the dimensionality reduction technique for the SPD manifold introduced in [26, 27] can be understood as a nested manifold mapping. Specifically, Harandi et al. [26, 27] proposed a projection mapping m : SD ++ Sd ++, so that Z = m(X) = W TXW , (4) with W RD d. Note that the matrix Z Sd ++ is guaranteed to be positive definite if W has a full rank. As proposed in [26, 27], this can be achieved, without loss of generality, by imposing orthogonality constraint on W such that W GD,d, i.e., W TW = I, where GD,d denotes the Grassmann manifold corresponding to the space of d-dimensional subspaces of RD [15]. Therefore, in the case of the SPD manifold, the projection mapping parameter set is Θm = {W }. Specifically, the m GP parameters are optimized on the product of Riemannian manifolds GD,d Mg. Also, the optimization of the m GP on the SPD manifold can be simplified as shown in Appendix C. In order to project the query point Z Sd ++ back onto the original space SD ++, we propose to build an inverse projection mapping based on m. It can be easily observed that using the pseudo-inverse W so that X = W TZW does not guarantee the recovered matrix X to be positive definite. Therefore, we propose a novel inverse mapping inspired by the nested sphere projections. To do so, we observe that an analogy can be drawn between the mappings (2) and (4). Namely, the mapping (4) first consists of a rotation RTXR of the data X SD ++ with R a rotation matrix whose D first columns equal W , i.e., R = (W V ), where W can been understood as Rtrunc in Eq. (2). Similarly to the nested sphere case, the rotated data can be projected onto a subspace of the manifold SD ++ by fixing their last coordinates. Therefore, the subspace is composed of matrices W TXW C CT B B SD d ++ is a constant matrix. Finally, this subspace may be identified with Sd ++ by multiplying the projected matrix W TXW C CT B with a truncated identity matrix Itrunc RD d. Therefore, the mapping (4) is equivalently expressed as Z = m(X) = IT trunc W TXW C CT B Itrunc = W TXW . From the properties of block matrices with positive block-diagonal elements, the projection is positive definite if and only if W TXW CBCT [6]. This corresponds to defining the side matrix as C = (W TXW ) 1 2 KB 1 2 , where K Rd D d is a contraction matrix, so that K 1 [6]. Based on the aforementioned equivalence, the inverse mapping m : Sd ++ SD ++ is given by X = m (Z) = R Z Z 1 2 KB 1 2 B 1 2 KTZ 1 2 B with reconstruction parameters Θr = {V , K, B}. The optimization (1) is thus carried out on the product of manifolds GD d,d Rd,D d SD d ++ subject to K 1 and W TV = 0. The latter condition is necessary for R to be a valid rotation matrix. We solve this optimization problem with the augmented Lagrangian method on Riemannian manifolds [38]. 4 Experiments In this section, we evaluate the proposed HD-Ga BO framework to optimize high-dimensional functions that lie on an intrinsic low-dimensional space. We consider benchmark test functions defined on a low-dimensional manifold Md embedded in a high-dimensional manifold MD. Therefore, the test functions are defined as f : MD R, so that y = f(m(x)) with m : MD Md being the nested projection mapping, as defined in Section 3.3. The projection mapping parameters are randomly set for each trial. The search space corresponds to the complete manifold for SD and to SPD matrices with eigenvalues λ [0.001, 5] for SD ++. We carry out the optimization by running 30 trials with random initialization. Both Ga BO and HD-Ga BO use the geodesic generalization of the SE kernel and their acquisition functions are optimized using trust region on Riemannian manifolds [1] (see Appendix D). The other state-of-the-art approaches use the classical SE kernel and the constrained acquisition functions are optimized using sequential least squares programming [34]. All the tested methods use EI as acquisition function and are initialized with 5 random samples. The GP parameters are estimated using MLE. All the implementations employ GPy Torch [22], Bo Torch [7] and Pymanopt [55]. Source code is available at https://github. com/Noemie Jaquier/Ga BOtorch. Supplementary results are presented in Appendix F. In the case of the sphere manifold SD, we compare HD-Ga BO against Ga BO, the Euclidean BO and three high-dimensional BO approaches, namely dropout BO [36], SIR-BO [59], and REMBO [57], which carry out all the operations in the Euclidean space. The optimization of the acquisition function of each Euclidean BO method was adapted to fulfill the constraint x = 1. Other approaches, such as the MGPC-BO of [42], are not considered here due to the difficulty of adapting them when the parameters lie on Riemannian manifolds. We minimize the Rosenbrock, Ackley, and product-of-sines functions (see also Appendix E) defined on the low-dimensional manifold S5 embedded in S50. Fig. 4a4c display the median of the logarithm of the simple regret along 300 BO iterations and the distribution of the logarithm of the BO recommendation x N for the three functions. We observe that HD-Ga BO generally converges fast and provides good optimizers for all the test cases. Moreover, it outperforms all the other BO methods for the product-of-sines function: it provides fast convergence and better optimizer with low variance. In contrast, SIR-BO, which leads to the best optimizer for the Rosenbrock function, performs poorly to optimize the product-of-sines function. Similarly, dropout achieves a similar performance as HD-Ga BO for the Ackley function, but it is outperformed by HD-Ga BO in the two other test cases. Moreover, it is worth noticing that Ga BO converges faster to (a) Rosenbrock, S5 embedded in S50 (b) Ackley, S5 embedded in S50 (c) Product of sines, S5 embedded in S50 (d) Rosenbrock, S3 ++ embedded in S10 ++ (e) Styblinski-Tang, S3 ++ embedded in S10 ++ (f) Product of sines, S3 ++ embedded in S10 ++ Figure 4: Logarithm of the simple regret for benchmark test functions over 30 trials. The left graphs show the evolution of the median for the BO approaches and the random search baseline. The right graphs display the distribution of the logarithm of the simple regret of the BO recommendation x N after 300 iterations. The boxes extend from the first to the third quartiles and the median is represented by a horizontal line. Supplementary results are provided in Appendix F. the best optimizer than the other approaches for the Ackley function and performs better than all the geometry-unaware approaches for the product-of-sines function. This highlights the importance of using geometry-aware approaches for optimizing objective functions lying on Riemannian manifolds. Regarding the SPD manifold SD ++, we compare HD-Ga BO against Ga BO, the Euclidean BO and SIRBO (augmented with the constraint λmin > 0). Moreover, we consider alternative implementations of BO, dropout, SIR-BO and REMBO that exploit the Cholesky decomposition of an SPD matrix A = LLT, so that the resulting parameter is the vectorization of the lower triangular matrix L (hereinafter denoted as Cholesky-methods). Note that we do not consider here the Euclidean version of the dropout and REMBO methods due to the difficulty of optimizing the acquisition function in the latent space while satisfying the constraint λmin > 0 for the query point in the high-dimensional manifold. We minimize the Rosenbrock, Styblinski-Tang, and product-of-sines functions defined on the low-dimensional manifold S3 ++ embedded in S10 ++. The corresponding results are displayed in Fig. 4d-4f (in logarithm scale). We observe that HD-Ga BO consistently converges fast and provides good optimizers for all the test cases. Moreover, it outperforms all the other approaches for the Styblinski-Tang function. Similarly to the sphere cases, some methods are still competitive with respect to HD-Ga BO for some of the test functions but perform poorly in other cases. Interestingly, Ga BO performs well for both Rosenbrock and Styblinski-Tang functions. Moreover, the Euclidean BO methods generally perform poorly compared to their Cholesky equivalences, suggesting that, although they do not account for the manifold geometry, Cholesky-based approaches provide a better representation of the SPD parameter space than the Euclidean methods. 5 Potential Applications After evaluating the performance of HD-Ga BO in various benchmark artificial landscapes, we discuss potential real-world applications of the proposed approach. First, HD-Ga BO may be exploited for the optimization of controller parameters in robotics. Of particular interest is the optimization of the error gain matrix Qt SDx ++ and control gain matrix Rt SDu ++ in linear quadratic regulators (LQR), where Dx and Du are the dimensionality of the system state and control input, respectively. The system state may consist of the linear and angular position and velocity of the robot end-effector, so that Dx = 13, and Du corresponds to Cartesian accelerations or wrench commands. Along some parts of the robot trajectory, the error w.r.t. some dimensions of the state space may not influence the execution of the task, i.e., affect negligibly the LQR cost function. Therefore, the matrix Qt for this trajectory segment may be efficiently optimized in a latent space Sdx ++ with dx < Dx. A similar analysis applies for R. Notice that, although BO has been applied to optimize LQR parameters [40, 41], the problem was greatly simplified as only diagonal matrices Q and R were considered in the optimization, resulting in a loss of flexibility in the controller. From a broader point of view, the low-dimensional assumption may also apply in the optimization of gain matrices for other types of controllers. Another interesting application is the identification of dynamic model parameters of (highly-) redundant robots. These parameters typically include the inertia matrix M SD ++ with D being the number of robot joints. As discussed in [60], a low-dimensional representation of the parameter space and state-action space may be sufficient to determine the system dynamics. Therefore, the inertia matrix may be more efficiently represented and identified in a lower-dimensional SPD latent space. In the context of directional statistics [54, 47], HD-Ga BO may be used to adapt mixtures of von Mises-Fisher distributions, whose mean directions belong to SD. On a different topic, object shape spaces are typically characterized on high-dimensional unit spheres SD. Several works have shown that the main features of the shapes are efficiently represented in a low-dimensional latent space Sd inheriting the geometry of the original manifold (see e.g., [32]. Therefore, such latent spaces may be exploited for shape representation optimization. Along a similar line, skeletal models, which seek at capturing the interior of objects, lie on a Cartesian product of manifolds that involves the unit hypersphere [48]. The relevant data structure is efficiently expressed in a product of low-dimensional manifolds of the same types, so that HD-Ga BO may be exploited to optimize skeletal models. 6 Conclusion In this paper, we proposed HD-Ga BO, a high-dimensional geometry-aware Bayesian optimization framework that exploited geometric prior knowledge on the parameter space to optimize highdimensional functions lying on low-dimensional latent spaces. To do so, we used a geometry-aware GP that jointly learned a nested structure-preserving mapping and a representation of the objective function in the latent space.We also considered the geometry of the latent space while optimizing the acquisition function and took advantage of the nested mappings to express the next query point in the high-dimensional parameter space. We showed that HD-Ga BO not only outperformed other BO approaches in several settings, but also consistently performed well while optimizing various objective functions, unlike geometry-unaware state-of-the-art methods. An open question, shared across various high-dimensional BO approaches, concerns the model dimensionality mismatch. In order to avoid suboptimal solutions where the optimum of the function may not be included in the estimated latent space, we hypothesize that the dimension d should be selected slightly higher in case of uncertainty on its value [35]. A limitation of HD-Ga BO is that it depends on nested mappings that are specific to each Riemannian manifold. Therefore, such mappings may not be available for all kinds of manifolds. Also, the inverse map does not necessarily exist if the manifold contains self-intersection. In this case, a non-parametric reconstruction mapping may be learned (e.g., based on wrapped GP [39]). However, most of the Riemannian manifolds encountered in machine learning and robotics applications do not self-intersect, so that this problem is avoided. Future work will investigate the aforementioned aspects. Broader Impact The HD-Ga BO formulation presented in this paper makes a step towards more explainable and interpretable BO approaches. Indeed, in addition to the benefits in terms of performance, the inclusion of domain knowledge via Riemannian manifolds into the BO framework permits to treat the space parameters in a principled way. This can notably be contrasted with approaches based on random features, that generally remain hard to interpret for humans. As often, the gains in terms of explainability and interpretability come at the expense of the low computational cost that characterizes random-based approaches. However, the carbon footprint of the proposed approach remains low compared to many deep approaches used nowadays in machine learning applications. Acknowledgments and Disclosure of Funding This work was mainly developed during a Ph D sabbatical at the Bosch Center for Artificial Intelligence (Renningen, Germany). This work was also partially supported by the FNS/DFG project TACTHAND, as part of the Ph D thesis of the first author, carried out at the Idiap Research Institute (Martigny, Switzerland), while also affiliated to the Ecole Polytechnique Fédérale de Lausanne (Lausanne, Switzerland). Noémie Jaquier is now affiliated with the Karlsruhe Institute of Technology (Karlsruhe, Germany). [1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7:303 330, 2007. [2] P. A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2007. [3] R. Antonova, A. Rai, and C. Atkeson. Deep kernels for optimizing locomotion controllers. In Conference on Robot Learning (Co RL), pages 47 56, 2017. [4] R. Antonova, A. Rai, T. Li, and D. Kragic. Bayesian optimization in variational latent spaces with dynamic compression. In Conference on Robot Learning (Co RL), 2019. [5] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2):411 421, 2006. [6] Rajendra B. Positive Definite Matrices. Princeton University Press, 2007. [7] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. Bo Torch: Programmable bayesian optimization in Py Torch. ar Xiv preprint 1910.06403, 2019. [8] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten. Multiclass brain-computer interface classification by Riemannian geometry. IEEE Trans. on Biomedical Engineering, 59(4):920 928, 2012. [9] M. Binois, D. Ginsbourger, and O. Roustant. On the choice of the low-dimensional domain for global optimization via random embeddings. Journal of Global Optimization, 76(1):69 90, 2020. [10] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for regression. In Proc. IEEE Intl Joint Conf. on Neural Networks (IJCNN), 2016. [11] A. Cully, J. Clune, D. Tarapore, and J. B. Mouret. Robots that can adapt like animals. Nature, 521:503 507, 2015. [12] T. R. Davidson, L. Falorsi, N. De Cao, T. Kipf, and J. M. Tomczak. Hyperspherical variational auto-encoders. In Conference on Uncertainty in Artificial Intelligence (UAI), 2018. [13] J. Djolonga, A. Krause, and V. Cevher. High-dimensional Gaussian process bandits. In Neural Information Processing Systems (Neur IPS), 2013. [14] D. K. Duvenaud. Automatic Model Construction with Gaussian Processes. Ph D thesis, University of Cambridge, 2014. [15] A. Edelman, T. A. Arias, and S. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Applications, 20(2):303 351, 1998. [16] Peter Englert and Marc Toussaint. Combined optimization and reinforcement learning for manipulations skills. In Robotics: Science and Systems (R:SS), 2016. [17] A. Feragen and S. Hauberg. Open problem: Kernel methods on manifolds and metric spaces. what is the probability of a positive definite geodesic exponential kernel? In 29th Annual Conference on Learning Theory, pages 1647 1650, 2016. [18] A. Feragen, F. Lauze, and S. Hauberg. Geodesic exponential kernels: When curvature and linearity conflict. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015. [19] N. I. Fisher, T. Lewis, and B. J. J. Embleton. Statistical analysis of spherical data. Cambridge University Press, 1987. [20] P. T. Fletcher and S. C. Joshi. Principal geodesic analysis on symmetric spaces: Statistics of diffusion tensors. In In Proc. of CVAMIA and MMBIA Worshops, pages 87 98, 2004. [21] J. R. Gardner, C. Guo, K. Q. Weinberger, R. Garnett, and R. Grosse. Discovering and exploiting additive structure for Bayesian optimization. In Proc. of the Intl Conf. on Artificial Intelligence and Statistics (AISTATS), pages 1311 1319, 2017. [22] J. R. Gardner, G. Pleiss, D. Bindel, K. Q. Weinberger, and A. G. Wilson. GPy Torch: Blackbox matrix-matrix gaussian process inference with GPU acceleration. In Neural Information Processing Systems (Neur IPS), 2018. [23] R. Garnett, M. A. Osborne, and P. Hennig. Active learning of linear embeddings for Gaussian processes. In Conference of Uncertainty in Artificial Intelligence (UAI), pages 230 239, 2014. [24] D. Gaudrie, R. Le Riche, V. Picheny, B. Enaux, and V. Herbert. Modeling and optimization with Gaussian processes in reduced eigenbases. Structural and Multidisciplinary Optimization, 61(6):2343 2361, 2020. [25] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2066 2073, 2012. [26] M. Harandi, M. Salzmann, and R. Hartley. From manifold to manifold: Geometry-aware dimensionality reduction for spd matrices. In Proc. European Conf. on Computer Vision (ECCV), 2014. [27] M. Harandi, M. Salzmann, and R. Hartley. Dimensionality reduction on spd manifolds: The emergence of geometry-aware methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(1):48 62, 2018. [28] S. Hauberg. Principal curves on Riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1915 1921, 2016. [29] J. Hu, X. Liu, Z. Wen, and Y. Yuan. A brief introduction to manifold optimization. ar Xiv preprint 1906.05450, 2019. [30] N. Jaquier, L. Rozo, S. Calinon, and M. Bürger. Bayesian optimization meets Riemannian manifolds in robot learning. In Conference on Robot Learning (Co RL), 2019. [31] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on Riemannian manifolds with Gaussian RBF kernels. IEEE Trans. on Pattern Analysis and Machine Intelligence, 37(12):2464 2477, 2015. [32] S. Jung, I. L. Dryden, and J. S. Marron. Analysis of principal nested spheres. Biometrika, 99(3): 551 568, 2012. [33] K. Kandasamy, J. Schneider, and B. Poczos. High dimensional Bayesian optimisation and bandits via additive models. In Intl. Conf. on Machine Learning (ICML), 2015. [34] D. Kraft. A software package for sequential quadratic programming. Technical report, Technical Report DFVLR-FB 88-28, Institut für Dynamik der Flugsysteme, Oberpfaffenhofen, 1988. [35] B. Letham, R. Calandra, A. Rai, and E. Bakshy. Re-examining linear embeddings for highdimensional Bayesian optimization. In Neural Information Processing Systems (Neur IPS), 2020. [36] C. Li, S. Gupta, S. Rana, V. Nguyen, S. Venkatesh, and A. Shilton. High dimensional Bayesian optimization using dropout. In Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 2096 2102, 2017. [37] C.-L. Li, K. Kandasamy, B. Póczos, and J. Schneider. High dimensional Bayesian optimization via restricted projection pursuit models. In Proc. of the Intl Conf. on Artificial Intelligence and Statistics (AISTATS), 2016. [38] C. Liu and N. Boumal. Simple algorithms for optimization on Riemannian manifolds with constraints. Applied Mathematics & Optimization, pages 1 33, 2019. [39] A. Mallasto and A. Feragen. Wrapped Gaussian process regression on Riemannian manifolds. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5580 5588, 2018. [40] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe. Automatic LQR tuning based on Gaussian process global optimization. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 270 277, 2016. [41] A. Marco, P. Hennig, S. Schaal, and S. Trimpe. On the design of LQR kernels for efficient controller learning. In IEEE Conference on Decision and Control (CDC), pages 5193 5200, 2017. [42] R. Moriconi, M. P. Deisenroth, and K. S. Sesh Kumar. High-dimensional Bayesian optimization using low-dimensional feature spaces. Machine Learning, 109:1925 1943, 2020. [43] A. Munteanu, A. Nayebi, and M. Poloczek. A framework for Bayesian optimization in embedded subspaces. In Intl. Conf. on Machine Learning (ICML), volume 97, pages 4752 4761, 2019. [44] M. Mutný and A. Krause. Efficient high dimensional Bayesian optimization with additivity and quadrature fourier features. In Neural Information Processing Systems (Neur IPS), 2018. [45] C. Oh, E. Gavves, and M. Welling. BOCK: Bayesian optimization with cylindrical kernels. In Intl. Conf. on Machine Learning (ICML), pages 3868 3877, 2018. [46] X. Pennec. Barycentric subspace analysis on manifolds. Annals of Statistics, 46(6A):2711 2746, 2018. [47] Arthur Pewsey and Eduardo García-Portugués. Recent advances in directional statistics. ar Xiv preprint 2005.06889, 2020. [48] S. M. Pizer, S. Jung, D. Goswami, J. Vicory, X. Zhao, R. Chaudhuri, J. N. Damon, S. Huckemann, and J. S. Marron. Nested sphere statistics of skeletal models. Innovations for Shape Analysis, pages 93 115, 2012. [49] A. Rai, R. Antonova, S. Song, W. Martin, H. Geyer, and C. Atkeson. Bayesian optimization using domain knowledge on the ATRIAS biped. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 1771 1778, 2018. [50] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2016. [51] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems (Neur IPS), page 2951 2959, 2012. [52] S. Sommer, F. Lauze, S. Hauberg, and M. Nielsen. Manifold valued statistics, exact principal geodesic analysis and the effect of linear approximations. In European Conf. On Computer Vision, pages 43 56, 2010. [53] S. Sommer, F. Lauze, and M. Nielsen. Optimization over geodesics for exact principal geodesic analysis. Advances in Computational Mathematics, 40(2):283 313, 2014. [54] S. Sra. Directional statistics in machine learning: a brief review. In C. Ley and T. Verdebout, editors, Applied Directional Statistics, Chapman & Hall/CRC Interdisciplinary Statistics Series, pages 259 276. CRC Press, Boca Raton, 2018. [55] J. Townsend, N. Koep, and S. Weichwald. Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17(137): 1 5, 2016. [56] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In European Conference on Computer Vision (ECCV), pages 589 600, 2006. [57] Z. Wang, M. Zoghiy, F. Hutterz, D. Matheson, and N. De Freitas. Bayesian optimization in high dimensions via random embeddings. In Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 1778 1784, 2013. [58] J. Xu and G. Durrett. Spherical latent spaces for stable variational autoencoders. In In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2018. [59] M. Zhang, H. Li, and S. Su. High dimensional Bayesian optimization via supervised dimension reduction. In Proc. of Intl Joint Conf. on Artificial Intelligence (IJCAI), 2019. [60] S. Zhu, D. Surovik, K. Bekris, and A. Boularias. Efficient model identification for tensegrity locomotion. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 2985 2990, 2018.