# sampleefficient_bayesian_optimisation_using_known_invariances__2dca93db.pdf

Sample-efficient Bayesian Optimisation Using Known Invariances

Theodore Brown 1,2 Alexandru Cioba 3 Ilija Bogunovic1

Equal contribution 1University College London 2United Kingdom Atomic Energy Authority 3Media Tek Research {theodore.brown.23, i.bogunovic}@ucl.ac.uk alexandru.cioba@mtkresearch.com

Bayesian optimisation (BO) is a powerful framework for global optimisation of costly functions, using predictions from Gaussian process models (GPs). In this work, we apply BO to functions that exhibit invariance to a known group of transformations. We show that vanilla and constrained BO algorithms are inefficient when optimising such invariant objectives, and provide a method for incorporating group invariances into the kernel of the GP to produce invariance-aware algorithms that achieve significant improvements in sample efficiency. We derive a bound on the maximum information gain of these invariant kernels, and provide novel upper and lower bounds on the number of observations required for invariance-aware BO algorithms to achieve ϵ-optimality. We demonstrate our method s improved performance on a range of synthetic invariant and quasi-invariant functions. We also apply our method in the case where only some of the invariance is incorporated into the kernel, and find that these kernels achieve similar gains in sample efficiency at significantly reduced computational cost. Finally, we use invariant BO to design a current drive system for a nuclear fusion reactor, finding a high-performance solution where non-invariant methods failed.

1 Introduction

Bayesian optimisation (BO) has been applied to many global optimisation tasks in science and engineering, such as improving the performance of web servers [30], reducing the losses in particle accelerators [26], or maximising the potency of a drug [1]. The most challenging real-world tasks tend to have large or continuous input domains, as well as target functions that are costly to evaluate.

One particular example of interest is the design and operation of nuclear fusion reactors, which promise to deliver huge amounts of energy with zero greenhouse gas emissions. However, the development of high-performance fusion power plant scenarios is a difficult optimisation task, involving expensive simulations and physical experiments [39, 11]. For tasks like this, it is desirable to develop sample efficient algorithms that only require a small number of evaluations to find the optimal solution.

Crucially, the physical world is characterized by underlying structures that can be leveraged to significantly improve sample efficiency. The concept of invariance, which describes properties that remain constant under various transformations, is one of the most fundamental structures in nature. A prime example of invariance is found in image classification tasks: a classifier should consistently identify a cat as a cat, regardless of its orientation, position, or size. The central question we answer is the extent to which the sample efficiency of BO can be improved by exploiting these invariances.

We study a general approach for incorporating invariances into any BO algorithm. We represent invariant objective functions as elements of a reproducing kernel Hilbert space (RKHS), and produce invariance-aware versions of existing Gaussian process (GP) bandit optimisation algorithms by using the kernel of this RKHS. We apply these algorithms to synthetic and nuclear fusion optimisation

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

(a) Permutation group, S2

(b) Cyclic group, C3

(c) Dihedral group, D5

Figure 1: Examples of group-invariant functions, sampled from a Gaussian process with the corresponding invariant kernel. Note that observing a point x (red) provides information about transformed locations {σ(x) : σ G} (white).

tasks, and show that incorporating the underlying invariance into the kernel significantly improves the sample efficiency. We also investigate the behaviour of the algorithms when the full invariant kernel is not known, and provide an empirical study of the performance of invariance-aware BO on target functions that are quasi-invariant (modelled as a sum of invariant and non-invariant components). Our main contribution is the derivation of novel upper and lower bounds on the regret achieved by invariance-aware BO algorithms.

1.1 Related work

Invariance in deep learning. The role of invariance in deep learning has been extensively explored in the literature. Invariance, particularly to geometric transformations like rotation, scaling, and translation, is crucial for the robust performance of deep learning models. Many deep learning methods incorporate invariance via data augmentation, in which additional training examples are generated by transforming the input data [28, 29]. However, the cost of kernelised learning methods often scales poorly with the number of training points n Gaussian process inference, for example, incurs O(n3) time and O(n2) memory cost [47] which means that data augmentation is prohibitively expensive. In this work, we adopt an alternative approach by directly embedding invariance properties into the model s structure, which effectively reduces computational complexity [44, 15].

Structure-informed BO. There is an extensive body of empirical BO literature that exploits known patterns present in the behaviour of a system. In [24], BO is applied after a variety of physics-informed feature transformations to the state; in [48] the authors employ BO to handle optimization over crystal configurations with symmetries found using [40]. However, in these works, only empirical results are given and asymptotic rates on sample complexity are not estimated.

Invariant kernel methods. The theory of group invariant kernels was introduced in [18, 27]. Using these ideas, [46] developed efficient variational methods for learning hyperparameters in invariant Gaussian processes. Invariant kernel ridge regression (KRR) for finite groups has been studied for translationand permutation-invariant kernels [33, 4]. [38] quantify the gain in sample complexity for kernels invariant to the actions of Lie groups on arbitrary manifolds and tie the sample complexity bounds to the dimensions of the group. Motivated by the invariance properties of convolutional neural networks, [27, 33, 4] focus on kernelised regression with the Neural Tangent Kernel; we instead consider the kernelised bandit optimisation setting with the more general Matérn family of kernels. These methods can be readily extended to less strict notions of so-called quasi-invariance [18, 46, 4].

Matérn kernels on manifolds. The previous studies of invariant kernel algorithms rely on the spectral properties of kernels on the hypersphere [33, 4]; the relevant spectral theory for Matérn kernels on Riemannian manifolds (including the hypersphere) was developed in [9]. These geometry-aware kernels were applied to optimisation tasks in [20]; however, the structure they include is restricted to the geometry of the domain only, rather than invariance to a set of transformations.

Regret bounds for Bayesian optimisation. The techniques for upper-bounding the regret of Bayesian optimisation algorithms using the kernel-dependent quantity known as maximum information gain were developed in [37]. Subsequent studies [43, 8, 14, 36, 19, 41] have focused on improving regret bounds in both cumulative and simple regret settings. [22] provided an upper bound for regret in BO using the convolutional NTK, which exhibits invariance only to the cyclic

group. [23] explored graph neural network bandits, offering maximum information gain rates by embedding permutation invariance using additive kernels. In [25], a permutation-invariant kernel is constructed in the same way as ours, and a regret analysis provided. Our work expands upon these approaches by considering more general (non-additive) kernels and broader transformation groups, demonstrating that the regret bounds from [22] can be shown to be a specialisation of ours. Lower bounds for kernelised bandits, including their robust variants ([7, 6]), have been examined in [35] and [12]. In this paper, we derive novel lower bounds on regret for the setting with invariant kernels.

1.2 Contributions

Our main contributions are:

We develop new upper bounds on sample complexity in Bayesian optimisation with invariant kernels, which explicitly quantify the gain in sample complexity due to the number of symmetries. We extend several results from the literature, as our statement applies to a wide class of groups. This is, to our knowledge, the first time such bounds are given in this degree of generality. We show how results from the literature (e.g. [12]) can be extended to obtain lower bounds on the sample complexity of invariant BO, using a novel construction and quantification of members of the RKHS of invariant functions. We extend the methods in the literature to the hyperspherical domain, giving a blueprint for similar constructions on other Riemannian manifolds. We conduct experiments that support our theoretical findings. We demonstrate that incorporating invariance to some subgroup of transformations is sufficient to achieve good performance, a result of key importance when the full invariance is not known or is too expensive to compute. We also demonstrate the robustness of the invariant kernel method, in the case where the target function deviates from true invariance. We apply these methods to solve a real-world design problem from nuclear fusion engineering that is known to exhibit invariance; our invariance-aware method finds better solutions in fewer samples compared to standard approaches.

2 Problem statement

We begin by providing an overview of invariant function spaces. Consider a finite subgroup of isometries, G, on a compact manifold, X, embedded in Rd. A function f : X R is invariant to the action of G if, for all σ G,

f σ = f i.e. f(x) = f(σ(x)) x X. (1)

We provide some examples of invariant functions for different groups G in Figure 1.

In this paper, we study the optimisation of invariant functions that belong to a reproducing kernel Hilbert space (RKHS). The invariance of the RKHS elements translates to invariance properties of the kernel. For k : X X R, we say that k is simultaneously invariant to the action of G if, for all σ G and x, y X, k(σ(x), σ(y)) = k(x, y), (2) and is totally invariant to the action of G if, for all σ, τ G and x, y X,

k(σ(x), τ(y)) = k(x, y), (3)

as introduced in [18]. Note that both inner product kernels k(x, y) = κ ( x, y ) and stationary kernels k(x, y) = κ( x y ) satisfy simultaneous invariance, as G consists of isometries (which preserve distances and inner products).

Next, we outline the relationship between the invariance of the kernel and the invariance of the functions in the associated RKHS. In particular, the following proposition (Proposition 1) identifies the space of invariant functions as a subspace of the RKHS of a simultaneously invariant kernel. Moreover, it turns out that this subspace is an RKHS in its own right, with a corresponding kernel that is totally invariant. For a proof of these properties, see Appendix A.1. Proposition 1 (RKHS of invariant functions). Let X and G be as defined above. Consider a positive definite and simultaneously invariant kernel k : X X R. The kernel defines a reproducing kernel Hilbert space Hk, whose elements are functions f : X R, and a corresponding Hilbert space norm Hk. Define the symmetrization operator SG : Hk Hk, such that

SG(f) = 1 |G|

σ G f σ. (4)

Figure 2: UCB run with standard k and invariant kernel k G applied to an invariant objective; queried points in white. UCB with the invariant kernel requires significantly fewer samples to find the optimum.

Then, SG is a self-adjoint projection operator, whose image Im[SG] is the subspace of Hk that contains G-invariant functions. Moreoever, Im[SG] is an RKHS in its own right, Hk G. For x, y X, the reproducing kernel of Hk G is

k G(x, y) = 1 |G|

σ G k(σ(x), y). (5)

We consider a kernelised optimisation setup with bandit feedback, in which the goal is to find the maximum of an unknown function f : X R that is invariant to a known group of transformations G. We model f as belonging to the invariant RKHS Hk G from Proposition 1. Without further smoothness assumptions, this task is intractable; similarly to the common non-invariant setting, we make the regularity assumption that f has bounded RKHS norm, f Hk G < B. The learner is given k and G. In Section 3.2, we explore setting k in k G to common kernels, such as those from the Matérn family.

At each round t, the learner selects a point xt X and receives a noisy function observation

yt = f(xt) + ηt, (6)

where ηt is drawn independently from a σn-sub-Gaussian noise distribution. The agent reports a point x t and incurs a corresponding simple regret, defined as

rt = arg max x X f(x) f(x t ). (7)

After T rounds, the cumulative regret is given by

arg max x X f(x) f(x t ) . (8)

In the next section, we quantify the impact of invariance on the number of function samples required to achieve a given regret, known as the sample complexity.

3 Sample complexity bounds for invariance-aware Bayesian optimisation

The bandit problem outlined in the previous section can be tackled using Bayesian optimisation (BO) [17]. Intuitively, we expect BO algorithms that utilize the G-invariance of the target function to achieve improvements in sample complexity, as each observed x simultaneously provides information about additional transformed locations σ(x) for each σ G (see Figure 1).

In this section, we quantify the performance improvements by deriving bounds on the sample complexity of BO with totally invariant kernels (Equation (3)). We begin with an outline of the BO framework.

3.1 Bayesian optimisation of invariant functions

In Bayesian optimisation, predictions from a Gaussian process (GP) are used in conjunction with an acquisition function to select points to query [17]. After collecting a sequence of input-observation pairs, D = {(x1, y1), . . . , (xt, yt)}, the learner models the invariant target function f Hk G using a GP with a totally invariant kernel, GP(0, k G). The posterior predictive distribution of the function value at a specified input location x will be Gaussian with mean and variance given by

µt(x) = k T G(KG + λI) 1y (9)

σ2 t (x) = k G(x, x) k T G(KG + λI) 1k G, (10)

Algorithm 1 Max. variance reduction [42]

Input: k G, X 1: initialize D = {} 2: for t = 1, . . . , T do 3: Select xt = arg maxx X σt 1(x) 4: Observe yt = f(xt) + ηt 5: Append (xt, yt) to D 6: Update µt, σt from Eqs. 9, 10 7: end for 8: return ˆx MVR opt = arg maxx X µT (x)

Algorithm 2 Upper confidence bound [37]

Input: k G, X 1: initialize D = {} 2: for t = 1, . . . , T do 3: Select xt = arg maxx X µt 1(x)+βσt 1(x) 4: Observe yt = f(xt) + ηt 5: Append (xt, yt) to D 6: Update µt, σt from Eqs. 9, 10 7: end for 8: return ˆx UCB opt = x T

where k G = [k G(x, x1), . . . , k G(x, xt)]T Rt 1, KG is the kernel matrix with elements [KG]i,j = k G(xi, xj) i, j 1, . . . , t, y = [y1, . . . , yt]T , and λ is a regularising term. Note that as this GP has a totally G-invariant kernel, k G, it can be interpreted as a distribution over G-invariant functions.

We consider two common acquisition functions. The tightest simple regret bounds in the literature are for the Maximum Variance Reduction algorithm (MVR), in which the learner iteratively queries the points with highest uncertainty before reporting the point with highest posterior mean [42]. Another algorithm that has been widely studied is Upper Confidence Bound (UCB), which balances exploration and exploitation by behaving optimistically in the face of uncertainty [37]. Pseudocode for MVR and UCB is provided in Algorithm 1 and Algorithm 2, respectively. Our analysis is framed in terms of MVR and simple regret, although in Section 4.1 we also investigate empirically the cumulative regret performance of UCB.

A popular family of kernels used in BO is the Matérn kernel. On Rd, the Matérn kernel with parameters l, ν is given by

k(x, x ) = 21 ν

where Kν is the modified Bessel function of the second kind. The parameter ν controls the differentiability of the corresponding GP, which enables Matérn kernels to define a diverse range of function spaces. As the Matérn kernel k is simultaneously invariant (Equation (2)), we can construct a totally invariant k G and corresponding RKHS following Proposition 1. The bounds that we derive both hold for the Matérn family; our upper bound also extends to any kernel that exhibits polynomial eigendecay.

In Figure 2, we provide a visual example of the impact of using k G in Bayesian optimisation. We run UCB on a group-invariant objective with the standard and invariant kernels. With the non-invariant kernel k, the algorithm spends many samples querying suboptimal points with similar objective values. In contrast, using the invariant kernel k G allows the algorithm to identify suboptimal regions across the parameter space with only a handful of samples, resulting in significantly faster convergence.

3.2 Upper bounds on sample complexity

In this section, let X = Sd 1 Rd and let G be a finite subgroup of the orthogonal group, O(d), equipped with its natural action on X. Let k be a simultaneously invariant kernel on Sd 1, and assume that the eigenvalues of the associated integral operator decay at a polynomial rate µk = O(k β p), for some constant β p > 1. Many common kernels exhibit this property, such as the standard Matérn kernel on Rd. For GPs defined intrinsically on Riemannian manifolds [45], the corresponding Matérn kernel also exhibits polynomial eigendecay [9]. We impose the following assumption on the spectra of elements in the group G, which is in line with the assumptions and spectral decay rates in [4]. Assumption 1. All elements σ G satisfy the following spectral property:

mλ < d 4 if λ = 1, (12)

for all λ Spec(σ), where mλ is the multiplicity of λ.

This assumption excludes certain groups in low dimensions, but we expect it can be relaxed through tighter analysis. In fact, our experiments for d = 2, 3 show it is essential as a theoretical artefact only.

Nevertheless, Equation (12) holds for all finite orthogonal groups in dimension d 10 (see Lemma 1) as well as for many groups of lower dimension; for example, groups for which mλ = 1, so that only distinct eigenvalues are present in the spectra. Under this assumption, we have the following upper bound on sample complexity. Theorem 1 (Upper bound on sample complexity of invariance-aware BO). Let k be a polynomial eigendecay kernel on Sd 1 which is simultaneously invariant w.r.t. the orthogonal group O(d), and let G O(d) be a subgroup of O(d) satisfying Assumption 1. Let k G be the totally invariant kernel, as defined in Proposition 1. Consider noisy bandit optimisation of f Hk G, where f Hk G < B and f : Sd 1 R. Then, the maximum information gain (MIG) for the totally invariant kernel after T rounds, γG T , is bounded by

γG T = O 1 |G|T

where O( ) hides logarithmic factors.

Moreover, for the MVR algorithm with invariant kernel k G to achieve expected simple regret E[r T ] < ϵ, it must hold that

1 |G| β p β p d+1 ϵ 2β p β p d+1 !

For the Matérn kernel, this corresponds to

T = O 1 |G| 2ν+d 1

2ν ϵ 2ν+d 1

Proof. See Appendix A.2.

Our upper bound demonstrates that incorporating the group invariance into the kernel is guaranteed to result in sample efficiency improvements that increase with the size of the group, due to the 1 |G| factor. Our result for the MIG γ of invariant kernels is of independent interest, as it is applicable to a wide range of scenarios. Notably, the linear 1 |G| improvement in the information gain from Equation (13) resembles the linear 1 |G| improvement in generalisation error in the invariant kernel ridge regression setting [3]. Intuitively, this reflects how the MIG criterion is determined solely by the ability of the kernel to approximate the objective function, and is agnostic to the algorithm used in optimisation.

In our proof, we begin with kernels on Sd 1 that are simultaneously invariant w.r.t. the full orthogonal group O(d), and use Proposition 1 to construct kernels that totally invariant kernels w.r.t finite subgroups G O(d). Simultaneously O(d)-invariant kernels on Sd 1 are in fact dot-product kernels, which allows us to use standard versions of Mercer s theorem in terms of known spectral decompositions. If an alternative decomposition is known for a particular k, then it is sufficient that k instead satisfies the weaker condition of simultaneous invariance w.r.t. G only.

3.3 Lower bounds on sample complexity

In the next theorem, we present a lower bound on sample complexity which nearly matches the upper bound in its dependence on |G|. Theorem 2 (Lower bound on sample complexity of invariance-aware BO with Matérn kernels). Let X be one of [0, 1]d or Sd. Let k be the standard Matérn kernel on Rd with smoothness ν, and G be a finite group of isometries acting on X. Let k G be the totally invariant kernel constructed from k and G according to Proposition 1. Consider noisy bandit optimisation of f Hk G, f : X R, where the observations are corrupted by σ-sub-Gaussian noise. Then, for sufficiently small ϵ

B ,there exists a function family FX (ϵ, B) such that the number of samples required for an algorithm to achieve simple regret less than ϵ for all functions in FX with probability at least 1 δ is bounded by:

T = Ω 1 |G| ν+d 1

Proof. See Appendix A.3.

Our lower bound guarantees that algorithms that are run for a number of sample less than T fail to be ϵ-optimal with high probability. At fixed T, there is a trade-off between this probability and the order of the group. However, when combined with the upper bound our results imply that as the order of the group increases, the probability that an algorithm is ϵ-optimal after a fixed number of iterations also increases.

We recover the previously known lower and upper bounds on sample complexity for non-invariant optimisation from [12] and [42] respectively by setting |G| = 1. We also note that there is a gap between our bounds, as the exponent of 1 |G| in Equation (15) is 1+ d 1

ν but is 1+ d 1

2ν in Equation (16). A likely reason for this gap is in the lower bound, which uses a packing argument for the support sets of functions in the RKHS. While in the absence of the group action, the lower bound and upper bound agree, it is not clear a priori how to achieve the tightest packing of invariant functions.

To achieve the largest improvement in sample complexity, one would like to choose the largest possible G when constructing k G. However, as we will see in our experimental section, there is a computational tradeoff between the sample complexity and the computation of SG, as in the case of very large groups the sums over the group elements becomes a computational bottleneck.

4 Experiments

We demonstrate the performance gains from incorporating known invariances in a number of synthetic and real-world optimisation tasks. Previous work has demonstrated the sample efficiency improvements from applying invariance-aware algorithms to regression tasks [3, 33, 38]; our experiments complement the literature by observing the corresponding improvements in optimisation tasks. Code for our experiments and implementations of invariant kernels can be found on Git Hub1.

4.1 Synthetic experiments

For each of our synthetic experiments, our goal is to find the maximum of a function drawn from an RKHS with a group-invariant isotropic Matérn-5/2 kernel. The groups we consider act by permutation of the coordinate axes, and therefore include reflections and discrete rotations. For Perm Inv-2D (Figure 1a) and Perm Inv-6D, the kernel of the GP is invariant to the full permutation group which acts on R2 and R6 respectively by permuting the coordinates. For Cycl Inv-3D it is invariant to the cyclic group acting on R3 by permuting the axes cyclically (Figure 1b). We provide the learner with the true kernel and observation noise variance to eliminate the effect of online hyperparameter learning. For more detail on our experiments, see Appendix B.

In Figure 3, we show that the algorithm s performance on invariant functions is improved by using an invariant kernel. In all cases, the performance of the invariance-aware algorithm significantly outperforms the baselines, converging to the optimum with many fewer samples. Notably, in the Perm Inv-6D task, the baseline algorithms fail to find the true optimum, whereas the permutationinvariant versions do so without difficulty. We also note that the performance improvement increases with increasing dimension and group size, as predicted by our regret bounds.

For the Perm Inv tasks, we provide an additional baseline: constrained Bayesian optimisation (CBO, [16]). In this case, we constrain the search space of the acquisition function optimiser to the fundamental region of the group s action. For arbitrary groups, finding an analytical formula for this region can be difficult, particularly when the dimension of the domain is high and the group is complicated; however, it is straightforward in the case of the full permutation group. We find that constrained BO performs worse than our method, and in high dimensions is identical to standard (non-invariant) BO. Intuitively, this matches our expectations, as in CBO the kernel is not aware of all of the structure that the RKHS elements possess that is, while the domain of exploration is restricted, the function class is not. Observations in CBO contribute no information across the boundaries of the region, unlike in the invariant kernel case, and so the performance improvement of CBO is upper bounded by 1 |G|.

For the Perm Inv-6D task, we also consider the case where the full invariance is not fully known to the learner; that is, the kernel is not totally invariant w.r.t. the full group, but only a subgroup. Our results show that incorporating any amount of the true invariance results in performance improvements. In this case, the true function is invariant to the full permutation group in 6D (720 elements). We evaluate

1github.com/theo-brown/invariantkernels and github.com/theo-brown/bayesopt_with_invariances

0 25 50 75 100 125 Number of evaluations

Cumulative regret

Perm Inv-2D

Standard Constrained Invariant

0 50 100 150 200 250 Number of evaluations

Cumulative regret

Cycl Inv-3D

Standard Invariant

0 200 400 600 Number of evaluations

Cumulative regret

Perm Inv-6D

Standard Constrained 3-block inv. 2-block inv. Fully inv.

0 25 50 75 100 125 Number of evaluations

Simple regret

Perm Inv-2D

Standard Constrained Invariant

0 50 100 150 200 250 Number of evaluations

Simple regret

Cycl Inv-3D

Standard Invariant

0 200 400 600 Number of evaluations

Simple regret

Perm Inv-6D

Standard Constrained 3-block inv.

2-block inv. Fully inv.

Figure 3: Regret performance of invariant UCB and MVR algorithms across 3 different tasks; lower is better. Non-invariant kernels (blue) are outperformed by the full group invariant kernel (red) as well as partially specified (subgroup-invariant) kernels (green and yellow). For the permutation invariant function, the search space of standard BO can be constrained by the fundamental domain of the group (purple), but this performs worse than the invariant kernel. Mean s.d., 32 seeds.

the performance of the full invariant kernel, the non-invariant kernel, and two subgroup invariant kernels. The subgroups we consider consist of block permutations, which involve reordering the coordinate indices in chunks of length 2 and 3 (leading to groups with 6 and 2 elements respectively). For both algorithms, we see that all kernels that incorporate invariance outperform the non-invariant kernel, even when the difference in the group size considered is large.

4.2 Extension: quasi-invariance using additive kernels

In real-world scenarios, it is possible that the underlying function only exhibits approximate symmetry. For example, in the fusion application from Section 4.3, the launchers may have slightly different properties and are no longer interchangeable. A 2D example of this quasi-invariance is given in the first column of Figure 4. Although the large-scale features (such as the layout of the peaks) remain approximately symmetric, invariance is not maintained in the detail (the values and shapes of the peaks).

We model quasi-invariance by introducing an additive non-invariant disturbance to the target function. The function can then be considered as belonging to the RKHS H(1 ε)k G+εk , where k G is a Ginvariant kernel, k is a non-invariant kernel, and ε sets the degree of deviation from invariance. In our experiments, we observe that performing BO with the fully invariant kernel still results in significant performance improvements over the non-invariant kernel when ε is small, achieving performance comparable to BO with the kernel of the function s RKHS. As this is a kind of model misspecification, we refer the reader to [5] for a theoretical discussion on the performance of BO with misspecified kernels.

4.3 Real-world application: nuclear fusion scenario design

In this section, we use invariance-aware BO to efficiently solve a design task from fusion engineering.

The leading concept for a controlled fusion power plant is the tokamak, a device that confines a hightemperature plasma using powerful magnetic fields. Finding steady-state configurations for tokamak plasma that are simultaneously robustly stable and high-performance is a challenging task. Moreover, evaluating the plasma operating point requires high-fidelity multi-physics simulations, which take several hours to evaluate one configuration due to the complexity and characteristic timescales of the physics involved [34]. Iterating on a design requires many such simulations, which has led to an increased focus on highly sample-efficient algorithms for tokamak design optimisation [13, 32, 10, 31].

We consider a scalarised version of the multi-objective optimisation task presented in [10], in which the goal is to shape the output of a plasma current actuator to maximise various performance and ro-

(a) ε = 0.01

0.00 0.25 0.50 0.75 1.00 x1

1.5 1.2 0.9 0.6 0.3 0.0 0.3 0.6 0.9 1.2

0 25 50 75 100 125 Number of evaluations

Simple regret

Standard Permutation invariant Quasi-invariant (truth)

0 25 50 75 100 125 Number of evaluations

Cumulative regret

Standard Permutation invariant Quasi-invariant (truth)

(b) ε = 0.05

0.00 0.25 0.50 0.75 1.00 x1

0 25 50 75 100 125 Number of evaluations

Simple regret

Standard Permutation invariant Quasi-invariant (truth)

0 25 50 75 100 125 Number of evaluations

Cumulative regret

Standard Permutation invariant Quasi-invariant (truth)

Figure 4: Performance of MVR and UCB on quasi-invariant functions. Regret shown for the noninvariant kernel k (standard, blue), the invariant kernel k G (invariant, green), and the quasi-invariant kernel k G + εk (additive, red). In all cases, the invariant kernel performs almost as well as the true quasi-invariant kernel.

bustness criteria based on the so-called safety factor profile. Using a weighted sum, we reduce the vector objective to [0, 1], with component weights selected to represent the proposed SPR45 design of the STEP tokamak [39, 31]. As the objective components are in direct competition, the highest achievable scores are around 0.7 0.8, while the lowest-scoring converged solutions achieve around 0.4 0.5.

In previous work, the actuator output was represented by arbitrary 1D functions (piecewise linear [31] and Bézier curves [10]). However, in practice the actuator will have a finite number of launchers that drive current in a localised region, and so will be unable to accurately recreate arbitrary profiles. In contrast to previous work, we directly optimise a parameterisation that reflects the actuator s real capabilities: a sum of 12 Gaussians, where each models a launcher targeted at a different point in the plasma cross-section (Figure 5a). In this setting, the objective function is f : R12 R, where the input is the normalised radial coordinate of the 12 launchers. As all of the launchers have identical behaviour, the objective is invariant to permutations. However, the full invariant kernel is too costly to compute (|G| = 12! = 479 106), so we instead use a partially specified kernel (3-block invariant, |G| = 4! = 24).

An additional challenge of this task is that large regions of parameter space are infeasible, corresponding to simulations that will not converge to a steady-state solution and will therefore be

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normalised radial position

Normalised power

Total Launcher 1 Launcher 2 Launcher 3 Launcher 4 Launcher 5

(a) Example current drive actuator profile, shown for 5 launchers

0 2 4 6 8 10 12 14 16 Optimisation step

Objective value

Safety factor optimisation

Standard Invariant

(b) Performance comparison (mean of 5 seeds)

Figure 5: Nuclear fusion application: optimising safety factor by adjusting current drive actuator. In (a), observe that the order of launchers can be permuted without changing the total profile. Incorporating this invariance into the kernel of the UCB algorithm achieves improved performance (b).

3-block inv.

3-block aug.

2-block inv.

2-block aug.

GPU memory (Gi B)

(a) Memory cost

3-block inv.

3-block aug.

2-block inv.

2-block aug.

Time to fit (ms)

(b) Time cost

Figure 6: Effect of group size on cost of data augmentation and invariant kernel methods. Benchmark task is to fit a 6D GP with the given kernel to 100 random datapoints from Perm Inv-6D. Shown are results from 100 seeds, 64 repeats per seed, performed on one NVIDIA V100-32GB GPU using Bo Torch. Invariant kernels are (a) more memory efficient than data augmentation, and (b) can be computed faster. Incorporating full invariance via data augmentation exceeds the GPU memory.

impossible to observe. Selecting an appropriate method of dealing with this situation can impact the optimiser s performance, and is an open area of research. In this work, we choose to assign a fixed low score to failed observations for simplicity. As MVR is purely exploratory it has no mechanism that discourages querying points from infeasible regions, leading to many failed observations. UCB is a more natural choice in this setting, as the β parameter introduces a balance between querying unknown regions (that could be infeasible) and exploiting known high-scoring regions.

In Figure 5b, we see that the invariant version of UCB achieves significantly improved performance compared to the non-invariant version. In fact, the non-invariant algorithm totally fails to find the small region of high-performance solutions, and instead queries many suboptimal profiles that are similar under the action of permutations. The high sample efficiency of the invariant kernel method will enable tokamak designers to iterate on designs faster, with fewer calls to expensive simulations.

5 Limitations

Although the number of terms involved in computing the totally invariant kernel only scales linearly with the group size (Equation (5)), this cost can still become a bottleneck for very large groups; for example, the size of the permutation group scales with the factorial of the dimension, leading to |G| = 24 for d = 4, but |G| = 479 106 for d = 12. In Figure 6 we compare the compute and memory costs of performing a marginal log likelihood fit of the GP hyperparameters for each of the kernels in the previous section, given 100 observations from the Perm Inv-6D function. The standard, 3-block invariant, and 2-block invariant kernel are faster than the fully invariant kernel by a factor of 2. As we observed that the 2and 3-block invariant kernels still achieve significantly improved sample efficiency over the standard kernel (Figure 3), we propose that subgroup approximations of the full invariance group can be used as a low-cost alternative when the group size is large, reducing computational cost while maintaining improved performance. Finally, we note that our invariant kernel method scales significantly more favourably than data augmentation, both in terms of memory and compute.

6 Conclusion

We have developed new upper bounds on the sample complexity in Bayesian optimization with invariant kernels, highlighting the advantages gained from symmetries that hold for a broad range of practical groups. We also derived novel lower bounds by constructing and quantifying members of the RKHS of invariant functions, extending this approach to the hyperspherical domain and providing a framework for other Riemannian manifolds. Empirically, our experiments show that even partial invariance can effectively improve performance, which is crucial when full invariance is unavailable or computationally intensive. Additionally, we applied these methods to a real-world problem in nuclear fusion engineering, achieving superior outcomes with fewer samples compared to standard methods.

Broader Impact. The incorporation of symmetries in Bayesian optimization can have a profound impact on science and engineering. By exploiting symmetrical properties in various applications, such as nuclear engineering, material and drug design, and robotic control, it can reduce the number of samples needed to achieve (near) optimal solutions, thus accelerating research and development processes.

Acknowledgments

TB was supported by the Engineering and Physical Sciences Research Council (EPSRC) grants EP/T517793/1 and EP/W524335/1. IB was supported by the EPSRC New Investigator Award EP/X03917X/1; EPSRC grant EP/S021566/1; and Google Research Scholar award.

[1] Mohsen Ahmadi et al. Predicting potent compounds via model-based global optimization . In: Journal of Chemical Information and Modeling 53.3 (2013), pp. 553 559. [2] Maximilian Balandat et al. Bo Torch: a framework for efficient Monte-Carlo Bayesian optimization . In: Advances in Neural Information Processing Systems. Vol. 33. 2020. [3] Alberto Bietti and Francis Bach. Deep equals shallow for Re LU networks in kernel regimes . In: International Conference on Learning Representations. 2021. [4] Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning under geometric stability . In: Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 18673 18684. [5] Ilija Bogunovic and Andreas Krause. Misspecified Gaussian process bandit optimization . In: Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 3004 3015. [6] Ilija Bogunovic et al. A robust phased elimination algorithm for corruption-tolerant Gaussian process bandits . In: Advances in Neural Information Processing Systems. Vol. 35. 2022, pp. 23951 23964. [7] Ilija Bogunovic et al. Adversarially robust optimization with Gaussian processes . In: Advances in Neural Information Processing Systems. Vol. 31. 2018, pp. 5765 5775. [8] Ilija Bogunovic et al. Truncated variance reduction: a unified approach to Bayesian optimization and level-set estimation . In: Advances in Neural Information Processing Systems. Vol. 29. 2016, pp. 1515 1523. [9] Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, et al. Matérn Gaussian processes on Riemannian manifolds . In: Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 12426 12437. [10] Theodore Brown et al. Multi-objective Bayesian optimization for design of Pareto-optimal current drive profiles in STEP . In: IEEE Transactions on Plasma Science 99 (2024), pp. 1 6. [11] RJ Buttery et al. DIII-D research to prepare for steady state advanced tokamak power plants . In: Journal of Fusion Energy 38 (2019), pp. 72 111. [12] Xu Cai and Jonathan Scarlett. On lower bounds for standard and robust Gaussian process bandit optimization . In: Proceedings of the 38th International Conference on Machine Learning. Vol. 139. 2021, pp. 1216 1226. [13] Ian Char et al. Offline contextual Bayesian optimization . In: Advances in Neural Information Processing Systems. Vol. 32. 2019, pp. 4627 4638. [14] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits . In: Proceedings of the 34th International Conference on Machine Learning. 2017, pp. 844 853. [15] Taco Cohen and Max Welling. Group equivariant convolutional networks . In: Proceedings of the 33rd International Conference on Machine Learning. Vol. 48. 2016, pp. 2990 2999. [16] Jacob R Gardner et al. Bayesian optimization with inequality constraints . In: Proceedings of the 31st International Conference on Machine Learning. Vol. 32. 2014, pp. 937 945. [17] Roman Garnett. Bayesian optimization. Cambridge University Press, 2023. [18] Bernard Haasdonk and Hans Burkhardt. Invariant kernel functions for pattern analysis and machine learning . In: Machine Learning 68 (2007), pp. 35 61. [19] David Janz, David Burt, and Javier González. Bandit optimisation of functions in the Matérn kernel RKHS . In: International Conference on Artificial Intelligence and Statistics. Vol. 108. 2020, pp. 2486 2495. [20] Noémie Jaquier et al. Geometry-aware Bayesian optimization in robotics using Riemannian Matérn kernels . In: Conference on Robot Learning. 2022, pp. 794 805. [21] Michael Kapovich. A note on properly discontinuous actions. 2023. eprint: arxiv:2301. 05325.

[22] Parnian Kassraie and Andreas Krause. Neural contextual bandits without regret . In: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics. Vol. 151. 2022, pp. 240 278. [23] Parnian Kassraie, Andreas Krause, and Ilija Bogunovic. Graph neural network bandits . In: Advances in Neural Information Processing Systems. Vol. 35. 2022, pp. 34519 34531. [24] Danial Khatamsaz et al. A physics-informed Bayesian optimization approach for material design: application to Ni Ti shape memory alloys . In: npj Computational Materials 9.1 (2023), p. 221. [25] Jungtaek Kim et al. Bayesian optimization with approximate set kernels . In: Machine Learning 110 (2021), pp. 857 879. [26] Johannes Kirschner et al. Tuning particle accelerators with safety constraints using Bayesian optimization . In: Physical Review Accelerators and Beams 25 (6 2022), p. 062802. [27] Imre Risi Kondor. Group theoretical methods in machine learning . Ph D thesis. Columbia University, 2008. [28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Image Net Classification with deep convolutional neural networks . In: Advances in Neural Information Processing Systems. Ed. by F. Pereira et al. Vol. 25. 2012. [29] Yann Lecun et al. Learning algorithms for classification: a comparison on handwritten digit recognition . In: Neural networks: The statistical mechanics perspective. World Scientific, 1995, pp. 261 276. [30] Benjamin Letham et al. Constrained Bayesian Optimization with noisy experiments . In: Bayesian Analysis 14.2 (2019), pp. 495 519. [31] S. Marsden et al. Using genetic algorithms to optimise current drive in STEP . In: Europhysics Conference Abstracts. Vol. 46A. 2022, P2b.123. [32] Viraj Mehta et al. Exploration via planning for information about the optimal trajectory . In: Advances in Neural Information Processing Systems. Vol. 35. 2022, pp. 28761 28775. [33] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models . In: Conference on Learning Theory. 2021, pp. 3351 3418. [34] Michele Romanelli et al. JINTRAC: a system of codes for integrated simulation of tokamak scenarios . In: Plasma and Fusion Research 9 (2014), pp. 3403023 3403023. [35] Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization . In: Proceedings of the 2017 Conference on Learning Theory. Vol. 65. 2017. [36] Shubhanshu Shekhar and Tara Javidi. Gaussian process bandits with adaptive discretization . In: Electronic Journal of Statistics 12 (2018), pp. 3829 3874. [37] Niranjan Srinivas et al. Gaussian process optimization in the bandit setting: no regret and experimental design . In: Proceedings of the 27th International Conference on Machine Learning. 2010, pp. 1015 1022. [38] Behrooz Tahmasebi and Stefanie Jegelka. The exact sample complexity gain from invariances for kernel tegression . In: Advances in Neural Information Processing Systems. Vol. 36. 2023, pp. 55616 55646. [39] Emmi Tholerus et al. Flat-top plasma operational space of the STEP power plant . In: Nuclear Fusion 64 (Aug. 2024). [40] Atsushi Togo and Isao Tanaka. Spglib: a software library for crystal symmetry search. https://github.com/spglib/spglib. 2018. eprint: ar Xiv:1808.01590. [41] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in Gaussian process bandits . In: International Conference on Artificial Intelligence and Statistics. Vol. 130. 2021, pp. 82 90. [42] Sattar Vakili et al. Optimal order simple regret for Gaussian process bandits . In: Advances in Neural Information Processing Systems 34 (2021), pp. 21202 21215. [43] Michal Valko et al. Finite-time analysis of kernelised contextual bandits . In: ar Xiv:1309.6869 (2013). [44] Elise Van der Pol et al. MDP homomorphic networks: group symmetries in reinforcement learning . In: Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 4199 4210.

[45] Peter Whittle. Stochastic processes in several dimensions . In: Bulletin of the International Statistical Institute 40.2 (1963), pp. 974 994. [46] Mark van der Wilk et al. Learning invariances using the marginal likelihood . In: Advances in Neural Information Processing Systems. Vol. 31. 2018, pp. 9960 9970. [47] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. 3. MIT Press Cambridge, MA, 2006. [48] Yunxing Zuo et al. Accelerating materials discovery with Bayesian optimization and graph deep learning . In: Materials Today 51 (2021), pp. 126 135.

A.1 Proof of Proposition 1 (RKHS of invariant functions)

The kernel k defines a reproducing kernel Hilbert space, Hk, with inner product , Hk and norm f Hk = p

f, f Hk for f Hk.

Recall from Equation (4) that SG(f) = 1 |G| P

We begin by showing that SG is continuous and bounded. Observe that simultaneous invariance determines the following property of the representer elements kx( ) = k(x, ) = k( , x):

kx σ = kσ 1(x) σ G. (17)

Now observe that:

||kx σ||2 = kx σ, kx σ = kσ 1(x), kσ 1(x) = k(σ 1(x), σ 1(x)) = k(x, x) = ||kx||2

If f = Pn i=1 kxi, we have, similarly:

||f σ||2 = ||

i=1 kσ 1(xi)||2 =

j=1 k(σ 1(xi), σ 1(xj)) =

j=1 k(xi, xj) = ||f||2 (19)

Now, let fn f be a Cauchy sequence, where each fn lies in the span of the kx elements. We have

||f σ|| = || lim fn σ|| = lim ||fn σ|| = lim ||fn|| = ||f|| (20)

||SGf|| 1 |G|

σ G ||f σ|| ||f|| (21)

so SG is continuous, with ||SG|| 1.

We now verify that the image of SG, denoted as Im(SG), contains precisely the G-invariant functions (Equation (1)). For f Hk, σ G, we have

SG(f) σ = 1 |G|

τ G (f τ) σ (22)

τ G f (τ σ) (23)

σ G f σ (24)

= SG(f), (25)

where Equation (23) follows by associativity of G, and Equation (24) follows by letting σ := τ σ and using the fact that G is closed under the operator. Hence, SG(f) is invariant to σ G (Equation (1)).

Next, we show that the image of SG is a reproducing kernel Hilbert space with kernel k G (Equation (5)). First note that SG is a projection, as applying SG twice gives

SG(SG(f)) = 1 |G|

σ G SG(f) σ (26)

σ G SG(f) (27)

= SG(f), (28)

where Equation (27) follows by applying Equation (25).

Moreover, SG is self-adjoint. Let f Hk be an element in the span of the representer elements kx, such that f( ) = Pn i=1 αikxi( ), and let g Hk. Then,

SG(f), g Hk = 1 |G|

σ G αikσ 1(xi), g Hk (30)

σ G αig(σ 1(xi)) (31)

σ αikxi, g σ 1 Hk (32)

σ f, g σ Hk (33)

= f, SG(g) Hk, (34)

where Equation (30) follows by expanding f and recalling that for simultaneously invariant kernels kx σ = k(σ 1(x), ), Equation (31) follows by using the reproducing property of kσ(x), Equation (32) uses the decomposition of g σ in terms of the reproducing element kxi, and Equation (33) uses the definition of f.

Now, as before, let fn f be a Cauchy sequence, where fn lies in the span of the kx elements. We have

lim fn, SG(g) Hk = lim fn, SG(g) = lim SG(fn), g = lim SG(fn), g = SG(lim fn), g Hk, (35) by continuity of the inner product. Hence, SG is self-adjoint for all f, g Hk.

As SG is a projection (Equation (28)) and is self-adjoint Im(SG) is a closed subspace of Hk. Now, denoting the evaluation functional Lx : Hk R, Lx(f) = f(x), note that since Im(SG) is closed, then if Lx was bounded on Hk, it remains bounded on Im(SG). Hence Im(SG) is itself a RKHS, which we ll denote as H .

We will now relate k H , the kernel associated with H , to k and G. Note that, for f H ,

f(x) = f, kx = SG(f), kx = f, SG(kx) . (36)

So SG(kx) is the reproducing element for H , giving us

k H (x, y) = SG(ky), SG(kx) = SG(ky), kx = 1 |G|

σ G k(σ(x), y), (37)

which recovers the formula for the kernel in Equation 5.

Finally, note that SG(ky), SG(kx) recovers the more general "double sum" formula for producing a totally invariant kernel out of a kernel which is not necessarily simultaneously invariant [18, 33]:

k G(x, y) = 1 |G|2 X

σ,τ G k(σ(x), τ(y)). (38)

A.2 Proof of Theorem 1: upper bound on sample complexity

In this section we derive Theorem 1, the upper bound on regret for Bayesian optimisation algorithms that incorporate invariance. The main elements of this proof were introduced in [41]. Our derivation is similar to the one in section C.1 of [22], but extends that work to more general groups and kernels.

In Appendix A.2.1, we present a concise summary of already known bounds on the information gain of dot-product kernels with polynomial eigendecay on the hypersphere. In Appendix A.2.2, we will adapt this argument for information gain of totally invariant versions of these kernels. We then use the bounds on information gain to bound the regret in Appendix A.2.3.

We begin with the following remark which we use throughout this section. Namely, with the assumptions of Theorem 1, a kernel k on Sd 1 which is simultaneously invariant with respect to the whole of O(d) is a dot-product kernel. To see this, choose r [ 1, 1] and pick xr, yr such that xr, yr = r and define κ(r) = k(xr, yr). Observe that κ is well defined, as for any other choice of (x r, y r) there is a (non-unique) isometry M (x r,y r) (xr,yr) such that

x r = M (x r,y r) (xr,yr) xr (39)

y r = M (x r,y r) (xr,yr) yr (40)

and therefore k(xr, yr) = k(x r, y r). Hence, throughout this section we can consider dot-product kernels only.

A.2.1 Information gain for kernels on the hypersphere

Consider the hyperspherical domain Sd 1 = {x Rd s.t. x = 1}. The Mercer decomposition of a dot-product kernel on the hypersphere is

k(x T x ) =

i=1 Yi,k(x)Yi,k(x ), (42)

where Yi,p is the i-th spherical harmonic polynomial of degree k and N(d, k) is the number of spherical harmonics of degree k on Sd 1, given by

N(d, k) = 2k + d 2

k + d 3 d 2

Asymptotically in terms of k we have

N(d, k) = Θ kd 2 . (44)

In this representation, µk can be interpreted as an eigenvalue that is repeated N(d, k) times, with corresponding eigenfunctions Y1,k, . . . , YN(d,k),k. We let ψ be the maximum absolute value of the eigenfunctions, such that |Yi,k(x)| ψ x Sd 1.

We project onto a subspace of eigenfunctions corresponding to the first D eigenvalues. Let K be the number of distinct values in the first D eigenvalues. It follows that

k=0 N(d, k) (45)

k=0 kd 2 D c2

k=0 kd 2 (46)

d 1 D C2 Kd 1

D = Θ Kd 1 (48)

where Equation (46) follows by substituting Equation (44), and Equation (47) follows by bounding the sum with an integral, and ci, Ci are constants that ensure the asymptotic equalities hold.

Truncating the eigenvalues at D leaves a tail with mass

i=D+1 λiψ2 = ψ2 X

k=K µk N(d, k) (49)

k=K µkkd 2 (50)

k=K k β p+d 2 (51)

ψ2C3K β p+d 1 (52)

β p+d 1 d 1 (53)

where Equation (50) follows by substituting Equation (44), Equation (51) follows if the distinct eigenvalues have polynomial decay rate, µk = O(k β p);2, Equation (52) follows by bounding the sum with an integral; Equation (53) follows because D = Θ(Kd 1), and hence K = Θ(D 1 d 1 ).

Theorem 3 in [41] states that

2 log 1 + k T τD

where |k(x, x )| k x and τ is the observation noise variance. Substituting in Equation (53) gives

2 log 1 + k T τD

β p+d 1 d 1 (55)

We have freedom in setting D. We choose it so that the first term dominates, by setting

2 log 1 + k T

β p+d 1 d 1 (56)

This is satisfied by

τ log 1 + k T

which gives:

τ log 1 + k T

log 1 + k T

where we have used the fact that x x + 1. Hence,

β p log 1 d 1

β p T . (59)

A.2.2 Information gain for symmetrised kernels on the hypersphere

Our upper bound involves comparing the corresponding reduction in maximal information across the invariant vs the standard kernel. In this section we will use overbars to refer to properties of the symmetrised (invariant) kernel. We remind the reader that G O(d) is a finite subgroup of

2Matérn kernels on Sd 1 satisfy this property with β p = 2ν + d 1 [9].

isometries of the sphere Sd 1. Our aim will be to provide bounds on the asymptotics of the eigenspace dimensions N(d,k) N(d,k) where N(d, k) is defined as in A.2.1 and N(d, k) is the corresponding eigenspace dimension for the symmetrized kernel k G.

We will follow the exposition from [4], where asymptotic bounds on γ(d, k) := N(d,k) N(d,k) are given with two minor differences: 1. our treatment will handle any finite group of isometries and 2. we pay the price for generalizing to arbitrary groups by obtaining less strict decay rates, which will not trouble us for our analysis.

We will show the following: Lemma 1. With the notations described in this section, assume G O(d) for d 10 is a finite group. Then, if I G:

0 γ(d, k) C

k , for k odd, (60)

2 |G| γ(d, k) 2 |G| + C

k , for k even, (61)

and, otherwise: 1 |G| γ(d, k) 1 |G| + C

where C is a constant that depends on k, d and the spectra of elements in G.

Proof. Recall from [4], the formula:

γ(d, k) = 1 |G|

σ G γ(d, k, σ) = X

σ G Eµ [Pd,k ( σx, x )] , (64)

where µ is the uniform measure on the sphere Sd 1, and Pd,k is the corresponding Gegenbauer polynomial. To demystify the conditional statement in the Lemma, notice first that Pd,k ( x, x ) = Pd,k(1) = 1, and Pd,k ( x, x ) = Pd,k( 1) = 1 if k is even and -1 if k is odd. For all other σ G, we ll follow the argument in [4] to show that γ(d, k, σ) 0 with asymptotics at least O 1

We don t make significant contributions to the method in [4], here, other than to remark on the fact that many of the arguments they present follow through for finite subgroups of O(d).

First remark that for σ G O(d) there is a matrix representation Aσ, which admits a canonical basis in which the matrix elements are:

where each Rλ is a 2 2 matrix of the form a b b a with eigenvalues λ and λ.

Moreover, notice that since A|G| σ = I, all eigenvalues are roots of unity λ = e

q with q |G|. The the eigenvalues of such matrices are the same as the allowable eigenvalues of permutation matrices considered in [4], from which Proposition 6 gives:

γ(d, k, σ) Ck d+s + o(k d+s) (66)

where s = maxλ Spec(Aσ){mλ + 4 1(|λ| < 1)}. Now assume σ = I. We have m1 d 2, m 1 d 2 and mλ d

2 if λ = 1. Collectively this implies d + s 1 if d 10.

In the following assume that I / G. The corresponding argument for I G is similar. In view of Lemma 1

1 |G| N(d, k) N(d, k) 1 |G| + C

We take the same number of distinct eigenvalues for the invariant and noninvariant kernels, K. However, there will be different multiplicities for the eigenvalues, leading to a different number of eigenvalues D. The corresponding tail has mass

i= D+1 λi = ψ2 X

k=K µk N(d, k) (68)

k=K µk N(d, k) 1

k=K µk N(d, k) (70)

|G| δD (73)

where Equation (69) follows by substituting Equation (67), Equation (70) follows because 1

K for k > K, Equation (72) follows because K = Θ(D 1 d 1 ), and for D > Dϵ where Dϵ satisfies d! C

1 d 1 ϵ < ϵ.

Now, Theorem 3 in [41] gives:

γT D 2 log 1 + k T τ D

+ Tδ D 2τ , (74)

Proposition 2. If an > 0, bn > 0, with lim an = lim bn = , and lim bn/an = λ > 0, then ϵ > 0, N such that N > N we have P

n<N bn (1 ϵ)λ P

Proof. From the definition of the limit, noting that the 1 ϵ gives slack to bound λ P

n<N an above by ϵ P

Using this proposition on the sequences N(d, k) and N(d, k) we have further:

γT D(1 + ϵ)

2|G| log 1 + k T|G| τD(1 + ϵ)

2 log 1 + k T|G| τD(1 + ϵ)

2 log 1 + k T|G| τD(1 + ϵ)

β p+d 1 d 1

As previously we ll choose DT as a function of T such that the logarithmic term dominates (notice that since the first term is increasing in D any choice of DT greater than this will also satisfy), i.e.

DT log 1 + k T|G| τ(1 + ϵ)

β p+d 1 d 1 T (78)

log 1 + k T |G| τ(1+ϵ)

With this choice:

β p log 1 + k T|G| τ(1 + ϵ)

A.2.3 From information gain to sample complexity

Under the assumption of sub-Gaussian noise, it is shown in [42, Remark 2] that the maximum variance reduction algorithm presented in Algorithm 1 incurs simple regret satisfying

γT log(T d/δ)

with probability at least 1 δ, where δ (0, 1).

Note that we are able to use this result without modification, as our approach only incorporates G-invariance through the kernel; hence, any changes to the regret due to the G-invariance will only appear in the γT term. Substituting γT from Equation (80) into Equation (82) gives

|G| 1 2 T β p d+1

2β p (T|G|) log 1 2 T/δ

|G| 1 2 T α logα(T|G|) log 1 2 T/δ , (84)

where we have absorbed factors that do not depend on T or G, and let α = β p d+1

Our goal is for r T = O(ϵ). This can be achieved by setting

ϵ 1 α log 1

|G| 1 2α 1 1

with appropriate implied constants. Substitution of T from Equation (85) into Equation (84) makes it evident that for this T we have

r T = O(ϵ) + O(log log T + log log |G|). (86)

Substituting α into Equation (85) gives the final upper bound on sample complexity, which is

β p β p d+1 ϵ 2β p β p d+1 log

ϵ 2β p β p d+1

d 1 β p d+1

β p β p d+1

ϵ 2β p β p d+1

|G| β p β p d+1

For the Matérn kernel on the hypersphere, [9] showed that

β p = 2ν + d 1, (88)

and hence, the MVR algorithm with an invariant Matern GP defined on Sd 1 will achieve simple regret ϵ with probability 1 δ if

2ν ϵ 2ν+d 1

where O hides logarithmic factors in ϵ and |G|.

(a) Function invariant to 10-fold rotations

(b) Function invariant to all rotations along an axis

Figure 7: Examples of invariant functions on the sphere. The construction for our lower bound consists of functions invariant to finite rotations, as in Figure 7a, but does not include functions like Figure 7b (which produce different packings).

A.3 Proof of Theorem 2: lower bound on sample complexity

In this section, we derive Theorem 2, the lower bound on regret for the invariant Matérn kernel under permutation groups. The main elements of this proof were introduced in [12], which extended the method in [35]. This constructive method serves to demonstrate the tightness of the upper bound. As in [12] and [42], our method involves building up a set of candidate optimization targets in Hk G. However, unlike these papers, it is no longer clear in this setting that such constructions are optimal in terms of the resulting dependence of the lower bound on |G| (the current gap to the upper bound potentially measures this failure rate). Our methods will produce elements in SG(F) Hk G, for a particular set of functions F, and it s the object of further research if these constructions can be improved.In Figure 7a, we have an example of the types of functions we construct, by applying the operator SG to a simple bump function. In this example, G is a group of rotations along the vertical axis. In Figure 7b is an example of a function we cannot retrieve through our method. This function is similarly invariant (also to all other rotations along the axis). Our method relies on a packing argument for the supports of such functions and it is a priori unclear how the tightest packing can be obtained.

We identify a bounded subset of the RKHS of invariant functions that contains functions with the property that each point in X is ϵ-optimal for at most one function. We then construct a set of noisy bandit problems using functions from this class, before applying [12, Lemma 1] to lower-bound the number of samples that a learner requires in order to distinguish between them, and hence the best-case regret for this configuration.

A.3.1 An explicit construction

The proof for our lower bound will be constructive. We will exhibit a family of functions for which there is a computable lower bound of samples needed to distinguish their maxima. The challenge comes from constructing these functions natively in the RKHS of invariant functions. The strategy we employ here is to construct functions by symmetrization of simple bump functions. For the benefit of the reader, we ll show an explicit version of this construction first, on the hypercube X = [0, 1]d equipped with the action of G Sd and subsequently extend to general actions of subgroups of O(d) on the sphere Sd 1.

Lemma 2 (Bounded support function construction [12, Lemma 4]). Let h(x) =

exp 1 1 x 2 1{ x 2 < 1} be the d-dimensional bump function, and define g(x, w, ϵ) = 2ϵ h(0)h( x

w) for some w > 0 and ϵ > 0. Then, g satisfies the following properties:

1. g(x, w, ϵ) = 0 for all x outside the ℓ2-ball of radius w centred at the origin;

2. g(x, w, ϵ) [0, 2ϵ] for all x, and g(0) = 2ϵ;

3. g k c1 2ϵ

w ν h k when k is the Matérn-ν kernel on Rd, where c1 is constant. In

particular, we have g k B when w = 2ϵc1 h k

Let e1, . . . , ed be the canonical basis vectors in Rd. Then, the set

L(α) = 2λ1 1

2 αe1 + + 2λd 1

2 αed : λ1, . . . , λd Z (90)

forms a lattice of points on Rd with spacing α. Equivalently, L(α) defines a partitioning of Rd into disjoint regions or lattice cells {R(x0) : x0 L(α)}, where R(x0) is a d-dimensional hypercube of side length α centred on x0.

Define the function class F(w, ϵ) = {g(x x0, w, ϵ) : x0 L(w)}. By property (1) of Lemma 2, g(x x0, w, ϵ) is zero outside of R(x0). Hence, the support of functions f F(w, ϵ) are pairwise disjoint. In addition, as F(w, ϵ) consists only of translates of g(x, w, ϵ), we retain max f = 2ϵ and f k B f F(w, ϵ) for appropriate w from properties 2 and 3 of Lemma 2 respectively.

The hypercube [0, 1]d contains N of these functions, where

Now, we construct a set of G-invariant bump functions with disjoint support on the hypercube [0, 1]d, centred on the lattice cells for a class of groups G equipped with an action which we define below.

Lemma 3 (Set of invariant functions). Let G be a subset of Sd, the permutation group on d elements. G acts on Rd (and on any invariant subset X) by permutation of the coordinate axes. Let F(w, ϵ) = {g(x x0, w, ϵ) : x0 L(w)}. Then, F(w, ϵ, G) = {SG(f) : f F(|G|1/νw, |G|ϵ)}, where SG is defined as in Proposition 1, satisfies the following properties:

1. F(w, ϵ, G) is a set of G-invariant functions on Rd with pairwise disjoint support;

2. maxx fi(x) = 2|G|ϵ

|G xi|, where fi(x) = SG(g(x xi, |G|1/νw, |G|ϵ)) F(w, ϵ, G);

3. f B f F(w, ϵ, G).

Proof. First, we show that F(w, ϵ, G) is indeed a SG-invariant set of G-invariant functions with disjoint support. As all translates g(x x0) are symmetric around x0, it holds that g(σ(x) x0) = g(x σ 1(x0)). From Equation (90), it is evident that the lattice L(α) is invariant to permutations; that is, σ(x0) L(w) for all x0 L(w), σ G. Hence, F(w, ϵ) itself is invariant under SG, and a fortiori F(w, ϵ, G) = SG F(|G|1/νw, |G|ϵ) .

Next, we show that functions in F(w, ϵ, G) have maximum value 2|G|ϵ

|G xi|. Let fi F(|G|1/νw, |G|ϵ) be the function centred on xi. From Lemma 2, property 2, fi has maximum value 2|G|ϵ. Let Gxi = {σ : σ(xi) = xi} denote the stabilizer group of G at xi.

Since the lattice itself is invariant, the image G xi of the centre will have size

|G xi| = |G| |Gxi|. (92)

The preimage of any element in G xi consists of exactly |Gxi| elements. Therefore,

SG(fi) = 1 |G|

σ H fi σ (93)

fi σH, (94)

where σH is a representative of the coset H and fi σH is the unique value of fi σ on this coset, regardless of its representative.

Note that for any cosets H and K, fi σH and fi σK have disjoint supports from the same arguments as before, and thus,

max SG(fi) = |Gxi|

|G| max fi σ = 2|G|ϵ|Gxi|

|G| = 2|G|ϵ

|G xi|. (96)

Finally, we show that functions in F(w, ϵ, G) have norm bounded by B. Let f F(w, ϵ, G) and f F(|G|1/νw, |G|ϵ). From property (3) in Lemma 2,

f k c12|G|ϵ

where w is the same as before. From Proposition 1, recall that SG is a projection; hence, f k B implies SG(f) k G B, and so f k G f k B.

We now restrict the function set to the unit hypercube. The number of functions from F(|G|1/νw, |G|ϵ) (the set of rescaled non-invariant functions) that fit in the unit hypercube is given by N , where

N = 1 |G|1/νw

where N is defined in Equation (91).

Clearly, the number of orbits in F(|G|1/νw, |G|ϵ), i.e. the number of elements in F(w, ϵ, G) is equal to the number of orbits of G on L(w), by Lemma 3. We would like to count the number of orbits, or at least bound it from below. For specific groups G one can do this explicitly, however, in general it is impossible to give a count of the orbits, without further assumption. What we can, however say is that the number of orbits of size exactly |G| represents almost the total number of orbits when w is small.

Proposition 3. Let M = | F(w, ϵ, G)| = | L(|G|1/νw) T[0, 1]d /G|. Let K := 1 |G|1/νw . Then

|G|, and this bound is asymptotically tight in the limit of w 0.

Proof. Let W(s) be the set of points in L(|G|1/νw) T[0, 1]d with exactly s d repeated indices. Assume w is such that K := 1 |G|1/νw > d + 1. Then, W(0) = under the action of G, such points in L(w) will have |G x| = |G|.

Points in Sd s=1 W(s), i.e. points with repeated indices lie in a union of manifolds of dimension d 1 or below within the hypercube [0, 1]d. As such, w decreases,

This is easy to see as |W(0)| = K (K 1) ...(K d). (100)

Hence, we have that | F(w, ϵ, G)| approaches |F(|G|1/νw,|G|ϵ)|

|G| from above as the width w tends to 0:

lim w 0 | F(w, ϵ, G)| |F(|G|1/νw, |G|ϵ)| = 1 |G|. (101)

Note that M is also the number of functions from F(w, ϵ, G) that are supported in [0, 1]d. Then, we can rewrite the above inequality as:

1 |G|1+d/ν N, (103)

where N is the number of functions from F(w, ϵ) that fit in [0, 1]d, as defined in Equation (91). Equation (103) gives a direct relationship between the packing of invariant and non-invariant bump functions with bounded norm that fit in the unit hypercube.

Note that the above analysis shows that this bound is tight for small enough w which depends only on d and ν since |G| < d!. If one is interested only in Equation (103), this is easy to obtain as M, the number of orbits is greater than N divided by the size of the largest orbit.

Note also that in the case of noiseless observations any optimisation algorithm will have to query at least M (resp. N) points to be able to distinguish between functions in F(w, ϵ, G) (resp. F(w, ϵ)) defined on [0, 1]d; hence, Equation (103) (resp. Equation (91)) can also be interpreted as a lower bound on the sample complexity of finding the maximum of these functions.

A.3.2 Distinguishing between bandit instances

The subsequent section of the proof mirrors the approach used to derive a simple regret lower bound in the standard (non-invariant) setting, as detailed in [12, Section 4.2]. The primary distinction lies in the construction of the representative functions that exhibit group invariance, as outlined in Equation (104) and Equation (105).

Recall the definition of a region R( ) from the discussion following Equation (90). Define Ri as Ri = S

σ G R(σ(xi)), where xi L(|G|1/νw) is the center of the region. Note that these regions are self-contained under G, so that x Ri = σ(x) Ri, and are pairwise disjoint. Therefore, we define a set of M such regions { Ri}M i=1, which partition our domain. The value of M is lower bounded in Equation (103).

We replace B with B/3 in the construction of F(|G|1/νw, |G|ϵ) (see Lemma 3).3 Let fi and fj (i = j) be shifted bump functions from the set F(|G|1/νw, |G|ϵ), centered at some fixed xi and xj respectively, where xi, xj L(|G|1/νw). Furthermore, these are selected such that the sizes of the orbits, |G xi| and |G xj|, are both equal to |G|. Then, we construct the pair

f = SG(fi), (104) f = SG(fi) + 2SG(fj). (105)

The following holds f k G B/3 and f k G B (from Lemma 3 and triangle inequality). We observe that f and f coincide outside Rj i.e., f f is only non-zero in region Rj. Moreover, we have maxx Ri f(x) = 2ϵ and maxx Rj f (x) = 4ϵ. It follows that only points from Rj and Ri can be ϵ-optimal for f and f, respectively.

In the subsequent steps, let Pf[ ] and Ef[ ] denote probabilities and expectations with respect to the random noise, given a suitably chosen underlying function f. Additionally, we define Pf(y|x) as the conditional distribution N(f(x), σ2), consistent with the Gaussian noise model.

We fix T Z+, δ (0, 1/3), B > 0 and ϵ (0, 1/2), and assume the existence of an algorithm such that for any f Fk G(B), it achieves an average simple regret r(x(T )) ϵ with a probability of at least 1 δ. Here, x(T ) represents the final point reported by the algorithm. Note that an algorithm that achieves simple regret less than ϵ for all f Fk G(B) automatically solves the decision problem of distinguishing between f and f as above. If no such algorithm exists, the sample complexity necessary to achieve this bound on regret must be higher than T. Next, we establish a lower bound on the number of samples required for such an algorithm to differentiate between two bandit environments (with reward functions f and f and the same Gaussian noise model) that share the same input domain but have distinct optimal regions.

Let A denote the event that the returned point x(T ) falls within the region Ri. Assume that an algorithm achieves a simple regret of at most ϵ for both functions f and f (both f, f Fk G(B) as established before) each with a probability of at least 1 δ. Under our assumption on simple regret, this implies P f[A] 1 δ and P f [A] δ since only points within Ri can be ϵ-optimal under f, and only points within Rj can be ϵ-optimal under f .

3We note that this modification affects only the constant factors (also in the bound on M from Equation (103)), which are not important for our analysis.

We are now in position to apply [12, Lemma 1]. We apply this lemma using the previously defined f, f , and A.4 It therefore holds that, for δ (0, 1

m=1 E f [Cm] max x Rm KL P f( |x) || P f ( |x) log 1 2.4δ , (106)

where Cm is the number of queried points in Rm up to time T.

For the two bandit instances and any input x, the observation distributions are N( f(x), σ2) and N( f (x), σ2). Utilising the standard result for the KL divergence between two Gaussian distributions with the same variance, we obtain:

max x Rm KL P f( |x) || P f ( |x) = (maxx Rm f(x) maxx Rm f (x))2

σ2 m = j 0 otherwise, (108)

as f(x) = f (x) for x / Rj, and for x Rj we have that f(x) = 0 and maxx Rj f (x) = 4ϵ.

Hence, Equation (106) becomes

σ2 E f [Cj] log 1 2.4δ . (109)

Let J = {j I|xj W(0)} be the set of indices j such that fj is centred around a point with no repeated indices such as in Proposition 3. Let ρw = |J| |G|M be the fraction of all orbits of size exactly |G|. By Proposition 3, for any small δw, there is a correspondingly small enough w such that ρw 1 δw. Since the previous inequality holds for an arbitrary j, we sum over over j (J \ G{i}) /G to arrive at:

σ2 E f [Cj] log 1 2.4δ . (110)

By rearranging and noting that PM j=1 E f [Cj] = T we obtain

T (Mρw 1) σ2

8ϵ2 log 1 2.4δ . (111)

Substituting in M from Equation (103) gives

8ϵ2 log 1 2.4δ σ2

8ϵ2 log 1 2.4δ , (112)

and using N from Equation (91) gives the final bound

where the implied constants can depend on d, l, and ν.

A.4 Lower bounds on different underlying spaces.

In this section we ll investigate how to raise the lower bounds on sample complexity to embedded submanifolds X , Rd. The full treatment below is valid only for the hypersphere, X = Sd but many of the ingredients required are valid on other manifolds as well. When specific statements about the sphere are made in the treatment below, we will explicitly refer to Sd instead of X.

4Following the approach in [12], we deterministically set the stopping time in Lemma 1 to T, corresponding to the fixed-length setting in which the time horizon is pre-specified. For an adaptation to scenarios where the algorithm may determine its stopping time, refer to Remark 1 in [12].

In the proof in A.3, we used the existence of a G-invariant lattice to underpin our bump function construction. Moreover, we ve used the fact that the RKHS norm B := g k ϵ wν . We will replicate the underlying function class construction for F(w, ϵ, G) here.

We will recover the same property for bump functions on X but under a different notion of rescaling.

Pick a point p U X and consider (y1, ..., ym) a system of coordinates on U so that xi(y1, ..., ym) represent the standard coordinates on the ambient space Rd. Let fp,w,ϵ : Rd R+ be a bump function around p which is 0 outside Bw(p; Rd) and supported in Bw/2(p; Rd) with height ϵ, as in section A.3. Let ϕp,w,ϵ = fp,w,ϵ X .

If X = Sd, then the intersection Brw(p; X) := Bw(p; Rd) T X is a geodesic neighbourhood with

geodesic radius rw = arccos 2 w2

2 . For small enough w,

w rw w + w2, (114)

by computing the Taylor expansion.

We will lso need to compute the RKHS norm for ϕp,w,ϵ for some suitably defined kernel. Let kν : Rd Rd be the standard Matérn kernel, and consider kν its restriction to X.

Next, we recall the existence of a fundamental domain for the action of finite groups G on X. Following the definition of fundamental domain from [21], we say F X is a fundamental domain for the action of G if:

F is a domain. G F = X g F T F = = g = e, the identity. For any compact K X, the transporter set ( F|K)G is finite.

Note that the last condition is trivial if G is finite. The construction of F is simple given that G acts by isometries of the underlying space Rd and g X = ι g Rd.

Indeed, let x be a point such that |Gx| = |G| (such a point trivially exists, as each member of G other than the identity fixes a subspace of dimension at most d 1) on the X = Sd 1. Then, |Gx| = |G|, and the Dirichlet domain:

Fx = {y X|d X (x, y) < d X (gx, y), g G \ {e}} (115)

is a fundamental domain for the action (see [21] for proof). One can see this set of points as the interior of the points in X which are closest to x rather than any of its other translates under G. For example, for the action of Sd on the positive orthant Sd 1, it is any of the simplices in the barycentric subdivision.

Equipped with this definition, we construct the lattice of bump function centres on X as follows. First construct a 2rw-net L(rw; Fx) on Fx as the set of centres of an optimal geodesic ball packing of Fx. Then define

L(rw) := L(rw; X) = G L(rw; fx) (116)

We will need to compare the cardinalities of L(rw) and L(trw) for some scaling factor t > 0. Let Π(rw) be the packing number of F Sd 1 by geodesic balls of radius rw.

We have trivially that

V ol(F) V ol(B2rw(p; X)) Π(rw) V ol(F) V ol(Brw(p; X)) (117)

where the first inequality comes trivially from noticing that a maximal rw packing is a 2rw covering. Moreover, since the exponential map is a diffeomorphism on small enough neighbourhoods (with derivative I at the origin), we have that:

K1V ol(Brw(p; Rd 1)) V ol(Brw(p; X)) K2V ol(Brw(p; Rd 1)) (118)

for some 0 < K1 < K2. Combining equations 117 and 118 gives:

c1td 1Π(rw) Π(trw) c2td 1Π(rw) (119)

Following this discussion, we have the following lemma which is an extension to Lemma 3. Denote by wr the inverse mapping of w rw, defined on small enough geodesic balls on Sd 1. We write F(w, ϵ) = {fp,w,ϵ : p L(rw)} for a set of functions on the ambient space and FX (rw, ϵ) = ϕp,w,ϵ := fp,w,ϵ X : p L(rw) for the restriction F(w, ϵ) X . Similarly, we write F(w, ϵ, G) = {SG(f) : f F(|G|1/νw, |G|ϵ)} and FX (w, ϵ, G) = F(w, ϵ, G) X Lemma 4 (Invariant functions on the sphere). The sets FX (rw, ϵ, G) and F(w, ϵ, G) satisfy the following properties.

1. F(w, ϵ, G) is a set of G-invariant functions on Sd 1 with pairwise disjoint support;

2. maxx SG(fp,w,ϵ)(x) = 2|G|ϵ

3. f B f FX (w, ϵ, G).

Proof. The key ingredient for 1. is to note that each fp,w,ϵ(x) = g( x p

w ), and thus, each f is G-invariant around the point p, namely, for σ G, fp,w,ϵ σ = fσ 1(p),w,ϵ. Then, since L(rw) is G-invariant by construction, 1. follows. For 2. the proof is identical to Lemma 3 2. For 3. notice that

ϕp,tw,ϵ kν inf ϕe ϕe p,tw,ϵ kν fp,tw,ϵ kν = tν fp,w,ϵ kν (120)

where the second inequality follows from the restriction and ϕe is any extension of ϕ to the underlying space, since the norm of a restriction is less than that of all extensions.

Let N = |F(w, ϵ)|, M = F(w, ϵ, G). Recall that these functions are defined in terms of the packing numbers of F, so that N = |G|Π(rw) and M = Π(r|G|1/νw). Pick δw > 0. For all small enough w, we have that:

M Π(|G|1/ν(1 + δw)rw) c1|G| d 1

ν (1 + δw)d 1Π(rw) = c3|G|1+ d 1

where the first inequality comes from equation 114 and the fact that the packing number is a decreasing function of the radius, the second inequality comes from equation 119 and c3 is a constant which depends on the ambient dimension and the curvature of the space X.

Following the arguments of section A.3.2 we see, that by construction, L(rw) admits only orbits of size |G|, end hence in Equation 110, the sum is over all elements in FX (rw, ϵ, G). Hence we recover the same lower bound on sample complexity:

This concludes the proof of Theorem 2.

The key element is the original construction of the functions fp,w,ϵ(x), which are of the form g(x p) for some bump function constructed around the origin which is itself symmetric with respect to the action of O(d) (and hence its subgroups).

B Experimental details

B.1 Synthetic experiments

We use an isotropic Matérn-5/2 kernel as the base kernel k, and compute k G according to Equation (5). To generate the objective functions, we first generate a large number n of sample pairs (xi, f(xi)) from a zero-mean GP prior with kernel lengthscale l = 0.12. We scale n with the dimensionality of the problem d, so that n = 64 for d = 2, n = 256 for d = 3, and n = 512 for d = 6, to mitigate sparsity. We then use Bo Torch to fit the noise parameter of a noisy invariant GP with fixed lengthscale l = 0.12 to that data, and treat the mean of the trained GP as our objective function. In this way, we ensure that the objective function belongs to the desired RKHS.

During the BO loop, the learner was provided the true lengthscale, magnitude, and noise variance of the trained kernel, to avoid the impact of online hyperparameter learning. For UCB, we use β = 2.0, except in the quasi-invariant case with ε = 0.05 where we found that increasing β was necessary to achieve good performance (β = 3.0). MVR has no algorithm hyperparameters.

For the constrained BO baseline, we replaced Bo Torch s default initialisation heuristic with rejection sampling in order to generate multiple random restarts that lie within the fundamental domain. We found that this improved the diversity of the initial conditions, and hence boosted the performance of the acquisition function optimiser.

In all other cases Bo Torch defaults were used.

B.2 Fusion experiment

We consider the 3-block permutation group, with |G| = 4! = 24. The kernel hyperparameters are learned offline, based on 256 runs of the fusion simulator with parameters generated by lowdiscrepancy Sobol sampling. Due to the cost of evaluating the objective function we use a batched version of the acquisition function as implemented by Bo Torch, querying points in batches of 10. We use the multi-start optimisation routine from Bo Torch for finding the maximum acquisition function value.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our theoretical results have been listed with all accompanying conditions and assumptions. Equally, our empirical results have the appropriate limitations and special settings described. Both match what is listed in the abstract. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In our Limitations section (Section 5), we outline the computational cost of our method; this is the main drawback of our approach. Importantly, we provide an alternative low-cost approximation for use when the full technique is too costly. In our theoretical results, we clearly state the group actions, kernels, and other assumptions for which our theorems hold (Proposition 1, Theorem 2, Theorem 1). Following the literature, we do not include the impact of online hyperparameter learning in our theory or experiments. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide formal proofs of all theorems and propositions in the main text in the appendix: Proposition 1 has proof in Appendix A.1, Theorem 2 has proof in Appendix A.3, Theorem 1 has proof in Appendix A.2. These are correct and complete to the best of our knowledge. Where our work depends on lemmas and theorems from other papers they are clearly cited and their use in our case explained in words (e.g. Lemma 2). Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All resources needed to reproduce our results have been provided in the code itself. The paper merely summarizes and describes the experimental settings, and all results are available directly from our code. Aside from the code, the experimental section clearly describes how the synthetic test functions are generated, and the tools used to perform BO. A complete explanation of the fusion setting is provided, and additional resources cited for more detailed information on the experiental configuration. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide a running codebase with scripted experiment files. All will be made available online upon acceptance. We provide access to the relevant simulation environments where needed, otherwise, as our algorithms fundamentally deal with data acquisition, open access to data is not applicable. In particular, as the fusion experiment depends on proprietary code and models that belong to the United Kingdom Atomic Energy Authority, the full fusion experiment cannot be made public. We do, however, provide the optimisation script, with the calls to the proprietary models censored, so that reviewers can view our exact methodology.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: All the special experimental settings needed to understand, appraise and judge the usefulness of our results have been provided, and connections to our theoretical setting have been drawn. Details needed for the implementation only are part of the code we supply; this includes exact hyperparameters and seeds used to generate each of the plots.

Guidelines:

The answer NA means that the paper does not include experiments.

The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: In all our experiments we provide adequate statistics except where not applicable. This includes sample sizes for experimental runs, sources of randomness and the appropriate error bars where applicable. In the fusion experiment, providing estimates of the standard deviation is not meaningful due to the presence of failed simulations, as discussed in the text. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We make special mention of the hardware we use, the computational trade-offs in various applications of our method, with quantitative data summarized in a graph. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We conform in every respect.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: In our conclusion we include a separate set of comments on broader impact. Our work more broadly falls into the category of foundational and theoretical research.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our work poses no such risk.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We cite the papers for the code we use ([2, 10]), which are both available under open-source licenses.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: We provide commented Python code with informative docstrings. Upon publication, we will share the full code repository under a permissive open-source license.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.