# bayesian_optimisation_of_functions_on_graphs__1a797320.pdf Bayesian Optimisation of Functions on Graphs Xingchen Wan , Pierre Osselin , Henry Kenlay Binxin Ru, Michael A. Osborne, Xiaowen Dong Department of Engineering Science, University of Oxford {xwan,osselinp,kenlay,robin,mosb,xdong}@robots.ox.ac.uk The increasing availability of graph-structured data motivates the task of optimising over functions defined on the node set of graphs. Traditional graph search algorithms can be applied in this case, but they may be sample-inefficient and do not make use of information about the function values; on the other hand, Bayesian optimisation is a class of promising black-box solvers with superior sample efficiency, but it has scarcely been applied to such novel setups. To fill this gap, we propose a novel Bayesian optimisation framework that optimises over functions defined on generic, large-scale and potentially unknown graphs. Through the learning of suitable kernels on graphs, our framework has the advantage of adapting to the behaviour of the target function. The local modelling approach further guarantees the efficiency of our method. Extensive experiments on both synthetic and real-world graphs demonstrate the effectiveness of the proposed optimisation framework. 1 Introduction Data collected in a network environment, such as transportation, financial, social, and biological networks, have become pervasive in modern data analysis and processing tasks. Mathematically, such data can be modelled as functions defined on the node set of graphs that represent the networks. This then poses a new type of optimisation problem over functions on graphs, i.e. searching for the node that possesses the most extreme value of the function. Real-world examples of such optimisation tasks are abundant. For instance, if the function measures the amount of delay at different locations in an infrastructure network, one may think about identifying network bottlenecks; if it measures the amount of influencing power users have in a social network platform, one may be interested in finding the most influential users; if it measures the time when individuals were infected in an epidemiological contact network, an important task would be to identify patient zero of the disease. Optimisation of functions on graphs is challenging. Graphs are an example of discrete domains, and conventional algorithms, which are mainly designed for continuous spaces, do not apply straightforwardly. Real-world graphs are often extremely large and sometimes may not even be fully observable. Finally, the target function, such as in the examples given above, is often a black-box function that is expensive to evaluate at the node level and may exhibit complex behaviour on the graph. Traditional methods to traverse the graph, such as breadth-first search (BFS) or depth-first search (DFS) [14], are heuristics that may be adopted in this setting for small-scale graphs, but inefficient to deal with large-scale real-world graphs and complex functions. Furthermore, these search methods only rely on the graph topology and ignore the function on the graph, which can be exploited to make the search more efficient. On the other hand, Bayesian optimisation (BO) [16] is a sample-efficient sequential optimisation technique with proven successes in various domains and is suitable for solving black-box, expensive-to-evaluate optimisation problems. However, while BO has been combined with graph-related settings, e.g. optimising for graph structures (i.e. the individual configurations Equal contribution. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). local subgraph New local subgraph Current subgraph New subgraph (a) (b) (c) Figure 1: Illustration of one iteration of Bayes Opt G on an example graph. (a) At iteration t, we construct a local subgraph Gt centred around v t whose nodes are marked in orange-red, with darker shade denoting a shorter distance to v t , the best node seen so far (marked in black), and nodes outside Gt are marked in grey. The readers are referred to 3.2 for the details; (b) we place a GP surrogate with the covariance function defined in 3.1 on Gt and pick the maximiser of the acquisition function (the acquisition function values are marked in shades of blue, with a darker shade denoting a higher acquisition value) as the node to query for iteration t + 1 (vt+1) ( 3.2) and (c) if querying vt+1 leads to a better objective function value (f(vt+1) < f(v t ), assuming minimisation), the neighbourhood around it is selected as the new subgraph Gt+1. The process continues until convergence or a pre-set number of evaluations is reached. that we optimise for are graphs) in the context of neural architecture search [20, 34], graph adversarial examples [42] or molecule design [23], it has not been applied to the problem of optimising over functions on graphs (i.e. the search space is a graph and the configurations we optimise for are nodes in the graph). The closest attempt was COMBO [27], which is a framework designed for a specific purpose, i.e. combinatorial optimisation, where the search space is modelled as a synthetic graph restricted to one that can be expressed as a Cartesian product of subgraphs. It also assumes that the graph structure is available and that the function values are smooth in the graph space to facilitate using a diffusion kernel. All these assumptions may not hold in the case of optimisation over generic functions on real-world graphs. We address these limitations in our work, and our main contributions are as follows: we consider the problem setting of optimising functions that are supported by the node set of a potentially generic, large-scale, and potentially unknown graph a setup that is by itself novel to the best of our knowledge in the BO literature. We then propose a novel BO framework that effectively optimises in such a problem domain with 1) appropriate kernels to handle the aforementioned graph search space derived by spectral learning on the local subgraph structure and is therefore flexible in terms of adapting to the behaviour of the target function, and 2) efficient local modelling to handle the challenges that the graphs in question can be large and/or not completely known a-priori. Finally, we deploy our method in various novel optimisation tasks on both synthetic and real-world graphs and demonstrate that it achieves very competitive results against baselines. 2 Preliminaries BO is a zeroth-order (i.e. gradient-free) and sample-efficient sequential optimisation algorithm that aims to find the global optimum x of a black-box function defined over search space X: x = arg minx X f(x) (we consider a minimisation problem without loss of generality). BO uses a statistical surrogate model to approximate the objective function and an acquisition function α(x) to balance exploitation and exploration under the principle of optimism in the face of uncertainty. At the t-th iteration of BO, the objective function is queried with a configuration xt and returns an output yt, a potentially noisy estimator of the objective function yt = f(xt) + ϵ, ϵ N(0, σ2 n) where σ2 n is the noise variance. The statistical surrogate is trained on the observed data up to t-th observation Dt = {(xi, yi)}t i=1 to approximate the objective function. In this work, we use a Gaussian process (GP) surrogate, which is query-efficient and gives analytic posterior mean and variance estimates on the unknown configurations. Formally, a GP is denoted as f(x) GP (m (x) , k (x, x )), where m (x) and k (x, x ) are the mean function and the covariance function (or the kernel), respectively. While the mean function is often set to zero or a simple function, the covariance function encodes our belief on the property of the function we would like to model, the choice of which is a crucial design decision when using GP. The covariance function typically has some kernel hyperparameters θ and are typically optimised by maximising the log-marginal likelihood (the readers are referred to detailed derivations in Rasmussen [31]). With m( ) and k( , ) defined, at iteration t, with Xt = [x1, ..., xt] and the corresponding output vector y1:t = [y1, ..., yt] , a GP gives analytic posterior mean µ(xt+1|Dt) = k(xt+1, X1:t)K 1 1:ty1:t and variance k(xt+1, x t+1|Dt) = k(xt+1, x t+1) k(xt+1, X1:t)K 1 1:tk(X1:t, x t+1)) estimates on an unseen configuration xt+1, where [K1:t]i,j = k(xi, xj) is the (i, j)-th element of the Gram matrix induced on the (i, j)-th training samples by k( , ), the covariance function. With the posterior mean and variance predictions, the acquisition function is optimised at each iteration to recommend the configuration (or a batch of configurations for the case of batch BO) to be evaluated for the t + 1-th iteration. For additional details of BO, the readers are referred to Frazier [15]. 3 Bayesian Optimisation on Graphs Problem setting. Formally, we consider a novel setup with a graph G defined by (V, E), where V = {vi}n i=1 are the nodes and E = {ek}m k=1 are the edges where each edge ek = {vi , vj } connects nodes vi and vj . The topology G may be succinctly represented by an adjacency matrix A {0, 1}n n; in our case, m and n are potentially large, and the overall topology is not necessarily fully revealed to the search algorithm at running time. It is worth noting that, for simplicity, we focus on the setup of undirected, unweighted graph where elements of A are binary and symmetrical (i.e. Aij = Aji)2. Specifically, we aim to optimise the black-box, typically expensive objective function that is defined over the nodes, i.e. it assigns a scalar value to each node in the graph. In other words, the search space (i.e. X in 2) in our setup is the set of nodes V and the goal of the optimisation problem is to find the configuration(s) (i.e. x in 2) that minimise the objective function v = arg minv V f(v). Promises and challenges of BO on graphs. We argue that BO is particularly appealing under the described setup as (1) it is known to be query-efficient, making it suitable for optimising expensive functions, and (2) it is fully black-box and gradient-free; indeed, we often can only observe inputs and outputs of many real-world functions, and gradients may not even exist in a practical setup. However, there exist various challenges in our setup that make the adaptation of BO highly non-trivial, and despite the prevalence of problems that may be modelled as such and the successes of BO, it has not been extended to the optimisation of functions on graphs. Some examples of such challenges are: (i) Exotic search space. BO is conventionally applied in continuous Euclidean spaces, whereas we focus on discrete graph search spaces. The differences in search space imply that key notions to BO, such as the similarity between two configurations and expected smoothness of objective functions (the latter is often used as a key criterion in selecting the covariance function to use), could differ significantly. For example, while comparing the similarity between two points in a Euclidean space requires only the computation of simple distance metrics (like ℓ2 distance), careful thinking is required to achieve the same in comparing two nodes in a graph that additionally accounts for the topological properties of the graph. (ii) Scalability. Real-world graphs such as citation and social networks can often feature a very large number of nodes while not presenting convenient properties such as the graph Cartesian product assumption in Oh et al. [27] to accelerate computations. Therefore, it is a technical challenge to adapt BO in this setting while still retaining computational tractability. (iii) Imperfect knowledge on the graph structure. Related to the previous point, it may also be prohibitively expensive or even impossible to obtain perfect, complete knowledge on real-world graphs beforehand or at any point during optimisation (e.g. obtaining the full contact tracing graph for epidemiology modelling); as such, any prospective method should be able to handle the situation where the graph structure is only revealed incrementally, on-the-fly. Overview of Bayes Opt G. To effectively address these challenges while retaining the desirable properties of BO, we propose to extend BO to this novel setup and are, to the best of our knowledge, the first to do so. To achieve that, we propose Bayesian Optimisation on Graphs, or Bayes Opt G in short, and an illustration of the overall procedure is shown in Fig. 1, and an algorithmic description is available in Algorithm 1. For the rest of this section, we discuss in detail the key components of Bayes Opt G and how the method specifically addresses the challenges identified above. 2We note that it is possible to extend the proposed method to more complex cases by using the corresponding definitions of Laplacian matrix. We defer thorough analysis to future work. 3.1 Kernels for BO on Graphs Algorithm 1 Bayesian Optimisation on Graphs (Bayes Opt G) 1: Inputs: Number of random points at initialisation/restart N0, total number of iterations T, subgraph size Q, graph G = {V, E} (whose topology is not necessarily fully known a-priori). 2: Objective: The node v T that minimises the objective function f(v), v V. 3: Initialise restart_flag True, visited nodes S , train data D0 = , h 1. 4: for t = 1, ..., T do 5: if restart_flag then 6: Initialise the GP surrogate Dt with randomly selected N0 points from V\S and their observations Dt {vi, yi}N0 i=1. 7: end if 8: Construct subgraph Gt = { Vt, Et} around v t (best node seen from the last restart) (See Algorithm 2 & 3.2). 9: Fit a GP with kernel defined in Table 1 on Gt with Dt by optimising log-marginal likelihood. 10: Select next query point vt+1 by optimising the acquisition function. 11: Query objective function f( ) at vt+1 to obtain a (potentially noisy) estimate yt+1; update train data Dt+1 Dt (vt+1, yt+1); seen nodes S S vt+1; determine the state of restart_flag with the criteria described in 3.2. 12: end for 13: return node that minimises f( ) from all restarts. Kernel design. Covariance functions are crucial to GP-based BO. To use BO in our setup, a covariance function that gives a principled similarity measure between two nodes {vi, vj} V is required to interpolate signals on the graph effectively. In this paper, we study several kernels, including both those proposed in the literature (e.g. the diffusion kernel on graphs and the graph Matérn kernel [4]) and two novel kernels designed by us. Following Smola & Kondor [37], all the kernels investigated can be considered in a general formulation. Formally, for a generic graph G = ( V, E) with n nodes and m edges, we define L := 1 2 , where I is the identity matrix of order n, A and D are the adjacency matrix and the degree matrix of G, respectively (the term after 1 2 is known as the normalised Laplacian matrix with eigenvalues in the range of [0, 2]; we scale it such that the eigenvalues are in the range of [0, 1]). It is worth emphasising that here we use notations with the tilde (e.g., G, n and m) to make the distinction that this graph is, in general, different from, and is typically a subgraph of, the overall graph G discussed at the start of this section, which might be too large or not be fully available at the start of the optimisation; we defer a full discussion on this in 3.2. We further note that L = UΛU with Λ = diag(λ1, ..., λ n) and U = [u1, ..., u n], where {λ1, ..., λ n} are the eigenvalues of Λ sorted in an ascending order and {u1, ..., u n} are the corresponding (unit) eigenvectors. Let p, q {1, ..., n} be two indices over the nodes of G, we may express our covariance function to compute the covariance between an arbitrary pair of nodes vp, vq in terms of a regularisation function of eigenvalues r(λi) i {1, ..., n}, as described in Smola & Kondor [37]: k(vp, vq) = i=1 r 1(λi)ui[p]ui[q], (1) where ui[p] and ui[q] are the p-th and q-th elements of the i-th eigenvector ui. The specific functional form of r(λi) depends on the kernel choice, and the kernels considered in this work are listed in Table 1. We note that all kernels encode the smoothness of the function on the local subgraph G. In particular, the diffusion kernel has been adopted in Oh et al. [27]; the polynomial and Matérn kernels are inspired by recent work in the literature of graph signal processing [11, 46, 3]; finally, the sum-of-inverse polynomials kernel is designed as a variant of the polynomial kernel: in terms of the regularisation function, it can be interpreted as (while ignoring ϵ) a scaled harmonic mean of the different degree components of the polynomial kernel. We next discuss the behaviours of these kernels from the perspective of kernel hyperparameters. Kernel hyperparameters. β := [β0, ..., βη 1] Rη 0 (for polynomial and sum-of-inverse polynomials) or [β1, ..., β n] R n 0 (for the diffusion kernel) define the characteristics of the kernel. We constrain β in both kernels to be non-negative to ensure the positive semi-definiteness of the resulting covariance matrix and are learned jointly via GP log-marginal likelihood optimisation. The parameter ν controls the mean-square differentiability in the classical GP literature with the Matérn kernel. The polynomial and the sum-of-inverse polynomials kernels in Table 1 feature an additional hyperparameter of kernel order η Z 0. We set it to be min{5, diameter} where diameter is the Table 1: Kernels considered in terms of the regularisation function r(λi). We derive the semidefiniteness of polynomial and sum-of-inverse polynomial kernels in App. A. Kernel Regularisation function r(λi) Kernel function K(V, V) Diffusion [37, 27] exp(βiλi) P n i=1 exp( βiλi)uiu i Polynomial Pη 1 α=0 βαλα i + ϵ P n i=1 Pη 1 α=0 βαλα i + ϵ 1 uiu i Sum-of-inverse polynomials Pη 1 α=0 1 βαλα i +ϵ 1 P n i=1 Pη 1 α=0 1 βαλα i +ϵ uiu i Matérn [4] βν + λi ν P n i=1 βν + λi ν uiu i Can be ARD or non-ARD: for ARD, {βi} n i=1 coefficients are learned; for non-ARD, a single, scalar β is learned. {βα}η 1 α=0 coefficients to be learned. ϵ: small positive constant (e.g. 10 8). η: order of kernel. length of the shortest path between the most distanced pair of nodes in G (a thorough ablation study on η is presented in App. D.). We argue that this allows both kernels to strike a balance between expressiveness, as all eigenvalues contained in the graphs are used in full without truncation, and regularity, as fewer kernel hyperparameters need to be learned. This is in contrast to, for example, diffusion kernels on graphs in Table 1, which typically has to learn n hyperparameters for a graph of size n, whose optimisation can be prone to overfitting. To address this issue, previous works often had to resort to strong sparsity priors (e.g. horseshoe priors [6]) and approximately marginalising with Monte Carlo samplers that significantly increase the computational costs and reduce the scalability of the algorithm [27]. In contrast, by constraining the order of the polynomials to a smaller value, the resulting kernels may adapt to the behaviour of the target function and can be better regularised against overfitting in certain problems, as we will validate in 5. 3.2 Tractable Optimisation via Local Modelling Local subgraph Local subgraph Subgraph Figure 2: Subgraphs Gt determined by Algorithm 2, marked in red, with a darker shade denoting a closer distance to the central node v t = arg minv {vt}t t =1 f(v) in the figure, marked in black), for a high-degree node (Left) and a node far from high-degree node (Right). Note for the latter case, the local subgraph can include nodes that are much further away. As discussed previously, it is a technical challenge to develop high-performing yet efficient methods in 1) large, real-world graphs (e.g. social network graphs) and 2) graphs for which it is expensive, or even impossible, to obtain complete topological information beforehand (e.g. if we model the interactions between individuals as a graph, the complete topology of the graph may only be obtained after exhaustive interviews and contact tracing with all people involved). The previous work in Oh et al. [27] cannot handle the second scenario and only addresses the first issue by assuming a certain structure of the graph (e.g. the Cartesian product of subgraphs), but these techniques are not applicable when we are dealing with a general graph G. To address the dual challenges, and inspired by trust region-based BO methods [7, 13, 43, 10, 44], we adapt and simplify the techniques to our use case: we propose to leverage local modelling by focusing on a subset of nodes that evolves as the optimisation progresses. At iteration t {1, ..., T}, assuming the collection of our observed configurations and outputs is Dt = {vt , yt }t t =1, we first find the node that leads to the best objective function so far v t = arg minv {vt}t t =1 f(v). We then use Algorithm 2 to select a neighbourhood around v t that is a subgraph of the overall graph G: Gt G with Q number of nodes (we will discuss how to choose Q in the next paragraph), in a procedure similar to the neighbourhood sampling in the Graph SAGE framework [18] as illustrated in Fig. 2: in particular, during sampling, the closer nodes to v t takes precedence over further nodes we only sample the latter if the subgraph consisting of v t and the closer nodes has fewer than Q nodes; hence the local subgraph is a form of an ego-network of the central node v t . We then only impose the GP and compute the covariance matrix over this subgraph only: First, this effectively limits the computational cost note that the time complexity in our case depends on both the number of training examples N and the size of the graph n we impose the GP on (O( n3 + N 3)), assuming a naïve eigen-decomposition algorithm. Second, it also effectively addresses the setup where the entire G is not available a-priori, as we only need to query and reveal the topological structure of the subgraph Gt on the fly. Algorithm 2 Selecting a local subgraph 1: Inputs: Best input up to iteration t since the last restart: v t , subgraph size Q. 2: Output: local subgraph Gt with Q nodes. 3: Initialise: Vt {v t }, h 1. 4: while | Vt| < Q do 5: Find Nh, the h-hop neighbours of v t . 6: if | Vt| + |Nh| Q then 7: Add all h-hop neighbours to Vt: Vt Vt Nh. 8: Increment h: h h + 1 9: else 10: Randomly sample Q | V|t nodes from Nh and add to Vt 11: end if 12: end while 13: return the subgraph Gt induced by Vt (i.e. the ego-network). Determining the local subgraph size. The local subgraph size at iteration t (Qt) is a hyperparameter of the algorithm. We adapt the idea of trust regions from trust region-based optimisation algorithms [7, 13, 43] to adaptively set the size of Qt as the optimisation progresses: specifically, we initialise Q0 (initial neighbourhood size), succ_tol (success tolerance), fail_tol (failure tolerance) and γ > 1 (multiplier) as hyperparameters3, and count successes as occasions where Bayes Opt G succeeds in improving the function values (i.e., at iteration t, f(v t ) < f(v t 1)) and failures otherwise. Upon consecutive succ_tol successes, we expand the neighbourhood size Qt min(round(γQt 1), n) to increase exploration, and upon consecutive fail_tol failures, we shrink QT max(round(Qt 1/γ, Qmin)) to increase exploitation. The notation round( ) denotes rounding to the nearest integer, and Qmin denotes some minimum value of neighbourhood size (typically set to 1 to include a single node v t for simplicity, although alternative values may be used). When Qt Qmin, we restart the BO by fitting the surrogate with randomly initialised nodes whose objective function values have not been evaluated. Remarks on the relation to trust-region BO methods. It is worth noting that while conceptually influenced by previous trust region-BO methods, the local graph construction we use differs from these methods in several crucial aspects. First, we use a bespoke distance metric in the graph space. Second, whereas the purpose of trust regions in previous works is to alleviate over-exploration in high-dimensional spaces, local subgraphs in our case also uniquely serve the crucial purpose of allowing Bayes Opt G to handle imperfect knowledge about the graphs, as we only need to reveal the topology of the subgraph (as opposed to the entire graph) at any given iteration. Lastly, we discussed, that using trust regions also improves scalability this can be concretely exemplified by the massive speed-up shown in Fig. 3. 1000 2000 3000 4000 Graph size Wall-clock time (s) Poly TR Poly Full Sum Inverse TR Sum Inverse Full Diff TR Diff Full Figure 3: Trust regions enable efficient optimisation on large graphs: Wall-clock time with and without trust regions in Bayes Opt G with different kernels over graphs of different sizes. Optimisation of the acquisition function. With the local subgraph obtained, we then fit a GP surrogate with the covariance function defined in 3.1 and optimise log-marginal likelihood. Given that the local search space in our case is finite (of size Q), we simply enumerate all nodes within Gt to compute their acquisition function acq( ) values (which is computed from the predictive mean and variance of the GP surrogate) and pick the maximiser as the recommended location to query the objective function f( ) for iteration t + 1 as vt+1 = arg maxv Vt acq(v). Any off-the-shelf acquisition function may be used, and we adopt expected improvement (EI) [16] in our experiments. It is worth noting that Bayes Opt G is also fully compatible with existing approaches such as Kriging believer fantasisation [17] for batch BO. 4 Related Work The setup we consider is by itself novel and largely under-explored. One of the few existing methods that can be used for optimisation over a graph search space is COMBO [27], where the search space is modelled as a graph that captures the relationship between different values for a group of 3We provide an ablation study in App. D to show the robustness of Bayes Opt G to hyperparameters. 2 0 2 Validation ground truth Validation prediction 2 0 2 Validation ground truth Sum of inverse polynomial 2 0 2 Validation ground truth 2 0 2 Validation ground truth Diffusion with ARD 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 2 0 2 Validation ground truth Validation prediction 2 0 2 Validation ground truth Sum of inverse polynomial 2 0 2 Validation ground truth 2 0 2 Validation ground truth Diffusion with ARD 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ Figure 4: Validation of predictive powers of kernels considered on a BA graph of size n = 200 nodes and parameter m = 1, with (a) function values on the nodes corresponding to elements of the eigenvector corresponding to the second smallest eigenvalue and (b) same as above, but corrupted with noise standard deviation σ = 0.05. The leftmost column shows the visualisation of the ground truth, and the right columns show the GP posterior mean and standard deviation (error bars) learned by the different kernels against ground truth with Spearman correlation ρ and learned r 1(λ) (Eq. 1). categorical variables. It is, therefore, designed explicitly for combinatorial optimisation. Several studies modified COMBO in various ways but followed essentially the same framework for similar tasks, e.g., optimisation over categorical variables [12, 19, 24]. Similarly, Ramachandram et al. [30] propose a specific graph construction to optimise multimodal fusion architectures. Our work differs from these studies in that: 1) we focus on optimisation over generic, large-scale and potentially unknown graphs; 2) the nodes of the graph are not limited to combinations of values for categorical variables and can represent any entities; 3) the kernel we propose is not limited to diffusion-based ones and can adapt to the behaviour of the function to be optimised. Finally, the graph bandit setting ([5, 39, 38]) can be seen to be similar to ours in the sense that it also aims at finding extreme values associated with nodes in a graph. However, the bandit problem considers a stochastic setting where nodes are influenced in a probabilistic fashion, and the objective function is actively shaped by this process; in comparison, in our case, we consider an underlying deterministic and black-box function, which is more aligned with the classical BO setting. Moreover, both Valko et al. [39] and Thaker et al. [38] require full graph access and require prohibitive operation on the full graph Laplacian (decomposition/inversion), whereas Bayes Opt G may work on-the-fly with initially unknown graphs and is much more scalable thanks to the designs in 3.2. Several works also leverage kernels on graphs to build Gaussian processes for graph-structured data [26, 41, 40, 46, 28, 4, 29]. While the kernels proposed in these approaches can, in theory, be used in a BO framework, these studies do not address the optimisation problem we consider. Another line of work focuses on optimisation over graph inputs (in contrast to a graph search space) where each input configuration itself is a graph. In contrast, in our case, each input configuration is a node. Examples of the former include Ru et al. [34] who model neural architectures as graphs and use Weisfeiler-Lehman kernels [36] to perform BO, and Wan et al. [42], who devise a BO agent for adversarial attack on graph classification models. Other representative examples include Kandasamy et al. [20], Korovina et al. [23] and Cui et al. [8, 9]. We emphasise that, while related, these works deal with a different setup and thus require a different method compared to the present work. For example, the kernels over graphs used in these methods typically aim to find vector embedding of graphs that account for their topologies. However, once the embedding is computed, standard Euclidean covariance functions (e.g., the dot product or squared-exponential kernel) are applied. On the other hand, in the present work, we aim to compute similarities over nodes, where topological information is crucial during the covariance computation itself. 5 Experiments We first validate the predictive power of the GPs with the adopted kernels on graphs and then demonstrate the optimisation performance of Bayes Opt G in both synthetic and real-world tasks. We compare Bayes Opt G against baselines, including random and local search optimisation algorithms as 0 25 50 75 100 0.0 (a) Betweenness; m = 2 0 25 50 75 100 0.00 (b) Betweenness; m = 3 0 25 50 75 100 0.00 (c) Betweenness; m = 4 0 25 50 75 100 #Iters (d) Eigenvector; m = 2 0 25 50 75 100 #Iters (e) Eigenvector; m = 3 0 25 50 75 100 #Iters 0.3 (f) Eigenvector; m = 4 BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 5: Maximising centrality scores with the BA random graph model and n = 1000 nodes. Different graphs show different values of the BA hyperparameter m {2, 3, 4} and centrality metrics {betweenness/eigenvector centrality}. (a) Betweenness (k, p) = (10, 0.1) (b) Betweenness (k, p) = (10, 0.2) (c) Betweenness (k, p) = (30, 0.1) (d) Betweenness (k, p) = (30, 0.2) 0 50 100 #Iters (d) Eigenvector (k, p) = (10, 0.1) 0 50 100 #Iters (e) Eigenvector (k, p) = (10, 0.2) 0 50 100 #Iters (f) Eigenvector (k, p) = (30, 0.1) 0 50 100 #Iters (g) Eigenvector (k, p) = (30, 0.2) Figure 6: Maximising centrality scores with the WS random graph model and n = 2000 nodes. Refer to Fig. 5 for legend and additional explanations. well as BFS and DFS. The description of these baselines is given in the App. B.2. In all figures, lines, and shades denote mean and standard error, respectively, across ten trials. 5.1 Validating Predictive Power of Kernels We first validate the predictive power of the adopted kernels in controlled regression experiments. To do so, we generate functions that are simply the eigenvectors of the graph Laplacian and compare the predictive performance of the kernels using three graph types: 2D grid, Barabási Albert (BA) [1] and Watts Strogatz (WS) [45]. We compare the performance in terms of validation error and show the results in Fig. 4 (results for other graph types are shown in App. C.1). We find that in the noiseless case, all kernels learn the underlying function effectively (except that the diffusion with ARD kernel learns a non-smooth transform on the spectrum due to its over-parameterisation, resulting in underestimations of the uncertainty in the noisy case). Still, the better-regularised kernels (described in 3.1) are considerably more robust to noise corruption. 0 50 100 #Iters (a) Ackley; = 0.5 0 50 100 #Iters (b) Ackley; = 1 0 50 100 #Iters 101 (c) Rosenbrock; = 0.5 0 50 100 #Iters 101 (d) Rosenbrock; = 1 Figure 7: Synthetic test functions task with Ackley/Rosenbrock functions with noise standard deviation σ {0.5, 1}. Regrets shown in log-scale for Rosenbrock; refer to Fig. 5 for legend. 0 50 100 #Iters (a); ( , ) = (0.1, 0.015) 0 50 100 #Iters 0.04 (b) ( , ) = (0.1, 0.15) 0 50 100 #Iters 0.04 (c) ( , ) = (0.2, 0.015) 0 50 100 #Iters 0.04 (d) ( , ) = (0.2, 0.15) Bayes Opt G_Matern Local BFS Bayes Opt G_Sum Inverse Bayes Opt G_Diff_ARD DFS Bayes Opt G_Poly Random Bayes Opt G_Diff Figure 8: Identifying the patient zero task with different SIR model hyperparameters β {0.1, 0.2} and γ {0.015, 0.15} and probability of recovery ϵ of 0. Refer to Fig. 20 23 for experiments with other hyperparameter combinations. 0 25 50 75 100 #Iters (a) Enron Mail 0 25 50 75 100 #Iters (b) Facebook pages 0 25 50 75 100 #Iters (c) Twitch user social Figure 9: Identifying influential users in a social network task on different real-life social networks (Enron/Facebook page/Twitch). Refer to Fig. 8 for legend. 0 50 100 #Iters (a) #skills = 2 and = 1 0 50 100 #Iters (b) #skills = 2 and = 1 0 50 100 #Iters (c) #skills = 4 and = 10 0 50 100 #Iters 0.02(d) #skills = 4 and = 10 Figure 10: Team optimisation task with s (number of skills) {2, 4} and α {1, 10} with Jaccard index threshold of 0.3 (refer to App. B.4.3 for explanations). Refer to Fig. 8 for legend and Fig. 24 26 for experiments with other hyperparameter combinations. 5.2 Optimisation Tasks We conduct experiments on a number of synthetic and real-life tasks that involve or imitate expensive optimisation, and we show all results in terms of simple regret (i.e., the difference between the objective function value and the ground-truth optimum). We consider the following synthetic tasks: Maximising centrality scores (Fig. 5 and 6; Fig. 18 in App. C.2): we aim to find the node with maximum centrality measure, from a graph sampled from a random graph model. We consider both eigenvector centrality and betweenness centrality as the centrality metrics, and use BA and WS with different hyperparameters as the random graph-generating models. We consider graphs with sizes in the range of 103 in Fig. 5 and 6. In Fig. 18, we further scale the size of graphs considered to 106 nodes to demonstrate the scalability of our method in a large-scale setup. Synthetic test functions (Fig. 7): we optimise a suite of discretised versions of commonly used synthetic test functions (Ackley and Rosenbrock) on graphs defined as a 2D-grid in both noiseless and noisy setups. The readers are referred to App. B.3.2 for additional implementation details. We consider the following real-life tasks: Identifying the patient zero (Fig. 8; Fig. 20 to 23 in App. C.3): we aim to find the patient zero of an epidemic in a contact network, who is to the person identified as the first carrier of a communicable disease in an epidemic outbreak. We use a real-world contact network based on Bluetooth proximity [2], and on top of simulating the epidemic process using the SIR model, the canonical compartmental model in epidemiology [22]. The function values are the time instants when an individual is infected; the readers are referred to App. B.4.1 for more details of this task. Identifying influential users in a social network (Fig. 9): we aim to find the most influential user in a social network. There are multiple ways of defining the influence power of a user, and for simplicity, we follow the common practice of taking node degree as a proxy of influence [21]. We use three real-world networks, namely the Enron email network [25], Facebook page network [33], and Twitch social network [32]. The readers are referred to App. B.4.2 for more details. Team optimisation (Fig. 10; Fig. 24 to 26 in App. C.4): we design a task of optimising team structure, where the objective is to find a team that contains members who are experts in different skills, and their collective expertise represents a diverse skill set. In this case, the teams are modelled as nodes, and edges represent the a priori similarity between teams. While there are various possible ways to model these similarities, in our experiment, we consider that an edge exists between two nodes if the Jaccard index between the two sets of team members is greater than a certain threshold. We include additional details and a formal description of the objective function in the App. B.4.3. We designed these tasks to imitate expensive but realistic black-box optimisation problems on which the use of Bayesian optimisation is ideal. For example, the identifying patient zero task imitates real-life contact tracing. If executed in real life, each function evaluation requires expensive and potentially disruptive procedures like interviews about the individuals travel history and the people they were in contact with. On the other hand, the centrality maximisation & identifying influential social network users problems mirror common online advertising tasks to identify the influential users without access to the full social network information (which would be near-impossible to obtain given the number of users). Real-life social media often limits how much one may interact with their platform through pay-per-use APIs or hard limits (e.g. upper limit of views). In either case, there is a strong reason to identify the influential users in the most query-efficient manner. 0 20 40 60 80 100 #Iters BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 11: Aggregated ranks of the methods (lower is better) vs. the number of evaluations averaged across all experiments. Discussions. In addition to the task-specific results, we further aggregate the performance of the different methods over all tasks in terms of relative ranking in Fig. 11. We find that within individual tasks and aggregated across the different tasks, Bayes Opt G with any kernel choice generally outperforms all baselines in terms of efficiency, final converged values, or both. Specifically, Random is simple but typically weak for larger graphs, except for very rough/noisy functions (like Ackley), or the variation in function values is generally small; DFS and BFS are relatively weak as they consider graph topology information only but not the node information (on which the objective function is defined) and can be sensitive to initialisation; Local search is, on balance, the strongest baseline, and it does particularly well on smoother functions with fewer local minima. As is the case for any GP-based method, the kernel choice impacts the performance, and the performance is stronger when the underlying assumptions of the kernel match the actual objective function. For example, diffusion kernels work well for patient zero identification (Fig. 8) and team optimisation (Fig. 10), as the underlying generative functions for both problems, are indeed smooth (in fact, the SIR model in disease propagation is heavily connected to diffusion processes). Diffusion without ARD further enforces isotropy, assuming the diffusion coefficient in all directions is the same, and thus typically underperforms except for team optimisation, where the generated graph is well structured and Ackley, which is indeed isotropic and symmetric. We recommend only if we know that the underlying function satisfies its rather stringent assumptions. Finally, the Sum Inverse and Diff ARD kernels are generally better, as they offer more flexibility in learning from the data; we recommend using one of these as default without prior knowledge suggesting otherwise. 6 Conclusion We address the problem of optimising over functions on graphs, a hitherto under-investigated problem. We demonstrate that BO, combined with learned kernels on graphs and efficient local modelling, provides an effective solution. The proposed framework works with generic, large-scale and potentially unknown graphs, a setting that existing BO methods cannot handle. Results on a diverse range of tasks support the effectiveness of the proposed method. The current work, however, only considers the case where the optimisation is over nodes; possible future works include extensions to related settings, such as optimising over functions defined on edges and/or on hypergraphs. Acknowledgement The authors would like to acknowledge the following sources of funding in direct support of this work: X.W. is supported by the Clarendon Scholarship at University of Oxford; P.O. is supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems EP/L015897/1; X.D. acknowledges support from the Oxford-Man Institute of Quantitative Finance and the EPSRC (EP/T023333/1). The authors declare no conflict of interest. [1] Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. science, 286(5439): 509 512, 1999. [2] Barrat, A., Cattuto, C., Kivelä, M., Lehmann, S., and Saramäki, J. Effect of manual and digital contact tracing on covid-19 outbreaks: a study on empirical contact data. Journal of the Royal Society Interface, 18(178):20201000, 2021. [3] Borovitskiy, V., Azangulov, I., Terenin, A., Mostowsky, P., Deisenroth, M., and Durrande, N. Matérn gaussian processes on graphs. In International Conference on Artificial Intelligence and Statistics, pp. 2593 2601. PMLR, 2021. [4] Borovitskiy, V., Azangulov, I., Terenin, A., Mostowsky, P., Deisenroth, M. P., and Durrande, N. Matern Gaussian processes on graphs. In International Conference on Artificial Intelligence and Statistics, 2021. [5] Carpentier, A. and Valko, M. Revealing graph bandits for maximizing local influence. In International Conference on Artificial Intelligence and Statistics, 2016. [6] Carvalho, C. M., Polson, N. G., and Scott, J. G. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pp. 73 80. PMLR, 2009. [7] Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods. SIAM, 2000. [8] Cui, J., Yang, B., and Hu, X. Deep bayesian optimization on attributed graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019. [9] Cui, J., Tan, Q., Zhang, C., and Yang, B. A novel framework of graph bayesian optimization and its applications to real-world network analysis. Expert Systems with Applications, 170 (114524), 2021. [10] Daulton, S., Eriksson, D., Balandat, M., and Bakshy, E. Multi-objective bayesian optimization over high-dimensional search spaces. In Uncertainty in Artificial Intelligence, pp. 507 517. PMLR, 2022. [11] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016. [12] Deshwal, A., Belakaria, S., and Doppa, J. R. Mercer Features for Efficient Combinatorial Bayesian Optimization. In AAAI Conference on Artificial Intelligence, 2021. [13] Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. Advances in neural information processing systems, 32, 2019. [14] Even, S. Graph algorithms. Cambridge University Press, 2011. [15] Frazier, P. I. A tutorial on bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018. [16] Garnett, R. Bayesian Optimization. Cambridge University Press, 2023. [17] Ginsbourger, D., Le Riche, R., and Carraro, L. Kriging is well-suited to parallelize optimization. Computational intelligence in expensive optimization problems, pp. 131 162, 2010. [18] Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [19] Imani, M. and Ghoreishi, S. F. Graph-based bayesian optimization for large-scale objectivebased experimental design. IEEE Transactions on Neural Networks and Learning Systems, 33 (10):5913 5925, 2022. [20] Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., and Xing, E. P. Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NIPS), pp. 2016 2025, 2018. [21] Kempe, D., Kleinberg, J., and Tardos, É. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137 146, 2003. [22] Kermack, W. O. and Mc Kendrick, A. G. A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character, 115(772):700 721, 1927. [23] Korovina, K., Xu, S., Kandasamy, K., Neiswanger, W., Poczos, B., Schneider, J., and Xing, E. P. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. In Artificial Intelligence and Statistics, 2020. [24] Krummenauer, J., Kammoun, N., Stein, B., and Goetze, J. Encoding categorical variables in physics-informed graphs for Bayesian Optimization. In International Conference on Omni-layer Intelligent Systems, 2022. [25] Leskovec, J., Lang, K. J., Dasgupta, A., and Mahoney, M. W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29 123, 2009. [26] Ng, Y. C., Colombo, N., and Silva, R. Bayesian semi-supervised learning with graph Gaussian processes. In Conference on Neural Information Processing Systems, 2018. [27] Oh, C., Tomczak, J. M., Gavves, E., and Welling, M. Combinatorial Bayesian Optimization using the Graph Cartesian Product. In Conference on Neural Information Processing Systems, 2019. [28] Opolka, F. L. and Liò, P. Graph convolutional Gaussian processes for link prediction. In ICML Workshop on Graph Representation Learning and Beyond, 2020. [29] Opolka, F. L., Zhi, Y.-C., Lió, P., and Dong, X. Adaptive gaussian processes on graphs via spectral graph wavelets. In International Conference on Artificial Intelligence and Statistics, 2022. [30] Ramachandram, D., Lisicki, M., Shields, T. J., Amer, M. R., and Taylor, G. W. Bayesian optimization on graph-structured search spaces: Optimizing deep multimodal fusion architectures. Neurocomputing, 298:80 89, 2018. [31] Rasmussen, C. E. Gaussian processes in machine learning. Springer, 2004. [32] Rozemberczki, B., Allen, C., and Sarkar, R. Multi-scale attributed node embedding, 2019. [33] Rozemberczki, B., Davies, R., Sarkar, R., and Sutton, C. Gemsec: Graph embedding with self clustering. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019, pp. 65 72. ACM, 2019. [34] Ru, B., Wan, X., Dong, X., and Osborne, M. Interpretable neural architecture search via bayesian optimisation with weisfeiler-lehman kernels. International Conference on Learning Representations (ICLR), 2021. [35] Sapiezynski, P., Stopczynski, A., Lassen, D. D., and Lehmann, S. Interaction data from the copenhagen networks study. Scientific Data, 6(1):1 10, 2019. [36] Shervashidze, N., Schweitzer, P., Van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011. [37] Smola, A. J. and Kondor, R. Kernels and regularization on graphs. In Learning theory and kernel machines, pp. 144 158. Springer, 2003. [38] Thaker, P., Malu, M., Rao, N., and Dasarathy, G. Maximizing and satisficing in multi-armed bandits with graph information. Advances in Neural Information Processing Systems, 35: 2019 2032, 2022. [39] Valko, M., Munos, R., Kveton, B., and Kocák, T. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pp. 46 54. PMLR, 2014. [40] Venkitaraman, A., Chatterjee, S., and Handel, P. Gaussian processes over graphs. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2020. [41] Walker, I. and Glocker, B. Graph convolutional Gaussian processes. In International Conference on Machine Learning, 2019. [42] Wan, X., Kenlay, H., Ru, B., Blaas, A., Osborne, M. A., and Dong, X. Adversarial attacks on graph classifiers via bayesian optimisation. Conference on Neural Information Processing Systems, 2021. [43] Wan, X., Nguyen, V., Ha, H., Ru, B., Lu, C., and Osborne, M. A. Think global and act local: Bayesian optimisation over high-dimensional categorical and mixed search spaces. International Conference on Machine Learning (ICML), 2021. [44] Wan, X., Lu, C., Parker-Holder, J., Ball, P. J., Nguyen, V., Ru, B., and Osborne, M. Bayesian generational population-based training. In First Conference on Automated Machine Learning (Main Track), 2022. [45] Watts, D. J. and Strogatz, S. H. Collective dynamics of small-world networks. Nature, 393 (6684):440 442, 1998. [46] Zhi, Y.-C., Ng, Y. C., and Dong, X. Gaussian processes on graphs via spectral kernel learning. IEEE Transactions on Signal and Information Processing over Networks, 2023. A Proof of Semi-Definiteness In this section, we show that all kernels considered in this paper are positive semi-definite (p.s.d). Specifically, we note that using the terminology defined in Eq. 1, any map r R [0, + ] defines a valid covariance kernel. Indeed, X V, k(X, X) = i=1 r 1(λi)ui[X]ui[X] , (2) where ui[X] = h ui[x1], ui[x2], ..., ui[xl] i with l = |X|. The matrix ui[X]ui[X] is symmetric p.s.d as the outer product of one non-zero vector: x Rl, x ui[X]ui[X] x = ui[X] x 2 2 0. As a result, our covariance matrix is symmetric p.s.d as the weighted sum of symmetric positive semidefinite matrices with positive coefficients. The kernels we presented in this paper correspond to a positive r; hence, they are all p.s.d. B Experimental Details B.1 Random Graph Models Barabási Albert model (BA). The network begins with an initial connected network of m0 nodes. New nodes are added to the network one at a time. Each new node is connected to m m0 existing nodes with a probability that is proportional to the number of links that the existing nodes already have. The probability pi that the new node is connected to node i is: where kj is the degree of node j. Watts Strogatz model (WS). The WS model was introduced to explain the "small-world" phenomena in a variety of networks. It achieves this by interpolating between a randomized structure close to ER graphs and a regular ring lattice. Given a mean degree K and a parameter β [0, 1]. An undirected graph is constructed with N nodes and NK/2 edges as follows: 1. Constructs a regular one-dimensional network with only local connections of range K, meaning each node is connected to its K/2 nearest neighbours on each side. 2. For every node i = 0, ..., N 1 take every edge connecting i to its K/2 rightmost neighbours. Rewire each of these edges with probability β to random nodes while avoiding self-loop and link duplicates. B.2 Baseline Algorithms Breadth first search (BFS) and depth-first search (DFS). These algorithms aim to explore the whole graph data structure. It starts with a root node and explores according to the depth or breadth of the graph. In the former, the algorithm explores as far as possible along each branch before backtracking. In the latter, the algorithm explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Random search. In this algorithm, at each time step, a random node is selected for evaluation of our objective function. Local search. In this algorithm, at each time step, we sample and query a random node from a neighbour of the node of the maximum value encountered so far, and we move to a neighbour node if the queried value is better than the incumbent best. When the algorithm reaches a local optimum (i.e., all neighbours have worse values than the current optimum), we allow our algorithm to restart at a random, unvisited node in the graph. Figure 12: Betweeness/Eigenvector Centrality on BA/WS graphs. B.3 Synthetic Optimisation Tasks B.3.1 Maximising network centrality Centrality measures were introduced in network analysis to study the importance of certain vertices with respect to desired characteristics. In this paper, we used two centrality measures: betweenness centrality and eigenvalue centrality. We show examples of these functions on sample BA/WS graphs in Fig. 12. Betweenness centrality. This centrality focuses not just on overall connectedness but on the occupying positions that are pivotal to the network s connectivity. The following formula gives the betweenness of node v: s,t V\{v} s =t where σst is the total number of shortest paths from node s to node t and σst(v) is the number of those paths that pass through v. V\{v} denotes the set of neighbouring nodes of v except the node v itself. Eigenvector centrality. This centrality measure accounts for the influence of a particular node within the network. The centrality score for the whole set of vertices, represented as a vector x, is a solution to the equation: where the matrix A represents the adjacency matrix and λ represents the largest eigenvalue of the adjacency matrix. B.3.2 Synthetic test functions In this subsection, we describe the Rosenbrock and Ackley test functions used for our task, both of which are discretised versions of their original, continuous function forms. The mathematical definitions of the test functions are listed below and are visualized in Fig. 13. Rosenbrock function. f(x, y) = 100(y x2)2 + (x 1)2 Ackley function. f(x, y) = 20 exp 0.2 p 0.5(x2 + y2) exp 0.5(cos 2πx + cos 2πy) + 20 + exp(1) Additive noise. In order to alter the smoothness property of our graph signal defined over the grid, we add random noise governed by noise standard deviation σn to be added to the loss function ˆf(x, y) = f(x, y) + ϵ with ϵ N(0, σ2 n). In our experiment, we vary the noise standard deviation σn {0, 0.1, 1, 5} for the Ackley function and σn {0, 0.1, 0.5, 1} for the Rosenbrock function. Figure 13: Test function values taken on a regular graph corresponding to the input space: Rosenbrock (left); Ackley (right). Figure 14: Simulation of the SIR process on BA/WS graphs. It is worth noting that some local optima are visible due to the ϵ parameter. B.4 Real-World Optimisation Tasks B.4.1 Finding patient zero in a contact network In this task, we simulate the diffusion processes over a graph via epidemics SIR models Kermack & Mc Kendrick [22]. We slightly modify this model to allow for a parameter ϵ representing the probability of spontaneous infection from unknown factors. More formally, given a graph G = {V, E}, our model has three parameters. Parameter β encodes the probability of infection, β the probability of recovery ϵ the probability of spontaneous infection, and T the time spent since the beginning of the outbreak. Let time t = 1, ...T be the current time, xv,t {I, S, R} the node status (Infected, Susceptible, Recovered) and SI,t, SS,t, SR,t the set of nodes in each category at time t. We have: v SI,t, P[xv,t+1 = R] = γ P[xv,t+1 = I] = 1 γ. (3) v SS,t, P[xv,t+1 = I] = 1 (1 ϵ) (1 β)|N(v) SI,t| P[xv,t+1 = S] = (1 ϵ) (1 β)|N(v) SI,t|. (4) v SR,t, P[xv,t+1 = R] = 1. (5) Given such a process, we construct an objective function indicating how close a certain node is to the source of the infection as follows. At time T, where we consider the diffusion to have ended (or corresponding to the present moment when looking for patient zero), for every node in SR,T SI,T we denote by τv the first time of infection. The objective function is then defined as: v V, f(v) = 0 if v SS,T (1 τv T )2 if v SI,T SR,T (6) This function takes value in [0, 1] and is maximised when the node corresponds to the patient zero. We expect local methods to perform well in this setting as local behaviour can trace the source of the infection through diffusion, and the variation of the functions on the graph is relatively smooth. The introduction of parameter ϵ nevertheless adds some sources of "local" minima in the graph objective function we show some examples of such phenomenon in Fig. 14, where we give exemplary graph signals induced by the generative process we described in this section. B.4.2 Finding influential users in a social network In this task, we consider a common problem in identifying the most influential person within a social network. Influence, in this context, is often quantified approximately using degree centrality, which may account for, for example, the number of followers or connections an individual possesses. However, the enormity of social network graphs often restricts our access to complete graph information, necessitating alternative search approaches. To validate our methodology, we conduct tests on various real-world graphs derived from diverse social networks. These include the Enron email network, which represents email communication between members of a corporation; the Facebook page network, which is a network of interconnected Facebook pages; and the Twitch social network, which provides insight into the relationships among users on the Twitch platform. B.4.3 Team optimisation Figure 15: An exemplary graph induced from the team optimisation problem. In this task, we aim to tackle the problem of optimising the performance of a team of individuals with different skills. We will assume that a team would perform most effectively when 1) all skills are covered by combining individual skills and 2) some individuals master every skill. More formally, we will consider a pool of N individuals. Each individual can be represented by a vector of skills xi [0, 1]K where K is the number of skills. This setup fits our framework well in the scenario where the skills of individuals are unknown, and the pool of potential candidates is also unknown in advance. A sample graph generated from this problem is shown in Fig. 15. Skill generative process. In each experiment, we will assume individual skills to be generated according to a Dirichlet distribution with parameter α: xi Dir(α), α = [α1, ..., αD] The parameter α encodes the sparsity of skill expertise in the general population. Small α generates individuals with specialised skills with more probability than large α where all skill levels concentrate to a score of 0.5. Graph construction. To solve this task and allow flexible exploration of teams with a varying number of individuals, we construct a graph where nodes represent teams and edges are based on the Jaccard index between each pair of team member sets. More specifically, given two teams s1 N and s2 N the similarity between them is computed as w(s1, s2) = s1 s2 s1 s2 . Given N teams, we can then construct an undirected graph with edges: s1, s2 [N], (s1, s2) E w(s1, s2) > Median({w(s1, s2) : s1, s2 [N]}) Objective function. To model the two desirable properties in terms of team composition, we choose the following objective function: s [N] : f(s) = Hk[En[x]] En[Hk[x]] Intuitively, the first term of the objective corresponds to the entropy of the skill distribution of the whole team, which is maximised when the skill distribution is close to the uniform distribution. The second term corresponds to the expected entropy of the distribution of skills of each individual, which is minimised (and the objective maximised) when each individual specialises in one skill. As a result, we can expect this objective to be well suited for modelling an ideal composition of a team. C Additional Experiments C.1 Kernel Validation Complementary to Fig. 4 in the main text, we conduct further regression analyses to confirm the expressive power of the investigated kernels. The results are shown in Fig. 16 and Fig. 17. 1 0 1 Validation ground truth Validation prediction 1 0 1 Validation ground truth Sum of inverse polynomial 1 0 1 Validation ground truth 1 0 1 Validation ground truth Diffusion with ARD 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 2 1 0 1 Validation ground truth Validation prediction 2 1 0 1 Validation ground truth Sum of inverse polynomial 2 1 0 1 Validation ground truth 2 1 0 1 Validation ground truth Diffusion with ARD 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ 0.00 0.25 0.50 0.75 1.00 λ Figure 16: Expressiveness of kernels on a grid graph of size n = 200 nodes. Refer to Fig. 4 for more explanations. 1 0 1 Validation ground truth Validation prediction 1 0 1 Validation ground truth Sum of inverse polynomial 1 0 1 Validation ground truth 1 0 1 Validation ground truth Diffusion with ARD 0.0 0.2 0.4 0.6 λ 0.0 0.2 0.4 0.6 λ 0.0 0.2 0.4 0.6 λ 0.0 0.2 0.4 0.6 λ 2 0 2 Validation ground truth Validation prediction 2 0 2 Validation ground truth Sum of inverse polynomial 2 0 2 Validation ground truth 2 0 2 Validation ground truth Diffusion with ARD 0.0 0.2 0.4 0.6 λ 0.0 0.2 0.4 0.6 λ 0.0 0.2 0.4 0.6 λ 0.0 0.2 0.4 0.6 λ Figure 17: Expressiveness of kernels on a WS graph of size n = 200 nodes. Refer to Fig. 4 for more explanations. C.2 Centrality Maximisation on Large Graphs In this section, we consider a similar problem of centrality maximisation as described in App. B.3.1, but on significantly larger graphs: we use BA and WS random graph generators similar to the experiments in Fig. 5 and Fig. 6 in the main text, but we generate graphs with 106 nodes instead and increase the query budget. We show the results in Fig. 18, and we find that the superiority of Bayes Opt G methods persists in this setting over the baseline methods. C.3 Finding Patient Zero Task in Real-World Graphs Setup. In this section, we consider several SIR diffusion problems as described in App. B.4.1 where we aim to find the patient zero on a real-world interaction network from the Copenhagen Networks Study [35]. This network represents physical proximity among participants (estimated via Bluetooth signal strength) in a population of more than 700 university students and thus is a good testbed to examine the diffusion process of a hypothetical epidemic outbreak. The visualization of the function on this graph is given by Fig. 19. 0 1000 2000 3000 (a) Ackley; = 0 0 2000 4000 (b) Ackley; = 0.1 0 2000 4000 6000 10 7 (c) Rosenbrock; = 0 0 2000 4000 (d) Rosenbrock; = 0.1 0 1000 2000 #Iters (e) BA Degree centrality: 0 1000 2000 #Iters (f) BA Degree centrality: 0 2000 4000 #Iters (g) WS Degree centrality: (k, p) = (30, 0.1) 0 2000 4000 #Iters (h) WS Degree centrality: (k, p) = (30, 0.2) BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 18: Maximising centrality scores with the BA/WS random graph model and n = 106 nodes. Figure 19: Diffusion objective function on the real-world interaction network. Results. The performance of each algorithm is presented in Fig. 20 23 where we use different values of initially infected population fraction and probability of recovery it is clear that due to the increased complexity as revealed in Fig. 19, there is some performance degradation in all algorithms considered. However, we can see that in most cases, our method performs at least as well as the local search baseline. C.4 Team Optimisation We show additional results for the team optimisation tasks in Fig. 24 to 26. We can observe that the key findings from the main text on this problem (Fig. 10) largely hold true for these tasks induced by different parameters. D Ablation and Sensitivity Studies In this section, we perform a thorough ablation and sensitivity study on how much the additionally introduced hyperparameters affect the algorithm s performance. We report sensitivity analyses to the most important hyperparameters below, namely Q0 (initial trust region size), fail_tol, η (a) ( , ) = (0.1, 0.005) (b) ( , ) = (0.1, 0.015) (c) ( , ) = (0.1, 0.15) (d) ( , ) = (0.2, 0.005) 0.06 (e) ( , ) = (0.2, 0.015) (f) ( , ) = (0.2, 0.15) 0 50 100 #Iters (g) ( , ) = (0.3, 0.005) 0 50 100 #Iters (h) ( , ) = (0.3, 0.015) 0 50 100 #Iters (i) ( , ) = (0.3, 0.15) BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 20: Identifying the patient zero task with different SIR model hyperparameters β {0.1, 0.2, 0.3} and γ {0.005, 0.015, 0.15}. A fraction of 0.0003 of the initial population was infected initially. The probability of recovery ϵ is set to 0. (order of the kernels) and γ (the trust region multiplier in case of successive successes or failures). We also additionally study the effect of introducing the trust region in this section. We perform ablation experiments in the setting with BA graphs and synthetic function optimization. We show the sensitivity analysis in Fig. 27 to 29 it is evident that our algorithm is largely robust to the choice of hyperparameters as long as a value within a sensible range is chosen. Use of trust regions. We compared, on some relatively small graphs (1000 nodes) in Fig. 31 as observed, while there is a small drop in performance because of the use of local modelling compared to constructing a surrogate model on the whole graph, it is worth noting that, as shown in Fig. 3 in the main text, the full Bayesian optimisation procedure with kernels defined on the whole graph becomes too prohibitive, even for a relatively small graph of size 1000, and that when the graph is unknown, it is impossible in the first place to construct a whole-graph GP. This verifies that the trust region strikes a promising balance between efficiency and performance. (a) ( , ) = (0.1, 0.005) (b) ( , ) = (0.1, 0.015) 0 50 100 10 2 (c) ( , ) = (0.1, 0.15) (d) ( , ) = (0.2, 0.005) (e) ( , ) = (0.2, 0.015) (f) ( , ) = (0.2, 0.15) 0 50 100 #Iters (g) ( , ) = (0.3, 0.005) 0 50 100 #Iters (h) ( , ) = (0.3, 0.015) 0 50 100 #Iters (i) ( , ) = (0.3, 0.15) BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 21: Identifying the patient zero task with different SIR model hyperparameters β {0.1, 0.2, 0.3} and γ {0.005, 0.015, 0.15}. A fraction of 0.0003 of the initial population was infected initially. The probability of recovery ϵ is set to 0.005. (a) ( , ) = (0.1, 0.005) (b) ( , ) = (0.1, 0.015) 10 1 (c) ( , ) = (0.1, 0.15) 2 10 2 3 10 2 (d) ( , ) = (0.2, 0.005) 2 10 2 3 10 2 (e) ( , ) = (0.2, 0.015) 3 10 2 4 10 2 (f) ( , ) = (0.2, 0.15) 0 50 100 #Iters (g) ( , ) = (0.3, 0.005) 0 50 100 #Iters (h) ( , ) = (0.3, 0.015) 0 50 100 #Iters (i) ( , ) = (0.3, 0.15) BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 22: Identifying the patient zero task with different SIR model hyperparameters β {0.1, 0.2, 0.3} and γ {0.005, 0.015, 0.15}. A fraction of 0.003 of the initial population was infected initially. The probability of recovery ϵ is set to 0. (a) ( , ) = (0.1, 0.005) (b) ( , ) = (0.1, 0.015) 3 10 2 4 10 2 (c) ( , ) = (0.1, 0.15) (d) ( , ) = (0.2, 0.005) (e) ( , ) = (0.2, 0.015) (f) ( , ) = (0.2, 0.15) 0 50 100 #Iters (g) ( , ) = (0.3, 0.005) 0 50 100 #Iters (h) ( , ) = (0.3, 0.015) 0 50 100 #Iters (i) ( , ) = (0.3, 0.15) BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 23: Identifying the patient zero task with different SIR model hyperparameters β {0.1, 0.2, 0.3} and γ {0.005, 0.015, 0.15}. A fraction of 0.003 of the initial population was infected initially. The probability of recovery ϵ is set to 0.005. (a) #skills = 2 and = 0.1 0 50 100 0.0 (b) #skills = 4 and = 0.1 (c) #skills = 8 and = 0.1 (d) #skills = 2 and = 1 0.10(e) #skills = 4 and = 1 (f) #skills = 8 and = 1 0 50 100 #Iters (g) #skills = 2 and = 10 0 50 100 #Iters 0.02(h) #skills = 4 and = 10 0 50 100 #Iters 0.015(i) #skills = 8 and = 10 BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 24: Team optimisation task with s {2, 4} and α {1, 10} with Jaccard index threshold of 0.1 (refer to App. B.4.3 for explanations) (a) #skills = 2 and = 0.1 0 50 100 0.0 (b) #skills = 4 and = 0.1 0 50 100 0.0 (c) #skills = 8 and = 0.1 (d) #skills = 2 and = 1 0 50 100 0.00 0.10(e) #skills = 4 and = 1 0 50 100 0.00 (f) #skills = 8 and = 1 0 50 100 #Iters (g) #skills = 2 and = 10 0 50 100 #Iters 0.02(h) #skills = 4 and = 10 0 50 100 #Iters (i) #skills = 8 and = 10 BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 25: Team optimisation task with s {2, 4} and α {1, 10} with Jaccard index threshold of 0.2 (refer to App. B.4.3 for explanations) (a) #skills = 2 and = 0.1 (b) #skills = 4 and = 0.1 (c) #skills = 8 and = 0.1 0 50 100 0.0 (d) #skills = 2 and = 1 0.10(e) #skills = 4 and = 1 0 50 100 0.00 (f) #skills = 8 and = 1 0 50 100 #Iters (g) #skills = 2 and = 10 0 50 100 #Iters 0.02(h) #skills = 4 and = 10 0 50 100 #Iters (i) #skills = 8 and = 10 BFS Bayes Opt G_Diff Bayes Opt G_Diff_ARD Bayes Opt G_Matern Bayes Opt G_Poly Bayes Opt G_Sum Inverse DFS Local Random Figure 26: Team optimisation task with s {2, 4} and α {1, 10} with Jaccard index threshold of 0.3 (refer to App. B.4.3 for explanations) 0 25 50 75 100 Regret Poly 0 25 50 75 100 0 25 50 75 100 0.025 0 25 50 75 100 Regret Sum Inverse 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Regret Diff_ARD 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Regret Diff 0 25 50 75 100 0 25 50 75 100 0.04 0 25 50 75 100 #Iters Regret Matern 0 25 50 75 100 #Iters 0 25 50 75 100 #Iters Q0 : 5 Q0 : 20 Q0 : 50 Q0 : 100 Q0 : 200 Figure 27: Sensitivity of performance to Q0 on different tasks and kernels. From left to right: Centrality maximisation on Enron, Facebook and Twitch networks. Kernels from top to bottom: polynomial, sum-of-inverse polynomials, diffusion (with ARD), diffusion (without ARD), and graph Matérn. 0 25 50 75 100 Regret Poly 0 25 50 75 100 0.01 0 25 50 75 100 0.025 0 25 50 75 100 Regret Sum Inverse 0 25 50 75 100 0 25 50 75 100 0.025 0 25 50 75 100 Regret Diff_ARD 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Regret Diff 0 25 50 75 100 0.01 0 25 50 75 100 0.04 0 25 50 75 100 #Iters Regret Matern 0 25 50 75 100 #Iters 0 25 50 75 100 #Iters fail_tol:5 fail_tol:10 fail_tol:20 fail_tol:30 fail_tol:40 fail_tol:50 Figure 28: Sensitivity of performance to fail_tol on different tasks and kernels. Refer to Fig. 27 for additional explanations. 0 25 50 75 100 Regret Poly 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Regret Sum Inverse 0 25 50 75 100 0 25 50 75 100 0.025 0 25 50 75 100 Regret Diff_ARD 0 25 50 75 100 0 25 50 75 100 0.04 0 25 50 75 100 Regret Diff 0 25 50 75 100 0 25 50 75 100 0.04 0 25 50 75 100 #Iters Regret Matern 0 25 50 75 100 #Iters 0 25 50 75 100 #Iters : 1.1 : 1.5 : 2.0 : 3.0 : 5.0 Figure 29: Sensitivity of performance to γ on different tasks and kernels. Refer to Fig. 27 for additional explanations. 0 25 50 75 100 Regret Poly 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Regret Sum Inverse 0 25 50 75 100 0 25 50 75 100 : 3 : 5 : 10 : 25 : 50 : 100 Figure 30: Sensitivity of performance to η on different tasks and kernels. Refer to Fig. 27 for additional explanations. Note that only Polynomial and Sum-of-inverse-polynomial kernels requiring non-trivial η selection are included. Figure 31: Influence of trust region method performance compared to full graph knowledge in a WS graph of 400 (Left) and 1000 (right) nodes. It is worth noting that while the use of trust regions leads to some trade-off in terms of optimisation performance, without trust regions, the algorithms are significantly more computationally prohibitive. In the graph of 1000 nodes (right), the computational cost is more than 20 the cost of a local algorithm (shown by Fig. 3 in the main text which focuses on wall-clock time comparison).