# graph_neural_network_bandits__18068aa6.pdf Graph Neural Network Bandits Parnian Kassraie ETH Zurich pkassraie@ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch Ilija Bogunovic University College London i.bogunovic@ucl.ac.uk We consider the bandit optimization problem with the reward function defined over graph-structured data. This problem has important applications in molecule design and drug discovery, where the reward is naturally invariant to graph permutations. The key challenges in this setting are scaling to large domains, and to graphs with many nodes. We resolve these challenges by embedding the permutation invariance into our model. In particular, we show that graph neural networks (GNNs) can be used to estimate the reward function, assuming it resides in the Reproducing Kernel Hilbert Space of a permutation-invariant additive kernel. By establishing a novel connection between such kernels and the graph neural tangent kernel (GNTK), we introduce the first GNN confidence bound and use it to design a phased-elimination algorithm with sublinear regret. Our regret bound depends on the GNTK s maximum information gain, which we also provide a bound for. While the reward function depends on all N node features, our guarantees are independent of the number of graph nodes N. Empirically, our approach exhibits competitive performance and scales well on graph-structured domains. 1 Introduction Contemporary bandit optimization approaches consider problems on large or continuous domains and have successfully been applied to a significant number of machine learning and real-world applications, e.g., in mobile health, environmental monitoring, economics, and hyperparameter tuning, to name a few. The main idea behind them is to exploit the correlations between the rewards of similar actions. This in turn, has resulted in increasingly rich models of reward functions (e.g., in linear and kernelized bandits [37, 12]), including several recent attempts to harness deep neural networks for bandit tasks (see, e.g, [45, 29]). A vast majority of previous works only focus on standard input domains and obtaining theoretical regret bounds. Learning on graph-structured data, such as molecules or biological graph representations, requires designing sequential methods that can effectively exploit the structure of graphs. Consequently, graph neural networks (GNNs) have received attention as a rapidly expanding class of machine learning models. They deem remarkably well-suited for prediction tasks in applications such as designing novel materials [20], drug discovery [22], structure-based protein function prediction [16], etc. This rises the question of how to bridge the gap, and design bandit optimization algorithms on graphstructured data that can exploit the power of graph neural networks to approximate a graph reward. In this paper, we consider bandit optimization over graphs and propose to employ graph neural networks with one convolutional layer to estimate the unknown reward. To scale to large graph domains (both in the number of graphs and number of nodes), we propose practical structural assumptions to model the reward function. In particular, we propose to use permutation invariant additive kernels. We show a novel connection between such kernels and the graph neural tangent kernel (GNTK) that we define in Section 3. Our main result are GNN confidence bounds that can be readily used in sequential decision-making algorithms to achieve sublinear regret bounds (see Section 4.3). 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Related Work. Our work extends the rich toolbox of methods for kernelized bandits and Bayesian optimization (BO) that work under the norm bounded Reproducing Kernel Hilbert Space assumption [37, 14, 42, 12]). The majority of these methods are designed for general Euclidean domains and rely on kernelized confidence sets to select which action to query next. The exception is [43], that consider the spectral setting in which the reward function is a linear combination of the eigenvectors of the graph Laplacian and the bandit problem is defined over nodes of a single graph. In contrast, our focus is on optimizing over graph domains (i.e., set of graphs), and constructing confidence sets that can quantify the uncertainty of graph neural networks estimates. This work contributes to the literature on neural bandits, in which a fully-connected [45, 44, 19], or a single hidden layer convolutional network [25] is used to estimate the reward function. These works provide sublinear cumulative regret bounds in their respective settings, however, when applied directly to graph features (as we demonstrate in Section 4.3), these approaches do not scale well with the number of graph nodes. Due to its important applications in molecule design, sequential optimization on graphs has recently received considerable attention. For example, in [28], the authors propose a kernel to capture similarities between graphs, and at every step, select the next graph through a kernelized random walk. Other works (e.g., [17, 18, 23, 38]) encode graph representations to the (continuous) latent space of a variational autoencoder and perform BO in the latent space. While practically relevant for discovering novel molecules with optimized properties, these approaches lack theoretical guarantees and deem computationally demanding. A primary focus in our work is on embedding the natural structure of the data, i.e., permutation invariance, into the reward model. This is inspired by the works of [6, 32] that consider invariances in kernel-based supervised learning. Consequently, the graph neural tangent kernel plays an integral role in our theoretical analysis. Du et al. [15] provide a recursive expression for the tangent kernel of a GNN, without showing that the obtained expression is the limiting tangent kernel as defined in Jacot et al. [21] (i.e., as in Eq. (3)). In contrast, we analyze the learning dynamics of the GNN and properties of the GNTK by exploiting the connection between the structure of a graph neural network and that of a neural network (in Section 3). We recover that the graph neural tangent kernel also encodes additivity. Additive models for bandit optimization have been previously studied in [24] and [35], however, these works only focus on Euclidean domains and standard base kernels. Finally, we build upon the recent literature on elimination-based algorithms that make use of maximum variance reduction sampling [13, 8, 7, 9, 40, 30]. One of our proposed algorithms, GNN-PE, employs a phased elimination strategy together with our GNN confidence sets. Main Contributions. We introduce a bandit problem over graphs and propose to capture prior knowledge by modeling the unknown reward function using a permutation invariant additive kernel. We establish a key connection between such kernel assumptions and the graph neural tangent kernel (Proposition 3.2). By exploiting this connection, we provide novel statistical confidence bounds for the graph neural network estimator (Theorem 4.2). We further prove that a phased elimination algorithm that uses our GNN-confidence bounds (GNN-PE) achieves sublinear regret (Theorem 4.3). Importantly, our regret bound scales favorably with the number of graphs and is independent of the number of graph nodes (see Table 1). Finally, we empirically demonstrate that our algorithm consistently outperforms baselines across a range of problem instances. 2 Problem Statement We consider a bandit problem where the learner aims to optimize an unknown reward function via sequential interactions with a stochastic environment. At every time step t {1, . . . , T}, the learner selects a graph Gt from a graph domain G and observes a noisy reward yt = f (Gt) + ϵt, where f : G R is the reward function and ϵt is i.i.d. zero-mean sub-Gaussian noise with known variance proxy σ2. Over a time horizon T, the learner seeks a small cumulative regret RT = PT t=1 f (G ) f (Gt), where G arg max G G f (G). The aim is to attain regret that is sublinear in T, meaning that RT /T 0 as T , which implies convergence to the optimal graph. As an example application, consider drug or material design, where molecules may be represented with graph structures (e.g., from SMILES representations [1]) and the reward f (G) can correspond to an unknown molecular property of interest, e.g., atomization energy. Evaluating such properties typically requires running costly simulations or experiments with noisy outcomes. To identify the most promising candidate, e.g., the molecule with the highest atomization energy, molecules are sequentially recommended for testing and the goal is to find the optimal molecule with the least number of evaluations. Graph Domain. We assume that the domain G is a finite set of undirected graphs with N nodes.1 Without exploiting structure, standard bandit algorithms (e.g., [3]) cannot generalize across graphs, and their regret linearly depends on |G|. To capture the structure, we consider reward functions depending on features associated with the graph nodes. Similar to Du et al. [15], we associate each node j [N] with a feature vector h G,j Rd, for every graph G G. We use h G = (h G,j)N j=1 RNd to denote the concatenated vector of all node features, and N(j) as the neighborhood of node j, including itself. We define the aggregated node feature h G,j = P i N(j) h G,i/|| P i N(j) h G,i||2 as the normalized sum of the neighboring nodes features. Similarly, h G RNd denotes the aggregated features, stacked across all nodes. Lastly, we let PN be the group of all permutations of length N, and use c G to denote a permuted graph, where a permutation c PN is a bijective mapping from {1, . . . , N} onto itself. Permuting the nodes of a graph c G produces a permuted feature vector hc G := (h G,c(j))N j=1, and the same holds for the aggregated features hc G . Reward Model. Practical graph optimization problems, such as drug discovery and materials optimization often do not depend on how the graphs nodes in the dataset are ordered. We incorporate this structural prior into modeling the reward function, and consider functions that are invariant to node permutations. We assume that f depends on the graph only through the aggregated node features and gives the same reward for all permutations of a graph, i.e., f (c G) = f (G), for any G G and c PN. To guarantee such an invariance, we assume that the reward belongs to the reproducing kernel Hilbert space (RKHS) corresponding to a permutation invariant kernel k(G, G ) = 1 |PN|2 X c,c PN k( hc G, hc G ), where k can be any kernel defined on graph representations h G. This assumption further restricts the hypothesis space to permutation invariant functions defined on Nd dimensional vector representations of graphs. This is due to the reproducing property of the RKHS which allows us to write f(G) = f, k(G, ) = f, k(c G, ) = f(c G). To make progress when optimizing over graphs with a large number of nodes N, we assume that k decomposes additively over node features, i.e., k( h G, h G ) = 1 j=1 k(j)( h G,j, h G ,j). Thereby, we obtain an additive graph kernel that is invariant to node permutations: k(G, G ) = 1 |PN|2 X j=1 k(j)( h G,c(j), h G ,c (j)). (1) For an arbitrary choice of k(j), calculating k requires a costly sum over (N!)2 operands, since |PN| = N!. In Section 3, we select a base kernel for which the sum can be reduced to N 2 terms. We are now in a position to state our main assumption. We assume that f belongs to the RKHS of k and has a B-bounded RKHS norm. The norm-bounded RKHS regularity assumption is typical in the kernelized and neural bandits literature [37, 12, 45, 25]. Note that Eq. (1) only puts a structural prior on the kernel function, i.e., it describes the generic form of an additive permutation invariant graph kernel. Specifying the base kernels k(j) determines the representation power of k. The smoother the base kernels are, the less complex the RKHS of k will be. In Section 3, we set the base kernels k(j) such that k becomes the expressive graph neural tangent kernel. 3 Graph Neural Networks Graph neural networks are effective models for learning complex functions defined on graphs. As in Du et al. [15], we consider graph networks that have a single graph convolutional layer and L 1This assumption is for ease of exposition. Graphs with fewer than N nodes can be treated by adding auxiliary nodes with no features that are disconnected from the rest of the graph. fully-connected Re LU layers of equal width m. Such a network f GNN(G; θ) : G R may be recursively defined as follows: f (1)( h G,j) = W (1) h G,j, f (l)( h G,j) = 2 m W (l)σrelu f (l 1)( h G,j) Rm, 1 < l L f GNN(G; θ) = 1 2W (L+1)σrelu f (L)( h G,j) , where θ := (W (i))i L+1 is initialized randomly with standard normal i.i.d. entries, and σrelu(x) := max(0, x). The network operates on aggregated node features h G,j as typical in Graph Convolutional Networks [27]. For convenience, we assume that at initialization f GNN(G; θ0) = 0, for all G G, similar to [25, 45]. This assumption can be fulfilled without loss of generality, with a similar treatment as in [25, Appendix B.2]. Embedded Invariances. In this work, we use graph neural networks to estimate the unknown reward function f . This choice is motivated by the expressiveness of the GNN, the fact that it scales well with graph size, and particularly due to the invariances embedded in its structure. We observe that the graph neural network f GNN is invariant to node permutations, i.e., for all G G and c PN, f GNN(G; θ) = f GNN(c G; θ). The key step to show this property is proving that f GNN can be expressed as an additive model of L-layer fully-connected Re LU networks, f GNN(G; θ) = 1 j=1 f NN( h G,j; θ), where f NN has a similar recursive definition as f GNN (see Equation A.1). The above properties are formalized in Lemma A.1 and Lemma A.2. Lazy (NTK) Regime. We initialize and train f GNN in the well-known lazy regime [11]. In this initialization regime, when the width m is large, training with gradient descent using a small learning rate causes little change in the network s parameters. Let g GNN(G, θ) = θf GNN(G, θ) denote the gradient of the network. It can be shown that during training, for all G G, the network remains close to f GNN(G, θ0) + g T GNN(G, θ0)(θ θ0), that is, its first order approximation around initialization parameters θ0. Training this linearized model with a squared error loss is equivalent to kernel regression with a tangent kernel k GNN(G, G ) := g T GNN(G; θ0) g GNN(G ; θ0). For networks of finite width, this kernel function is random since it depends on the random network parameters at initialization. We show in Proposition 3.1, that in the infinite width limit, the tangent kernel converges to a deterministic kernel. This proposition introduces the Graph Neural Tangent Kernel as the limiting kernel, and links it to the Neural Tangent Kernel ([21], also defined in Appendix A). Proposition 3.1. Consider any two graphs G and G with N nodes and d-dimensional node features. In the infinite width limit, the tangent kernel k GNN(G, G )/m converges to a deterministic kernel, k GNN(G, G ) := lim m k GNN(G, G )/m. (3) which we refer to as the Graph Neural Tangent Kernel (GNTK). Moreover, k GNN(G, G ) = 1 N 2 j,j =1 k NN( h G,j, h G ,j ) (4) where k NN : Sd 1 Sd 1 R is the Neural Tangent Kernel. The proof is given in Appendix A.1. We note that h G,j lies on the d-dimensional sphere, since the aggregated node features are normalized. The NTK is bounded by 1 for any two points on the sphere [5]. Therefore, Proposition 3.1 implies that the GNTK is also bounded, i.e., k GNN(G, G ) 1 for any G, G G. This proposition yields a kernel which captures the behaviour of the lazy GNN. While defined on graphs with Nd dimensional representations, the effective input domain of this kernel is d-dimensional. This advantage directly stems from the additive construction of the GNTK. The next proposition uncovers the embedded structure of the GNTK by showing a novel connection between the GNTK and k, the permutation invariant additive kernel from Eq. (1). The proof is presented in Appendix A.1. Proposition 3.2. Consider k from Eq. (1), where for every 1 j N the base kernel k(j) is set to be equal to k NN, k(G, G ) = 1 |PN|2 X j=1 k NN( h G,c(j), h G ,c (j)). Then the permutation invariant additive kernel and the GNTK are identical, i.e., for all G, G G, k(G, G ) = k GNN(G, G ). This result implies that k GNN inherits the favorable properties of the permutation invariant additive kernel class. Hence, functions residing in HGNN, the RKHS of k GNN, are additive, invariant to node permutations, and act on G through its aggregated node features. While we use the GNTK as an analytical tool, this kernel can be of independent interest in kernel methods over graph domains. In particular, calculating k GNN requires significantly fewer operations compared to a kernel k with an arbitrary choice of k(j), for which calculating k(G, G ) requires super-exponentially many operations in N (See Eq. (1)). In contrast, due to the decomposition in Eq. (4), calculating k GNN only costs a quadratic number of summations. 4 GNN Bandits The bandit literature is rich with algorithms that effectively balance exploration and exploitation to achieve sublinear regret. Two components are common in kernelized bandit optimization algorithms. The maximum information gain, for characterizing the worst-case complexity of the learning problem [37, 24, 12, 40]; and confidence sets, for quantifying the learner s uncertainty [4, 39, 36, 12, 31]. Our first main result is an upper bound for the maximum information gain when the hypothesis space is HGNN (Theorem 4.1). We then propose valid confidence sets that utilize GNNs in Theorem 4.2. These theorems may be of independent interest, as they can be used towards bounding the regret for a variety of bandit algorithms on graphs. Lastly, we introduce the GNN-PE algorithm, together with its regret guarantee. 4.1 Information Gain In bandit tasks, the learner seeks actions that give a large reward while, at the same time, provide information about the unknown reward function. The speed of learning about f is commonly quantified via the maximum information gain. Assume that the learner chooses a sequence of actions (G1, . . . , GT ) and observes noisy rewards, where the noise is i.i.d. and drawn from a zero-mean sub-Gaussian distribution with a variance proxy λ. The information gain of this sequence calculated via the GNTK is I(G1, . . . , GT ; k GNN) = 1 2 log det(I + λ 1KGNN,T ) with the kernel matrix KGNN,T = [k GNN(Gi, Gj)]i,j T . The maximum information gain (MIG) [37] is then defined as: γGNN,T = max (G1, ,GT ) t:Gt G I(G1, , GT ; k GNN). (5) In Section 4.3, we express regret bounds in terms of this quantity, as common in kernelized and neural bandits. In Theorem 4.1 we obtain a data-independent bound on the MIG. The proof is given in Appendix B. Theorem 4.1 (GNTK Information Gain Bound). Suppose the observation noise is i.i.d., and drawn from a zero-mean sub-Gaussian distribution, and the input domain is G. Then the maximum information gain associated with k GNN is bounded by γGNN,T = O T d 1 d log 1 d T . We observe that the obtained MIG bound does not depend on N the number of nodes in the graphs. To highlight this advantage, we compare Theorem 4.1 to the equivalent bound for the vanilla neural tangent kernel which ignores the graph structure. We consider the neural tangent kernel that operates on graphs through the Nd-dimensional vector of aggregated node features h G, κNN(G, G ) = κNN For κNN the maximum information gain scales as γNN,T = O(T (Nd 1)/Nd log1/Nd T), where N appears in the exponent [25]. This results in poor scalability with graph size in the bandit optimization task, as we further demonstrate in Section 4.3. Table 1 summarizes this comparison. 4.2 Confidence Sets Quantifying the uncertainty over the reward helps the learner to guide exploration and balance it against exploitation. Confidence sets are an integral tool for uncertainty quantification. Conditioned on the history Ht 1 = (Gi, yi)i 0 to denote the minimum eigenvalue of the kernel matrix calculated for the entire domain, i.e., KGNN = [k GNN(G, G )]G,G G. Theorem 4.2 shows that this construction gives valid confidence intervals, i.e., it satisfies Eq. (7), when the reward function lies in HGNN and has a bounded RKHS norm. Theorem 4.2 (GNN Confidence Bound). Set δ (0, 1). Suppose f Hk GNN with a bounded norm f k GNN B. Assume that the random sequences (Gi)i