# quantum_bayesian_optimization__04d539a7.pdf Quantum Bayesian Optimization Zhongxiang Dai*1, Gregory Kang Ruey Lau*1,2, Arun Verma1, Yao Shu1, Bryan Kian Hsiang Low1, Patrick Jaillet3 1Department of Computer Science, National University of Singapore 2CNRS@CREATE, 1 Create Way, #08-01 Create Tower, Singapore 138602 3Department of Electrical Engineering and Computer Science, MIT dzx@nus.edu.sg, greglau@comp.nus.edu.sg, arun@comp.nus.edu.sg, shuyao95@u.nus.edu, lowkh@comp.nus.edu.sg, jaillet@mit.edu Kernelized bandits, also known as Bayesian optimization (BO), has been a prevalent method for optimizing complicated black-box reward functions. Various BO algorithms have been theoretically shown to enjoy upper bounds on their cumulative regret which are sub-linear in the number T of iterations, and a regret lower bound of Ω( T) has been derived which represents the unavoidable regrets for any classical BO algorithm. Recent works on quantum bandits have shown that with the aid of quantum computing, it is possible to achieve tighter regret upper bounds better than their corresponding classical lower bounds. However, these works are restricted to either multi-armed or linear bandits, and are hence not able to solve sophisticated real-world problems with non-linear reward functions. To this end, we introduce the quantum-Gaussian process-upper confidence bound (Q-GP-UCB) algorithm. To the best of our knowledge, our Q-GP-UCB is the first BO algorithm able to achieve a regret upper bound of O(poly log T), which is significantly smaller than its regret lower bound of Ω( T) in the classical setting. Moreover, thanks to our novel analysis of the confidence ellipsoid, our Q-GP-UCB with the linear kernel achieves a smaller regret than the quantum linear UCB algorithm from the previous work. We use simulations, as well as an experiment using a real quantum computer, to verify that the theoretical quantum speedup achieved by our Q-GP-UCB is also potentially relevant in practice. 1 Introduction Kernelized bandits [9], also named Bayesian optimization (BO) [10, 11, 16, 19, 32], has been an immensely popular method for various applications involving the optimization of complicated blackbox reward functions. For example, BO has been extensively used to optimize the hyperparameters of machine learning (ML) models [12, 14, 22], the parameters of computationally expensive simulators [5], etc. In every iteration t = 1, . . . , T, BO chooses an arm/input xt and then queries the reward function f for a noisy observation yt = f(xt)+ζt where ζt is a sub-Gaussian noise. In addition to its impressive practical performance, BO is also equipped with solid theoretical guarantees. A number of commonly adopted BO algorithms have been theoretically shown to enjoy sub-linear upper bounds on their cumulative regret [9, 33], which ensures that they are guaranteed to be able to find the global optimum of the reward function as T (i.e., the total number of function queries) increases. On the other hand, a lower bound of Ω( T) (ignoring additional log factors) on the cumulative regret of BO has been shown [31], which represents the fundamental limit of any BO algorithm (in the classical setting). In other words, no classical BO algorithm can achieve a cumulative regret smaller than Ω( T). This naturally begs the question: can we leverage more advanced technology to go beyond * Equal contribution. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). the classical setting and hence break this fundamental limit of Ω( T)? In this work, we give an affirmative answer by showing that this can be achieved with the aid of quantum computing [34]. Quantum bandits, which incorporates quantum computing into bandit algorithms, has been studied by a number of recent works [8, 38, 40, 41]. Notably, the recent work of [38] has introduced quantum variants of classical algorithms for multi-armed bandits (MAB) and stochastic linear bandits (SLB). In the setting of quantum bandits adopted by [38], every query to the reward function f at the arm xt (in the classical setting) is replaced by a chance to access a quantum oracle, which encodes the reward distribution for the arm xt. For every selected arm xt, [38] has proposed to adopt the quantum Monte Carlo (QMC) algorithm [4, 28] as a subroutine to obtain an accurate estimation of f(xt) in an efficient way, i.e., using a small number of queries to the quantum oracle (Lemma 1). As a result, [38] has shown that the regrets of the quantum algorithms for both MAB and SLB are significantly improved compared with their classical counterparts and are smaller than their classical lower bounds. However, both MAB and SLB fall short when optimizing complex real-world functions (e.g., optimizing the hyperparameters of ML models). This is because MAB is unable to exploit the correlations among the arms to model the reward function, and the assumption of a linear reward function adopted by SLB is usually too restrictive in practice. Therefore, designing a quantum bandit algorithm capable of optimizing sophisticated non-linear reward functions, which is a critical step towards practically useful quantum bandit algorithms, is still an open problem. In this work, we resolve this open problem by proposing the first algorithm for quantum BO. Similar to [38], in our quantum BO problem setting, queries to the reward function in the classical setting are replaced by access to a quantum oracle encoding the reward distribution. As discussed in [38, 39], a quantum oracle is available when the learning environment is implemented by a quantum algorithm, which makes the setting of our quantum BO fairly general. For example, our quantum BO algorithm may be used to optimize the hyperparameters of quantum ML algorithms [3], such as quantum support vector machines (SVMs) [29], quantum neural networks (NNs) [1, 18], among others. It may also be used to optimize the parameters of simulators implemented on a quantum computer, or for applications involving quantum systems where the data produced is inherently quantum. Moreover, our quantum BO algorithm could also be applied to classical data and algorithms, because as discussed in [39], any classical computer program can be converted to a quantum circuit, allowing it to serve as a quantum oracle in our quantum BO algorithm. In this work, we introduce the first quantum BO algorithm: quantum-Gaussian process-upper confidence bound (Q-GP-UCB). For every selected arm xs, our Q-GP-UCB algorithm (Algo. 1) adopts the QMC subroutine (Lemma 1) to efficiently estimate the corresponding reward function value f(xs) to satisfy a target estimation error ϵs. Every arm xs is selected based on our weighted GP posterior distribution, in which every previously selected arm xs is given a weight of 1/ϵ2 s which is inversely related to its estimation error. The theoretical analysis of our Q-GP-UCB is faced with non-trivial technical challenges, and our contributions can be summarized as follows: We analyze the growth rate of the weighted information gain (Sec. 5.1) which arises due to the use of our weighted GP regression (Sec. 4.1), and show its connection with the standard maximum information gain commonly used in the analysis of BO/kernelized bandits [9, 33]. Our analysis here may be of broader independent interest for future works adopting weighted GP regression. We derive a tight confidence ellipsoid which gives a guarantee on the concentration of the reward function around the weighted GP posterior mean (Sec. 5.3), and discuss the intuition behind our proof which corresponds to an interesting interpretation about our Q-GP-UCB algorithm. A particularly crucial and novel step in our analysis relies on the recognition to apply the concentration inequality for 1-sub-Gaussian noise (Sec. 5.3). We prove that our Q-GP-UCB achieves a regret upper bound of O(poly log T) for the commonly used squared exponential (SE) kernel (Secs. 5.4 and 5.5), which is considerably smaller than the classical regret lower bound of Ω( T) [31] and hence represents a significant quantum speedup. By using a similar proof technique to the one adopted for our tight confidence ellipsoid (Sec. 5.3), we improve the confidence ellipsoid for the quantum linear UCB (Q-Lin UCB) algorithm [38]. This improved analysis leads to a tighter regret upper bound for Q-Lin UCB, which matches the regret of our Q-GP-UCB with the linear kernel (Sec. 5.6). We use simulations implemented based on the Qiskit package to verify the empirical improvement of our Q-GP-UCB over the classical GP-UCB and over Q-Lin UCB, through a synthetic experiment and an experiment on automated machine learning (Auto ML) (Sec. 6). Notably, we have also performed an experiment using a real quantum computer, in which our Q-GP-UCB still achieves a consistent performance advantage (Fig. 4, App. J) 2 Related Work A number of recent works have studied bandit algorithms in the quantum setting. The works of [8] and [40] have focused on the problem of pure exploration in quantum bandits. [25] has studied a different problem setting where the learner aims to maximize some property of an unknown quantum state (i.e., the rewards) by sequentially selecting the observables (i.e., actions). The recent work of [38] has introduced algorithms for quantum multi-armed bandits (MAB) and stochastic linear bandits (SLB), and has shown that by incorporating the QMC subroutine into MAB and SLB, tight regret upper bounds can be achieved which are better than the classical lower bounds. More recently, the works of [23] and [41] have followed similar approaches to introduce quantum bandit algorithms for, respectively, stochastic convex bandits and bandits with heavy-tailed reward distributions. In addition to quantum bandits, some recent works have introduced quantum reinforcement learning (RL) algorithms. [39] has assumed access to a generative model of the environment and proved that their quantum RL algorithm achieves significantly better sample complexity over their classical counterparts. More recently, [17, 43] have removed the requirement for a generative model in quantum RL and achieved quantum speedup in terms of the regret. More comprehensive reviews of the related works on quantum RL can be found in recent surveys (e.g., [27]). The problem setting of our quantum BO (Sec. 3.2) is the same as many of these above-mentioned previous works on quantum bandits [38, 41] and quantum RL [17, 39, 43]. However, to the best of our knowledge, none of the existing works on quantum bandits (and quantum RL) is able to handle non-linear reward functions, which we resolve in this work. Over the years, extensive efforts [7, 24, 30, 36] have been made to design BO/kernelized bandit algorithms whose regret upper bounds are small enough to (nearly) match the classical regret lower bound of Ω( T) (ignoring additional log factors). Our work provides an alternative route by proving that with the aid of quantum computing, it is possible to achieve a tight regret upper bound which is significantly smaller than the classical regret lower bound. 3 Problem Setting and Background 3.1 Classical Kernelized Bandits A BO/kernelized bandit algorithm aims to maximize a reward function f : X R where X Rd, i.e., it aims to find x arg maxx X f(x). Consistent with the previous works on BO/kernelized bandits [9, 37], we assume that f lies in the reproducing kernel Hilbert space (RKHS) associated with a kernel k: f Hk(X). That is, we assume that f Hk B for B > 0, in which Hk denotes the RKHS norm induced by the kernel k. Note that unlike previous works on linear bandits [2], our assumption here allows f to be non-linear w.r.t. the input x. In this work, we mainly focus on the widely used squared exponential (SE) kernel: k(x, x ) exp( x x 2 2 /(2l2)) in which l is the length scale. We also extend our results to the Matérn kernel in Sec. 5.7. Without loss of generality, we assume that k( , ) 1. In the following, we use [t] to denote {1, . . . , t} for simplicity. In every iteration t [T] of BO, an input xt is selected, after which a corresponding noisy observation yt = f(xt) + ζt is collected where ζt is a sub-Gaussian noise (e.g., bounded noise or Gaussian noise). The inputs xt s are sequentially selected by maximizing an acquisition function, which is calculated based on the Gaussian process (GP) posterior distribution. Specifically, after t iterations, we use the current history Dt {(x1, y1), . . . , (xt, yt)} to calculate the GP posterior predictive distribution at x X: N(µt(x), σ2 t (x)) where µt(x) k t (x)(Kt + λI) 1Yt, σ2 t (x) k(x, x) k t (x)(Kt + λI) 1kt(x), (1) in which kt(x) [k(xτ, x)] τ [t] and Yt [yxτ ] τ [t] are column vectors, Kt [k(xτ, xτ )]τ,τ [t] is the t t covariance matrix, and λ > 1 is a regularization parameter. Based on (1), a commonly used acquisition function is GP-UCB [33], which selects the next query by: xt+1 = arg maxx X µt(x) + ξt+1σt(x) where ξt+1 is a parameter carefully chosen to balance exploration and exploitation. The performance of a BO/kernelized bandit algorithm is usually theoretically analyzed by deriving an upper bound on its cumulative regret: RT = PT t=1[f(x ) f(xt)], and a tighter regret upper bound is an indicator of a better theoretical convergence. 3.2 Quantum Bandits/BO A quantum state |x in Hilbert space Cn can be expressed as a superposition of n basis states (e.g. qubits) via a vector x = [x1, . . . , xn] , with |xi|2 representing the probability of being in the ith basis state. Given two quantum states |x Cn and |y Cm, we use |x |y = [x1y1, . . . , xnym] to denote their tensor product. A quantum algorithm typically works by applying unitary operators to quantum states, and may have access to input data encoded in unitary operators called quantum oracles (examples have been discussed in Sec. 1). We defer a more detailed introduction to quantum computing to related works dedicated to these topics (e.g., [26]). Our quantum BO setting follows that of the quantum bandits works in [38, 41]. In every iteration of quantum BO, after an input x is selected, instead of observing a noisy reward as in classical BO/bandits (Sec. 3.1), we get a chance to access a quantum unitary oracle Ox and its inverse that encode the noisy reward distribution. Specifically, let Px denote the reward distribution, Ωx denote the finite sample space of Px, and yx : Ωx R denote the random reward associated with input x. Then, Ox is formally defined as: Px(ω)|ω |yx(ω) . (2) Quantum mean estimation, which aims to estimate the mean of an unknown distribution with better sample efficiency than classical algorithms, has been a widely studied problem [4, 21]. Consistent with the works of [38, 41], we will make use of the following quantum Monte Carlo (QMC) algorithm: Lemma 1 (Quantum Monte Carlo (QMC) [28]). Let yx : Ωx R denote a random variable, Ωx is equipped with probability measure Px, and the quantum unitary oracle Ox encodes Px and yx. Bounded Noise: If the noisy output observation satisfies yx [0, 1], then there exists a constant C1 > 1 and a QMC algorithm QMC(Ox, ϵ, δ) which returns an estimate ˆyx of E[yx] such that P(|ˆyx E[yx]| > ϵ) δ, using at most C1 ϵ log(1/δ) queries to Ox and its inverse. Noise with Bounded Variance: If the variance of yx is σ2, then for ϵ < 4σ, there exists a constant C2 > 1 and a QMC algorithm QMC(Ox, ϵ, δ) which returns an estimate ˆyx s.t. P(|ˆyx E[yx]| > ϵ) δ, using at most C2σ ϵ log3/2 2 8σ ϵ log2(log2 8σ δ queries to Ox and its inverse. The quantum query complexity of a quantum algorithm is usually measured by the number of queries to the quantum oracle [3, 38]. So, the sample complexities from Lemma 1 can be compared with that of classical Monte Carlo (MC) estimation. In the classical setting, it can be easily shown using concentration inequalities that MC estimation requires e O(1/ϵ2) samples to reach a target mean estimation error of ϵ. Therefore, the QMC algorithm (Lemma 1), which only needs e O(1/ϵ) samples, achieves a quadratic reduction in the required number of samples. This dramatic improvement is crucial for the quantum speedup in terms of regrets achieved by our algorithm (Sec. 5.4). During our Q-GP-UCB algorithm, after every input xs is selected, we will apply the QMC algorithm from Lemma 1 with the quantum oracle Oxs to obtain an estimation ys of its reward f(xs) (line 6 of Algo. 1). Of note, the above-mentioned equivalence between a query to the quantum oracle and a sample for classical MC mean estimation implies that a query to the quantum oracle Ox in quantum BO/bandits is equivalent to the pulling of an arm x in classical BO/bandits. Therefore, the cumulative regret RT of our Q-GP-UCB algorithm is defined as the regret incurred after T queries to the quantum oracle, which makes it amenable to comparisons with the regrets of classical algorithms. 4 Quantum Bayesian Optimization In this section, we first introduce weighted GP regression (Sec. 4.1) which will be used by our Q-GP-UCB algorithm to calculate the acquisition function for input selection, and then describe our Q-GP-UCB algorithm (Sec. 4.2) in detail. 4.1 Weighted GP Posterior Distribution In contrast to standard GP-UCB [33] which uses the standard GP posterior distribution (1) to calculate the acquisition function, our Q-GP-UCB makes use of a weighted GP posterior distribution. That is, we assign a weight to every previous observation. Specifically, after stage s of our Q-GP-UCB (i.e., given Ds {(xτ, yτ)}τ [s]), we define the weight matrix Ws diag(1/ϵ2 1, . . . , 1/ϵ2 s). Ws is an s s diagonal matrix, in which the τ th diagonal element represents the weight 1/ϵ2 τ given to the τ th observation (xτ, yτ). We will set ϵτ = eστ 1(xτ)/ λ (Sec. 4.2), i.e., ϵτ is calculated using the weighted GP posterior standard deviation (3) (conditioned on the first τ 1 observations) at xτ. Define Ks [k(xτ, xτ )]τ,τ [s] which is the s s-dimensional covariance matrix given the first s observations, and define e Ks W 1/2 s Ks W 1/2 s which is the weighted covariance matrix. Similarly, define ks(x) [k(x, xτ)] τ [s] and eks(x) W 1/2 s ks(x). Denote the collection of output observa- tions by Ys [yτ] τ [s], and define e Ys W 1/2 s Ys. With these definitions, given Ds, our weighted GP posterior distribution at an input x X is a Gaussian distribution: N(eµs(x), eσ2 s(x)), in which eµs(x) ek s (x)( e Ks + λI) 1 e Ys, eσ2 s(x) k(x, x) ek s (x)( e Ks + λI) 1eks(x). (3) Note that the GP posterior mean eµs above is equivalently the solution to the following weighted kernel ridge regression problem: eµs = arg minf Hk(X) Ps τ=1 1 ϵ2τ (yτ f(xτ))2 + λ f 2 Hk. We give a more detailed analysis of the weighted GP posterior (3) in App. A. Weighted GP regression has also been adopted by previous works on BO such as [15]. However, our choice of the weights, algorithmic design and theoretical analyses all require significantly novel treatments. 4.2 Q-GP-UCB Algorithm Algorithm 1 Q-GP-UCB 1: for stage s = 1, 2, . . . do 2: xs = arg maxx X eµs 1(x) + βseσs 1(x) (3). 3: ϵs = eσs 1(xs)/ λ. 4: (a) Nϵs = C1 δ ) (for bounded noise), or (b) Nϵs = C2σ ϵs log3/2 2 8σ ϵs log2(log2 8σ ϵs ) log 2m δ (for noise with variance bounded by σ2). 5: If Ps k=1 Nϵk > T, terminate the algorithm. 6: Run the QMC(Oxs, ϵs, δ/(2m)) algorithm to query the quantum oracle of xs for the next Nϵs rounds, and obtain ys as an estimate of f(xs). 7: Update the weighted GP posterior (3) using (xs, ys). Our Q-GP-UCB algorithm is presented in Algo. 1. Q-GP-UCB proceeds in stages. In stage s, we first select the next input xs to query by maximizing the GP-UCB acquisition function calculated using the weighted GP posterior distribution (3) (line 2 of Algo. 1). Here βs B + p 2(eγs 1 + 1 + log(2/δ)) (more details in Sec. 5.3), in which δ (0, 2/e] and eγs 1 1 2 log(det(I + 1 λ e Ks 1)) is the weighted information gain (more details in Sec. 5.1). Next, we calculate ϵs = eσs 1(xs)/ λ (line 3) and Nϵs (line 4), in which Nϵs depends on the type of noise and the value of ϵs. Here m is an upper bound on the total number m of stages which we will analyze in Sec. 5.2. Subsequently, unless the algorithm is terminated (line 5), we run the QMC algorithm (Lemma 1) to estimate f(xs) by querying the quantum oracle of xs for Nϵs rounds (line 6). The QMC procedure returns an estimate ys of the reward function value f(xs), for which the estimation error is guaranteed to be bounded: |ys f(xs)| ϵs with probability of at least 1 δ/(2m). Lastly, we update the weighted GP posterior (3) using the newly collected input-output pair (xs, ys), as well as its weight 1/ϵ2 s (line 7). Of note, the value of ϵs is used for both (a) calculating the number Nϵs of queries to the quantum oracle for xs, and (b) computing the weight 1/ϵ2 s assigned to (xs, ys) in the weighted GP regression (3) in the subsequent iterations. Regarding (a), our designs of ϵs and Nϵs = e O(1/ϵs) have an interesting interpretation in terms of the exploration-exploitation trade-off: In the initial stages, the weighted GP posterior standard deviation eσs 1(xs) and hence ϵs are usually large, which leads to a small number Nϵs of queries for every xs and hence allows our Q-GP-UCB to favor the exploration of more unique inputs; in later stages, eσs 1(xs) and ϵs usually become smaller, which results in large Nϵs s and hence causes our Q-GP-UCB to prefer the exploitation of a small number of unique inputs. Regarding (b), assigning a larger weight 1/ϵ2 s to an input xs with a smaller ϵs is reasonable, because a smaller ϵs indicates a smaller estimation error for ys as we explained above, which makes the observation ys more accurate and reliable for calculating the weighted GP regression (3). Note that we have modified the original GP-UCB by querying every selected input xs multiple times, in order to make it amenable to the integration of the QMC subroutine. The recent work of [6] has also adapted BO to evaluate a small number of unique inputs while querying each of them multiple times. It has been shown [6] that the resulting BO algorithm preserves both the theoretical guarantee (i.e., the regret upper bound) and empirical performance of the original BO, while significantly reducing the computational cost. So, the findings from [6] can serve as justifications for our modification to GP-UCB. Also note that despite this similarity, [6] still focuses on the traditional setting (i.e., their regret upper bound is RT = O(γT T)) and their regret analyses are entirely different from ours. 5 Theoretical Analysis Throughout our analysis, we condition on the event that |f(xs) ys| ϵs, s = 1, . . . , m. This event holds with probability of at least 1 δ/2, because we set the error probability in QMC (Lemma 1) to δ/(2m) in which m m (see Theorem 2 for details). For simplicity, we mainly focus on the scenario of bounded noise (i.e., the observations are bounded within [0, 1], the first case of Lemma 1), because the theoretical analyses and insights about the other scenario of noise with bounded variance (i.e., the second case of Lemma 1) are almost identical (more details in Sec. 5.5). 5.1 Weighted Information Gain To begin with, the following lemma (proof in App. B) gives an upper bound on the sum of the weights in all m stages, which will be useful for upper-bounding the weighted information gain eγm. Lemma 2. Choose δ (0, 2/e], then we have that Pm s=1 1/ϵ2 s T 2. Denote by γT the maximum information gain from any set of T inputs: γT max{x1,...,x T } X 1 2 log det(I + 1 λKT ), which is commonly used in the analysis of BO/kernelized bandit algorithms [9, 33]. Denote by eγm the weighted information gain from all m selected inputs in all stages: eγm 1 2 log det(I + 1 λ e Km). Note that in the definition of eγm, we do not take the maximum over all sets of m inputs, so eγm still depends on the selected inputs Xm {x1, . . . , xm}. Also note that {ϵ1, . . . , ϵm} are known constants conditioned on Xm. This is because the weighted GP posterior standard deviation eσs 1 (3) does not depend on the output observations, so every ϵs = eσs 1(xs)/ λ only depends on {x1, . . . , xs}. The next theorem gives an upper bound on eγm. Theorem 1. (a) Given Xm {x1, . . . , xm}, the growth rate of eγm is the same as that of γPm s=1 1/ϵ2s. (b) eγm γT 2. We give the complete statement of result (a) in Theorem 8 (App. C), which presents the concrete growth rate. As stated by result (a), to obtain an upper bound on eγm, we simply need to firstly use the upper bound on γT from previous works [35], and then replace the T in this upper bound by Pm s=1 1/ϵ2 s. Based on this, the result (b) has further upper-bounded Pm s=1 1/ϵ2 s by T 2 through Lemma 2. Note that because ϵs = eσs 1(xs)/ λ 1, we have that 1/ϵ2 s 1 and hence Pm s=1 1/ϵ2 s m. Therefore, our weighted information gain eγm has a looser upper bound than γm. This intuitively implies that our weighted covariance matrix e Km is expected to contain more information than the original Km, whose implication will be discussed again in Sec. 5.2. The proof of Theorem 1 (App. C) has followed the analysis of the information gain from [35]. Specifically, we have leveraged a finitedimensional projection of the infinite-dimensional RKHS feature space, and hence upper-bounded the information gain in terms of the tail mass of the eigenvalues of the kernel k. The proof requires carefully tracking the impact of the weights 1/ϵ2 s throughout the analysis. Importantly, our Theorem 1 and its analysis may be of wider independent interest for future works which also adopt weighted GP regression (Sec. 4.1). The growth rate of γT has been characterized for commonly used kernels [35]. Based on these, the next corollary provides the growth rates of eγm for the linear and squared exponential (SE) kernels. Corollary 1. For the linear and squared exponential (SE) kernels, we have that eγm = O(d log Pm s=11/ϵ2 s) = O(d log T) linear kernel, eγm = O(logd+1 Pm s=11/ϵ2 s) = O((log T)d+1) SE kernel. 5.2 Upper Bound on The Total Number m of Stages The next theorem gives an upper bound on the total number of stages of our Q-GP-UCB algorithm. Theorem 2. For the linear kernel, our Q-GP-UCB algorithm has at most m = O(d log T) stages. For the SE kernel, our Q-GP-UCB algorithm has at most m = O((log T)d+1) stages. Note that for the linear kernel, our upper bound on m matches that of the Q-Lin UCB algorithm from [38] (see Lemma 2 from [38]). The proof of Theorem 2 is given in App. D, and here we give a brief sketch of the proof. For stage s, define Vs λI + Ps τ=1(1/ϵ2 τ)ϕ(xτ)ϕ(xτ) , in which ϕ( ) is the (potentially infinite-dimensional) RKHS feature mapping: k(x, x ) = ϕ(x) ϕ(x ). Intuitively, log det Vs represents the amount of information contained in the first s selected inputs {x1, . . . , xs}. As the first step in our proof, we prove that thanks to our choice of ϵs (line 3 of Algo. 1) and our design of the weights 1/ϵ2 s, the amount of information log det Vs is doubled after every stage: det Vτ+1 = 2 det Vτ, τ = 0, . . . , m 1. This allows us to show that m log 2 = log(det(Vm)/ det(V0)) where V0 = λI. Next, we show that this term is also related to our weighted information gain: log(det(Vm)/ det(V0)) = 2eγm, which allows us to upper-bound m in terms of eγm and hence in terms of Pm s=1 1/ϵ2 s (see Corollary 1). This eventually allows us to derive an upper bound on m by using the fact that the total number of iterations (i.e., queries to the quantum oracles) is no larger than T. Therefore, the key intuition of the proof here is that as long as we gather a sufficient amount of information in every stage, we do not need a large total number of stages. 5.3 Confidence Ellipsoid The next theorem proves the concentration of f around the weighted GP posterior mean (3). Theorem 3. Let η 2/T, λ 1 + η, and βs B + p 2(eγs 1 + 1 + log(2/δ)). We have with probability of at least 1 δ/2 that |f(x) eµs 1(x)| βseσs 1(x), s [m], x X. The proof of Theorem 3 is presented in App. E, and here we give a brief proof sketch. Denote by Fs 1 the σ-algebra Fs 1 = {x1, . . . , xs, ζ1, . . . , ζs 1} where ζk = yk f(xk) is the noise. Recall that W 1/2 s diag(1/ϵ1, . . . , 1/ϵs) (Sec. 4.1), and define ζ1:s [ζk]k [s]. In the first part of the proof, we use the RKHS feature space view of the weighted GP posterior (3) (see App. A) to upper-bound |f(x) eµs 1(x)| in terms of eσs 1(x) and ||W 1/2 s ζ1:s||(( e Ks+ηI) 1+I) 1. Next, the most crucial step in the proof is to upper-bound ||W 1/2 s ζ1:s||(( e Ks+ηI) 1+I) 1 in terms of eγs by applying the self-normalized concentration inequality from Theorem 1 of [9] to 1-sub-Gaussian noise. This is feasible because (a) every ϵs = eσs 1(xs)/ λ is Fs 1-measurable, and (b) conditioned on Fs 1, the scaled noise term ζs/ϵs (in W 1/2 s ζ1:s) is 1-sub-Gaussian. The statement (a) follows because ϵs only depends on {x1, . . . , xs} as we have discussed in Sec. 5.1. The statement (b) follows since our QMC subroutine ensures that every noise term ζk is bounded within [ ϵk, ϵk] (Sec. 4.2) because |yk f(xk)| ϵk (with high probability). This suggests that conditioned on Fs 1 (which makes ϵs a known constant as discussed above), ζs/ϵs is bounded within [ 1, 1] and is hence 1-sub-Gaussian. This critical step in the proof is in fact also consistent with an interesting interpretation about our algorithm. That is, after xs is selected in stage s, we adaptively decide the number Nϵs of queries to the quantum oracle in order to reach the target accuracy ϵs (line 4 of Algo. 1). This ensures that |ys f(xs)| ϵs (with high probability), which guarantees that the noise ζs = ys f(xs) is (conditionally) ϵs-sub-Gaussian. Moreover, note that the vector of observations e Ys used in the calculation of the weighted GP posterior mean (3) is e Ys = W 1/2 s Ys. So, from the perspective of the weighted GP posterior (3), every noisy observation ys, and hence every noise ζs, is multiplied by 1/ϵs. So, the effective noise of the observations e Ys used in the weighted GP posterior (i.e., ζs/ϵs) is ϵs/ϵs = 1-sub-Gaussian. This shows the underlying intuition as to why our proof of the concentration of the reward function f using the weighted GP regression can make use of 1-sub-Gaussian noise (e.g., in contrast to the σ-sub-Gaussian noise used in the proof of classical GP-UCB [9]). Our tight confidence ellipsoid around f from Theorem 3 is crucial for deriving a tight regret upper bound for our Q-GP-UCB (Sec. 5.4). Moreover, as we will discuss in more detail in Sec. 5.6, our proof technique for Theorem 3, interestingly, can be adopted to obtain a tighter confidence ellipsoid for the Q-Lin UCB algorithm from [38] and hence improve its regret upper bound. 5.4 Regret Upper Bound Here we derive an upper bound on the cumulative regret of our Q-GP-UCB algorithm. Theorem 4. With probability of at least 1 δ, we have that1 RT = O d3/2(log T)3/2 log(d log T) linear kernel, RT = O (log T)3(d+1)/2 log((log T)d+1) SE kernel. The proof of Theorem 4 is presented in App. F. The proof starts by following the standard technique in the proof of UCB-type algorithms to show that rs = f(x ) f(xs) 2βseσs 1(xs). Next, importantly, our design of ϵs = eσs 1(xs)/ λ allows us to show that rs 2βsϵs λ. After that, the total regrets incurred in a stage s can be upper-bounded by Nϵs 2βsϵs ϵs log(m/δ) βsϵs) = O(βs log(m/δ)). Lastly, the regret upper bound follows by summing the regrets from all m stages. Of note, for the commonly used SE kernel, our regret upper bound is of the order O(poly log T), which is significantly smaller than the regret lower bound of Ω( T) in the classical setting shown by [31]. This improvement over the classical fundamental limit is mostly attributed to the use of the QMC (Lemma 1), i.e., line 6 of Algo. 1. If the QMC procedure is replaced by classical MC estimation, then in contrast to the Nϵs = O(1/ϵs) queries by QMC, every stage would require Nϵs = O(1/ϵ2 s) queries. As a result, this would introduce an additional factor of 1/ϵs to the total regrets in a stage s (as discussed above), which would render the derivation of our regret upper bound (Theorem 4) invalid. This verifies that our tight regret upper bound in Theorem 4 is only achievable with the aid of quantum computing. Our Theorem 4 is a demonstration of the immense potential of quantum computing to dramatically improve the theoretical guarantees over the classical setting. Importantly, when the linear kernel is used, the regret upper bound of our Q-GP-UCB is in fact better than that of the Q-Lin UCB algorithm from [38]. This improvement arises from our tight confidence ellipsoid in Theorem 3. In Sec. 5.6, we will give a more detailed discussion of this improvement and adopt our proof technique for Theorem 3 to improve the confidence ellipsoid for Q-Lin UCB, after which their improved regret upper bound matches that of our Q-GP-UCB with the linear kernel. 5.5 Regret Upper Bound for Noise with Bounded Variance Here we analyze the regret of our Q-GP-UCB when the variance of the noise is bounded by σ2, which encompasses common scenarios including σ-sub-Gaussian noises and Gaussian noises with a variance of σ2. Consistent with the analysis of Q-Lin UCB [38], in order to apply the theoretical guarantee provided by QMC (Lemma 1) for noise with bounded variance, here we need to assume that the noise variance is not too small, i.e., we assume that σ > 1/4. In this case of noise with bounded variance, the following theorem gives an upper bound on the regret of our Q-GP-UCB. Theorem 5. With probability of at least 1 δ, we have that RT = O σd3/2(log T)3/2 log(d log T)(log2 σT)3/2 log2(log2 σT) linear kernel, RT = O σ(log T)3(d+1)/2 log((log T)d+1)(log2 σT)3/2 log2(log2 σT) SE kernel. (4) The proof of Theorem 5 is given in App. G. Note that same as Theorem 4, the regret for the SE kernel in Theorem 5 is also of the order O(poly log T), which is also significantly smaller than the classical regret lower bound of Ω( T) [31]. Moreover, Theorem 5 shows that for both the linear kernel and SE kernel, the regret of our Q-GP-UCB is reduced when the noise variance σ2 becomes smaller. 5.6 Improvement to Quantum Linear UCB (Q-Lin UCB) As we have mentioned in Sec. 5.4, the regret upper bound of our Q-GP-UCB with the linear kernel is RT = O(d3/2(log T)3/2 log(d log T)), which is better than the corresponding RT = O(d2(log T)3/2 log(d log T)) for Q-Lin UCB [38]2 by a factor of O( d). This improvement can be attributed to our tight confidence ellipsoid from Theorem 3. Here, following the general idea of the proof of our Theorem 3, we prove a tighter confidence ellipsoid for the Q-Lin UCB algorithm, i.e., we improve Lemma 3 from the work of [38]. Here we follow the notations from [38], and we defer the detailed notations, as well as the proof of Theorem 6 below, to App. H due to space limitation. 1We have omitted the dependency on log(1/δ) for simplicity. Refer to App. F for the full expression. 2This is the high-probability regret (rather than expected regret) from [38], which we focus on in this work. Theorem 6 (Improved Confidence Ellipsoid for [38]). With probability of at least 1 δ, θ Cs = {θ Rd : θ bθs Vs v u u t2d log 1 + L2 λS }, s [m]. Note that the size of the confidence ellipsoid from our Theorem 6 is of the order O( p d log(Ps k=1 1/ϵ2 k)) = O( d log T) (we have used Lemma 2 here), which improves over the O( dm) = O( d d log T) = O(d log T) from Lemma 3 of [38] by a factor of d. Plugging in our tighter confidence ellipsoid (Theorem 6) into the regret analysis of Q-Lin UCB, the regret upper bound for Q-Lin UCB is improved to RT = O d3/2 log3/2 T log(d log T) (more details at the end of App. H). This improved regret upper bound exactly matches that of our Q-GPUCB with the linear kernel (Theorem 4). Similar to Theorem 3, the most crucial step in the proof of Theorem 6 is to apply the self-normalized concentration inequality from Theorem 1 of [2] to 1-sub-Gaussian noise. Again this is feasible because ϵs = eσs 1(xs)/ λ is Fs 1-measurable, and conditioned on Fs 1, the scaled noise ζs/ϵs is 1-sub-Gaussian. 5.7 Regret Upper Bound for the Matérn Kernel So far we have mainly focused on the linear and SE kernels in our analysis. Here we derive a regret upper bound for our Q-GP-UCB algorithm for the Matérn kernel with smoothness parameter ν. Theorem 7 (Matérn Kernel, Bounded Noise). With probability of at least 1 δ, we have that RT = e O T 3d/(2ν+d) . The proof is in App. I. Note that the state-of-the-art regret upper bound for the Matérn kernel in the classical setting is RT = O( TγT ) = e O(T (ν+d)/(2ν+d)) [7, 24, 30, 36], which matches the corresponding classical regret lower bound (up to logarithmic factors in T) [35]. Therefore, the regret upper bound of our Q-GP-UCB for the Matérn kernel (Theorem 7) improves over the corresponding classical regret lower bound when ν > 2d, i.e., when the reward function is sufficiently smooth. Also note that similar to the classical GP-UCB [33] (with regret RT = O(γT T) = e O(T (ν+3d/2)/(2ν+d))) which requires the reward function to be sufficiently smooth (i.e., ν > d/2) to attain sub-linear regrets for the Matérn kernel, our Q-GP-UCB, as the first quantum BO algorithm, also requires the reward function to be smooth enough (i.e., ν > d) in order to achieve a sub-linear regret upper bound for the Matérn kernel. We leave it to future works to further improve our regret upper bound and hence relax this requirement for smooth functions. 6 Experiments We use the Qiskit python package to implement the QMC algorithm (Lemma 1) following the recent work of [20]. Some experimental details are deferred to App. J due to space limitation. Synthetic Experiment. Here we use a grid of |X| = 20 equally spaced points within [0, 1] as the 1-dimensional input domain X (d = 1), and sample a function f from a GP prior with the SE kernel. The sampled function f is scaled so that its output is bounded within [0, 1], and then used as the reward function in the synthetic experiments. We consider two types of noises: (a) bounded noise (within [0, 1]) and (b) Gaussian noise, which correspond to the two types of noises in Lemma 1, respectively. For (a) bounded noise, we follow the practice of [38] such that when an input x is selected, we treat the function value f(x) as the probability for a Bernoulli distribution, i.e., we observe an output of 1 with probability of f(x) and 0 otherwise. For (b) Gaussian noise, we simply add a zero-mean Gaussian noise with variance σ2 to f(x). The results for (a) bounded noise and (b) Gaussian noise are shown in Figs. 1 (a) and (b), respectively. The figures show that for both types of noises, our Q-GP-UCB significantly outperforms the classical baseline of GP-UCB. Specifically, although our Q-GP-UCB incurs larger regrets in the initial stage, it is able to leverage the accurate observations provided by the QMC subroutine to rapidly find the global optimum. These results show that the quantum speedup of our Q-GP-UCB in terms of the tighter regret upper bounds (Theorems 4 and 5) may also be relevant in practice. We have additionally compared with linear UCB (Lin UCB) [2] and Q-Lin UCB [38], which are, respectively, the most representative classical and quantum linear bandit algorithms. The results in Fig. 2 (App. J) show that in these experiments where the reward 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB Q-GP-UCB 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB ( = 0.3) GP-UCB ( = 0.4) Q-GP-UCB ( = 0.3) Q-GP-UCB ( = 0.4) 0 1000 2000 3000 4000 5000 Iterations Cumulative Regret GP-UCB ( = 0.01) GP-UCB ( = 0.05) Q-GP-UCB ( = 0.01) Q-GP-UCB ( = 0.05) 0 1000 2000 3000 4000 5000 Iterations Cumulative Regret GP-UCB ( = 0.01) GP-UCB ( = 0.05) Q-GP-UCB ( = 0.01) Q-GP-UCB ( = 0.05) Lin UCB ( = 0.05) Q-Lin UCB ( = 0.05) (a) synthetic (Bernoulli) (b) synthetic (Gaussian) (c) Auto ML (d) Auto ML Figure 1: Cumulative regret for synthetic experiments with (a) bounded noise and (b) Gaussian noise. (c) Cumulative regret for the Auto ML experiment; (d) additionally includes results for linear bandits. function is non-linear, algorithms based on linear bandits severely underperform compared with BO/kernelized bandit algorithms in both the classical and quantum settings. Auto ML Experiment. In our experiments on automated machine learning (Auto ML), we use our Q-GP-UCB algorithm to tune the hyperparameters of an SVM for a classification task. Here we consider a Gaussian noise with a variance of σ2 since it is more practical and more commonly used in real-world experiments. The results in Fig. 1 (c) show that our Q-GP-UCB again significantly outperforms the classical GP-UCB. Fig. 1 (d) additionally shows the comparisons with Lin UCB and Q-Lin UCB. The results again corroborate that BO/kernelized bandit algorithms are considerably superior in real-world applications with highly non-linear reward functions in both the classical and quantum settings. These results here further demonstrate the potential of our Q-GP-UCB to lead to quantum speedup in practical applications. More Realistic Experiments. We have additionally tested the performance of our Q-GP-UCB algorithm after accounting for the effect of quantum noise, by incorporating into our Qiskit simulations a noise model based on the actual performance of real IBM quantum computers. The results (Fig. 3 in App. J) show that although the addition of quantum noise slightly deteriorates the performance of our Q-GP-UCB, it is still able to significantly outperform classical GP-UCB. Notably, we have additionally performed an experiment using a real quantum computer (details in App. J), in which the performance of our Q-GP-UCB, although further worsened compared to the experiment using simulated noise, is still considerably better than classical GP-UCB (Fig. 4, App. J). Also note that the work of [38] has not shown that Q-Lin UCB outperforms Lin UCB in the presence of simulated quantum noise. Therefore, the consistently superior performances of our Q-GP-UCB over GP-UCB with both simulated quantum noise and a real quantum computer serve as important new support for the potential practical advantages of quantum bandit algorithms. Discussion. In our experimental results (e.g., Fig. 1), our Q-GP-UCB usually has relatively larger regrets in the initial stages but quickly achieves zero regret thereafter (i.e., the curve plateaus), which has also been observed in [38] for the Q-Lin UCB algorithm. This is because in the initial stages, our Q-GP-UCB explores a smaller number of unique arms than GP-UCB. However, after the initial exploration, our Q-GP-UCB quickly starts to perform reliable exploitation, because the accurate reward observations achieved thanks to our QMC subroutines allow us to rapidly learn the reward function and hence find the optimal arm. 7 Conclusion We have introduced the first quantum BO algorithm, named Q-GP-UCB. Our Q-GP-UCB achieves a regret upper bound of O(poly log T), which is significantly smaller than the classical regret lower bound of Ω( T). A limitation of our work is that the regret upper bound of our Q-GP-UCB: O((log T)3(d+1)/2) (ignoring polylog factors) has a worse dependency on the input dimension d than the regret of the classical GP-UCB: O((log T)d+1 T). A similar limitation is shared by Q-Lin UCB, since its regret has a dependency of O(d3/2) in contrast to the O(d) of Lin UCB. It is an interesting future work to explore potential approaches to remove this extra dependency on d. Another limitation of our work is that for the Matérn kernel, we require the reward function to be smooth enough in order to achieve a sub-linear regret upper bound (Sec. 5.7). So, another important future work is to further tighten the regret upper bound of our Q-GP-UCB for the Matérn kernel when the reward function is less smooth, as well as for other kernels encompassing non-smooth functions such as the neural tangent kernel which has been adopted by recent works on neural bandits [13, 42, 44]. Moreover, another interesting future work is to derive regret lower bounds in our setting of quantum kernelized bandits, which can help evaluate the tightness of our regret upper bounds. Acknowledgements and Disclosure of Funding This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-Ph D/2023-01-039J). Des Cartes: this research is supported by the National Research Foundation, Prime Minister s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. [1] A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner. The power of quantum neural networks. Nature Computational Science, 1(6):403 409, 2021. [2] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Proc. Neur IPS, volume 24, 2011. [3] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd. Quantum machine learning. Nature, 549(7671):195 202, 2017. [4] G. Brassard, P. Hoyer, M. Mosca, and A. Tapp. Quantum amplitude amplification and estimation. Contemporary Mathematics, 305:53 74, 2002. [5] S. Cakmak, R. Astudillo Marban, P. Frazier, and E. Zhou. Bayesian optimization of risk measures. In Proc. Neur IPS, volume 33, pages 20130 20141, 2020. [6] D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco. Scaling Gaussian process optimization by evaluating a few unique candidates multiple times. In Proc. ICML, pages 2523 2541. PMLR, 2022. [7] R. Camilleri, K. Jamieson, and J. Katz-Samuels. High-dimensional experimental design and kernel bandits. In Proc. ICML, pages 1227 1237. PMLR, 2021. [8] B. Casalé, G. Di Molfetta, H. Kadri, and L. Ralaivola. Quantum bandits. Quantum Machine Intelligence, 2:1 7, 2020. [9] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proc. ICML, pages 844 853, 2017. [10] Z. Dai, B. K. H. Low, and P. Jaillet. Federated Bayesian optimization via Thompson sampling. In Proc. Neur IPS, 2020. [11] Z. Dai, B. K. H. Low, and P. Jaillet. Differentially private federated Bayesian optimization with distributed exploration. In Proc. Neur IPS, volume 34, 2021. [12] Z. Dai, Y. Shu, B. K. H. Low, and P. Jaillet. Sample-then-optimize batch neural Thompson sampling. In Proc. Neur IPS, 2022. [13] Z. Dai, Y. Shu, A. Verma, F. X. Fan, B. K. H. Low, and P. Jaillet. Federated neural bandits. In Proc. ICLR, 2023. [14] Z. Dai, H. Yu, B. K. H. Low, and P. Jaillet. Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pages 1496 1506, 2019. [15] Y. Deng, X. Zhou, B. Kim, A. Tewari, A. Gupta, and N. Shroff. Weighted Gaussian process bandits for non-stationary environments. In Proc. AISTATS, pages 6909 6932. PMLR, 2022. [16] P. I. Frazier. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018. [17] B. Ganguly, Y. Wu, D. Wang, and V. Aggarwal. Quantum computing provides exponential regret improvement in episodic reinforcement learning. ar Xiv:2302.08617, 2023. [18] S. Garg and G. Ramakrishnan. Advances in quantum deep learning: An overview. ar Xiv preprint ar Xiv:2005.04316, 2020. [19] R. Garnett. Bayesian Optimization. Cambridge Univ. Press, 2022. [20] D. Grinko, J. Gacon, C. Zoufal, and S. Woerner. Iterative quantum amplitude estimation. npj Quantum Information, 7(1):52, 2021. [21] L. K. Grover. A framework for fast quantum mechanical algorithms. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 53 62, 1998. [22] F. Hutter, L. Kotthoff, and J. Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019. [23] T. Li and R. Zhang. Quantum speedups of optimizing approximately convex functions with applications to logarithmic regret stochastic convex bandits. In Proc. Neur IPS, 2022. [24] Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In Proc. AISTATS, pages 92 107. PMLR, 2022. [25] J. Lumbreras, E. Haapasalo, and M. Tomamichel. Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states. Quantum, 6:749, 2022. [26] A. Messiah. Quantum mechanics. Courier Corporation, 2014. [27] N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler. A survey on quantum reinforcement learning. ar Xiv:2211.03464, 2022. [28] A. Montanaro. Quantum speedup of Monte Carlo methods. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2181):20150301, 2015. [29] P. Rebentrost, M. Mohseni, and S. Lloyd. Quantum support vector machine for big data classification. Physical review letters, 113(13):130503, 2014. [30] S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based Bayesian optimization algorithm with order-optimal regret performance. In Proc. Neur IPS, volume 34, pages 28836 28847, 2021. [31] J. Scarlett, I. Bogunovic, and V. Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization. In Proc. COLT, pages 1723 1742. PMLR, 2017. [32] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2016. [33] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pages 1015 1022, 2010. [34] A. Steane. Quantum computing. Reports on Progress in Physics, 61(2):117, 1998. [35] S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in Gaussian process bandits. In Proc. AISTATS, pages 82 90. PMLR, 2021. [36] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proc. UAI, 2013. [37] A. Verma, Z. Dai, and B. K. H. Low. Bayesian optimization under stochastic delayed feedback. In Proc. ICML, pages 22145 22167. PMLR, 2022. [38] Z. Wan, Z. Zhang, T. Li, J. Zhang, and X. Sun. Quantum multi-armed bandits and stochastic linear bandits enjoy logarithmic regrets. In Proc. AAAI. [39] D. Wang, A. Sundaram, R. Kothari, A. Kapoor, and M. Roetteler. Quantum algorithms for reinforcement learning with a generative model. In Proc. ICML, pages 10916 10926. PMLR, 2021. [40] D. Wang, X. You, T. Li, and A. M. Childs. Quantum exploration algorithms for multi-armed bandits. In Proc. AAAI, volume 35, pages 10102 10110, 2021. [41] Y. Wu, C. Guan, V. Aggarwal, and D. Wang. Quantum heavy-tailed bandits. ar Xiv:2301.09680, 2023. [42] W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural Thompson sampling. In Proc. ICLR, 2021. [43] H. Zhong, J. Hu, Y. Xue, T. Li, and L. Wang. Provably efficient exploration in quantum reinforcement learning with logarithmic worst-case regret. ar Xiv:2302.10796, 2023. [44] D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with UCB-based exploration. In Proc. ICML, pages 11492 11502. PMLR, 2020. A Analysis of the Weighted GP Posterior (3) Here we show an alternative view of the weighted GP posterior distribution (3) in the RKHS feature space. The notations in this section will be extensively used in the theoretical analyses in the subsequent sections. We denote the kernel k by k(x, x ) = ϕ(x) ϕ(x ), in which ϕ(x) is the potentially infinitedimensional feature mapping of x in the RKHS of k. Define Φs [ϕ(x1), . . . , ϕ(xs)] which is an s matrix. Recall that we have defined Ws diag( 1 ϵ2 1 , . . . , 1 ϵ2s ) which is an s s diagonal matrix. Define Vs λI +Ps τ=1 1 ϵ2τ ϕ(xτ)ϕ(xτ) = λI +Φ s WsΦs which is an matrix, and let V0 λI. With these notations, it can be easily verified that Ks = ΦsΦ s = [k(xτ, xτ )]τ,τ =1,...,s, and e Ks = W 1/2 s ΦsΦ s W 1/2 s = W 1/2 s Ks W 1/2 s . Moreover, we can also easily show that ks(x) = Φsϕ(x) = [k(x, xτ)]τ=1,...,s, eks(x) = W 1/2 s Φsϕ(x) = W 1/2 s ks(x), and e Ys = W 1/2 s Ys. We start by deriving an alternative expression for the weighted GP posterior mean eµs(x). To begin with, we have that Φ s W 1/2 s ( e Ks + λI) = Φ s W 1/2 s (W 1/2 s ΦsΦ s W 1/2 s + λI) = Φ s W 1/2 s W 1/2 s ΦsΦ s W 1/2 s + λΦ s W 1/2 s = (Φ s WsΦs + λI)Φ s W 1/2 s = VsΦ s W 1/2 s . Now we multiply both sides by V 1 s on the left and by ( e Ks + λI) 1 on the right, which leads to V 1 s Φ s W 1/2 s = Φ s W 1/2 s ( e Ks + λI) 1 Now we can show that eµs(x) = ϕ(x) V 1 s Φ s Ws Ys = ϕ(x) Φ s W 1/2 s ( e Ks + λI) 1W 1/2 s Ys = ek s (x)( e Ks + λI) 1 e Ys. Next, we derive an alternative expression for the weighted GP posterior variance eσ2 s(x). To begin with, we have that W 1 s + 1 λW 1/2 s ΦsΦ s W 1/2 s λW 1/2 s ΦsΦ s W 1/2 s = λW 1/2 s λI + e Ks 1 W 1/2 s . This allows us to show that eσ2 s(x) = λϕ(x) V 1 s ϕ(x) = λϕ(x) (λI + Φ s WsΦs) 1ϕ(x) (a) = λϕ(x) 1 λΦ s (W 1 s + Φs 1 λΦ s ) 1Φs 1 λ = ϕ(x) ϕ(x) ϕ(x) Φ s (W 1 s + 1 λΦsΦ s ) 1Φs 1 λϕ(x) (b) = ϕ(x) ϕ(x) ϕ(x) Φ s W 1/2 s λI + e Ks 1 W 1/2 s Φsϕ(x) = k(x, x) ek s (x)( e Ks + λI) 1eks(x), in which we have used the matrix inversion lemma in step (a), and used equation (5) in step (b). To summarize, we have derived the following alternative expressions for the weighted GP posterior distribution (3): eµs(x) = ϕ(x) V 1 s Φ s Ws Ys = ek s (x)( e Ks + λI) 1 e Ys, eσ2 s(x) = λϕ(x) V 1 s ϕ(x) = k(x, x) ek s (x)( e Ks + λI) 1eks(x). (6) B Proof of Lemma 2 To begin with, we can lower-bound the total number of iterations (i.e., the total number of queries to the quantum oracles) in m stages by 1 ϵ2s . (7) The first inequality follows since C1 > 1 (Lemma 1) and log( 2m δ ) 1 because we have chosen δ (0, 2/e]. Suppose that Pm s=1 1 ϵ2s > T 2, then we have that Pm s=1 C1 δ ) > T which is a contradiction. Therefore, we have that 1 ϵ2s T 2. (8) C Proof of Theorem 1 Here we prove the growth rate of eγm, i.e., the weighted information gain. Recall that Wm = diag 1 ϵ2 1 , 1 ϵ2 2 , . . . , 1 ϵ2m , and eγm = 1 2 log det I + λ 1 e Km . Our proof here follows closely the proof of Theorem 3 from the work of [35]. We will omit the subscript m as long as the context is clear. Denote by Km = ΦmΦ m = [k(xi, xj)]i,j=1,...,m the m m covariance matrix, and define e Km = W 1/2 m ΦmΦ m W 1/2 m = W 1/2 m Km W 1/2 m . Following Theorem 1 from [35], our kernel k can be denoted as k(x, x ) = P τ=1 λτψτ(x)ψτ(x ), in which {λτ}τ=1,..., and {ψτ}τ=1,..., represent, respectively, the eigenvalues and eigenfunctions of the kernel k. We will make use of the projection onto a D-dimensional RKHS (D < ), which consists of the first D features corresponding to the D largest eigenvalues of the kernel k. Specifically, define k P as the D-dimensional projection in the RKHS of k, and k O corresponds to the orthogonal element such that k( , ) = k P ( , ) + k O( , ). That is, following Section 3.2 from the work of [35], define k P (x, x ) = τ=1 λτψτ(x)ψτ(x ). (9) We define KP [k P (xi, xj)]i,j=1,...,m and KO [k O(xi, xj)]i,j=1,...,m. Similarly, define e KP = W 1/2 m KP W 1/2 m and e KO = W 1/2 m KOW 1/2 m . This implies that Km = KP + KO and that e Km = e KP + e KO. Here to be consistent with [35], we use Im to denote the m m-dimensional identity matrix. Denote by e I(ym; f) the information gain about the function f from the observations ym = [y1, . . . , ym]. We have that e I(ym; f) = 1 2 log det Im + 1 2 log det Im + 1 e KP + e KO 2 log det Im + 1 The last line follows because det (AB) = det (A) det (B). In the following, we will separately upper-bound the two terms in (10). We start by upper-bounding the first term in equation (10). Denote by ψD(x) the D-dimensional feature vector corresponding to the first D principle features of the kernel k (i.e., the features corresponding to the largest D eigenvalues). Define Ψm,D [ψD(x1), . . . , ψD(xm)] which is an m D matrix. Denote by ΛD = diag(λ1, . . . , λD) a D D-diagonal matrix whose diagonal entries consist of the largest D eigenvalues of the kernel k in descending order. With this definition, we have that KP = Ψm,DΛDΨ m,D and e KP = W 1/2 m Ψm,DΛDΨ m,DW 1/2 m . (11) Also define the gram matrix: e Gm = Λ1/2 D Ψ m,DWmΨm,DΛ1/2 D . (12) Based on these definitions, we have that = det Im + 1 which follows from the Weinstein Aronszajn identity: det ID + AA = det Im + A A for a D m matrix A, and we have plugged in A = 1 λΛ1/2 D Ψ m,DW 1/2 m . In addition, the following inequality will also be useful in the subsequent proof: Λ1/2 D ψD(x) 2 τ=1 λτψ2 τ(x) = k P (x, x) k(x, x) 1, (14) in which we have made use of the definition of k P from (9), and our assumption (w.l.o.g.) that k(x, x ) 1, x, x (Sec. 3.1). In the following, we will denote the trace of a matrix A by Tr(A). Next, we have that λTr Λ1/2 D Ψ m,DWmΨm,DΛ1/2 D (a) = D + 1 1 ϵ2s ψD(xs)ψ D(xs) 1 ϵ2s Λ1/2 D ψD(xs)ψ D(xs)Λ1/2 D 1 ϵ2s Tr Λ1/2 D ψD(xs)ψ D(xs)Λ1/2 D (b) = D + 1 1 ϵ2s Tr ψ D(xs)Λ1/2 D Λ1/2 D ψD(xs) 1 ϵ2s ψ D(xs)Λ1/2 D Λ1/2 D ψD(xs) Λ1/2 D ψD(xs) 2 in which (a) follows because Ψ m,DWmΨm,D = Pm s=1 1 ϵ2s ψD(xs)ψ D(xs), (b) has made use of the cyclic property of the trace operator, and (c) follows from (14) above. Also note that for a positive definite matrix P Rn n, we have that log det(P) n log Tr(P) Therefore, we have that log det Im + 1 = log det ID + 1 D log D + 1 λ Pm s=1 1 ϵ2s D = D log The first equality follows from (13), and the inequality results from (16) and (15). Now we upper-bound the second term in equation (10). To begin with, we have that Tr e KO , (18) which is because the matrix Im + 1 λ e KP 1 is positive semi-definite (PSD) whose largest eigenvalue is upper-bounded by 1, and Tr(P1P2) λP1Tr(P2) where P1 and P2 are PSD matrices and λP1 is the largest eigenvalue of P1. Next, define δD P τ=D+1 λτψ2 where |ϕτ(x)| ψ, x, τ. Then we immediately have that k O(x, x ) = P τ=D+1 λτϕτ(x)ϕτ(x ) δD, x, x . Therefore, we have that 1 ϵ2s k O(xs, xs) δD 1 ϵ2s . (19) As a result, we have that 1 ϵ2s , (20) in which the inequality follows from (18) and (19). Next, again making use of equation (16) and incorporating the upper bound on the trace from (20), we have that m log m + 1 λδD Pm s=1 1 ϵ2s m in which the second last step is because log(1 + z) z, z R. Finally, combining the upper bounds from 17 and 21, equation (10) can be upper-bounded by e I(ym; f) 1 Pm s=1 1 ϵ2s Dλ Therefore, we have that eγm = e I(ym; f) = 1 2 log det Im + 1 Pm s=1 1 ϵ2s Dλ ) + 1 Also recall that for the standard unweighted maximum information gain γT , we have that [35] 2 log det IT + 1 2D log(1 + T Therefore, the upper bound on the weighted information gain eγm is obtained by replacing the T from the upper bound on standard maximum information gain γT by Pm s=1 1 ϵ2s . This is formalized by the following Theorem. Theorem 8. [More formal statement of Theorem 1 (a)] Given Xm {x1, . . . , xm}, then we have that Pm s=1 1 ϵ2s Dλ ) + 1 γPm s=1 1/ϵ2s 1 Pm s=1 1 ϵ2s Dλ ) + 1 Lastly, note that it has been shown that γT = O(d log T) for the linear kernel and γT = O((log T)d+1) for the SE kernel, Therefore, plugging in Lemma 2, which allows us to upper-bound Pm s=1 1/ϵ2 s by T 2, leads to Corollary 1. D Proof of Theorem 2 Our proof here follows the outline of the proof of Lemma 2 from the work of [38]. To begin with, we show that det(Vτ+1) = 2det(Vτ), τ = 0, . . . , m 1, i.e., the amount of collected information is doubled after every stage. det(Vτ+1) = det 1 ϵ2 j ϕ(xj)ϕ(xj) 1 ϵ2 j ϕ(xj)ϕ(xj) + 1 ϵ2 τ+1 ϕ(xτ+1)ϕ(xτ+1) Vτ + 1 ϵ2 τ+1 ϕ(xτ+1)ϕ(xτ+1) ! I + 1 ϵ2 τ+1 V 1/2 τ ϕ(xτ+1)ϕ(xτ+1) V 1/2 τ = det (Vτ) det I + 1 ϵ2 τ+1 V 1/2 τ ϕ(xτ+1)ϕ(xτ+1) V 1/2 τ (a) = det (Vτ) 1 + 1 ϵ2 τ+1 ϕ(xτ+1) V 1/2 τ V 1/2 τ ϕ(xτ+1) 1 + 1 ϵ2 τ+1 ϕ(xτ+1) 2 V 1 τ (b) = det (Vτ) 1 + eσ2 τ(xτ+1)/λ eσ2τ(xτ+1)/λ = 2det (Vτ) . Step (a) has made use of the matrix determinant lemma: det(I + ab ) = det(I + a b), and (b) follows from the expression of the weighted GP posterior variance (6) and our choice of ϵτ+1 = λ (line 3 of Algo. 1). An immediate consequence of (25) is that det(Vm) = 2mdet(V0). Recall we have defined earlier that V0 = λI (App. A). Recall that following the notations of App. A, we have that Vm = λI + Φ m WmΦm. Next, we can also show that log det(Vm) det(V0) = log det(λI + Φ m WmΦm) det(λI) = log det I + 1 λ(Φ m W 1/2 m )(W 1/2 m Φm) (a) = log det I + 1 λW 1/2 m ΦmΦ m W 1/2 m = log det I + 1 in which step (a) has again made use of the Weinstein Aronszajn identity (similar to (13)), and the last two equalities have simply plugged in the expressions for e Ks = W 1/2 s ΦsΦ s W 1/2 s (App. A), and eγm = 1 2 log det(I + λ 1 e Km) (Sec. 5.1). According to Corollary 1, for the linear kernel, we have that there exists an absolute constant C > 0 such that eγm Cd log Pm s=1 1 ϵ2s . Combining equations (25) and (26), we can show that m log 2 = log det(Vm) det(V0) = 2eγm 2Cd log This allows us to show that m X 1 ϵ2s (2m)1/(2Cd). (28) Note that in every stage s, we query the quantum oracle for C1 δ ) times, in which C1 > 1, δ (0, 2/e] and m is an upper bound on the total number of stages. An immediate consequence is that log( 2m δ ) 1. Therefore, the total number of iterations can be analyzed as (2m)1/(2Cd). (29) Now we derive an upper bound on m by contradiction. Suppose m > 2Cd log 2 log T, this immediately implies that q (2m)1/(2Cd)) > T. (30) This equation, combined with (29), implies that Pm s=1 C1 δ ) > T, which is a contradiction. Therefore, we have that m 2 log 2Cd log T = O(d log T). (31) That is, for the linear kernel, our algorithm runs for at most O(d log T) stages, which matches the upper bound on the total number of stages for the quantum linear UCB (Q-Lin UCB) algorithm, as proved by Lemma 2 from the work of [38]. Next, for the SE kernel, Corollary 1 shows that there exists an absolute constant C > 0 such that eγm C logd+1 Pm s=1 1 ϵ2s . Again combining equations (25) and (26), we can show that m log 2 = log det(Vm) det(V0) = 2eγm 2C logd+1 This allows us to show that Now again we derive an upper bound on m by contradiction. Suppose m > 2C log 2 (2 log T)d+1, then this implies that v u u texp This, combined with (29), implies that Pm s=1 C1 δ ) > T which is a contradiction. Therefore log 2 (2 log T)d+1 = O (log T)d+1 (35) for the SE kernel. This completes the proof. E Proof of Theorem 3 Our proof here follows closely the proof of Theorem 2 from [9]. Denote by ζs the observation noise for the observation in stage s: ys = f(xs) + ζs. Define ζ1:s [ζ1, . . . , ζs] which is an s 1 vector. Denote by Fs the σ-algebra: Fs = {x1, . . . , xs+1, ζ1, . . . , ζs}. With this definition, we have that xs is Fs 1-measurable, and ζs is Fs-measurable. Note that we have used ϕ( ) to denote the RKHS feature map of the kernel k: k(x, x ) = ϕ(x) ϕ(x ), and here we let ϕ(x) = k(x, ). Then using the reproducing property, we have that f(x) = ϕ(x), f k = ϕ(x) f, in which the inner product denotes the inner product induced by the RKHS of k. To begin with, we have that Φ s W 1/2 s (W 1/2 s ΦsΦ s W 1/2 s + λI) 1 = (Φ s WsΦs + λI) 1Φ s W 1/2 s , (36) which follows from noting that (Φ s WsΦs + λI)Φ s W 1/2 s = Φ s W 1/2 s (W 1/2 s ΦsΦ s W 1/2 s + λI), and then multiplying by (Φ s WsΦs + λI) 1 on the left and by (W 1/2 s ΦsΦ s W 1/2 s + λI) 1 on the right. Now let fs [f(xk)] k=1,...,s = Φsf which is an s 1-dimensional vector, and efs W 1/2 s fs = W 1/2 s Φsf. Next, we can prove that |f(x) ek s (x)( e Ks + λI) 1 efs| = |f(x) ϕ(x) Φ s W 1/2 s W 1/2 s ΦsΦ s W 1/2 s + λI 1 efs| (a) = |f(x) ϕ(x) (Φ s WsΦs + λI) 1Φ s W 1/2 s W 1/2 s Φsf| = |ϕ(x) f ϕ(x) (Φ s WsΦs + λI) 1Φ s WsΦsf| = |λϕ(x) (Φ s WsΦs + λI) 1f| f k λϕ(x) (Φ s WsΦs + λI) 1 k λϕ(x) (Φ s WsΦs + λI) 1λI(Φ s WsΦs + λI) 1ϕ(x) λϕ(x) (Φ s WsΦs + λI) 1(λI + Φ s WsΦs)(Φ s WsΦs + λI) 1ϕ(x) λϕ(x) (Φ s WsΦs + λI) 1ϕ(x) (b) Beσs(x), in which step (a) has made use of (36), and step (b) follows from our assumption that f k B (Sec. 3.1) and the expression for the weighted GP posterior variance (6). Next, we have that |ek s (x)( e Ks + λI) 1W 1/2 s ζ1:s| = |ϕ(x) Φ s W 1/2 s W 1/2 s ΦsΦ s W 1/2 s + λI 1 W 1/2 s ζ1:s| (a) = |ϕ(x) (Φ s WsΦs + λI) 1Φ s W 1/2 s W 1/2 s ζ1:s| = |ϕ(x) (Φ s WsΦs + λI) 1Φ s Wsζ1:s| (Φ s WsΦs + λI) 1/2ϕ(x) k (Φ s WsΦs + λI) 1/2Φ s Wsζ1:s k ϕ(x) (Φ s WsΦs + λI) 1ϕ(x) q (Φ s Wsζ1:s) (Φ s WsΦs + λI) 1Φ s Wsζ1:s ζ 1:s WsΦs(Φ s WsΦs + λI) 1Φ s W 1/2 s W 1/2 s ζ1:s ζ 1:s WsΦsΦ s W 1/2 s (W 1/2 s ΦsΦ s W 1/2 s + λI) 1W 1/2 s ζ1:s ζ 1:s W 1/2 s W 1/2 s ΦsΦ s W 1/2 s (W 1/2 s ΦsΦ s W 1/2 s + λI) 1W 1/2 s ζ1:s ζ 1:s W 1/2 s e Ks( e Ks + λI) 1W 1/2 s ζ1:s, in which (a) and (b) follow from (36). Since eµs(x) = ek s (x)( e Ks+λI) 1 e Ys and e Ys = efs+W 1/2 s ζ1:s, then the two equations above combine to tell us that |f(x) eµs(x)| eσs(x) B + 1 ζ 1:s W 1/2 s e Ks( e Ks + λI) 1W 1/2 s ζ1:s eσs(x) B + q ζ 1:s W 1/2 s e Ks( e Ks + λI) 1W 1/2 s ζ1:s in which the second inequality follows since λ = 1 + η > 1. Again following [9], we apply a few more steps of transformations. Note that for an invertible K, we have that K(K + I) 1 = (K + I)K 1 1 = I + K 1 1. Substituting K = e Ks + ηI, we have that ( e Ks + ηI)( e Ks + (η + 1)I) 1 = I + ( e Ks + ηI) 1 1 . Next, we have that ζ 1:s W 1/2 s e Ks( e Ks + λI) 1W 1/2 s ζ1:s ζ 1:s W 1/2 s ( e Ks + ηI)( e Ks + (1 + η)I) 1W 1/2 s ζ1:s = ζ 1:s W 1/2 s ( e Ks + ηI) 1 + I 1 W 1/2 s ζ1:s = W 1/2 s ζ1:s 2 (( e Ks+ηI) 1+I) 1 . This allows us to further re-write (39) as |f(x) eµs(x)| eσs(x) B + W 1/2 s ζ1:s (( e Ks+ηI) 1+I) 1 Recall that Ws = diag 1 ϵ2 1 , 1 ϵ2 2 , . . . , 1 , therefore, W 1/2 s ζ1:s = [ζk 1 ϵk ] k=1,...,s. Of note, since we have that |f(xs) ys| ϵs, s = 1, . . . , m (with high probability, refer to the beginning of Sec. 5), so, the noise ζs = ys f(xs) is bounded: |ζs| ϵs, s = 1, . . . , m. In other words, ζs is ϵs-sub-Gaussian. To begin with, note that ϵk is Fk 1-measurable. This can be seen recursively: conditioned on {x1}, ϵ1 = eσ0(x1)/ λ is a deterministic constant (i.e., predictable); conditioned on {x1, x2}, eσ1( ) is predictable because it depends on x1 and ϵ1 (via the weight 1 ϵ2 1 ), and hence ϵ2 = eσ1(x2)/ dictable conditioned on {x1, x2}. By induction, conditioned on {x1, . . . , xk}, ϵk = eσk 1(xk)/( λ) is predictable. Therefore, ϵk is Fk 1-measurable, because Fk 1 = {x1, . . . , xk, ζ1, . . . , ζk 1}. Next, note that since ζk is Fk-measurable, we have that ζk ϵk is Fk-measurable. Moreover, conditioned on Fk 1 (i.e., ϵk is a predictable), ζk ϵk is 1-sub-Gaussian, because ζk is ϵk-sub-Gaussian as discussed above. To summarize, we have that every element ζk ϵk of this noise vector W 1/2 s ζ1:s is (a) Fkmeasurable and (b) 1-sub-Gaussian conditionally on Fk 1. Therefore, we can apply Theorem 1 of [9] to 1-sub-Gaussian noise, to show that W 1/2 s ζ1:s (( e Ks+ηI) 1+I) 1 det((1 + η)I + e Ks) with probability of at least 1 δ. When applying Theorem 1 of [9], we have also replaced the original unweighted covariance matrix Ks by the corresponding weighted covariance matrix e Ks = W 1/2 s Ks W 1/2 s . This is possible because every ϵk is Fk 1-measurable. More concretely, in the proof of Theorem 1 of [9], when defining the super-martingale {Mt}t which is conditioned on F , we only need to replace the covariance matrix Kt (for the multivariate Gaussian distribution of h(x1), . . . , h(xt)) by e Kt = W 1/2 t Kt W 1/2 t , because both the original Kt and our e Kt only require conditioning on {x1, . . . , xt}. Combining the two equations above allows us to show that |f(x) eµs(x)| eσs(x) B + det((1 + η)I + e Ks) with probability of 1 δ. Next, we can further upper-bound the log determinant term: log det (1 + η)I + e Ks (a) s log(1 + η) + log det I + 1 1 + η e Ks (b) log det I + 1 (c) = 2eγs + sη (d) 2eγs + 2, in which (a) follows since det(AB) = det(A) det(B) and det(c A) cs det(A) for a scalar c and an s s-matrix A, (b) has made use of log(1 + z) z, (c) has made use of the definition of γs, and (d) follows because η = 2/T and s T. This eventually allows us to show that |f(x) eµs(x)| eσs(x) B + det((1 + η)I + e Ks) eσs(x) B + p 2(eγs + 1 + log(1/δ)) , Of note, although the work of [15] has also adopted weighted GP regression in a similar way to our (3), our proof technique here cannot be applied in their analysis. This is because the work of [15] has chosen the weight to be ws = η s for η (0, 1) and hence Ws = diag[w1, . . . , ws]. Therefore, every element of the scaled noise vector W 1/2 s ζ1:s (i.e., ζk wk) is not guaranteed to be sub-Gaussian. As a result, this makes it infeasible for them to apply the self-normalizing concentration inequality from Theorem 1 of [9], i.e., our (41) above does not hold in their case. The work of [15] has come up with other novel proof techniques suited to their setting. Therefore, our proof here only works in our problem setting of quantum BO. F Proof of Theorem 4 To begin with, we can show that rs = f(x ) f(xs) (a) eµs 1(x ) + βseσs 1(x ) f(xs) (b) eµs 1(xs) + βseσs 1(xs) f(xs) (c) 2βseσs 1(xs) (d) = 2βsϵs in which (a) and (c) follow from Theorem 3, (b) results from our policy for selecting xs (line 2 of Algo. 1), and (d) follows from our design of ϵs = eσs 1(xs)/ λ (line 3 of Algo. 1). As a result, the total regret in stage s can be upper bounded by: Then an upper bound on the total cumulative regret can be obtained by summing up the regrets from all m stages: δ = O mβm log m For the linear kernel, recall from Theorem 3 and Corollary 1 that βm = O( p eγm 1 + log(1/δ)) = O( p d log T + log(1/δ)). Also recall that Theorem 2 tells us that m m = O(d log T). Therefore, for the linear kernel, RT = O d log T p d log T + log(1/δ) log d log T = O d3/2 log3/2 T log(d log T) , (48) in which we have ignored the dependency on log(1/δ) in the second step to get a cleaner expression. For the SE kernel, recall from Theorem 3 and Corollary 1 that βm = O( p eγm 1 + log(1/δ)) = O( p (log T)d+1 + log(1/δ)). Also recall that Theorem 2 tells us that m m = O((log T)d+1). Therefore, for the linear kernel, (log T)d+1 q (log T)d+1 + log(1/δ) log (log T)d+1 = O (log T)3(d+1)/2 log((log T)d+1) , in which we have again ignored the dependency on log(1/δ) to obtain a cleaner expression. The error probability of δ results from (a) conditioning on the event that |f(xs) ys| ϵs, s = 1, . . . , m which holds with probability of at least 1 δ/2, and (b) Theorem 3 which also holds with probability of at least 1 δ/2. G Proof of Theorem 5 Consistent with the analysis of Q-Lin UCB [38], in order to apply the theoretical guarantee provided by QMC (Lemma 1) for noise with bounded variance, here we need to assume that the noise variance is not too small: We assume that σ > 1/3. We have chosen the value of 1/3 just for convenience, in fact, any σ larger than 1/4 can be used in our analysis here because these different values will only alter the constants in our regret upper bound which are absorbed by the O. Note that ϵs = eσs 1(xs)/ λ 1, which follows because eσ2 s 1(xs) 1 (which can be easily seen from (3) and our assumption that k(x, x ) 1) and λ > 1. As a result of the assumption of σ > 1/3, we have that σ > 1/3 > 2 2/8. This, combined with ϵs 1, allows us to show that log3/2 2 8σ 23/4 and that log2(log2 8σ Next, we derive an upper bound on Pm s=1 1 ϵ2s which is analogous to (8). Note that in the case of a noise with bounded variance, the number of queries to the quantum oracle in stage s is given by Nϵs = C2σ ϵs log3/2 2 8σ log2(log2 8σ ϵs ) log 2m δ with C2 > 1. As a result, we have the following inequality which is in a similar spirit to (7) ϵs log3/2 2 log2(log2 8σ ϵs ) log 2m 1 3 1 ϵs 23/42 1 1 3 21/4 Now suppose that Pm s=1 1 ϵ2s > 9 2T 2, then we have that Pm s=1 Nϵs > T which is a contradiction. Therefore, we have that m X Next, note that Theorem 1 is unaffected since its proof does not depend on the noise. Moreover, Corollary 1 also stays the same because it has only made use of (8) which gives an upper bound on Pm s=1 1/ϵ2 s, and compared to (8), its counterpart (51) in the analysis here only introduces an extra constant factor which is absorbed by the bit O notation. Similarly, Theorem 2 is also unaltered for a similar reason, i.e., only an extra constant factor of 3 21/4 will be introduced to the right hand side of (30) and (34), which is absorbed by the big O notation. Theorem 3 is also unaltered since its proof does not depend on the type of noise. In particular, the key underlying reason why it is unaffected is because for both types of noise, given a particular ϵs, we choose the corresponding number Nϵs of queries (to the quantum oracle) depending on the type of noise so that the error guarantee of |ys f(xs)| ϵs is satisfied. This allows us to make use of the self-normalizing bound from [9] for 1-sub-Gaussian noise. Lastly, we need to modify the proof of Theorem 4. To begin with, the total regret from stage s can now be bounded as C2σ ϵs log3/2 2 log2(log2 8σ ϵs ) log 2m λ = O σ log3/2 2 ( σ ϵs ) log2(log2 σ ϵs ) log m (52) Note that every 1/ϵs is upper-bounded by 1/ϵs 3 21/4T = O(T) which can be easily inferred using (51). So, the total regret can be upper-bounded as s=1 σ log3/2 2 ( σ ϵs ) log2(log2 σ ϵs ) log m = O mβm log m δ σ log3/2 2 (σT) log2 log2(σT) . As a result, for the linear kernel for which m m = O(d log T) (Theorem 2) and βm = p eγm 1 + log(1/δ) = O( p d log T + log(1/δ)) (Theorem 3 and Corollary 1), RT = O d log T p d log T + log(1/δ) log d log T δ σ log3/2 2 (σT) log2(log2(σT)) = O σd3/2 log3/2 T log(d log T) log3/2 2 (σT) log2(log2(σT)) , (54) in which we have ignored all log(1/δ) factors for a cleaner expression. For the SE kernel for which m m = O(logd+1 T) (Theorem 2) and βm = p eγm 1 + log(1/δ) = logd+1 T + log(1/δ)) (Theorem 3 and Corollary 1), logd+1 T + log(1/δ) log logd+1 T δ σ log3/2 2 (σT) log2(log2 σT) = O σ(log T)3(d+1)/2 log((log T)d+1) log3/2 2 (σT) log2(log2 σT) , in which we have again ignored all log(1/δ) factors for a cleaner expression. H Proof of Theorem 6 Our proof here follows closely the proof of Theorem 2 from the work of [2]. For consistency, in our proof here, we follow the notations from [38] and [2]. Since we focus on linear bandits here, we assume that the reward function is linear with a groundtruth d-dimensional parameter vector θ : f(x) = x θ , x. Again following [2] and [38] as well as many other works on linear bandits, we assume that θ 2 S, and x 2 L, x. We use Xs to denote Xs [x1, . . . , xs] which is an s d matrix, and denote Ys [y1, . . . , ys] which is an s 1-dimensional vector of observations. We use ζ1:s [ζ1, . . . , ζs] to denote the s 1-dimensional vector of noise. These notations allow us to write Ys = Xsθ +ζ1:s. We denote Vs X s Ws Xs+λI. Following [38], we use bθs to denote the MLE estimate of the parameter θ given the observations after the first s stages: bθs = V 1 s X s Ws Ys. Given these notations, we firstly have that bθs = (X s Ws Xs + λI) 1X s Ws(Xsθ + ζ1:s) = (X s Ws Xs + λI) 1X s Wsζ1:s + (X s Ws Xs + λI) 1(X s Ws Xs + λI)θ λ(X s Ws Xs + λI) 1θ = (X s Ws Xs + λI) 1X s Wsζ1:s + θ λ(X s Ws Xs + λI) 1θ . x bθs x θ = x (X s Ws Xs + λI) 1X s Wsζ1:s λx (X s Ws Xs + λI) 1θ = x, X s Wsζ1:s V 1 s λ x, θ V 1 s . (57) This allows us to use the Cauchy-Schwarz inequality to show that |x bθs x θ | x V 1 s X s Wsζ1:s V 1 s + λ θ V 1 s X s Wsζ1:s V 1 s + The last inequality follows because θ 2 V 1 s 1 λmin(Vs) θ 2 2 1 λ θ 2 2. Note that X s Wsζ1:s = Ps k=1 1 ϵ2 k ζkxk. Next, we make use of the self-normalized concentration inequality from Theorem 1 of [2]. Specifically, we will use { ζs ϵs } s=1 as to replace the stochastic process {ζs} s=1 (i.e., representing the noise, {ηs} s=1 according to the notations from [2]), and use { 1 ϵs xs} s=1 to replace the vector-valued stochastic process {xs} s=1 (i.e., representing the sequence of inputs). Now we explain why this is feasible. Similar to the proof of Theorem 3, we again denote the observation noise as ζs: ys = f(xs) + ζs, and define Fs as the σ-algebra Fs = σ(x1, x2, . . . , xs+1, ζ1, ζ2, . . . , ζs). Similarly, an immediate consequence is that xs is Fs 1-measurable, and ζs is Fs-measurable. Following a similar analysis to that used in the proof of Theorem 3 allows us to show that ϵk is Fk 1-measurable (i.e., ϵk can be predicted based on Fk 1). Therefore, we have that 1 ϵk xk is Fk 1-measurable (i.e., 1 ϵk xk can be predicted based on Fk 1) and that ζk ϵk is Fk-measurable. Moreover, conditioned on Fk 1, ζk ϵk is 1-sub Gaussian. This is because the QMC subroutine guarantees that |yk f(xk)| ϵk, k = 1, . . . , m (with high probability), which means that the absolute value of the noise ζk = yk f(xk) is upper-bounded by ϵk: |ζk| ϵk. Therefore, conditioned on Fk 1, ζk ϵk is 1-sub-Gaussian. Given that these conditions are satisfied, we can apply the self-normalized concentration inequality from Theorem 1 of [2] to { ζs ϵs } s=1 and { 1 ϵs xs} s=1 with (following the notations from Theorem 1 of [2]) 1 ϵ2 k xkx k , 1 ϵ2 k ζkxk, in which we have used V = λI. This allows us to show that Ss V 1 s = X s Wsζ1:s V 1 s v u u t2 log det(Vs)1/2 det(λI) 1/2 which holds with probability of at least 1 δ. Next, again making use of the trace-determinant inequality from (16) and using the aforementioned assumption that x 2 L, we can show that det(Vs) (λ + L2 d Ps k=1 1 ϵ2 k )d. Also note that det(λI) = λd. As a result, we can further upperbound the equation above by: X s Wsζ1:s V 1 s v u u u u t2 log d Ps k=1 1 ϵ2 k )d v u u u u u t2 log 2d log 1 + L2 λ Ps k=1 1 ϵ2 k δ , in which we have used d 1 in the last step to get a cleaner expression. Therefore, we can re-write (58) as |x bθs x θ | x V 1 s 2d log 1 + L2 λ Ps k=1 1 ϵ2 k δ + , x, s, (62) in which we have also used the above-mentioned assumption of θ 2 S. Now again following [2], we plug in x = Vs(bθs θ ), which gives us Vs Vs(bθs θ ) V 1 s 2d log 1 + L2 λ Ps k=1 1 ϵ2 k δ + 2d log 1 + L2 λ Ps k=1 1 ϵ2 k δ + Therefore, we have that 2d log 1 + L2 λ Ps k=1 1 ϵ2 k δ + This completes the proof of Theorem 6. Now we briefly show how the improved confidence ellipsoid from our Theorem 6 leads to an improved regret upper bound for Q-Lin UCB. Here we follow the proof of Theorem 3 from [38], and hence defer the detailed explanations of some of the steps to the proof there. To begin with, we have that f(x ) f(xs) = (x xs) θ eθs bθs 1 Vs 1 + bθs 1 θ Vs 1 2d log 1 + L2T 2/λ Here for simplicity, we have upper-bounded Ps k=1(1/ϵ2 k) by T 2 (Lemma 2) in the second inequality, and have omitted the dependency on log(1/δ). Then the total regrets in stage s can be upper-bounded by d log T = O p d log T log m Plugging in our tighter confidence ellipsoid (Theorem 6) into the regret analysis of QLin UCB, we have that the regret upper bound of QLin UCB can be analyzed as d log T log(m d log T log(m = O(d log T p d log(T) log(d log T)) = O d3/2(log T)3/2 log(d log T) , which gives the tighter regret upper bound for Q-Lin UCB that we have discussed in the main text (Sec. 5.6). I Regret Upper Bound for the Matérn Kernel (Proof of Theorem 7) In this section, we modify our analysis to derive a regret upper bound for the Matérn kernel. We focus on bounded noise here (i.e., the first scenario in Lemma 1), since the analysis for noise with bounded variance (i.e., the second scenario in Lemma 1) is similar and follows from the analysis of App. G. In our analysis here, for simplicity, we ignore the logarithmic factors. To begin with, note that Theorem 1 is not kernel-specific and hence still holds in our analysis here. It has been shown that for the Matérn kernel, the upper bound on the standard maximum information gain is γT = e O(T d 2ν+d ) [35]. Therefore, similar to our Corollary 1, for the Matérn kernel, we can use Theorem 1 to show that = e O T 2d 2ν+d . (68) Next, we modify the proof of our Theorem 2 to derive the corresponding upper bound on the total number of stages for the Matérn kernel, i.e., we discuss here how we modify the proof of Theorem 2 in App. D. To begin with, (68) suggests that there exists a C > 0 such that eγm C Pm s=1 1 ϵ2 s d 2ν+d . Different from our analysis for the linear and SE kernels in App. D (i.e., (27) and (32)), the absolute constant C here also contains logarithmic factors. Similar to (27), we can show that m log 2 = log det(Vm) det(V0) = 2eγm 2C This naturally leads to m X 1 ϵ2s mlog 2 which is analogous to (28). Next, similar to (29), the total number of iterations can be analyzed as Now suppose m > T 2d 2ν+d 2C log 2, then this implies that s mlog 2 d > T. (72) This equation, combined with (71), suggests that Pm s=1 C1 δ ) > T, which is a contradiction. Therefore, we have that m T 2d 2ν+d 2C log 2 = e O T 2d 2ν+d . (73) Next, note that our confidence ellipsoid in Theorem 3 does not depend on the choice of the kernel and hence also holds for the Matérn kernel. Lastly, we can modify the proof of Theorem 4 (App. F). Specifically, we can adopt the result from (47): RT O mβm log m δ . For the Matérn kernel, Theorem 3 and (68) allow us to show that βm = e O( p eγm 1) = e O( p T 2d 2ν+d ) = e O(T d 2ν+d ). Combining this with (73) allows us to show that for the Matérn kernel, RT = e O T 2d 2ν+d T d 2ν+d = e O T 3d 2ν+d . (74) This completes the proof of Theorem 7. J More Experimental Details, and Experiment on Real Quantum Computer In our experiments, when implementing both the classical GP-UCB and our Q-GP-UCB algorithms, we use random Fourier features (RFF) approximation to approximate the kernel: k(x, x ) eϕ(x) eϕ(x ). As a result, we can implement the weighted GP posterior following (6) as discussed in App. A, in which the M-dimensional (M < ) random features eϕ( ) are used to replace the infinite-dimensional RKHS features ϕ( ). We have used M = 100 random features in the synthetic experiment and M = 200 in the Auto ML experiment. As we have mentioned in the main paper (Sec. 6), we have implemented the QMC algorithm from the recent work of [20] using the Qiskit python package. We have used 6 qubits for the Gaussian noise and 1 qubits for the Bernoulli noise. For each experiment, we perform multiple independent trials (10 trials for the synthetic experiment and 5 trials for the Auto ML experiment) and have plotted the resulting mean and standard error as the solid lines and error bars in each figure, respectively. For our Q-GP-UCB, we choose βs = 1 + log s, s 1 following the order given by the theoretical value of βs (Theorem 3). For classical GP-UCB, we tried different approaches to setting βs: βs = 1 + log s, βs = 2 and βs = 1, all of which are commonly adopted practices in BO; we found that βs = 2 and βs = 1 have led to the best performances for GP-UCB in, respectively, the synthetic and Auto ML experiments. For the synthetic experiment, some important experimental details have been discussed in Sec. 6. The reward function f used in the synthetic experiment is sampled from a GP with the SE kernel with a length scale of 0.1. In the Auto ML experiment, an SVM is used for diabetes diagnosis. That is, we adopt the diabetes diagnosis dataset which can be found at https://www.kaggle.com/uciml/ pima-indians-diabetes-database, which is under the CC0 license. The task involves using 8 features to predict whether the patient had diabetes or not, which constitutes a two-class classification problem. We use 70% of the dataset as the training set and the remaining dataset as the validation set. We aim to tune two hyperparameters of the SVM: the penalty parameter and the RBF kernel parameter, both within the range of [10 4, 1]. For simplicity, for each of the two hyperparameters, we discretize its domain into 5 points with equal spacing. This results in a domain with |X| = 25. Here we adopt the SE kernel: k(x, x ) σ2 0 exp( x x 2 2 /(2l2)) with σ2 0 = 0.5. Since we have implemented our algorithms (both classical GP-UCB and our Q-GP-UCB) using RFF approximation as discussed above, we performed a grid search for the length scale l within {0.1, 0.2, . . . , 1} for both GP-UCB and Q-GP-UCB when generating the random features. We found that l = 1.0 leads to the best performance for GP-UCB and l = 0.2 works the best for Q-GP-UCB. This is likely because for our Q-GP-UCB, the effective noise is much smaller (thanks to the accurate estimation by the QMC algorithm), which allows us to use a smaller length scale (which can model more complicated functions) to accurately model the reward function. 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB Q-GP-UCB Lin UCB Q-Lin UCB 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB ( = 0.3) GP-UCB ( = 0.4) Q-GP-UCB ( = 0.3) Q-GP-UCB ( = 0.4) Lin UCB ( = 0.4) Q-Lin UCB ( = 0.4) (a) synthetic, Bernoulli noise (b) synthetic, Gaussian noise Figure 2: Cumulative regret for the synthetic experiment with (a) Bernoulli noise and (b) Gaussian noise, with the additional comparisons with Lin UCB and Q-Lin UCB. 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB Q-GP-UCB Q-GP-UCB (simulated quantum noise) 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB ( = 0.3) GP-UCB ( = 0.4) Q-GP-UCB ( = 0.3) Q-GP-UCB ( = 0.4) Q-GP-UCB ( = 0.3, simulated quantum noise) Q-GP-UCB ( = 0.4, simulated quantum noise) 0 1000 2000 3000 4000 5000 Iterations Cumulative Regret GP-UCB ( = 0.01) GP-UCB ( = 0.05) Q-GP-UCB ( = 0.01) Q-GP-UCB ( = 0.05) Q-GP-UCB ( = 0.01, simulated quantum noise) Q-GP-UCB ( = 0.05, simulated quantum noise) (a) synthetic, Bernoulli noise (a) synthetic, Gaussian noise (c) Auto ML Figure 3: Cumulative regret for (a) the synthetic experiment with Bernoulli noise, (b) the synthetic experiment with Gaussian noise, and (c) the Auto ML experiment, with the additional comparisons with our Q-GP-UCB algorithm after considering quantum noise. All our experiments are run on a computer with Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz, with 64 CPUs. No GPUs are needed. More Experimental Results. Fig. 2 additionally shows the cumulative regret for the Lin UCB and Q-Lin UCB algorithms in the synthetic experiment. Consistent with the results in Fig. 1 (d) in the main paper (Sec. 6) which has included the results for Lin UCB and Q-Lin UCB in the Auto ML experiment, here Lin UCB and QLin UCB again underperform significantly in Fig. 2 for both types of noises. These results further verify that in realistic experiments with non-linear reward functions, algorithms based on linear bandits are significantly outperformed by those based on BO/kernelized bandits. Fig. 3 additionally shows the results of our Q-Lin UCB after considering quantum noise. As we have discussed in the main paper (Sec. 6), the results show that the quantum noise leads to slightly worse performances for our Q-GP-UCB, yet they are still able to significantly outperform classical GP-UCB. Results Using A Real Quantum Computer. Here we additionally perform an experiment using a real quantum computer. Specifically, we ran our QMC subroutine on a real IBM quantum computer based on superconducting qubit technology (7 qubit, Falcon r5.11H processor) through the IBM Quantum cloud service. We ran our synthetic experiment with Gaussian noise on the system for only 1 trial (instead of 10 as in other simulations) due to time and resource constraints. The results in Fig. 4 show that although the performance of our Q-GP-UCB on real quantum computers is worse than those obtained without quantum noise and with simulated quantum noise, our Q-GP-UCB run on real quantum computers is still able to significantly outperform classical GP-UCB. This experiment provides a further demonstration for the potential of our Q-GP-UCB algorithm to lead to quantum speedup real BO applications. 0 2000 4000 6000 8000 10000 Iterations Cumulative Regret GP-UCB ( = 0.4) Q-GP-UCB ( = 0.4) Q-GP-UCB ( = 0.4, simulated quantum noise) Q-GP-UCB ( = 0.4, real quantum computer) Figure 4: Results using a real IBM quantum computer. Cumulative regret for the synthetic experiment with Gaussian noise with σ = 0.4.