# the_inductive_bias_of_quantum_kernels__febb3a16.pdf The Inductive Bias of Quantum Kernels Jonas M. Kübler Simon Buchholz Bernhard Schölkopf Max Planck Institute for Intelligent Systems Tübingen, Germany {jmkuebler, sbuchholz, bs}@tue.mpg.de It has been hypothesized that quantum computers may lend themselves well to applications in machine learning. In the present work, we analyze function classes defined via quantum kernels. Quantum computers offer the possibility to efficiently compute inner products of exponentially large density operators that are classically hard to compute. However, having an exponentially large feature space renders the problem of generalization hard. Furthermore, being able to evaluate inner products in high dimensional spaces efficiently by itself does not guarantee a quantum advantage, as already classically tractable kernels can correspond to highor infinite-dimensional reproducing kernel Hilbert spaces (RKHS). We analyze the spectral properties of quantum kernels and find that we can expect an advantage if their RKHS is low dimensional and contains functions that are hard to compute classically. If the target function is known to lie in this class, this implies a quantum advantage, as the quantum computer can encode this inductive bias, whereas there is no classically efficient way to constrain the function class in the same way. However, we show that finding suitable quantum kernels is not easy because the kernel evaluation might require exponentially many measurements. In conclusion, our message is a somewhat sobering one: we conjecture that quantum machine learning models can offer speed-ups only if we manage to encode knowledge about the problem at hand into quantum circuits, while encoding the same bias into a classical model would be hard. These situations may plausibly occur when learning on data generated by a quantum process, however, they appear to be harder to come by for classical datasets. 1 Introduction In recent years, much attention has been dedicated to studies of how small and noisy quantum devices [1] could be used for near term applications to showcase the power of quantum computers. Besides fundamental demonstrations [2], potential applications that have been discussed are in quantum chemistry [3], discrete optimization [4] and machine learning (ML) [5 12]. Initiated by the seminal HHL algorithm [13], early work in quantum machine learning (QML) was focused on speeding up linear algebra subroutines, commonly used in ML, offering the perspective of a runtime logarithmic in the problem size [14 17]. However, most of these works have an inverse polynomial scaling of the runtime in the error and it was shown rigorously by Ciliberto et al. [18] that due to the quantum mechanical measurement process a runtime complexity O( n) is necessary for convergence rate 1/ n. Rather than speeding up linear algebra subroutines, we focus on more recent suggestions that use a quantum device to define and implement the function class and do the optimization on a classical computer. There are two ways to that: the first are so-called Quantum Neural Networks (QNN) or JMK and SB contributed equally and are ordered randomly. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). |0 RX(x2) id |0 RX(xd) id (a) V |0 RX(x2) |0 RX(x2) id |0 RX(xd) id (c) Figure 1: Quantum advantage via inductive bias: (a) Data generating quantum circuit f(x) = Tr ρV (x)(M id) = Tr ρV (x)M . (b) The full quantum kernel k(x, x ) = Tr ρV (x)ρV (x ) is too general and cannot learn f efficiently. (c) The biased quantum kernel q(x, x ) = Tr ρV (x) ρV (x ) meaningfully constrains the function space and allows to learn f with little data. parametrized quantum circuits [5 7] which can be trained via gradient based optimization [5, 19 23]. The second approach is to use a predefined way of encoding the data in the quantum system and defining a quantum kernel as the inner product of two quantum states [7 11]. These two approaches essentially provide a parametric and a non-parametric path to quantum machine learning, which are closely related to each other [11]. Since the optimization of QNNs is non-convex and suffers from so-called Barren Plateaus [24], we here focus on quantum kernels, which allow for convex problems and thus lend themselves more readily to theoretical analysis. The central idea of using a QML model is that it enables to do computations that are exponentially hard classically. However, also in classical ML, kernel methods allow us to implicitly work with highor infinite dimensional function spaces [25, 26]. Thus, purely studying the expressivity of QML models [27] is not sufficient to understand when we can expect speed-ups. Only recently first steps where taken into this direction [10, 12, 28]. Assuming classical hardness of computing discrete logarithms, Liu et al. [10] proposed a task based on the computation of the discrete logarithm where the quantum computer, equipped with the right feature mapping, can learn the target function with exponentially less data than any classical (efficient) algorithm. Similarly, Huang et al. [12] analyzed generalization bounds and realized that the expressivity of quantum models can hinder generalization. They proposed a heuristic to optimize the labels of a dataset such that it can be learned well by a quantum computer but not a classical machine. In this work, we relate the discussion of quantum advantages to the classical concept of inductive bias. The no free lunch theorem informally states that no learning algorithm can outperform other algorithms on all problems. This implies that an algorithm that performs well on one type of problem necessarily performs poorly on other problems. A standard inductive bias in ML is to prefer functions that are continuous. An algorithm with that bias, however, will then struggle to learn functions that are discontinuous. For a QML model to have an edge over classical ML models, we could thus ensure that it is equipped with an inductive bias that cannot be encoded (efficiently) with a classical machine. If a given dataset fits this inductive bias, the QML model will outperform any classical algorithm. For kernel methods, the qualitative concept of inductive bias can be formalized by analyzing the spectrum of the kernel and relating it to the target function [25, 29 33]. Our main contribution is the analysis of the inductive bias of quantum machine learning models based on the spectral properties of quantum kernels. First, we show that quantum kernel methods will fail to generalize as soon as the data embedding into the quantum Hilbert space is too expressive (Theorem 1). Then we note that projecting the quantum kernel appropriately allows to construct inductive biases that are hard to create classically (Figure 1). However, our Theorem 2 also implies that estimating the biased kernel requires exponential measurements, a phenomenon reminiscent of the Barren plateaus observed in quantum neural networks. Finally we show experiments supporting our main claims. While our work gives guidance to find a quantum advantage in ML, this yields no recipe for obtaining a quantum advantage on a classical dataset. We conjecture that unless we have a clear idea how the data generating process can be described with a quantum computer, we cannot expect an advantage by using a quantum model in place of a classical machine learning model. 2 Supervised learning We briefly introduce the setting and notation for supervised learning as a preparation for our analysis of quantum mechanical methods in this context. The goal of supervised learning is the estimation of a functional mechanism based on data generated from this mechanism. For concreteness we focus on the regression setting where we assume data is generated according to Y = f (X) + ε where ε denotes zero-mean noise. We focus on X X Rd, Y R. We denote the joint probability distribution of (X, Y ) by D and we are given n i.i.d. observations Dn from D. We will refer to the marginal distribution of X as µ, define the L2 µ inner product f, g = R f(x)g(x) µ(dx) and denote the corresponding norm by . The least square risk and the empirical risk of some hypothesis h : X R is defined by R(h) = ED (h(X) Y )2 and Rn(h) = EDn (h(X) Y )2 . In supervised machine learning, one typically considers a hypothesis space H of functions h : X R and tries to infer argminh HR(h) (assuming for simplicity that the minimizer exists). Typically this is done by (regularized) empirical risk minimization argminh HRn(h) + λΩ(h), where λ > 0 and Ωdetermine the regularization. The risk of h can then be decomposed in generalization and training error R(h) = (R(h) Rn(h)) + Rn(h). Kernel ridge regression. We will focus on solving the regression problem over a reproducing kernel Hilbert space (RKHS) [25, 26]. An RKHS F associated with a positive definite kernel k : X X R is the space of functions such that for all x X and h F the reproducing property h(x) = h, k(x, ) F holds. Kernel ridge regression regularizes the RKHS norm, i.e., Ω(h) = h 2 F. With observations {(x(i), y(i))}n i=1 we can compute the kernel matrix K(X, X)ij = k(x(i), x(j)) and the Representer Theorem [34] ensures that the empirical risk minimizer of kernel ridge regression is of the form ˆf λ n( ) = Pn i=1 αik(x(i), ), with α = (K(X, X) + λ id) 1y. The goal of our work is to study when a (quantum) kernel is suitable for learning a particular problem. The central object to study this is the integral operator. Spectral properties and inductive bias. For kernel k and marginal distribution µ, the integral operator K, is defined as (Kf)(x) = R k(x, x )f(x )µ(dx ). Mercer s Theorem ensures that there exist a spectral decomposition of K with (possibly infinitely many) eigenvalues γi (ordered nonincreasingly) and corresponding eigenfunctions φi, which are orthonormal in L2 µ, i.e., φi, φj = δi,j. We will assume that Tr [K] = P i γi = 1 which we can ensure by rescaling the kernel. We can then write k(x, x ) = P i γiφi(x)φi(x ). While the functions φ form a basis of F they might not completely span L2 µ. In this case we simply complete the basis and implicitly take γ = 0 for the added functions. Then we can decompose functions in L2 µ as i aiφi(x). (1) We have f 2 = P i a2 i and f 2 F = P i a2 i γi (if f F). Kernel ridge regression penalizes the RKHS norm of functions. The components corresponding to zero eigenvalues are infinitely penalized and cannot be learned since they are not in the RKHS. For large regularization λ the solution ˆf λ n is heavily biased towards learning only the coefficients of the principal components (corresponding to the largest eigenvalues) and keeps the other coefficients small (at the risk of underfitting). Decreasing the regularization allows ridge regression to also fit the other components, however, at the potential risk of overfitting to the noise in the empirical data. Finding good choices of λ thus balances this bias-variance tradeoff. We are less concerned with the choice of λ, but rather with the spectral properties of a kernel that allow for a quantum advantage. Similar to the above considerations, a target function f can easily be learned if it is well aligned with the principal components of a kernel. In the easiest case, the kernel only has a single non-zero eigenvalue and is just k(x, x ) = f(x)f(x ). Such a construction is arguably the simplest path to a quantum advantage in ML. Example 1 (Trivial Quantum Advantage). Let f be a scalar function that is easily computable on a quantum device but requires exponential resources to approximate classically. Generate data as Y = f(X) + ϵ. The kernel k(x, x ) = f(x)f(x ) then has an exponential advantage for learning f from data. To go beyond this trivial case, we introduce two qualitative measures to judge the quality of a kernel for learning the function f. The kernel target alignment of Cristianini et al. [30] is A(k, f) = k, f f k, k 1/2 f f, f f 1/2 = P i γia2 i (P i γ2 i )1/2 P and measures how well the kernel fits f. If A = 1, learning reduces to estimating a single real parameter, whereas for A = 0, learning is infeasible. We note that the kernel target alignment also weighs the contributions of f depending on the corresponding eigenvalue, i.e., the alignment is better if large |ai| correspond to large γi. The kernel target alignment was used extensively to optimize kernel functions [31] and recently also used to optimize quantum kernels [35]. In a similar spirit, the task-model alignment of Canatar et al. [32] measures how much of the signal of f is captured in the first i principal components: C(i) = P j a2 j) 1. The slower C(i) approaches 1, the harder it is to learn as the target function is more spread over the eigenfunctions. 3 Quantum computation in machine learning In this section we introduce hypothesis spaces containing functions whose output is given by the result of a quantum computation. For a general introduction to concepts of quantum computation we refer to the book of Nielsen and Chuang [36]. We will consider quantum systems comprising d N qubits. Discussing such systems and their algebraic properties does not require in-depth knowledge of quantum mechanics. A pure state of a single qubit is described by vector (α, β) C2 s.t. |α|2+|β|2 = 1 and we write |ψ = α |0 +β |1 , where {|0 , |1 } forms the computational basis. A d qubit pure state lives in the tensor product of the single qubit state spaces, i.e., it is described by a normalized vector in C2d. A mixed state of a d-qubit system can be described by a density operator ρ C2d 2d, i.e., a positive definite matrix (ρ 0) with unit trace (Tr [ρ] = 1). For a pure state |ψ the corresponding density operator is ρ = |ψ ψ| (here, ψ| is the complex conjugate transpose of |ψ ). A general density operator can be thought of as a classical probabilistic mixture of pure states. We can extract information from ρ by estimating (through repeated measurements) the expectation of a suitable observable, i.e., a Hermitian operator M = M (where the adjoint ( ) is the complex conjugate of the transpose), as Tr [ρM] . (3) Put simply, the potential advantage of a quantum computer arises from its state space being exponentially large in the number of qubits d, thus computing general expressions like (3) on a classical computer is exponentially hard. However, besides the huge obstacles in building quantum devices with high fidelity, the fact that the outcome of the quantum computation (3) has to be estimated from measurements often prohibits to easily harness this power, see also Wang et al. [37], Peters et al. [38]. We will discuss this in the context of quantum kernels in Section 4. We consider parameter dependent quantum states ρ(x) = U(x)ρ0U (x) that are generated by evolving the initial state ρ0 with the data dependent unitary transformation U(x) [7, 11]. Most often we will without loss of generality assume that the initial state is ρ0 = (|0 0|) d. We then define quantum machine learning models via observables M of the data dependent state f M(x) = Tr U(x)ρ0U (x)M = Tr [ρ(x)M] . (4) In the following we introduce the two most common function classes suggested for quantum machine learning. We note that there also exist proposals that do not fit into the form of Eq. (4) [27, 35, 39]. Quantum neural networks. A "quantum neural network" (QNN) is defined via a variational quantum circuit (VQC) [6, 40, 41]. Here the observable Mθ is parametrized by p N classical parameters θ Θ Rp. This defines a parametric function class FΘ = {f Mθ|θ Θ}. The most common ansatz is to consider Mθ = U(Θ)MU (Θ) where U(Θ) = Q i U(θi) is the composition of unitary evolutions each acting on few qubits. For this and other common models of the parametric circuit it is possible to analytically compute gradients and specific optimizers for quantum circuits based on gradient descent have been developed [5, 19 23]. Nevertheless, the optimization is usually a non-convex problem and suffers from additional difficulties due to oftentimes exponentially (in d) Table 1: Concepts in the quantum Hilbert space H and the reproducing kernel Hilbert space F. Quantum Space of d qubits RKHS x 7 ρ(x) H (explicit feature map) x 7 k( , x) F (canonical feature map) H = n ρ C2d 2d | ρ = ρ , ρ 0, Tr [ρ] = 1 o k(x, x ) = Tr [ρ(x)ρ(x )] = ρ(x), ρ(x ) H k(x, x ) = k( , x), k( , x ) F F = {f M|f M( ) = Tr [ρ( )M] , M = M } F = Span ({k( , x) | x X}) vanishing gradients [24]. This hinders a theoretical analysis. Note that the non-convexity does not arise from the fact that the QNN can learn non-linear functions, but rather because the observable Mθ depends non-linearly on the parameters. In fact, the QNN functions are linear in the fixed feature mapping ρ(x). Therefore the analogy to classical neural networks is somewhat incomplete. Quantum kernels. The class of functions we consider are RKHS functions where the kernel is expressed by a quantum computation. The key observation is that (4) is linear in ρ(x). Instead of optimizing over the parametric function class FΘ , we can define the nonparametric class of functions F = {f M|f M( ) = Tr [ρ( )M] , M = M }.2 To endow this function class with the structure of an RKHS, observe that the expression Tr [ρ1ρ2] defines a scalar product on density matrices. We then define kernels via the inner product of data-dependent density matrices: Definition 1 (Quantum Kernel [7, 8, 11]). Let ρ : x 7 ρ(x) be a fixed feature mapping from X to density matrices. Then the corresponding quantum kernel is k(x, x ) = Tr [ρ(x)ρ(x )]. The Representer Theorem [34] reduces the empirical risk minimization over the exponentially large function class F to an optimization problem with a set of parameters whose dimensionality corresponds to the training set size. Since the ridge regression objective is convex (and so are many other common objective functions in ML), this can be solved efficiently with a classical computer. In the described setting, the quantum computer is only used to estimate the kernel. For pure state encodings, this is done by inverting the data encoding transformation (taking its conjugate transpose) and measuring the probability that the resulting state equals the initial state ρ0. To see this we use the cyclic property of the trace k(x, x ) = Tr [ρ(x)ρ(x )] = Tr U(x)ρ0U (x) U(x )ρ0U (x ) = Tr U (x )U(x)ρ0U (x)U(x ) ρ0 . If ρ0 = (|0 0|) d, then k(x, x ) corresponds to the probability of observing every qubit in the 0 state after the initial state was evolved with U (x )U(x). To evaluate the kernel, we thus need to estimate this probability from a finite number of measurements. For our theoretical analysis we work with the exact value of the kernel and in our experiments we also simulate the full quantum state. We discuss the difficulties related to measurements in Sec. 4. 4 The inductive bias of simple quantum kernels We now study the inductive bias for simple quantum kernels and their learning performance. We first give a high level discussion of a general hurdle for quantum machine learning models to surpass classical methods and then analyze two specific kernel approaches in more detail. Continuity in classical machine learning. Arguably the most important bias in nonparametric regression are continuity assumptions on the regression function. This becomes particularly apparent in, e.g., nearest neighbour regression or random regression forests [42] where the regression function is a weighted average of close points. Here we want to emphasize that there is a long list of results concerning the minimax optimality of kernel methods for regression problems [43 45]. In particular these results show that asymptotically the convergence of kernel ridge regression of, e.g., Sobolev functions reaches the statistical limits which also apply to any quantum method. 2F is defined for a fixed feature mapping x 7 ρ(x). Although M is finite dimensional and thus F can be seen as a parametric function class, we will be interested in the case where M is exponentially large in d and we can only access functions from F implicitly. Therefore we refer to it as nonparametric class of functions. A simple quantum kernel. We now restrict our attention to rather simple kernels to facilitate a theoretical analysis. As indicated above we consider data in X Rd and we assume that the distribution µ of the data factorizes over the coordinates (i.e. µ can be written as µ = N µi). This data is embedded in a d-qubit quantum circuit. Let us emphasize here that the RKHS based on a quantum state of d-qubits is at most 4d dimensional, i.e., finite dimensional and in the infinite data limit n standard convergence guarantees from parametric statistics apply. Here we consider growing dimension d , and sample size polynomial in the dimension n = n(d) Poly(d). In particular the sample size n 4d will be much smaller than the dimension of the feature space and bounds from the parametric literature do not apply. Here we consider embeddings where each coordinate is embedded into a single qubit using a map ϕi followed by an arbitrary unitary transformation V , so that we can express the embedding in the quantum Hilbert space as |ψV (x) = V N |ϕi(xi) with corresponding density matrix (feature map)3 ρV (x) = |ψV (x) ψV (x)| . (5) Note that the corresponding kernel k(x, x ) = Tr [ρ(x)ρ(x )] is independent of V and factorizes k(x, x ) = Tr [N ρi(xi) N ρi(x i)] = Q Tr [ρi(xi)ρi(x i)] where ρi(xi) = |ϕi(xi) ϕi(xi)|. The product structure of the kernel allows us to characterize the RKHS generated by k based on the one dimensional case. The embedding of a single variable can be parametrized by complex valued functions a(x), b(x) as |ϕi(x) = a(x)|0 + b(x)|1 . (6) One important object characterizing this embedding turns out to be the mean density matrix of this embedding given by ρµi = R ρi(y) µi(dy) = R |ϕi(y) ϕi(y)| µi(dy). This can be identified with the kernel mean embedding of the distribution [46]. Note that for factorizing base measure µ the factorization ρµ = N ρµi holds. Let us give a concrete example to clarify the setting, see Fig. 1(b). Example 2. [11, Example III.1.] We consider the cosine kernel where a(x) = cos(x/2), b(x) = i sin(x/2). This embedding can be realized using a single quantum RX(x) = exp ( i x 2σx) gate such that |ψ(x) = RX(x)|0 = cos(x/2)|0 + i sin(x/2)|1 . In this case the kernel is given by k(x, x ) = | 0|R X(x)RX(x)|0 |2 = | cos( x 2 ) + sin( x 2 )|2 = cos( x x As a reference measure µ we consider the uniform measure on [ π, π]. Then the mean density matrix is the completely mixed state ρµ = 1 2 id. For Rd valued data whose coordinates are encoded independently the kernel is given by k(x, x ) = Q cos2 ((xi x i)/2) and ρµ = 2 did2d 2d. We emphasize that due to the kernel trick this kernel can be evaluated classically in runtime O(d). Quantum RKHS. We now characterize the RKHS and the eigenvalues of the integral operator for quantum kernels. The RKHS consists of all functions f F that can be written as f(x) = Tr [ρ(x)M] where M C2d 2d is a Hermitian operator. Using this characterization of the finite dimensional RKHS we can rewrite the infinite dimensional eigenvalue problem of the integral operator as a finite dimensional problem. The action of the corresponding integral operator on f can be written as (Kf)(x) = Z f(y)k(y, x) µ(dy) = Z Tr [Mρ(y)] Tr [ρ(y)ρ(x)] µ(dy) = Z Tr [(M ρ(x))(ρ(y) ρ(y))] µ(dy) = Tr (M ρ(x)) Z ρ(y) ρ(y) µ(dy) (8) We denote the operator Oµ = R ρ(y) ρ(y) µ(dy) for which Tr [Oµ] = 1 holds. Then we can write (Kf)(x) = Tr [Oµ(M ρ(x))] = Tr [Oµ(M id)(id ρ(x))] = Tr [Tr1 [Oµ(M id)] ρ(x)] (9) 3When we can ignore V , we simply assume V = id and write ρ(x) instead of ρV (x). For the kernel, since V V = id = V V and due to the cyclic property of the trace we have k V (x, x ) = Tr ρV (x)ρV (x ) = Tr V ρ(x)V V ρ(x )V = Tr V V ρ(x)V V ρ(x ) = Tr [ρ(x)ρ(x )] = k(x, x ). where Tr1 [ ] refers to the partial trace over the first factor. For the definition and a proof of the last equality we refer to Appendix A. The eigenvalues of K can now be identified with the eigenvalues of the linear map Tµ mapping M Tr1 [Oµ(M id)]. As shown in the appendix there is an eigendecomposition such that Tµ(M) = P λi Ai Tr [Ai M] where Ai are orthonormal Hermitian matrices (for details, a proof and an example we refer to Appendix C). The eigenfunctions of K are given by fi(x) = Tr [ρ(x)Ai]. We now state a bound that controls the largest eigenvalue of the integral operator K in terms of the eigenvalues of the mean density matrix ρµ (Proof in Appendix C.2). Lemma 1. The largest eigenvalue γmax of K satisfies the bound γmax q The lemma above shows that the squared eigenvalues of K are bounded by Tr ρ2 µ , an expression known as the purity [36] of the density matrix ρµ, which measures the diversity of the data embedding. On the other hand the eigenvalues of K are closely related to the learning guarantees of kernel ridge regression. In particular, standard generalization bounds for kernel ridge regression [47] become vacuous when γmax is exponentially smaller than the training sample size (if Tr [K] = 1 which holds for pure state embeddings). The next result shows that this is not just a matter of bounds. Theorem 1. Suppose the purity of the embeddings ρµi satisfies Tr ρ2 µi δ < 1 as the dimension and number of qubits d grows. Furthermore, suppose the training sample size only grows polynomially in d, i.e., n dl for some fixed l N. Then there exists d0 = d0(δ, l, ε) such that for all d d0 no function can be learned using kernel ridge regression with the d-qubit kernel k(x, x ) = Tr [ρ(x)ρ(x )] in the sense that for any f L2, with probability at least 1 ε for all λ 0 R( ˆf λ n) (1 ε) f 2. (10) The proof of the theorem can be found in Appendix D. It relies on a general result (Theorem 3 in Appendix D) which shows that for any (not necessarily quantum) kernel the solution of kernel ridge regression cannot generalize when the largest eigenvalue in the Mercer decomposition is sufficiently small (depending on the sample size). Then the proof of Theorem 1 essentially boils down to proving a bound on the largest eigenvalue using Lemma 1. Theorem 1 implies that generalization is only possible when the mean embedding of most coordinates is close to a pure state, i.e. the embedding x |ϕi(x) is almost constant. To make learning from data feasible we cannot use the full expressive power of the quantum Hilbert space but instead only very restricted embeddings allow to learn from data. This generalizes an observation already made in [12]. Since also classical methods allow to handle high-dimensional and infinite dimensional RKHS the same problem occurs for classical kernels where one solution is to adapt the bandwidth of the kernel to control the expressivity of the RKHS. In principle this is also possible in the quantum context, e.g., for the cosine embedding. Biased kernels. We have discussed that without any inductive bias, the introduced quantum kernel cannot learn any function for large d. One suggestion to reduce the expressive power of the kernel is the use of projected kernels [12]. They are defined using reduced density matrices given by ρV m(x) = Trm+1...d ρV (x) where Trm+1...d [ ] denotes the partial trace over qubits m + 1 to d (definition in Appendix A) . Then they consider the usual quantum kernel for this embedding q V m(x, x ) = Tr ρV m(x) ρV m(x ) . Physically, this corresponds to just measuring the first m qubits and the functions f in the RKHS can be written in terms of a hermitian operator M acting on m qubits so that f(x) = Tr ρV (x)(M id) = Tr ρV m(x)M . If V is sufficiently complex it is assumed that f is hard to compute classically [48]. Indeed above procedure reduces the generalization gap. But this comes at the price of an increased approximation error if the remaining RKHS cannot fully express the target function f anymore, i.e., the learned function underfits. Without any reason to believe that the target function is well represented via the projected kernel, we cannot hope for a performance improvement by simply reducing the size of the RKHS in an arbitrary way. However, if we know something about the data generating process than this can lead to a meaningful inductive bias. For the projected kernel this could be that we know that the target function can be expressed as f (x) = Tr ρV m(x)M , see Fig. 1. In this case using q V m improves the generalization error without increasing the approximation error. To emphasize this, we will henceforth refer to q V m as biased kernel. 2 3 4 5 6 7 Number of Qubits d Size of eigenvalue γi γ0 γ1 γ2 γ3 1 2 3 4 5 6 7 Number of Qubits d Mean Squared Error q train q test k train k test krbf train krbf test qw train qw test Figure 2: Left: Spectral behavior of biased kernel q, see Theorem 2b) and Equation (11) Right: The biased kernel q, equipped with prior knowledge, easily learns the function for arbitrary number of qubits and achieves optimal mean squared error (MSE). Models that are ignorant to the structure of f fail to learn the function. The classical kernel krbf and the full quantum kernel overfit (they have low training error, but large test error). The biased kernel on the wrong qubit qw has litle capacity with the wrong bias and thus underfits (training and test error essentially overlap). We now investigate the RKHS for reduced density matrices where V is a Haar-distributed random unitary matrix (proof in Appendix E). Theorem 2. Suppose V is distributed according to the Haar measure on the group of unitary matrices. Fix m. Then the following two statements hold: a) The reduced density operator satisfies with high probability ρV m = 2 mid + O(2 d/2) and the projected kernel satisfies with high probability q V m(x, x ) = 2 m + O(2 d/2) as d . b) Let T V µ,m denote the linear integral operator for the kernel q V m as defined above. Then the averaged operator EV T V µ,m has one eigenvalue 2 m + O(2 2d) whose eigenfunction is constant (up to higher order terms of order O(2 2d) and 22m 1 eigenvalues 2 m d + O(2 2d). The averaged integral operator in the second part of the result is not directly meaningful, however it gives some indication of the behavior of the operators T V µ,m. In particular, we expect a similar result to hold for most V if the mean embedding ρµ is sufficiently mixed. A proof of this result would require us to bound the variance of the matrix elements of T V µ,m which is possible using standard formula for expectations of polynomials over the unitary group but lengthy. Note that the dimension of the RKHS for the biased kernel q V m with m-qubits is bounded by 4m. This implies that learning is possible when projecting to sufficiently low dimensional biased kernels such that the training sample size satisfies n 4m dim(F). Let us now focus on the case m = 1, that is the biased kernel is solely defined via the first qubit. Assuming that Theorem 2b) also holds for fixed V we can assume that the biased kernel has the form q(x, x ) q V 1 (x, x ) = γ0φ0(x)φ0(x ) + X3 i=1 γiφi(x)φi(x ), (11) where γ0 = 1/2 + O(2 2d) and φ0(x) = 1 is the constant function up to terms of order O(2 2d). For i = 1, 2, 3 we have γi = O(2 d 1) = O(2 d) (Fig. 2) and φi is a function that conjectured to be exponentially hard in d to compute classically [48]. It is thus impossible to include a bias towards those three eigenfunctions classically. On the other hand we can include a strong bias towards the constant eigenfunction also classically. The straightforward way to do this is to center the data in the RKHS (see Appendix B for details). Barren plateaus. Another conclusion from Theorem 2a) is that the fluctuations of the reduced density matrix around its mean are exponentially vanishing in the number of qubits. In practice the value of the kernel would be determined by measurements and exponentially many measurements are necessary to obtain exponential accuracy of the kernel function. Therefore the theorem suggests that it is not possible in practice to learn anything beyond the constant function from generic biased kernels for (modestly) large values of d. This observation is closely related to the fact that for many quantum neural networks architectures, the gradient of the parameters with respect to the loss is exponentially vanishing with the system size d, a phenomenon known as Barren plateaus [24, 49]. 5 Experiments Since for small d we can simulate the biased kernel efficiently, we illustrate our theoretical findings in the following experiments. Our implementation, building on standard open source packages [50, 51], is available online.4 We consider the case described above where we know that the data was generated by measuring an observable on the first qubit, i.e., f (x) = Tr ρV 1 (x)M , but we do not know M , see Fig. 1. We use the full kernel k and the biased kernel q for the case m = 1. To show the effect of selecting the wrong bias, we also include the behavior of a biased kernel defined only on the second qubit, denoted as qw. As a classical reference we also include the performance of a radial basis function kernel krbf(x, x ) = exp( x x 2/2). For the experiments we fix a single qubit observable M = σz and perform the experiment for varying number d of qubits. First we draw a random unitary V . The dataset is then generated by drawing N = 200 realizations {x(i)}N i=1 from the d dimensional uniform distribution on [0, 2π]d. We then define the labels as y(i) = cf (x(i)) + ϵ(i), where f (x) = Tr ρV (x)σz , ϵ(i) is Gaussian noise with Var[ϵ] = 10 4, and c is chosen such that Var[f(X)] = 1. Keeping the variances fixed ensures that we can interpret the behavior for varying d. We first verify our findings from Theorem 2b) and Equation (11) by estimating the spectrum of q. Fig. 2 (left) shows that Theorem 2b) also holds for individual V with high probability. We then use 2/3 of the data for training kernel ridge regression (we fit the mean seperately) with preset regularization, and use 1/3 to estimate the test error. We average the results over ten random seeds (random V , x(i), ϵ(i)) and results are reported in the right panel of Fig. 2. This showcases that as the number of qubits increases, it is impossible to learn f without the appropriate spectral bias. k and krbf have too little bias and overfit, whereas qw has the wrong bias and severly underfits. The performance of qw underlines that randomly biasing the kernel does not significantly improve the performance over the full kernel k. In the appendix we show that this is not due to a bad choice of regularization, by reporting cherry-picked results over a range of regularizations. To further illustrate the spectral properties, we empirically estimate the kernel target alignment [30] and the task-model alignment [32] that we introduced in Sec. 2. By using the centered kernel matrix (see App. B) and centering the data we can ignore the first eigenvalue in (11) corresponding the constant function. In Figure 3 (left) we show the empirical (centered) kernel target alignment for 50 random seeds. The biased kernel is the only one well aligned with the task. The right panel of Fig. 3 shows the task model alignment. This shows that f can be completely expressed with the first four components of the biased kernel, while the other kernels essentially need the entire spectrum (we use a sample size of 200, hence the empirical kernel matrix is only 200 dimensional) and thus are unable to learn. Note that the kernel qw is four dimensional, and so higher contributions correspond to functions outside its RKHS that it actually cannot even learn at all. 6 Discussion We provided an analysis of the reproducing kernel Hilbert space (RKHS) and the inductive bias of quantum kernel methods. Rather than the dimensionality of the RKHS, its spectral properties determine whether learning is feasible. Working with exponentially large RKHS comes with the risk of having a correspondingly small inductive bias. This situation indeed occurs for naive quantum encodings, and hinders learning unless datasets are of exponential size. To enable learning, we necessarily need to consider models with a stronger inductive bias. Encoding a bias towards continuous functions is likely not a promising path for a quantum advantage, as this is where classical machine learning models excel. Our results suggest that we can only achieve a quantum advantage if we know something about the data generating process and cannot efficiently encode this classically, yet are able use this information to bias the quantum model. We indeed observe an exponential advantage in the case where we know that the data comes from a single qubit observable and constrain the RKHS accordingly. However, 4https://github.com/jmkuebler/quantumbias 0.0 0.2 0.4 0.6 0.8 1.0 Kernel Alignment A 100 101 102 Task-Model Alignment C(i) Figure 3: Histogram of the kernel target alignment over 50 runs (left) and task model alignment (right) for d = 7. we find that evaluating the kernel requires exponentially many measurements, an issue related to Barren Plateaus encountered in quantum neural networks. With fully error-corrected quantum computers it becomes feasible to define kernels with a strong bias that do not require exponentially many measurements. An example of this kind was recently presented by Liu et al. [10]: here one knows that the target function is extremely simple after computing the discrete logarithm. A quantum computer is able to encode this inductive bias by using an efficient algorithm for computing the discrete logarithm. However, even for fully coherent quantum computers it is unclear how we can reasonably encode a strong inductive bias about a classical dataset (e.g., images of cancer cells, climate-data, etc.). The situation might be better when working with quantum data, i.e., data that is collected via observations of systems at a quantum mechanical scale. To summarize, we conclude that there is no indication that quantum machine learning will substantially improve supervised learning on classical datasets. Acknowledgments and Disclosure of Funding The authors thank the anonymous reviewers for their helpful comments that made the theorems and their proofs more accessible. This work was in part supported by the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B) and the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. [1] John Preskill. Quantum computing in the NISQ era and beyond. Quantum, 2:79, 2018. [2] Frank Arute, Kunal Arya, Ryan Babbush, et al. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505 510, 2019. [3] Alberto Peruzzo, Jarrod Mc Clean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J Love, Alán Aspuru-Guzik, and Jeremy L O brien. A variational eigenvalue solver on a photonic quantum processor. Nature Communications, 5:4213, 2014. [4] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm. ar Xiv:1411.4028, 2014. [5] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa, and Keisuke Fujii. Quantum circuit learning. Phys. Rev. A, 98:032309, 2018. [6] Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term processors. ar Xiv:1802.06002, 2018. [7] Vojtˇech Havlíˇcek, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala, Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature spaces. Nature, 567(7747):209, 2019. [8] Maria Schuld and Nathan Killoran. Quantum machine learning in feature Hilbert spaces. Phys. Rev. Lett., 122:040504, 2019. [9] Jonas M. Kübler, Krikamol Muandet, and Bernhard Schölkopf. Quantum mean embedding of probability distributions. Phys. Rev. Research, 1:033159, 2019. [10] Yunchao Liu, Srinivasan Arunachalam, and Kristan Temme. A rigorous and robust quantum speed-up in supervised machine learning. Nature Physics, 17(9):1013 1017, 2021. [11] Maria Schuld. Quantum machine learning models are kernel methods. ar Xiv:2101.11020, 2021. [12] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jarrod R. Mc Clean. Power of data in quantum machine learning. Nature Communications, 12(1):2631, 2021. [13] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Phys. Rev. Lett., 103(15):150502, 2009. [14] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Phys. Rev. Lett., 113(13), 2014. [15] Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quantum algorithm for data fitting. Phys. Rev. Lett., 109:050505, 2012. [16] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning. ar Xiv:1307.0411, 2013. [17] Iordanis Kerenidis and Anupam Prakash. Quantum gradient descent for linear systems and least squares. Phys. Rev. A, 101:022316, 2020. [18] Carlo Ciliberto, Andrea Rocchetto, Alessandro Rudi, and Leonard Wossnig. Statistical limits of supervised quantum learning. Physical Review A, 102(4), 2020. [19] Gian Giacomo Guerreschi and Mikhail Smelyanskiy. Practical optimization for hybrid quantumclassical algorithms. ar Xiv:1701.01450, 2017. [20] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. Evaluating analytic gradients on quantum hardware. Physical Review A, 99(3):032331, 2019. [21] James Stokes, Josh Izaac, Nathan Killoran, and Giuseppe Carleo. Quantum natural gradient. Quantum, 4:269, 2020. [22] Ryan Sweke, Frederik Wilde, Johannes Jakob Meyer, Maria Schuld, Paul K Fährmann, Barthélémy Meynard-Piganeau, and Jens Eisert. Stochastic gradient descent for hybrid quantumclassical optimization. Quantum, 4:314, 2020. [23] Jonas M. Kübler, Andrew Arrasmith, Lukasz Cincio, and Patrick J. Coles. An Adaptive Optimizer for Measurement-Frugal Variational Algorithms. Quantum, 4:263, 2020. [24] Jarred R. Mc Clean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush, and Hartmut Neven. Barren plateaus in quantum neural network training landscapes. Nature Communications, 9: 4812, 2018. [25] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, USA, 2002. [26] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [27] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the expressive power of variational quantum-machine-learning models. Phys. Rev. A, 103:032430, 2021. [28] Hsin-Yuan Huang, Richard Kueng, and John Preskill. Information-theoretic bounds on quantum advantage in machine learning. Phys. Rev. Lett., 126:190505, 2021. [29] Robert C. Williamson, Alexander J. Smola, and Bernhard Schölkopf. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6):2516 2532, 2001. [30] Nello Cristianini, Jaz Kandola, Andre Elisseeff, and John Shawe-Taylor. On kernel target alignment. In Dawn E. Holmes and Lakhmi C. Jain, editors, Innovations in Machine Learning: Theory and Applications, pages 205 256. Springer Berlin Heidelberg, 2006. [31] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(28):795 828, 2012. [32] Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1):2914, 2021. [33] Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clement Hongler, and Franck Gabriel. Kernel alignment risk estimator: Risk prediction from training data. In Neur IPS, 2020. [34] Bernhard Schölkopf, Ralf Herbrich, and Alexander J. Smola. A generalized representer theorem. In COLT, 2001. [35] Thomas Hubregtsen, David Wierichs, Elies Gil-Fuster, Peter-Jan H. S. Derks, Paul K. Faehrmann, and Johannes Jakob Meyer. Training quantum embedding kernels on near-term quantum computers. ar Xiv:2105.02276, 2021. [36] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press, 2010. [37] Xinbiao Wang, Yuxuan Du, Yong Luo, and Dacheng Tao. Towards understanding the power of quantum kernels in the NISQ era. Quantum, 5:531, 2021. [38] Evan Peters, João Caldeira, Alan Ho, Stefan Leichenauer, Masoud Mohseni, Hartmut Neven, Panagiotis Spentzouris, Doug Strain, and Gabriel N. Perdue. Machine learning of high dimensional data on a noisy quantum processor. ar Xiv:2101.09581, 2021. [39] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I. Latorre. Data reuploading for a universal quantum classifier. Quantum, 4:226, 2020. [40] M. Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C. Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R. Mc Clean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, and Patrick J. Coles. Variational quantum algorithms. Nature Reviews Physics, 3(9):625 644, 2021. [41] Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw, et al. Noisy intermediate-scale quantum (nisq) algorithms. ar Xiv:2101.08448, 2021. [42] Leo Breiman. Random forests. Mach. Learn., 45(1):5 32, 2001. [43] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. [44] Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In COLT, 2009. [45] Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(205):1 38, 2020. [46] Krikamol Muandet, Kenji Fukumizu, Bharath K. Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. Found. Trends Mach. Learn., 10(1-2):1 141, 2017. [47] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2012. [48] Aram W. Harrow and Ashley Montanaro. Quantum computational supremacy. Nature, 549 (7671):203 209, 2017. [49] M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, and Patrick J. Coles. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nature Communications, 12(1):1791, 2021. [50] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [51] Ville Bergholm, Josh Izaac, Maria Schuld, Christian Gogolin, and Nathan Killoran. Pennylane: Automatic differentiation of hybrid quantum-classical computations. ar Xiv:1811.04968, 2018. [52] Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2016. [53] Zbigniew Puchała and Jarosław A. Miszczak. Symbolic integration with respect to the haar measure on the unitary groups. Bulletin of the Polish Academy of Sciences: Technical Sciences, 65(No 1):21 27, 2017. [54] Christoph Dankert, Richard Cleve, Joseph Emerson, and Etera Livine. Exact and approximate unitary 2-designs and their application to fidelity estimation. Phys. Rev. A, 80:012304, 2009.