# hierarchical_kernels_in_deep_kernel_learning__cdaf45e0.pdf Journal of Machine Learning Research 24 (2023) 1-30 Submitted 4/23; Revised 6/23; Published 12/23 Hierarchical Kernels in Deep Kernel Learning Wentao Huang huangwt55@mail2.sysu.edu.cn Houbao Lu luhb6@mail2.sysu.edu.cn Haizhang Zhang zhhaizh2@mail.sysu.edu.cn School of Mathematics (Zhuhai) Sun Yat-sen University Zhuhai 519082, P. R. China Editor: Lorenzo Rosasco Kernel methods are built upon the mathematical theory of reproducing kernels and reproducing kernel Hilbert spaces. They enjoy good interpretability thanks to the solid mathematical foundation. Recently, motivated by deep neural networks in deep learning, which construct learning functions by successive compositions of activation functions and linear functions, a class of methods termed as deep kernel learning has appeared in the literature. The core of deep kernel learning is hierarchical kernels that are constructed from a base reproducing kernel by successive compositions. In this paper, we characterize the corresponding reproducing kernel Hilbert spaces of hierarchical kernels, and study conditions ensuring that the reproducing kernel Hilbert space will be expanding as the layer of hierarchical kernels increases. The results will answer whether the expressive power of hierarchical kernels will be improving as the layer increases, and give guidance to the construction of hierarchical kernels for deep kernel learning. Keywords: hierarchical kernels, reproducing kernels, deep learning, compositional kernels, reproducing kernel Hilbert spaces 1. Introduction Machine learning has played an important role in recent advances of artificial intelligence. There are two major categories of learning methods: the classical kernel methods (Sch olkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004; Vapnik, 1998) and deep learning methods (Goodfellow et al., 2016; Le Cun et al., 2015). In many applications, the target of these two kinds of learning methods is the same, which is to learn a prediction function from given training data. The target can be approached by minimizing a functional of the form: min f F 1 n j=1 L(f(xj), yj), (1) subject to a constraint on the complexity of model f, where (xj, yj), 1 j n are prescribed training data from X Y with X being the input space and Y being the output space, L is a chosen loss function, and F is a set of candidate prediction functions. An essential . Corresponding author c 2023 Wentao Huang, Houbao Lu, and Haizhang Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/23-0538.html. Huang, Lu, and Zhang difference between kernel learning and deep learning lies in the choice of F. In the classical kernel methods, F is generated via a reproducing kernel K (also called a Mercer kernel (Mercer, 1909)) on X by j=1 cj K(xj, ) : cj R, 1 j n . In other words, the candidate functions are linear combinations of the reproducing kernel at the sampling points xj, 1 j n. In deep learning methods, F is generated from a deep neural network by consecutive compositions of linear functions and an activation function. Also, some other techniques such as pooling and batch normalization are adopted in deep neural networks. A great advantage of kernel learning is that it is built on the solid mathematical foundation of reproducing kernels and reproducing kernel spaces. In fact, long before the emergence of machine learning, renowned mathematicians including Aronszajn (Aronszajn, 1950), Bochner (Bochner, 1959), and Schoenberg (Schoenberg, 1938, 1942) had been studying positive-definite functions. These functions were latter found to be identical to reproducing kernels. Reproducing kernels and reproducing kernel spaces have been extensively studied since then (see, for example, Berlinet and Thomas-Agnan (2004); Cucker and Smale (2002); Cucker and Zhou (2007); Evgeniou et al. (2000); Fukumizu et al. (2004); Fitz Gerald et al. (1995); Wendland (2005); Wu (1995); Zhang et al. (2009) and references therein). Such a solid mathematical foundation endows good interpretability of kernel methods. For instance, we are able to analyze the generalization ability of many kernel methods by estimating the learning rates from an approximation theory viewpoint (Cucker and Smale, 2002; Cucker and Zhou, 2007). A disadvantage of kernel methods is that they cannot well handle many challenging learning tasks. Recently, motivated by the success of deep learning, a class of learning methods termed as deep kernel learning has appeared in the literature (Anselmi et al., 2015; Bohn et al., 2019; Chen et al., 2017; Cho and Saul, 2009; Wilson et al., 2016). The essence of deep kernel learning is the usage of hierarchical kernels (also known as compositional kernels) that are constructed from existing kernels by successive function compositions. Such constructions are mainly motivated by composition structure of deep neural networks. Function compositions are able to generate high-dimensional complicated functions from relatively low-dimensional simple functions. This can be explained by the wellknown Kolmogorov-Arnold representation theorem in mathematics (Morris, 2021). The applications and implications of the Kolmogorov-Arnold representation theorem to neural networks are discussed in Schmidt-Hieber (2021); Yarotsky (2017). Deep kernel learning based on hierarchical kernels have shown comparable performances in a few challenging learning problems (Chen et al., 2017; Cho and Saul, 2009; Wilson et al., 2016) . However, two theoretical questions remain unanswered for these hierarchical kernels. The first question concerns the understanding of the reproducing kernel Hilbert spaces of hierarchical kernels. The second question is about what conditions ensure that the expressive power of the hierarchical kernels increases as the number of layers of hierarchical kernels increases. In this paper, we aim to answer these two questions for a type of Hierarchical Kernels in Deep Kernel Learning hierarchical kernels generated by K0 = K, Kn = g(Kn 1), n N, where the initial kernel K is a commonly-used kernel in machine learning (such as the Gaussian kernel, the exponential kernels, or a polynomial kernel), and g is a chosen univariate function such as the exponential function or a fixed polynomial. Our objective is to characterize the reproducing kernel Hilbert space HKn of each Kn, and to investigate conditions ensuring that HKn+1 is strictly larger than HKn. These results will contribute to the mathematical foundation of deep kernel learning. The rest of the paper is organized as follows. In Section 2, we introduce some basic facts about reproducing kernels and reproducing kernel Hilbert spaces. Note that our study will build upon the existing results on inclusion relation of the reproducing kernel Hilbert spaces. In Section 3, we present some general results on compositional kernels. Sections 4-6 are devoted to hierarchical kernels generated from the compositions of the Gaussian kernel, the exponential kernel, the polynomial kernel with the exponential function or a fixed polynomial, respectively. Initial experiments on hierarchical kernels are conducted in Section 7 and the paper is concluded in Section 8. We shall see that the main analysis of hierarchical kernels is closely related to the high order Bell numbers. Moreover, the answer to the second question is not always affirmative, as we shall see in the case of hierarchical exponential kernels that the related reproducing kernel Hilbert spaces remain unchanged as the number of layers increases, indicating that one should not use the exponential kernels in deep kernel learning. Before we enter the formal investigation, we would like to discuss the differences between our mathematical study on hierarchical kernels and those studies aiming to give explanation to deep neural networks by kernels. Such studies include understanding the training process and mechanism of generalization of deep neural networks by neural tangent kernels (see, for example, Cho and Saul (2009); Daniely et al. (2016); Huang and Yau (2020); Huang et al. (2021); Lee et al. (2018); Jacot et al. (2018); Neal (1996)). Compositional kernels related to neural networks and neural tangent kernels are investigated in Bietti and Bach (2021); Chen and Xu (2021); Geifman et al. (2020). These kernels are dot-product kernels on the sphere and are different from the hierarchical kernels on the Euclidean spaces in this paper. There is another line of of recent work aiming to understand shallow neural networks by reproducing kernel Banach spaces (Bach, 2017; Bartolucci et al., 2021; Ongie et al., 2019; Parhi and Nowak, 2021; Spek et al., 2022). In particular, a function composition structure in Banach spaces was proposed in Parhi and Nowak (2022) to understand the functions learned by deep neural networks. The main purpose of the above researches is to give explanation to neural networks by neural tangent kernels or by reproducing kernel Banach spaces. Hierarchical kernels in the current work are constructed from existing kernels by successive function compositions. Such constructions are stimulated by the composition structure of deep neural networks. Hierarchical kernels can then be used for feature extraction or used in classical kernel methods. Hierarchical kernels are not to compete with deep learning. Therefore, the theme of the paper is different from those in the former paragraph. In the future, we plan to investigate compositional kernels on the sphere and hierarchical kernels involving compositions with multiple kernels. Huang, Lu, and Zhang 2. Preliminaries In this paper, we use R, C, N, Z, Z+ to denote the set of real numbers, the set of complex numbers, the set of positive integers, the set of integers, and the set of nonnegative integers, respectively. Let X be a prescribed input space. A reproducing kernel (or kernel for short) K on X is a function from X X to C such that for all finite pairwise distinct inputs xj X, 1 j n, the kernel matrix [K(xj, xk) : 1 j, k n] , is hermitian and positive semi-definite. A reproducing kernel K on X corresponds to a unique reproducing kernel Hilbert space (RKHS), denoted by HK, such that K(x, ) HK for all x X and f(x) = f, K(x, ) HK for all f HK, x X, (2) where , HK denotes the inner product on HK. Let g be a chosen univariate function. We shall concentrate on characterizing Hg(K) and examining the inclusion relationship between HK and Hg(K) in the paper. To this end, we will recall some useful notations and related results in this section. Given two kernels K, G on X, the inclusion relation HK HG was first investigated by Aronszajn in Aronszajn (1950) and then extensively studied in Xu and Zhang (2007, 2009); Zhang and Zhao (2013). The following result is well-known. Denote K G if G K remains a kernel on X. Lemma 1 (Aronszajn, 1950) Let K, G be two kernels on X. Then HK HG if and only if there exists a non-negative constant λ such that K λG. When HK HG, it was observed in Aronszajn (1950) by the closed graph theorem that the identity operator from HK into HG is bounded. The operator norm of this embedding is denoted by β(K, G), (Zhang and Zhao, 2013). We also denote λ(K, G) = inf{λ 0|K λG}. The relation between these two important constants was discovered in Zhang and Zhao (2013). Lemma 2 (Zhang and Zhao, 2013) When HK HG, it holds β(K, G) = p λ(K, G) and K λ(K, G)G. The following result from Aronszajn (1950) will also be needed. Denote by B the norm on a Banach space B. Lemma 3 (Aronszajn, 1950) Let K1, K2 be two kernels on X and K = K1 + K2. Then HK = {f1 + f2 : f1 HK1, f2 HK2}, and f 2 HK = inf{ f1 2 HK1 + f2 2 HK2 : f = f1 + f2, f1 HK1, f2 HK2}. In particular, if additionally, HK1 HK2 = {0}, then f1 + f2 2 HK = f1 2 HK1 + f2 2 HK2, f1 HK1, f2 HK2. Hierarchical Kernels in Deep Kernel Learning One of the most commonly-used classes of reproducing kernels on the Euclidean spaces is the class of translation-invariant kernels. A kernel K on Rd is said to be translationinvariant if K(x a, y a) = K(x, y) for all x, y, a Rd. It is clear that K on Rd Rd is translation-invariant if and only if there exists a function k on Rd such that K(x, y) = k(x y), x, y Rd. A celebrated result due to Bochner states that continuous translation-invariant kernels on Rd are exactly the Fourier transform of finite positive Borel measures on Rd. The Fourier transform and its inverse are defined by Rd f(x)e ix ξdx, ξ Rd, Rd f(ξ)eix ξdξ, x Rd, where x ξ is the standard inner product of x and ξ on Rd. We shall later use x = x x to denote the Euclidean norm on Rd. Denote by B(Rd) the set of finite positive Borel measures on Rd. By the Bochner theorem (Bochner, 1959), a continuous function K on Rd Rd is a translation-invariant kernel on Rd if and only if there exists a µ B(Rd) such that K(x, y) = Z Rd e i(x y) ξ dµ(ξ), x, y Rd. (3) For two translation-invariant kernels K, G on Rd, the inclusion relation HK HG was extensively studied in Zhang and Zhao (2013). We shall need two of the results. By the Jordan decomposition of measures, every measure µ B(Rd) can be factored into the sum of two positive measures µc and µs, which are absolutely continuous and singular with respect to the Lebesgue measure, respectively. For the absolutely continuous measure µc, the Radon-Nikodym theorem ensures the existence of a nonnegative function u L1(Rd) such that A u(ξ)dξ for every Lebegue measurable A Rd. Consequently, K defined by (3) can be factored as K = Kc + Ks, Kc(x, y) = Z Rd e i(x y) ξ dµc(ξ), Ks(x, y) = Z Rd e i(x y) ξdµs(ξ). (4) The two required results are as follows. Huang, Lu, and Zhang Lemma 4 (Zhang and Zhao, 2013) Let uc, us be two nonnegative Borel measures on Rd that are absolutely continuous and singular with respect to the Lebesgue measure, respectively. And let Kc, Ks be the associated translation-invariant kernels given by (4). Then HKc HKs = {0}. Lemma 5 (Zhang and Zhao, 2013) Let u, v be nonnegative functions in L1(Rd) and let K, G be defined by K(x, y) = Z Rd e i(x y) ξu(ξ)dξ, G(x, y) = Z Rd e i(x y) ξv(ξ)dξ, x, y Rd. (5) Then HK HG if and only if there exists a nonnegative constant C such that u(ξ) Cv(ξ) almost everywhere on Rd, in which case λ(K, G) = u/v L (Rd). We next introduce a characterization of the RKHS of a translation-invariant kernel. For a nonnegative function u L1(Rd), denote by L2 u(Rd) the Hilbert space of Borel measurable functions f on Rd such that Z Rd |f(t)|2u(t)dt < + . The inner product and norm on L2 u(Rd) are given by f, g L2u(Rd) = Z Rd f(t)g(t)u(t)dt, f L2u(Rd) = Z Rd |f(t)|2u(t)dt 1/2 . Lemma 6 (Wendland, 2005) Let u be a nonnegative functions in L1(Rd) and let K be defined by K(x, y) = Z Rd e i(x y) ξu(ξ)dξ, x, y Rd. (6) HK = f(x) = Z Rd fu(t)eix ξu(ξ)dξ : fu L2 u(Rd) f C(Rd) : Z Rd | ˆf(ξ)|2 u(ξ) dξ < + with inner product f, g HK = fu, gu L2u(Rd) = Z Another important class of reproducing kernels in machine learning is the polynomial kernel, which is a special form of the Hilbert-Schmidt kernel. By Mercer s theorem (Mercer, 1909), any continuous kernel on a compact metric space is a Hilbert-Schmidt kernel. For this sake, we shall first present the general form of Hilbert-Schmidt kernels. Let a be a nonnegative function on N and set an := a(n), n N. Denote by ℓ2 a(N) the Hilbert space of functions c on N such that c ℓ2a(N) := X n=1 an|cn|2 1/2 < + . Hierarchical Kernels in Deep Kernel Learning The inner product on ℓ2 a(N) is c, d ℓ2a(N) := n=1 ancndn, c, d ℓ2 a(N). Suppose that φn, n N is a sequence of functions on the input space X, such that for each x X the function Φ(x) defined on N as Φ(x)(n) := φn(x), n N, (7) belongs to ℓ2 a(N). The Hilbert-Schmidt kernel Ka associated with a is given as Ka(x, y) := (Φ(x), Φ(y))ℓ2a(N) = n=1 anφn(x)φn(y), x, y X. (8) Now suppose that there exists another nonnegative function b on N such that Φ(x) ℓ2 b(N) for all x X. Set Kb(x, y) := (Φ(x), Φ(y))ℓ2 b(N) = n=1 bnφn(x)φn(y), x, y X. (9) The inclusion relation HKa HKb was characterized in Zhang and Zhao (2013). Lemma 7 Suppose that b is nontrivial, and span{Φ(x) : x X} is dense in both ℓ2 a(N) and ℓ2 b(N). Then HKa HKb if and only if there is a constant λ > 0 such that an λbn for all n N. In this case, λ(Ka, Kb) = sup an bn : n N, bn > 0 . (10) We shall also need a characterization of the RKHS of a Hilbert-Schmidt kernel, which is well-known (Cucker and Smale, 2002). Note that when span{Φ(x) : x X} is dense in ℓ2 a(N), if c ℓ2 a(N) satisfies n=1 cnanφn(x) = 0 for all x X, then (c, Φ(x))ℓ2a(N) = 0 for all x X, which implies by the denseness condition that c = 0 in ℓ2 a(N). Lemma 8 Let Ka be the kernel defined by (8) and suppose that span{Φ(x) : x X} is dense in ℓ2 a(N). Then HKa = fc(x) := (c, Φ(x))ℓ2a(N) = n=1 cnanφn(x), x X, c ℓ2 a(N) with the inner product fc, fd HKa = c, d ℓ2a(N), c, d ℓ2 a(N). Huang, Lu, and Zhang 3. General Characterizations Let K be a kernel on an input space X. The hierarchical kernels constructed from K are defined recursively via compositing with a function g on R by Kn(x, y) = g(Kn 1(x, y)), x, y X, n 1, where K0 := K. The first question is for what g, Kn would always remain a reproducing kernel. This can be answered by a classical result on positive-definite functions (Fitz Gerald et al., 1995). Lemma 9 Let g be a function on C. Then for any reproducing kernel K, g(K) remains a reproducing kernel if and only if g is holomorphic on C and all the coefficients in its Maclaurin series are nonnegative. By the above lemma, typical choices of g in deep kernel methods for machine learning including g(x) = ex and g(x) = P(x), where P is a polynomial with nonnegative coefficients. We first investigate such hierarchical kernels. A couple of simple observations are in order. We shall need the well-known fact that the product of two reproducing kernels remains a reproducing kernel. This follows directly from the Schur product theorem (Horn and Johnson, 1991) that the component-wise product (Hadamard product) of two positive semidefinite matrices is still a positive semi-definite matrix. Proposition 10 Let g be a holomorphic function on C of the form n=0 anzn, z C, an 0, n Z, (11) where a1 > 0. Then HK Hg(K). In particular, HK He K. Proof By Lemma 9, g(K) a1K is a kernel on X. Then by Lemma 1, HK = Ha1K Hg(K). Clearly, the exponential function satisfies the requirements on g. Theorem 11 Let K, G be two kernels on X such that HK HG, and let g be given by (11). Then Hg(K) Hg(λG), where λ = λ(K, G). In particular, if λ(K, G) 1 then Hg(K) Hg(G). Proof As HK HG, by Lemma 2, K λG where λ = λ(K, G). Then there exists a kernel L such that λG = K + L. By (11), n=0 an(K + L)n n=0 an Kn + Hierarchical Kernels in Deep Kernel Learning Since the product of two kernels remains a kernel, each of the Kj Ln j is a kernel. Therefore, g(K) g(λG). By Lemma 1, Hg(K) Hg(λG). When λ(K, G) 1, one can choose λ = 1 and apply the above arguments to show that Hg(K) Hg(G). Corollary 12 If g given by (11) is a polynomial and HK HG then Hg(K) Hg(G). Proof If λ(K, G) 1 then the result is true by Theorem 11. Now suppose n=0 anzn, an 0, 0 n m, and λ = λ(K, G) 1. Then we have n=0 an(λG)n n=0 anλm Gn = λmg(G), which implies that g(K) λmg(G). By Lemma 1, Hg(K) Hg(G). We make two remarks about the above results. The first one is that when g is not a polynomial then HK HG may not imply Hg(K) Hg(G). To see a counterexample, we shall use Lemma 7 on the inclusion relation of RKHSs of Hilbert-Schmidt kernels. Let K(x, y) = 2xy, G(x, y) = xy be two polynomial kernels on R. Then HK = HG. But e K = e2xy = k! , e G = exy = 1 k! = 2k : k N = + , which implies by Lemma 7 that He K He G. The second remark is that λ(K, G) 1 is not necessary to imply Hg(K) Hg(G) from HK HG even when g is not a polynomial. An example is given by the following two kernels K(x, y) = e (x y)2 = Z R u(ξ)e i(x y)ξdξ, G(x, y) = e |x y| = Z R v(ξ)e i(x y)ξdξ, x, y R, where u(ξ) = 1 2 πe ξ2 4 , v(ξ) = 1 π(1 + ξ2). We compute that λ(K, G) = sup ξ R u(ξ) v(ξ) = π Huang, Lu, and Zhang By Lemma 5, HK HG. Now consider e K(x,y) = 1 + k! := 1 + K1(x, y) := 1 + Z R e i(x y)ξu1(ξ)dξ, e G(x,y) = 1 + k! := 1 + G1(x, y) := 1 + Z R e i(x y)ξv1(ξ)dξ, u1(ξ) = 1 2 π 4k , v1(ξ) = 1 1 k! k ξ2 + k2 . λ(K1, F1) = sup ξ R u1(ξ) v1(ξ) = π 4k P k=1 1 k! k ξ2+k2 e ξ2 4k P k=1 1 k!k 4 π P k=1 e 1 4k 1 (k 1)! k P k=1 1 k!k < + . By Lemma 5, He K He G despite that λ(K, G) > 1. 4. Hierarchical Gaussian Kernels The purpose of this section is to study the characterization and inclusion relation of RKHSs corresponding to hierarchical kernels generated from composition of the Gaussian kernel and a fixed function g. Popular choices of g including the exponential function and a polynomial. We start with the exponential function. 4.1 Composition with Exponential Function The Gaussian kernel is given by G0(x, y) = exp( λ x y 2) = Z Rd e i(x y) ξg0(ξ)dξ, x, y Rd, λ > 0, (12) g0(ξ) := 1 (2 λπ)d exp ξ 2 , ξ Rd. (13) The hierarchical Gaussian kernels from composition with the exponential function are given by Gn(x, y) = exp (Gn 1(x, y)) , n N. (14) In particular, G1(x, y) = exp(G0(x, y)) = k! := 1 + K1(x, y), x, y Rd, (15) Hierarchical Kernels in Deep Kernel Learning K1(x, y) := exp( λk x y 2) One computes that K1(x, y) = Z Rd e i(x y) ξg1(ξ)dξ, x, y Rd, g1(ξ) := 1 (2 1 k!kd/2 exp ξ 2 , ξ Rd. (17) Theorem 13 It holds HG0 HG1, but HG1 HG0. Proof Firstly, by Lemmas 3 and 4 HG1 = {c + f : c C, f HK1}, and ||c + f||2 HG1 = |c|2 + ||f||2 HK1. Apparently, g0(ξ) g1(ξ). It follows by Lemma 4 that HG0 HK1 HG1. On the other hand, one sees g1(ξ) g0(ξ) = P k=1 1 k!kd/2 exp ξ 2 1 k!kd/2 exp ξ 2 1 + exp ξ 2 Therefore, g1(ξ) g0(ξ) is unbounded on Rd. By Lemma 5, HK1 HG0. Consequently, HG1 HG0. Next we want to study the inclusion relation between HGn and HGn+1 for general n. By Lemma 5, the key problem is to estimate a dominating relation between the Fourier transforms of Gn and Gn+1. A sequence of coefficients from the exponential generating functions e0(x) = x, en(x) = exp (en 1(x)) , n N, (18) will play an important role. For this reason, we first define and study them (Asai et al., 2001; Bell, 1938). Huang, Lu, and Zhang Definition 14 The high order Bell numbers βn,k, n, k Z+, are the coefficients satisfying en(x) = en(0) k! xk, x R. (19) Note that β2,k, k 0 are refered as the Bell numbers. We next look at the properties of the high order Bell numbers. For T Zd +, we shall denote |T| = T1 + T2 + + Td. Also, for latter use, we set for βn,k, x Rd and T Zd +, = k! Qd i=1 Ti , βn,T = i=1 βn,Ti, x T = x T1 1 x Td d . (20) Proposition 15 The high order Bell numbers satisfy β1,k = 1, k 0, βn,0 = 1, n N, (21) βn,T , n, k 1. (22) Proof Equation (21) is obvious. Since en+1(x) = en+1(0) exp (en(x) en(0)), we could expand en+1(x) as follows en+1(x) = en+1(0) exp en(x) en(0) = en+1(0) exp en(0) = en+1(0) 1 + The term (P l=1 βn,l l! xl)j in the last equation above can be expanded into an infinite polynomial by X Using the notations (20), one observes that 1 k! k! Qj l=1 Tl! βn,T = 1 Hierarchical Kernels in Deep Kernel Learning Combining the above equation with (23), we have en+1(x) = en+1(0) 1 + = en+1(0) 1 + By comparing the coefficients in (19) and (24), we could see that (22) is true. We next estimate the ratio βn+1,k/βn,k with the results in Asai et al. (2001). Lemma 16 (Asai et al., 2001) The high order Bell numbers satisfy the inequality 2 k1 k2βn,k1+k2 βn,k1βn,k2, k1, k2 0. (25) Lemma 17 It holds βn+1,k β2,k 2k βn,k, n 1, k 1, (26) and for some positive constant C that βn+1,k C 2k 3 2 + log(2k) βn,k, n 1, k 1. (27) Proof By Proposition 15 and Lemma 16, we have 2 kβn,k = β2,k 2k βn,k, (28) where we have used the identity β1,k = 1 in equation (21). We then recall an asymptotic formula for the Bell numbers given in Bruijin (1981) k = log k log log k 1 + log log k log k + 1 log k + 1 2 + O log log k Thus log β2,k k = log k + O(log log k), which implies that for k large enough, Consequently, for k large enough, β2,k kk/2 = k k 2 2k2 k k 2 2(2k 3 2 + log(2k)). (29) Huang, Lu, and Zhang Notice that k k 2 2 = lim k 2kk2 k k 2 = lim k 2kk2 k k lim k 2kk2 3k = lim k k2 Thus, k k 2 2 2k for k large enough. This inequality together with (29) implies that for k large enough, β2,k 2k(2k 3 2 + log(2k)). Inequality (27) now follows from the above equation and equation (28). With the preparations above, we are ready to characterize the reproducing kernel Hilbert space HGn of the hierarchical Gaussian kernel Gn, and to show that HGn are strictly expanding as n increases. Below we shall use A B to denote that A is a proper subset of B. We shall also denote by x and x the least integer greater than or equal to x and the largest integer small than or equal to x, respectively. Theorem 18 Given the hierarchical Gaussian kernels defined by (12) and (14), it holds Gn(x, y) = en(0) k! exp( kλ x y 2), n N, x, y Rd, (30) and Gn(x, y) = en(0) + Kn(x, y) with Kn(x, y) = Z Rd e i(x y) ξgn(ξ)dξ, gn(ξ) = en(0) (2 βn,k k!kd/2 exp ξ 2 , ξ Rd. (31) Consequently, c + f : c R, f C(Rd) satisfying Z Rd | ˆf(ξ)|2 gn(ξ) dξ < + and HGn HGn+1, n N. (33) Proof Identity (31) is obtained by applying (19) and the Fourier transform (13) of Gaussian kernels. Thus, equation (32) is a direct consequence of Lemmas 3 and 6. Clearly, gn(ξ) = en(0) (2 βn,k k!kd/2 exp ξ 2 k!kd/2 exp ξ 2 Thus, HGn HGn+1. On the other hand, it holds gn(ξ) = en+1(0) P k=1 βn+1,k k!kd/2 exp ξ 2 en(0) P k=1 βn,k k!kd/2 exp ξ 2 Hierarchical Kernels in Deep Kernel Learning We shall show that gn+1(ξ)/gn(ξ) is unbounded by showing that there exists a positive constant C such that for ξ = 8m gn(ξ) C log(2m) for sufficiently large m. (34) The above equation can be rewritten as 1 k!kd/2 (en+1(0)βn+1,k Cen(0)βn,k log(2m)) exp 16m2 Let C be the constant in Lemma 17, by en+1(0) en(0), Lemma 17 and (35), it suffices to show that for large enough m βn,k k!kd/2 2k 3 2 + log(2k) log(2m) exp 16m2 First note that for all k log(2m) , 2k 3 2 + log(2k) log(2m) 0. Therefore, for sufficiently large m, using 2m 3 2 e and md/2 m!, we get βn,k k!kd/2 2k 3 2 + log(2k) log(2m) exp 16m2 βn,m m!md/2 2m 3 2 exp( 16m) + βn,k k!kd/2 2k 3 2 + log(2k) log(2m) exp 16m2 (m!)2 e 16m+1 βn,k k!kd/2 log(2m)e 16m2 (m!)2 e 16m+1 k! log(2m)e 8m2 log(2m) βn,m log(2m) (m!)2 e 16m+1 βn,m log(2m)e 8m2 log(2m) X = eβn,m log(2m) e 16m (m!)2 e 8m2 log(2m) = eβn,m log(2m)e 8m2 log(2m) exp 16m + 8m2 log(2m) = eβn,m log(2m)e 8m2 log(2m) exp 4m2 log(2m) 8m Therefore, it suffices to show that for m large enough, log(2m) 8m m!. Huang, Lu, and Zhang By m! mm, the above equation is true if log(2m) 8m m log m, which is clearly true for sufficiently large m. We conclude that inequality (34) holds, which implies that gn+1(ξ)/gn(ξ) is unbounded. By Lemma 5, HGn+1 HGn. 4.2 Composition with a Polynomial We consider hierarchical kernels generated from the composition of the Gaussian kernel and a fixed polynomial in this subsection. Let P be a given polynomial k=1 akxk, (36) where ak 0, a N > 0, N 2. The hierarchical kernels under investigation are defined recursively by Gn(x, y) = P(Gn 1(x, y)), x, y Rd, n Z+, (37) with G0 = exp( λ x y 2). To characterize HGn and their inclusion relations, we shall need a well-known result on the inclusion relation between RKHSs of Gaussian kernels (Steinwart et al., 2006). It can also be viewed as a direct consequence of Lemma 5 and the Fourier transform (13) of the Gaussian kernel. Lemma 19 (Steinwart et al., 2006) Given Gaussian kernels Gγ(x, y) = exp( γ x y 2), x, y Rd, it holds HGγ1 HGγ2 whenever γ1 < γ2. Another result in need follows directly from Lemma 3. Lemma 20 Given two kernels K and G, if HK HG then HK+G = HG. We are ready to present the main result on hierarchical Gaussian kernels from composition with a fixed polynomial. Theorem 21 Let P be a fixed polynomial given by (36) and define the hierarchical Gaussian kernels recursively by (37). Then, for every n 0, HGn = HGNnλ, (38) HGn HGn+1, (39) HGn = f C(Rd) : Z Rd | ˆf(ξ)|2 exp ξ 2 dξ < + . (40) Hierarchical Kernels in Deep Kernel Learning Proof Notice that each Gn is a linear combination of finitely many Gaussian kernels with positive coefficients. Assume that k=1 ck Gγk, where ck > 0 and γ1 < γ2 < < γm. One observes from the definition of Gn that γm = Nnλ. By Lemmas 19 and 20, for 0 < γ < γ and a, b > 0, Ha Gγ+b Gγ = Hb Gγ = HGγ . Therefore, HGn = HGγm = HGNnλ. Equation (39) follows from the above equation and Lemma 19. And equation (40) follows from (38) and Lemma 6. 5. Hierarchical Exponential Kernels We investigate hierarchical kernels generated from the composition of the exponential kernel with the exponential function or a polynomial in this section. For x = (x1, x2, , xd) Rd, denote x 1 = Pd i=1 |xi| and x 2 = q Pd i=1 x2 i . Also let Γ denote the Gamma function 0 ts 1e tdt, s > 0. The exponential kernel is given by Ep,0(x, y) = exp( λ x y p) = Z Rd e i(x y) ξφp,0(ξ)dξ, x, y Rd, λ > 0, p = 1, 2, (41) φ1,0(ξ) = 1 λ ξ2 j + λ2 , ξ Rd, (42) φ2,0(ξ) = Γ(d+1 2 λ2 + ξ 2 2 d+1 2 , ξ Rd. (43) 5.1 Composition with the Exponential Function Recall the exponential generating functions en defined in (18). The hierarchical exponential kernels via consecutively compositing with the exponential function are defined by Ep,n(x, y) = en(Ep,0(x, y)), n N, p = 1, 2. (44) We present the RKHS of the hierarchical exponential kernels and a surprise result on their inclusion relations. Huang, Lu, and Zhang Theorem 22 Given the hierarchical exponential kernels defined by (41) and (44), it holds HEp,n = c+f : c C, f C(R) satisfying Z Rd | ˆf(ξ)|2 φp,n(ξ)dξ < + , n N, p = 1, 2, (45) φ1,n(ξ) = en(0) kλ ξ2 j + k2λ2 , ξ Rd, (46) φ2,n(ξ) = en(0)Γ(d+1 k! k2λ2 + ξ 2 2 d+1 2 , ξ Rd. (47) HEp,n = HEp,n+1, n N, p = 1, 2. (48) Proof We first write Ep,n(x, y) = ep,n(0) + Fp,n(x, y), n N, n 1, p = 1, 2, Fp,n(x, y) = Z Rd e i(x y) ξφp,n(ξ)dξ, x, y Rd, p = 1, 2, and by the expansion (19), φp,n(ξ) = en(0) βn,k k!kd φp,0(ξ k), ξ Rd, p = 1, 2. (49) Combing the above equation and equations (42), (43), we obtain (46) and (47). The first result (45) thus follows directly from Lemmas 3 and 6. We now turn to the proof of identity (48). On one hand, since the high order Bell numbers satisfy βn,k βn+1,k, k N, it implies HEp,n HEp,n+1 for p = 1, 2. On the other hand, since k) φ1,0(ξ) = λ2 + ξ2 j λ2 + ξ2 j /k2 j=1 k2 = k2d, ξ Rd, k N+, k) φ2,0(ξ) = λ2 + ξ 2 2 λ2 + ξ 2 2/k2 2 k2d, ξ Rd, k N+, Hierarchical Kernels in Deep Kernel Learning with (49) we get φp,n(ξ) = en+1(0) P k=1 βn+1,k k!kd φp,0( ξ en(0) P k=1 βn,k k!kd φp,0( ξ en+1(0) P k=1 βn+1,k k!kd φp,0( ξ k) en(0)βn,1φp,0(ξ) k!kd φp,0( ξ = en+1(ed) < + , p = 1, 2, which implies by Lemma 5 that HEp,n+1 HEp,n for p = 1, 2. We conclude that (48) holds. The above theorem reveals that the hierarchical structure of reproducing kernels does not necessarily yield RKHSs with increasing expressive power. 5.2 Composition with a Polynomial We consider hierarchical exponential kernels generated from compositions with a fixed polynomial in this subsection. Let P be a polynomial as described in (36). We generate the hierarchical exponential kernels by Ep,n(x, y) = P(Ep,n 1(x, y)), n N, x, y Rd, p = 1, 2. (50) where Ep,0 is the exponential kernel Ep,0 given by (41). The following result was proved in Zhang and Zhao (2013). Lemma 23 Given exponential kernels Ep,λ(x, y) = exp( λ x y p), x, y Rd, p = 1, 2, it holds HEp,λ1 = HEp,λ2 for all λ1, λ2 > 0 and p = 1, 2. We show that composition of the exponential kernel with a polynomial will not enlarge the corresponding RKHS either. Theorem 24 Let P be a fixed polynomial given by (36) and define the hierarchical exponential kernels recursively by (50). Then for every n 0 and p = 1, 2, HEp,n = HEp,0. Huang, Lu, and Zhang Proof Notice that each Ep,n is a linear combination of finitely many exponential kernels with positive coefficients. Assume that k=1 ck Ep,λk. By Lemma 23 HEp,λ1 = HEp,λ2 = = HEp,λk. Then Lemma 20 implies that HEp,n = HEp,λ1 = HEp,λ = HEp,0 = HEp,0, which proves the result. 6. Hierarchical Polynomial Kernels We study hierarchical kernels generated from the compositions of a given polynomial kernel and the exponential function in this section. A polynomial kernel is a reproducing kernel of the form k=0 ak(x y)k, x, y Rd, (51) where ak 0 for every k Z+. Let r be the radius of convergence of the associated polynomial k=0 akzk. (52) Then the kernel K in (51) is well-defined on {x Rd : x < r}. The hierarchical kernels via compositions of K and the exponential function are generated by Kn = en(K), n N, (53) where en are the exponential generating functions given in (18). We first show that under a mild condition, the RKHS HKn is indeed expanding as n increases. Theorem 25 Suppose the radius of convergence of the polynomial P given in (52) is infinity and that P is not a constant. Then for each n N, HKn 1 HKn. Proof By Proposition 10, HKn 1 HKn for each n N. Assume that HK1 HK0. By Lemma 1, there exists a constant λ > 0 such that λK0 K1 is a kernel. Consequently, λK0(x, x) exp(K0(x, x)) 0 for all x Rd, (54) which implies that k=0 ak(x x)k, Hierarchical Kernels in Deep Kernel Learning is bounded on Rd. As a result, P(z) is bounded on C. By Liouville s theorem, P must be a constant, a contradiction. Similarly, if for some n N, HKn HKn 1 then Pn 1 must be a constant. Here, P0 = P, Pn = exp(Pn 1), n 1. But Pn 1 is constant if and only if P is constant. Therefore, we conclude that HKn 1 HKn for every n N. In the rest of this section, we focus on the most popular polynomial kernel in machine learning, which is of the form K(x, y) = (x y)q, x, y Rd, (55) where q N is fixed. Thus K corresponds to the polynomial Let Kn be the hierarchical kernels (53). Theorem 25 applies to this special case. Thus, HKn are expanding as n increases. As the final theoretical task of this paper, we desire to characterize HKn. Theorem 26 Let n N and Kn be the hierarchical polynomial kernels defined by (53) with K be given in (55). Then Kn(x, y) = en(0) k! (x y)qk, x, y Rd, (56) α Zd + |α|=qk α Zd + |α|=qk |ak,α|2 < + (57) with inner product fa, fb HKn = en(0) α Zd +,|α|=qk ak,αbk,α. (58) Proof Recall the exponential generating functions en given in (18). Apparently, Kn(x, y) = en(K(x, y)) = en((x y)q), x, y Rd. Huang, Lu, and Zhang Therefore, by (19), Kn(x, y) = en(0) k! (x1y1 + x2y2 + + xdyd)qk α Zd + |α|=qk Equations (57), (58) now follow directly from the above equation and Lemma 8. 7. Experiments To verify the theoretical results in the paper, we shall conduct initial experiments with hierarchical Gaussian kernels and hierarchical exponential kernels in this section. Specifically, we shall evaluate the hierarchical Gaussian kernels from consecutive composition with the exponential function on various tasks, including classification on the Scikit-learn (Pedregosa et al., 2011) moon scattering dataset, classification on CIFAR-10, and regression on UCI datasets. Considering the numerical stability, we shall only evaluate hierarchical kernels with at most three layers, and shall slightly modify the hierarchical Gaussian kernels as Gk+1(x, y) = exp (ek(1) (Gk(x, y) 1)) , G0(x, y) = exp( λ x y 2), x, y Rd. (59) To further confirm the theoretical results of the paper, we shall also evaluate the ℓ1 norm hierarchical exponential kernels on three regression tasks from LIBSVM (Chang and Lin, 2011). These ℓ1 norm hierarchical exponential kernels are also slightly modified as Hk+1(x, y) = exp (ek(1) (Hk(x, y) 1)) , H0(x, y) = exp( λ x y 1), x, y Rd. (60) For classification tasks, we shall use the C-support vector classification (Boser et al., 1992; Cortes and Vapnik, 1995) method min w,b,ξ 1 2w T w + C subject to yi(w T φ(xi) + b) 1 ξi ξi 0, i = 1, , n, where xi, 1 i n are the given training data in two classes, yi { 1, 1}, 1 i n are the corresponding labels, φ(xi) maps xi into a feature space, and C > 0 is the regularization parameter. For multi-class classification, we adapt the one-against-one strategy that trains N(N 1) 2 classifiers, where N is the number of classes, and predict the label of a new input through majority voting. Hierarchical Kernels in Deep Kernel Learning For regression tasks, we shall use the ϵ-support vector regression method min w,b,ξ,ξ 1 2w T w + C i=1 (ξi + ξ i ) subject to w T φ(xi) + b yi ϵ + ξi yi w T φ(xi) b ϵ + ξ i ξi, ξ i 0, i = 1, , n, where xi and yi, 1 i n are the input and output data, respectively, and C > 0, ϵ > 0 are given regularization parameters. Note that we shall use the Root Mean Square Error (RMSE) to measure the performance of regression tasks. Both the classification and regression tasks will be solved with the Sequential Minimal Optimization (SMO) solver. The computational cost of this method is analyzed in (Platt, 1998). According to (Platt, 1998), this algorithm does not involve matrix computations with the kernel matrices. This together with existence of regularization in C-support vector classification and ϵ-support vector regression ensure the stability of the numerical experiments in this section. As our focus is on the hierarchical kernels, we shall fix C = 1 and ε = 10 3 in all the tasks for fair comparison. For each task and each hierarchical kernel, the hyperparameter λ in the hierarchical kernels (59) and (60) will be optimally chosen. For evaluations with hierarchical Gaussian kernels, we implement with the thundersvm (Wen et al., 2018) to accelerate the training process. And for evaluations with hierarchical exponential kernels, we directly implement the SVM module in Scikit-learn since the datasets are relatively small. Notice that thundersvm only supports precomputed mode for custom kernels, which is inefficient and memory consuming for large scale data. Thus, we directly modify their source codes for hierarchical kernels. More details about the implementation of thundersvm could be found in (Wen et al., 2017), and our codes could be accessed via the github repository https://github.com/Saeba Huang/ Hierarchical-Kernel-in-Deep-Kernel-Learning. For hierarchical Gaussian kernels, we will see that in all tasks, the best results are obtained with G3 as the layer increases from 0 to 3. While for hierarchical exponential kernels, we will see that the result is not improving as the number of layers increases. These confirm our results in previous sections that as the number of layers increases, the RKHS of hierarchical Gaussian kernels is expanding while the RKHS of hierarchical exponential kernels remains the same. We present detailed results of the experiments as follows. 7.1 Scikit-learn moon scattering dataset with hierarchical Gaussian kernels We generate a set of points representing moon scattering using Scikit-learn, and perform evaluations with different hierarchical Gaussian kernels. We randomly choose 500 points as the training set which is equally divided into two classes. For each hierarchal Guassian kernel Gk, 0 k 3, the hyperparameter λ will be optimally chosen from [2 5, 2 4, , 29, 210]. The results are tabulated in Table 1. We also plot the decision boundaries of corresponding to the best λ in Figure 1. One could see that the decision boundary becomes tighter as the number of layers increases. Huang, Lu, and Zhang Kernel log2(λ) -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 G0 86.3% 87.2% 89.8% 98.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.9% 99.8% 99.0% G1 86.4% 88.6% 95.7% 99.9% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.9% 99.7% 99.3% G2 90.4% 98.6% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.8% 99.5% 98.8% G3 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.8% 99.3% 99.4% 96.1% 90.1% 85.4% 83.2% Table 1: Accuracy on randomly generated set of moon scattering points with different λ. 2 1 0 1 2 3 x 2 1 0 1 2 3 x 2 1 0 1 2 3 x 2 1 0 1 2 3 x Figure 1: Decision boundaries of different hierarchical Gaussian kernels with their own optimal λ. Hierarchical Kernels in Deep Kernel Learning 7.2 CIFAR-10 dataset with hierarchical Gaussian kernels For evaluations on CIFAR-10 dataset with hierarchical Gaussian kernels, the λ of each layer is optimally chosen from [2 19, , 2 18, , 2 8]. As shown in Table 2, the best result is obtained with G3. Kernel log2(λ) -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 G0 36.48% 38.03% 40.38% 43.01% 45.28% 48.13% 50.55% 53.73% 55.16% 54.30% 45.43% 29.84% G1 36.48% 38.19% 40.88% 43.59% 46.00% 48.96% 51.16% 52.97% 53.29% 50.32% 35.68% 16.45% G2 39.17% 42.04% 44.65% 46.91% 49.56% 52.15% 53.66% 54.13% 52.85% 48.45% 34.83% 16.81% G3 49.09% 51.77% 54.41% 55.55% 53.97% 47.63% 38.05% 32.45% 31.93% 31.96% 20.76% 11.19% Table 2: Accuracy of hierarchical Gaussian kernels on CIFAR-10 with different λ. 7.3 UCI Regression tasks with hierarchical Gaussian kernels We also evaluate the hierarchical Gaussian kernels on the UCI regression datasets elevators, kin40k, and servo. To read these datasets, we utilize code from the repository https: //github.com/treforevans/uci_datasets. For each evaluation, we normalize the data to be between 1 and 1, and perform 5-fold cross-validation. The results are presented in Tables 3-5. One sees that the best test RMSE are obtained with G3 for each dataset. Kernel log2(λ) -8 -7 -6 -5 -4 -3 -2 -1 0 1 G0 0.1320 0.0030 0.1227 0.0028 0.1165 0.0029 0.1119 0.0028 0.1077 0.0025 0.1051 0.0024 0.1032 0.0023 0.1021 0.0022 0.1021 0.0024 0.1068 0.0025 G1 0.1296 0.0030 0.1205 0.0029 0.1148 0.0029 0.1098 0.0027 0.1062 0.0025 0.1039 0.0024 0.1024 0.0023 0.1019 0.0024 0.1038 0.0025 0.1092 0.0026 G2 0.1181 0.0029 0.1130 0.0029 0.1081 0.0026 0.1052 0.0025 0.1032 0.0023 0.1020 0.0023 0.1017 0.0025 0.1037 0.0025 0.1077 0.0024 0.1163 0.0019 G3 0.1041 0.0023 0.1025 0.0023 0.1017 0.0023 0.1030 0.0024 0.1084 0.0026 0.1204 0.0021 0.1441 0.0023 0.1835 0.0025 0.2273 0.0028 0.2548 0.0035 Table 3: RMSE of hierarchical Gaussian kernels on elevators dataset with different λ. Kernel log2(λ) -8 -7 -6 -5 -4 -3 -2 -1 0 1 G0 0.2540 0.0024 0.2409 0.0030 0.2397 0.0036 0.2148 0.0036 0.1738 0.0062 0.1102 0.0031 0.0715 0.0013 0.0460 0.0006 0.0310 0.0005 0.0340 0.0007 G1 0.2450 0.0026 0.2378 0.0033 0.2208 0.0033 0.1739 0.0024 0.1096 0.0019 0.0672 0.0012 0.0430 0.0007 0.0300 0.0007 0.0285 0.0006 0.0436 0.0009 G2 0.2357 0.0034 0.2074 0.0028 0.1576 0.0025 0.0922 0.0014 0.0578 0.0014 0.0384 0.0007 0.0282 0.0007 0.0292 0.0006 0.0400 0.0008 0.0668 0.0014 G3 0.0829 0.0011 0.0515 0.0012 0.0361 0.0007 0.0279 0.0005 0.0386 0.0008 0.0755 0.0016 0.1580 0.0023 0.2472 0.0022 0.2734 0.0022 0.2770 0.0022 Table 4: RMSE of hierarchical Gaussian kernels on kin40k dataset with different λ. Kernel log2(λ) -7 -6 -5 -4 -3 -2 -1 0 1 2 G0 0.3481 0.0347 0.3258 0.0299 0.2813 0.0283 0.2364 0.0269 0.1851 0.0340 0.1632 0.0422 0.1561 0.0461 0.1440 0.0456 0.1568 0.0436 0.2007 0.0462 G1 0.3479 0.0332 0.3187 0.0304 0.2728 0.0284 0.2276 0.0305 0.1795 0.0372 0.1630 0.0454 0.1500 0.0465 0.1475 0.0489 0.1726 0.0477 0.2419 0.0465 G2 0.2991 0.0297 0.2571 0.0293 0.2027 0.0311 0.1732 0.0395 0.1581 0.0455 0.1430 0.0454 0.1457 0.0464 0.1717 0.0476 0.2255 0.0483 0.3212 0.0404 G3 0.1588 0.0437 0.1470 0.0434 0.1424 0.0461 0.1670 0.0439 0.2198 0.0483 0.3378 0.0396 0.4288 0.0337 0.4512 0.0329 0.4534 0.0329 0.4536 0.0329 Table 5: RMSE of hierarchical Gaussian kernels on servo dataset with different λ. Huang, Lu, and Zhang 7.4 LIBSVM regression datasets with hierarchical exponential kernels Finally, we shall evaluate the hierarchical exponential kernels on the datasets bodyfat, mpg, and triazines from LIBSVM. Similarly, we normalize the data to be between 1 and 1, and perform 5-fold cross-validation for each evaluation. The results are presented in Tables 6-8. One sees that the best test RMSE is not improving as the number of layers increases, which justifies our results about hierarchical exponential kernels in Section 5. Kernel log2(λ) -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 H0 0.3199 0.0201 0.3063 0.0200 0.2842 0.0205 0.2503 0.0214 0.2062 0.0244 0.1564 0.0285 0.1082 0.0331 0.0800 0.0354 0.0686 0.0361 0.0658 0.0360 0.0708 0.0362 H1 0.3199 0.0201 0.3063 0.0199 0.2843 0.0205 0.2504 0.0213 0.2064 0.0245 0.1582 0.0283 0.1110 0.0331 0.0816 0.0353 0.0721 0.0367 0.0718 0.0367 0.0792 0.0383 H2 0.2973 0.0202 0.2712 0.0208 0.2333 0.0222 0.1873 0.0261 0.1332 0.0304 0.0955 0.0346 0.0746 0.0356 0.0690 0.0359 0.0713 0.0373 0.0809 0.0375 0.0956 0.0397 H3 0.1360 0.0300 0.0953 0.0347 0.0743 0.0357 0.0669 0.0361 0.0676 0.0355 0.0753 0.0374 0.0914 0.0394 0.1184 0.0397 0.1714 0.0345 0.2605 0.0259 0.3228 0.0211 Table 6: RMSE of hierarchical exponential kernels on bodyfat dataset with different λ. Kernel log2(λ) -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 H0 0.3115 0.0723 0.3050 0.0749 0.2922 0.0797 0.2755 0.0859 0.2474 0.0977 0.2281 0.1044 0.2053 0.1120 0.1949 0.1174 0.1936 0.1179 0.1934 0.1173 0.1988 0.1177 H1 0.3115 0.0723 0.3051 0.0749 0.2923 0.0796 0.2758 0.0857 0.2483 0.0974 0.2294 0.1037 0.2070 0.1117 0.1962 0.1169 0.1946 0.1182 0.1941 0.1191 0.1992 0.1213 H2 0.3002 0.0766 0.2865 0.0816 0.2649 0.0906 0.2380 0.1014 0.2197 0.1106 0.1989 0.1127 0.1946 0.1180 0.1932 0.1184 0.1951 0.1198 0.2006 0.1211 0.2170 0.1187 H3 0.2205 0.1105 0.1987 0.1127 0.1942 0.1181 0.1934 0.1176 0.1951 0.1181 0.2007 0.1185 0.2172 0.1190 0.2454 0.1121 0.2728 0.1017 0.2931 0.0935 0.3036 0.0883 Table 7: RMSE of hierarchical exponential kernels on pyrim dataset with different λ. Kernel log2(λ) -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 H0 0.3969 0.0470 0.3924 0.0473 0.3866 0.0482 0.3758 0.0487 0.3668 0.0466 0.3577 0.0482 0.3495 0.0510 0.3429 0.0499 0.3370 0.0524 0.3382 0.0519 0.3365 0.0521 H1 0.3969 0.0470 0.3924 0.0473 0.3867 0.0482 0.3760 0.0488 0.3671 0.0466 0.3582 0.0486 0.3501 0.0513 0.3455 0.0507 0.3377 0.0528 0.3351 0.0515 0.3394 0.0542 H2 0.3903 0.0478 0.3824 0.0485 0.3716 0.0475 0.3628 0.0472 0.3545 0.0505 0.3478 0.0509 0.3398 0.0519 0.3376 0.0534 0.3339 0.0514 0.3358 0.0530 0.3401 0.0513 H3 0.3550 0.0500 0.3472 0.0513 0.3382 0.0505 0.3365 0.0526 0.3387 0.0513 0.3365 0.0511 0.3370 0.0495 0.3496 0.0453 0.3671 0.0481 0.3832 0.0489 0.3891 0.0465 Table 8: RMSE of hierarchical exponential kernels on triazines dataset with different λ. 8. Conclusion Kernel methods constitute an important category of machine learning methodologies. They enjoy solid mathematical foundations and good interpretability consequently. Motivated by deep neural networks, which generate learning functions through successive composition of activation functions and linear functions, a class of hierarchical kernels has appeared in the literature recently. Such kernels are generated by successive composition of a base kernel and a chosen univariate function. An important theoretical question about hierarchical kernels is whether the expressive power of the kernel will be improving as the number of layer increases. We investigate this question by studying the reproducing kernel Hilbert spaces of hierarchical kernels. It is shown in the paper that the RKHS of hierarchical Gaussian kernels and polynomial kernels is indeed expanding as the number of layer increases, while the RKHS of hierarchical exponential kernels always remains the same. The results reveal that we should not use the exponential kernels as bases kernels in deep kernel learning. In contrast, Gaussian kernels and polynomial kernels are good choices. Numerical experiments on the Scikit-learn demo datasets, the CIFAR-10, and UCI datasets confirm that the learning ability of the hierarchical Gaussian kernel is improving as the number of layer increases. And experiments on datasets from LIBSVM indicate that the learning ability of the hierarchical exponential kernel is not improving as the number of layer increases. These numerical findings justify the theorems in the paper. Hierarchical Kernels in Deep Kernel Learning Finally, we remark that the hierarchical kernels considered in the paper has a simple structure. For applications to complicated learning problems, hierarchical kernels with more sophisticated structures should be investigated in the future. Acknowledgments We would like to express gratitude to the reviewers for careful reading our manuscript, and insightful comments and suggestions that help improve the quality of this manuscript. This work is supported in part by the National Natural Science Foundation of China (NSFC) under grants 12371103, 11971490 and 12126610, and by Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032). F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical kernel machines. CBMM Memos, 35, 2015. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68: 337 404, 1950. N. Asai, I. Kubo, and H. Kuo. Bell numbers, log-concavity, and log-convexity. Acta Applicandae Mathematicae, 63: 79 87, 2001. F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(1): 629-681, 2017. F. Bartolucci, E. De Vito, L. Rosasco, and S. Vigogna. Understanding neural networks with reproducing kernel Banach spaces. Applied and Computational Harmonic Analysis, 62: 194 236, 2021. E. T. Bell. The iterated exponential integers. The Annals of Mathematics, 39: 539 557, 1938. A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, Boston, MA, 2004. A. Bietti and F. Bach. Deep equals shallow for Re LU networks in kernel regimes. Proceedings of the International Conference on Learning Representations (ICLR), 2021. S. Bochner. Lectures on Fourier Integrals with an Author s Supplement on Monotonic Functions, Stieltjes Integrals, and Harmonic Analysis. Annals of Mathematics Studies 42, Princeton University Press, New Jersey, 1959. N. G. de Bruijn. Asymptotic Methods in Analysis. Dover Publications, New York, pp. 102109, 1981. B. Bohn, C. Rieger, and M. Griebel. A representer theorem for deep kernel learning. Journal of Machine Learning Research, 20: 1 32, 2019. Huang, Lu, and Zhang B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. Annual Conference Computational Learning Theory, 1992. C. Chang and C. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2: 1 27, 2011. J. Chen, H. Avron, and V. Sindhwani. Hierarchically compositional kernels for scalable nonparametric learning. Journal of Machine Learning Research, 18: 1 42, 2017. L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. Proceedings of the International Conference on Learning Representations (ICLR), 2021. Y. Cho and L. Saul. Kernel methods for deep learning. Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS), 2009. C. Cortes, and V. Vapnik. Support-vector network. Machine Learning, 20: 273 297, 1995. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002. F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge Monographs on Applied and Computational Mathematics, 24, Cambridge University Press, Cambridge, 2007. A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Advances in Neural Information Processing Systems (NIPS), 2016. T. Evgeniou, M. Pontil and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13: 1 50, 2000. C. H. Fitz Gerald, C. A. Micchelli, and A. Pinkus. Functions that preserve families of positive semidefinite matrices. Linear Algebra and its Applications, 221: 83 102, 1995. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5: 73 99, 2004. A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri. On the similarity between the laplace and neural tangent kernels. Advances in Neural Information Processing Systems (Neur IPS), 2020. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, 2016. R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, Cambridge, 1991. J. Huang and H. T. Yau. Dynamics of deep neural networks and neural tangent hierarchy. Int. Conf. Mach. Learn., PMLR, 2020, 4542 4551. Hierarchical Kernels in Deep Kernel Learning W. Huang, W. Du and R. Y. Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. IJCAI, 2021. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems (NIPS), 2018. Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553): 436 444, 2015. J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. Proceedings of the International Conference on Learning Representations (ICLR), 2018. J. Mercer. Functions of positive and negative type and their connection with the theorey of integral equations. Philosophical Transactions of the Royal Society A, 209: 415 446, 1909. S. A. Morris. Hilbert 13: Are there any genuine continuous multivariate real-valued functions? Bulletin of the American Mathematical Society, 58(1): 107 118, 2021. R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. G. Ongie, R. Willett, D. Soudry, and N. Srebro. A function space view of bounded norm infinite width Re LU nets: the multivariate case. International Conference on Learning Representations, 2019. R. Parhi and R. D. Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(1): 1960-1999, 2021. R. Parhi and R. D. Nowak. What kinds of functions do deep neural networks learn? Insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2): 464489, 2022. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, G. Louppe, P. Prettenhofer, R. Weiss, R.J. Weiss, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825 2830, 2011. J. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research Technical Report, 1998. J. Schmidt-Hieber. The Kolmogorov-Arnold representation theorem revisited. Neural Networks, 137: 119-126, 2021. I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics (2), 39: 811 841, 1938. I. J. Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9: 96 108, 1942. B. Sch olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, Massachusetts, 2002. Huang, Lu, and Zhang J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, 2004. L. Spek, T. J. Heeringa, and C. Brune. Duality for neural networks through reproducing kernel Banach spaces. ar Xiv:2211.05020, 2022. I. Steinwart, D. Hush, and C. Scovel. An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Transactions on Information Theory, 52: 4635 4643, 2006. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. Z. Wen, J. Shi, Q. Li, B. He, and J. Chen. Thunder SVM: a fast SVM library on GPUs and CPUs. Journal of Machine Learning Research, 19: 1 5, 2018. Z. Wen, J. Shi, Q. Li, B. He, and J. Chen. Supplementary material of Thunder SVM: https: //github.com/zeyiwen/thundersvm/blob/master/thundersvm-full.pdf, 2017. H. Wendland. Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics 17, Cambridge University Press, Cambridge, 2005. A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. Proceedings of Machine Learning Research, 2016. Z. M. Wu. Compactly supported positive definite radial functions. Advances in Computational Mathematics, 4(3): 283 292, 1995. Y. Xu and H. Zhang. Refinable kernels. Journal of Machine Learning Research, 8: 2083 2120, 2007. Y. Xu and H. Zhang. Refinement of reproducing kernels. Journal of Machine Learning Research, 10: 107 140, 2009. D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94: 103 114, 2017. H. Zhang, Y. Xu, and J. Zhang. Reproducing kernel Banach spaces for machine learning. Journal of Machine Learning Research, 10: 2741 2775, 2009. H. Zhang and L. Zhao. On the inclusion relation of reproducing kernel Hilbert spaces. Analysis and Applications, 11: 1350014, 31 pages, 2013.