# nearlytight_bounds_for_deep_kernel_learning__f4d684d3.pdf

Nearly-tight Bounds for Deep Kernel Learning

Yi-Fan Zhang 1 2 Min-Ling Zhang 2 3

The generalization analysis of deep kernel learning (DKL) is a crucial and open problem of kernel methods for deep learning. The implicit nonlinear mapping in DKL makes existing methods of capacity-based generalization analysis for deep learning invalid. In an attempt to overcome this challenge and make up for the gap in the generalization theory of DKL, we develop an analysis method based on the composite relationship of function classes and derive capacity-based bounds with mild dependence on the depth, which generalizes learning theory bounds to deep kernels and serves as theoretical guarantees for the generalization of DKL. In this paper, we prove novel and nearly-tight generalization bounds based on the uniform covering number and the Rademacher chaos complexity for deep (multiple) kernel machines. In addition, for some common classes, we estimate their uniform covering numbers and Rademacher chaos complexities by bounding their pseudo-dimensions and kernel pseudodimensions, respectively. The mild bounds without strong assumptions partially explain the good generalization ability of deep learning combined with kernel methods.

1. Introduction

Recent work in machine learning has given a revival of attention to deep learning due to its impressive empirical advances across a wide range of tasks. Deep models are typically heavily over-parametrized, while they still achieve good generalization performance, Zhang et al. (2017) showed that deep neural networks can almost per-

1School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China 3School of Computer Science and Engineering, Southeast University, Nanjing 210096, China. Correspondence to: Min-Ling Zhang <zhangml@seu.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

fectly ﬁt the training data even with random labels, this sparked a rush to explain this phenomenon through generalization analysis based on complexity measures. The problem of understanding deep learning theoretically remains relatively under-explored. Meanwhile, kernel machines have a perfect ﬁt of training data while guaranteeing that they generalize well (Bartlett & Mendelson, 2002; Belkin et al., 2018). Therefore, the progress on understanding deep learning is also greatly affected by the new theoretical research of kernel methods.

In recent years, efforts to explain why deep models generalize well have become a hot and important open problem in learning theory. Uniform convergence is a powerful tool in learning theory for understanding the generalization ability of learners, and it is also widely used in the generalization analysis of deep learning. Although ongoing endeavors have developed a considerable amount of non-vacuous generalization error bounds that reﬂect weak dependence or even independence on the network width and depth, these theoretical results are elaborate and algorithm-based (i.e. theoretical analysis incorporates the implicit regularization of stochastic gradient descent (SGD)) (Neyshabur et al., 2017; Soudry et al., 2018; He et al., 2019; Foret et al., 2021; Lei & Ying, 2021), which makes such theoretical results not applicable to all deep models. The capacity/complexity-based generalization analysis can provide a general theoretical guarantee for the surprising generalization performance of deep learning. However, capacity-based generalization bounds tend to have a strong dependence on the network depth, which makes theoretical results often of limited signiﬁcance.

Deep kernel learning (DKL), which combines the representation power and structural prior knowledge of deep learning with the non-parametric ﬂexibility of kernel methods, is an important problem of kernel methods for deep learning. A satisfactory and complete study of deep kernel learning should cover three aspects: 1) the design of deep kernels, 2) the efﬁciency of training algorithms, and 3) the analysis of generalization property. The design of deep kernels has been developed to a certain extent (Cho & Saul, 2009; Wilson et al., 2016b;a; Al-Shedivat et al., 2017; Lee et al., 2018), which mainly beneﬁts from Gaussian processes. In terms of training algorithms, the studies are often related to speciﬁc deep kernel machines (Cho & Saul, 2009; Mairal et al., 2014; Al-Shedivat et al., 2017) or involve many heuris-

Nearly-tight Bounds for Deep Kernel Learning

tics (Zhuang et al., 2011). As for the theoretical analysis of generalization, the studies are still in the early stage. Almost all existing works are for shallow kernel learning problems (Bartlett & Mendelson, 2002; Ying & Campbell, 2010; Lei & Ding, 2014), and the kernel machines involved do not exceed two layers (Zhuang et al., 2011). Therefore, it is necessary to perform generalization analysis for DKL. As a matter of fact, theoretical research on deep kernel learning can also promote the understanding of deep learning (Jacot et al., 2018).

This paper analyzes the generalization of deep kernel learning from the perspective of capacity, and aims to provide a general theoretical guarantee for the generalization ability of deep kernel machines (DKMs). The DKM is a special deep model based on kernel methods. The most signiﬁcant difference between the DKM and the deep neural network (DNN) is that the nonlinear mapping (i.e., the kernel mapping) is implicitly induced by the kernel function, i.e., the nonlinear mapping is implicit. Golowich et al. (2018) revealed that existing tight generalization bounds on capacity-based generalization analysis of deep neural networks are accompanied by strong assumptions about the properties of activation functions and the norm constraints on the parameter matrix of each layer. One cannot make assumptions on the implicit nonlinear mapping in DKMs and the Reproducing Kernel Hilbert Space (RKHS) norm, which makes existing methods of generalization analysis for DNNs inapplicable to DKMs. In view of the challenges brought by the implicit nonlinear mapping in deep kernel learning, we need to develop a speciﬁc generalization analysis method for deep kernel learning and further derive meaningful generalization error bounds.

In this paper, we derive novel and nearly-tight capacitybased generalization bounds based on the uniform covering number and the Rademacher chaos complexity for DKMs and deep multiple kernel machines (DMKMs), respectively. Speciﬁcally, we ﬁrst obtain the composite relationship of uniform covering numbers, which is used to deal with the difﬁculties in theoretical analysis caused by the implicit nonlinear mapping of DKL, then derive bounds for DKMs based on the uniform covering number which can be bounded using pseudo-dimensions, and derive bounds for DMKMs based on the Rademacher chaos complexity which can be bounded using kernel pseudo-dimensions. Furthermore, we bound pseudo-dimensions and kernel pseudo-dimensions for some speciﬁc classes, and further estimate their uniform covering numbers and Rademacher chaos complexities, respectively. Finally, we provide a lower bound for DKMs based on the Rademacher complexity.

To our knowledge, this is the ﬁrst formal attempt to extend generalization analysis to the case of deep kernels and derive capacity-based generalization bounds for DKL with mild

dependence on the depth. Major contributions of the paper include:

We prove novel and nearly-tight capacity-based generalization bounds based on the complexity of the whole hypothesis space of deep kernel models, which provides general theoretical guarantees for DKL.

We introduce and formally describe the composite relationship between layers of DKMs/DMKMs for the generalization analysis of DKL, which overcomes the difﬁculties brought by the implicit nonlinear mapping in DKL for theoretical analysis and leads to bounds with square-root dependence on the depth (outside of log terms).

We further show how to estimate the uniform covering number and the Rademacher chaos complexity for the function class of DKMs/DMKMs by bounding the pseudo-dimension and the kernel pseudo-dimension.

We structure our work as follows. We ﬁrst introduce the related work in Section 2, followed by an overview of the deﬁnitions of related complexities, the problem setting on DKL and the notation in Section 3. We then present our main theoretical results in Sections 4 and 5, which are the generalization bound based on the uniform covering number for DKL and the generalization bound based on the Rademacher chaos complexity for deep multiple kernel learning (DMKL), respectively. In Section 6, we show how to estimate the relative complexities for DKMs and DMKMs. In Section 7, we provide a lower bound based on the Rademacher complexity for DKL. In Section 8, we provide a discussion of the implications and inspirations of our theoretical results. Finally, we give a conclusion of our work in Section 9.

2. Related Work

In this section, we introduce the related work about kernel methods for deep learning and capacity-based generalization bounds for deep learning.

2.1. Kernel Methods for Deep Learning

A considerable amount of deep kernels and DKMs have been proposed to link kernel methods with deep learning. Various deep kernels were designed to simulate the computation of deep neural networks based on Gaussian processes (Cho & Saul, 2009; Hazan & Jaakkola, 2015; Wilson et al., 2016a;b; Al-Shedivat et al., 2017; Lee et al., 2018). Convolution kernels and convolution kernel networks were used to encode the invariance of image representations Mairal et al. (2014); Mairal (2016). Furthermore, SVM-based deep stacking networks (Wang et al., 2019), deep spectral kernel

Nearly-tight Bounds for Deep Kernel Learning

networks (Xue et al., 2019; Li et al., 2020; 2022) and deep models based on multiple kernel fusion (Song et al., 2017) were all designed to introduce kernels into deep learning. In terms of theoretical analysis, The relationship between the representer theorem and DKMs was established (Bohn et al., 2019). Regularizing deep neural networks with Reproducing Kernel Hilbert Space norm was proposed (Bietti et al., 2019) and generalization error bounds of speciﬁc deep networks were derived (Suzuki, 2018) for the generalization theory. Similarity indexes and kernel PCA were used to measure the relationship between representations in deep networks (Kornblith et al., 2019; Montavon et al., 2011). The neural tangent kernel obtained at initialization was proven to control the learning dynamics of gradient descent in the over-parameterized regime (Jacot et al., 2018), and its inductive bias was also studied (Bietti & Mairal, 2019). Nevertheless, there is still a lack of new theories for kernel methods to analyze deep learning (Belkin et al., 2018). To make up for the gap in the generalization theory, in this paper, we derive capacity-based generalization bounds for DKMs and DMKMs, respectively.

2.2. Capacity-based Generalization Bounds for Deep Learning

The capacity-based generalization bounds established by traditional statistical learning theory aim to provide general theoretical guarantees for deep learning. Goldberg & Jerrum (1995); Bartlett & Williamson (1996); Bartlett et al. (1998) proposed upper bounds based on the VC dimension for DNNs. Unfortunately, these theoretical results heavily rely on both the depth and the width of the network, which makes these bounds less attractive once the model size is extremely large, even for the tightest upper bounds based on the VC dimension by Bartlett et al. (2019). There are many bounds that can alleviate the dependence on the width, but often still have a strong dependence on the depth. Neyshabur et al. (2015) used Rademacher complexity to prove the bound with exponential dependence on the depth. Neyshabur et al. (2018) and Bartlett et al. (2017) used the PAC-Bayesian analysis and the covering number to obtain bounds with polynomial dependence on the depth, respectively. Golowich et al. (2018) provided bounds with (sublinear) square-root dependence on the depth and further improved the bounds to be depth-independent. However, these results are all accompanied by strong assumptions about the properties of activation functions and the norm constraints on the parameter matrix of each layer, making these methods of generalization analysis invalid for DKL.

3. Preliminaries

In the context of classiﬁcation, given a dataset D = {(x1, y1) , . . . , (xn, yn)} with n samples which are iden-

tically and independently distributed (i.i.d.) from a probability distribution P on X Y, where X Rd and Y = { 1, 1}. Let [n] := {1, . . . , n} for any natural number n. For any function f : X Y, we denote the expected risk E(f) as E(x,y) P [ℓ(f(x), y)] and denote the empirical risk b ED(f) with respect to the training dataset D as 1

n Pn i=1 ℓ(f(xi), yi). Let K : X X R denote a kernel function and denote the RKHS by H with norm H. In this paper, we assume that all kernels fulﬁll r := sup p

K(x, x) < for all x X.

3.1. Deep Kernel Learning

Although the nonlinear mapping induced by a kernel is implicit, we can avoid the limitation and directly construct DKMs by combining nonlinear mappings in a composite way:

Kl(x, x ) = Φl(x), Φl(x )

= φl φl 1( φ1(x)) , φl φl 1( φ1(x )) ,

where Kl denotes the l-layer deep kernel, φl denotes the l-th layer kernel mapping. With the method of constructing deep kernels, we can denote the l-layer DKM as

f(x) = W l Φl(x) =

i=1 αi Kl (x, xi) .

Let Hl represent the RKHS induced by the deep kernel Kl

and Hl denote the norm in Hl. We deﬁne a class of i-th layer deep kernels as follows:

Ki sin = Ki( , ) = gi Ki 1( , ) ,

where Ki 1( , ) Ki 1 sin , i {2, , l}, gi is a function produced by Ki( , ) which composites Ki 1( , ), while ensuring that the composite result is still a valid kernel.

For ease of understanding, we consider an example of a twolayer Gaussian kernel. A Gaussian kernel is typically deﬁned as K(x, x ) = exp γ x x 2 , where γ > 0 is the kernel parameter. We can construct a two-layer Gaussian kernel by compositing the nonlinear mapping corresponding to the Gaussian kernel:

= Φ2(x), Φ2(x ) = φ (φ(x)) , φ (φ(x ))

=e γ φ(x) φ(x ) 2 = e 2γ(1 K(x,x )) = κe2γK(x,x ),

where κ is a constant that can be omitted. Hence, the corresponding g function (operation) induced by the composition of the Gaussian kernel to other kernels is g( ) = κe2γ( ).

Nearly-tight Bounds for Deep Kernel Learning

Figure 1. The architecture of DMKL framework. We show the deep multiple kernel machine with l layers, where kernels in i-th layer are from the kernel domain of i-th layer.

The function class of DKMs can be denoted as

i=1 αi Kl (x, xi) : X

i,j αiαj Kp (xi, xj) 1,

Kp(xi, xi) A2, Kp( , ) Kp sin,

xi X, p [l]}.

Then the DKL framework can be cast as the following minimization problem:

min f F 1 n

i=1 ℓ(yif (xi)) + λ f Hl, (1)

where ℓ( ) is the loss function and λ is a positive regularization parameter. When the loss function is hinge loss, then (1) is reduced to deep support vector machines (SVMs).

However, there are some limitations in the DKMs mentioned above, they are often speciﬁc, non-automated and inﬂexible since the type of kernels involved is single. Therefore, we have to further consider allowing a combination of a variety of different kernels when designing deep kernels, which we refer to as DMKL.

To this end, we ﬁrst consider the initial set of basis kernel functions as K1 mul = {K1, . . . , Km}, then we deﬁne a domain (or class) of i-th layer deep multiple kernels as follows:

Ki mul = Ki( , ) = gi hi 1 Ki 1 1 ( , ), . . . , Ki 1 m ( , ) , (2)

where Ki 1 t ( , ) Ki 1 mul( t [m]), i {2, , l}, hi 1

is a function that combines multiple (i 1)-layer deep kernels (such as the linear or convex combination of several base kernels in multiple kernel learning), gi is a function which composites the results of hi 1, while ensuring that the composite result is still a valid kernel. The function class of DMKMs is similar to that of DKMs, the main difference is that Kp( , ) Kp mul instead of Kp sin. Hence, we can

formulate the DMKL framework as the following two-layer minimization problem:

min Kl Kl mul min f F 1 n

i=1 ℓ(yif (xi)) + λ f Hl, (3)

where ℓ( ) is the loss function, λ is a positive regularization parameter. As a matter of fact, if the hi 1 operation is the identity transformation and the initial kernel domain contains only a single type of kernels, it degenerates into DKL. The architecture of DMKL is shown in Figure 1.

3.2. Related Complexity Measures

In this paper, we use the uniform covering number to bound the sample complexity for DKL. The uniform covering number can be bounded by the pseudo-dimension:

Deﬁnition 3.1 (uniform covering number). A subset C G is an ε-cover of a function class G under the metric d if for any g G and ε > 0 there exists c C with d(g, c) ε. The covering number N(ε, G, d) is the size of the smallest ε-cover of G. Given a dataset D of size n, we deﬁne the uniform covering number corresponding to the metric dp for a function class G:

Np(ε, G, n) = max |D|=n

N ε, G|D, dp ,

where G|D denotes the restriction of the function class G to the dataset D (that is, the set of restrictions to D of all functions in G).

Deﬁnition 3.2 (pseudo-dimension). Let F be a set of functions mapping from X to R. We say that Dn = {x1, x2, . . . , xn} X is pseudo-shattered by F if there are real numbers {ri R : i [n]} such that for any b { 1, 1}n there is a function f F with property sgn (f(xi) ri) = bi for any i [n]. Then, we deﬁne a pseudo-dimension dp F of F to be the maximum cardinality of Dn that is pseudo-shattered by F.

In this paper, we also use the Rademacher complexity to provide a lower bound for DKL.

Deﬁnition 3.3 (Rademacher complexity). Let F be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical Rademacher complexity over F is deﬁned by

ˆRD(F) = Eϵ

i=1 ϵif (xi)

where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables, and we refer to the expectation Rn(F) = ED[ ˆRD(F)] as the Rademacher complexity of F.

The Rademacher chaos complexity was introduced into the discussion on learning rates of MKL machines (Ying &

Nearly-tight Bounds for Deep Kernel Learning

Campbell, 2010). Some comprehensive studies and significant results were provided to justify that the Rademacher chaos complexity is appropriate to treat the learning rates (Ying & Campbell, 2009; 2010; Lei & Ding, 2014), which also demonstrate that the Rademacher chaos complexity has inherited the advantage of the Rademacher complexity. Here we use the Rademacher chaos complexity to perform generalization analysis for DMKL.

Deﬁnition 3.4 (Rademacher chaos complexity). Let F be a class of functions mapping from X X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical Rademacher chaos complexity over F is deﬁned by

ˆUD(F) = Eϵ

i,j [n],i<j ϵiϵjf (xi, xj)

where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables, and we refer to the expectation Un(F) = ED[ ˆUD(F)] as the Rademacher chaos complexity of F.

The Rademacher chaos complexity can be bounded by the kernel pseudo-dimension of the set of candidate kernels. The kernel pseudo-dimension measures the complexity of a kernel domain.

Deﬁnition 3.5 (kernel pseudo-dimension). Let K be a set of reproducing kernel functions mapping from X X to R. We say that Sn = {(xi, x i) X X : i [n]} is pseudoshattered by K if there are real numbers {ri R : i [n]} such that for any b { 1, 1}n there is a function K K with property sgn (K (xi, x i) ri) = bi for any i [n]. Then, we deﬁne a kernel pseudo-dimension dk K of K to be the maximum cardinality of Sn that is pseudo-shattered by K.

4. Generalization Bounds Based on the Uniform Covering Number

In this section, we are committed to establishing a generalization error bound based on the uniform covering number for DKMs. Since the composite relationship between layers of DKMs, to derive tight generalization error bounds, we ﬁrst have to reveal the relationship between uniform covering numbers of composite function classes, and then bound these uniform covering numbers with the pseudo-dimension.

Suppose that the input space is X = Rd, a l-layer DKM can be simpliﬁed to the composition of a series of mappings:

f = fl f1 f0(x)

f0 : Rd H1,

fi : Hi Hi+1, 1 i l 1,

fl : Hl { 1, 1}.

Hi represents the i-th layer RKHS, the class of functions on the i-th layer can be denoted as F(i), then the class of l-layer DKMs can be expressed as

F = F(l) F(2) F(1),

we ignore F(0) since it is induced by kernels implicitly. Then, we ﬁrst have the following lemma:

Lemma 4.1. Let F(1) YX 1 and F(2) YY1 2 be function classes of kernels, F = F(1) F(2) be the class of composite functions, we have

Np(2ε, F, n) Np(ε, F(1), n) Np(ε, F(2), n).

Proof Sketch. We ﬁrst bound the cover of F on the dataset D through appropriate scaling, and then complete the proof with the arbitrariness of D.

Lemma 4.1 reveals the relationship between uniform covering numbers of composite function classes, similar results exist in the deep neural networks literature (Nagarajan & Kolter, 2019; Wei & Ma, 2019; Ledent et al., 2021; Graf et al., 2022).

We denote the pseudo-dimension of the i-th layer function class as dp i 1, i [l], Let dp max = maxi{dp i }, then we have the following lemma:

Lemma 4.2. Let F be a class of functions, which represents l-layer DKMs. Let N (ε, F, n) be the uniform covering number deﬁned on F. Then we have

N (ε, F, n) len A

Proof Sketch. First, we have to use dp i to bound N (ε, F(i), n), i [l]. Then, with Lemma 4.1, N (ε, F, n) can be bounded using dp max.

With these conclusions about the uniform covering number, the generalization error bound for DKMs can be derived as follows:

Theorem 4.3. Let F be a class of functions, which represents l-layer DKMs, taking values in { 1, 1}. Let γ > 0, 0 < δ < 1 and deﬁne the γ-margin cost function by

0, γ x 1 x/γ, 0 x γ 1, x 0 .

Given a dataset D of size n. Then, for any δ > 0, with probability at least 1 δ, the following holds for any f F:

E(f) b ED(f) +

δ + 8ldp max ln( 4le An

Nearly-tight Bounds for Deep Kernel Learning

Remark 4.4. The generalization error bounds for traditional kernel learning problem do not apply to deep kernel learning since the impact of deep is not considered. The explicit nonlinear mapping in deep learning makes one might often be constrained to perform generalization analysis using a peeling argument, but this is not feasible in DKL. We introduce the composite relationship between layers to eliminate the challenges posed by the implicit nonlinear mapping in DKL, which makes our theoretical results non-trivial extensions of kernel learning. The bound in Theorem 4.3 is represented by the pseudo dimension of kernels. It is obvious that the generalization error bounds we derived are only related to the depth of DKMs and the number of samples, and the convergence rate is e O( p

5. Generalization Bounds Based on the Rademacher Chaos Complexity

In this section, we are committed to establishing a tight generalization error bound based on the Rademacher chaos complexity for DMKMs. When considering the case of DMKL, the pseudo-dimension cannot capture the complexity information of multiple deep kernels, so we use the kernel pseudo-dimension to measure the complexity of a kernel domain (or class).

Let Kmul be a domain of (multiple) kernel functions on X X and D = {x1, , xn}, we deﬁne pseudo-metrics on Kmul as follows:

d (f, g) := max 1 i,j n |f (xi, xj) g (xi, xj)| ,

d2(f, g) := s

1 i<j n |f (xi, xj) g (xi, xj)|2.

The uniform covering number N (ε, Kmul, n) and N2(ε, Kmul, n) correspond to the pseudo-metric d (f, g) and d2(f, g). Let Ki mul be the domain of i-th layer deep multiple kernels as deﬁned in (2), that is, the kernels in Ki mul are transformed by multiple (i 1)-th layer deep kernels in Ki 1 mul, then the domain of l-layer deep multiple kernels can be expressed as

Kmul = K(l) mul K(2) mul K(1) mul.

We denote the kernel pseudo-dimension of the i-th layer kernel domain Ki mul as dk i 1, i [l]. Let dk max = maxi{dk i }, then we have:

Lemma 5.1. Let Kmul be a domain of l-layer deep multiple kernels deﬁned by (2). Let N (ε, Kmul, n) and N2(ε, Kmul, n) be uniform covering numbers deﬁned on

Kmul. Then we have

N2(ε, Kmul, n) 8l2e A4

N (ε, Kmul, n) len2A2

Proof Sketch. First, we use dk i to bound N2(ε, K(i) mul, n) and N (ε, K(i) mul, n), i [l], respectively. Then, with Lemma 4.1, N2(ε, Kmul, n) and N (ε, Kmul, n) can be bounded using dk max.

We then establish a generalization error bound based on the Rademacher chaos complexity as follows:

Lemma 5.2. Let F be a class of functions mapping from X to { 1, 1}, Kmul be a class of multiple kernel functions deﬁned like (2). Given a dataset D of size n. Let γ > 0, 0 < δ < 1 and the loss function be the γ-margin cost function. Then, for any δ > 0, with probability at least 1 δ, the following holds for any f F:

E(f) ˆED (f) + 4

+ 4A λγ n +

Proof Sketch. Let the union of the unit balls of RKHSs be BKmul, Bλ = 1

λBKmul, Φ(D) = supf Bλ |E(f) ˆED(f)|. We ﬁrst have Φ(D) ED[Φ(D)] + p

ln(1/δ)/2n by Mc Diarmid s inequality. Then, with the contraction property of Rademacher complexities and the reproducibility of kernels, we can derive the upper bound on ED[Φ(D)]. Finally, substituting the upper bound into the above inequality, the desired bound is immediate.

With the composite relationship between layers, we can bound the Rademacher chaos complexity with the kernel pseudo-dimension for DMKMs:

Lemma 5.3. Let Kmul be a domain of l-layer deep multiple kernels. For any D = {x1, . . . , xn}, there holds

Un(Kmul) 220e A2ldk max ln l.

Proof Sketch. Combining with Lemma 5.1 and the relationship between the uniform covering number and the Rademacher chaos complexity, the desired bound can be derived by appropriate scaling.

With Lemma 5.2 and Lemma 5.3, the generalization error bound based on the Rademacher chaos complexity for DMKMs is given as follows:

Nearly-tight Bounds for Deep Kernel Learning

Theorem 5.4. Let F be a class of functions, which represents l-layer DMKMs. Given a dataset D of size n. Let γ > 0, 0 < δ < 1 and the loss function be γ-margin cost function. Then, with probability at least 1 δ, there holds for any f F:

E(f) ˆED (f) + 8

110e A2ldk max ln l n

+ 4A λγ n +

Remark 5.5. The theorems above reveal that the generalization analysis of deep kernel learning is not only to simply bound the (kernel) pseudo-dimension of each layer, but also to clarify the composite relationship of uniform covering numbers, which makes our bounds with square-root dependence (outside of log terms) on the depth non-trivial. It is obvious that the bound for DMKMs is tighter than DKMs with a faster convergence rate O( p

l ln l/n), since the kernel pseudo-dimension is used to bound the Rademacher chaos complexity. Next we will further show that the kernel pseudo-dimension of DMKMs is related to the number of basis kernel functions. Compared with the results of (Ying & Campbell, 2009), we extend the theory bounds of multiple kernel learning to the deep case (i.e., deep multiple kernel learning), and give the generalization bound corresponding to the deep multiple kernel function class produced by multiple combinations and compositions of multiple kernels, and show how to estimate the complexity of the deep multiple kernel function class. Although the bound has some dependence on the parameter λ, our focus here is to show that at least some compromise assumption leads to bounds with square-root dependence (outside of log terms) on the depth, and hope this dependence can be improved in further work.

6. Estimating the Uniform Covering Number and the Rademacher Chaos Complexity

In this section, for both DKMs and DMKMs, we further estimate the uniform covering number which can be bounded by the pseudo-dimension and the Rademacher chaos complexity which can be bounded by the kernel pseudo-dimension for some common classes, respectively.

6.1. Estimating the Uniform Covering Number

For l-layer DKMs, we will bound the pseudo-dimension for each layer, then further bound the uniform covering number. The pseudo-dimension of kernels is as follows:

Theorem 6.1. Let K : X X R be a kernel function and let Φ : X H be a feature mapping associated to K. Let D x : K(x, x) r2 be a dataset of size n, and

F = {x 7 w, Φ(x) :

min x |w Φ(x)| = 1 w H Λ}

for some Λ 0. Then the pseudo-dimension dp of the function class F satisﬁes

Proof Sketch. We ﬁrst obtain the relationship between d and Λ from the deﬁnition of the pseudo-dimension and introduce the expectation of y = (y1, . . . , yd) { 1, 1}d. Then, using Jensen s inequality and the property of convex functions, the desired bound is derived.

6.2. Estimating the Rademacher Chaos Complexity

The above theoretical results reveal that the Rademacher chaos complexity of DMKMs is mainly affected by the depth and the kernel pseudo-dimension which measures the complexity of a kernel domain Kmul. Moreover, different kernel domains correspond to different kernel pseudodimensions. Here, for some common kernel domains, we will bound the kernel pseudo-dimension and further estimate the Rademacher chaos complexity corresponding to these speciﬁc classes of DMKMs.

According to the deﬁnition of kernel domains, it is obvious that the architecture of DMKMs is controlled by combinations of multiple kernels in the (i 1)-layer and composite methods of the i-layer to the combined results of (i 1)- layer, i.e., hi 1 and gi. For a kernel set S with k basic kernels, the common types of combination, i.e., commonused hl 1 operations, are mainly span, linear, and convex:

Kspan (S) def =

i=1 λi Ki | λi R

Kline (S) def =

i=1 λi Ki | Kλ 0,

Kconv (S) def =

i=1 λi Ki | λi 0,

It can be known that Kconv (S) Kline(S) Kspan(S).

Then we consider the composite method produced by the common-used kernel functions, such as Polynomial kernel K(x, y) = (x y)d, Gaussian kernel K(x, y) = exp γ x y 2 , and Laplacian kernel K(x, y) = exp ( µ x y ). The compositions of these kernels on other kernels yield corresponding gi operations in the deﬁ-

Nearly-tight Bounds for Deep Kernel Learning

nition of kernel domains as follows:

gi Pol(K) = (K)d, d > 1 gi Gau(K) = κe2γK, κ, γ > 0

gi Lap(K) = e

2µ2(K 1), µ > 0 . (4)

Obviously, these gi operations are monotonous (nondecreasing) with respect to their input kernels.

With the above discussion on the methods of combination and composition, we can bound the Rademacher chaos complexity for DMKMs as follows:

Theorem 6.2. Let Kmul be a domain of l-layer deep kernels. Given the initial set of basis kernels K1 = {K1, . . . , Km}. For any i {2, . . . , l}, Ki mul is the kernel domain of i-th layer deep kernels. If the hi 1 operation is the type of span combination and the gi operation is non-decreasing with respect to its input kernels, the kernel pseudo-dimension of the domain Ki mul satisﬁes

dk(Ki mul) m

and the Rademacher chaos complexity of Kmul is bounded by Un(Kmul) 220e A2l(ln l)m.

Proof Sketch. According to the deﬁnition of kernel domains, combined with the monotonicity of the kernel pseudo-dimension, the desired bounds can be derived.

Remark 6.3. If the given initial basic kernel set contains only a single type of kernel, i.e., m = 1, DMKMs degenerate into DKMs. Then we can immediately bound the Rademacher chaos complexity for DKMs:

Un(Ksin) 220e A2l ln l.

We can ﬁnd that the Rademacher chaos complexity for DKMs here will lead to the generalization bound with a convergence rate O( p

l ln l/n), which is tighter than Theorem 4.3, The main reason is that they use different complexity measures, the Rademacher chaos complexity takes into account the data distribution to some extent.

7. A Lower Bound Based on the Rademacher Complexity

In this section, we provide a lower bound based on the Rademacher complexity for DKMs studied here. The formal result is the following:

Theorem 7.1. Let F be a class of functions, which represents l-layer DKMs, taking values in { 1, 1}. Given a dataset D of size n. Then, there exists a c > 0 such that for any f F:

ˆRD(F) c A n.

Remark 7.2. We can see that the lower bound of ˆRD(F) is Ω( 1 n), which will result in a lower generalization bound of order Ω( 1 n). Our derived upper bounds avoid exponential dependence on the depth, although not depth-independent but are square-root dependent on the depth. The depthindependent lower bound in Theorem 7.1 demonstrates that the capacity-based upper bounds on the uniform covering number are nearly-tight.

8. Discussion

Our capacity-based generalization bounds provide general worst-case theoretical guarantees for the generalization of DKL. The reason our generalization bounds do not appear to be ideally non-vacuous (i.e. less than 1) is that our bounds characterize the complexity of the whole hypothesis space rather than the effective hypothesis space learned by the algorithm. The effective hypothesis space is signiﬁcantly smaller than the whole hypothesis space, so it is not surprising to expect much tighter generalization bounds. Moreover, such tight generalization bounds are algorithm-based, which reﬂect the implicit regularization of the optimization algorithm, and these generalization bounds cannot provide general theoretical guarantees for deep learning.

A major challenge in obtaining non-trivial capacity-based theoretical results is that when the deep model goes beyond 2 layers, the generalization bounds, which originally get smaller with increasing width, become vacuous since the introduction of the depth, so the heavy dependence on the depth should be at least reduced to be weaker than linear dependence. The common capacity-based generalization analysis method is to use a peeling argument, i.e., the complexity bound for l-layer networks is reduced to a complexity bound for (l 1)-layer networks. In this method, a product factor of the constant related to the Lipschitz property of the activation function and the upper bound of the norm of the weight metrix will be introduced for each reduction due to scaling. After applying the reduction l times, the multiplication of product factors with exponential dependence on the depth makes the bound vacuous. Only strong mathematical skills and assumptions can obtain tight bounds with weak dependence on the depth. The success of the peeling argument is premised on having 1) the explicit nonlinear mapping, and in addition, 2) some norm-based assumptions, and 3) some assumptions on the properties of activation functions are sufﬁcient conditions for obtaining non-vacuous generalization bounds. Obviously, the implicit mapping of DKL directly isolates the peeling argument. However, our generalization analysis based on the composite relationship of hypothesis spaces provides a good demonstration for overcoming this challenge.

Since existing training algorithms for DKMs are difﬁcult to generalize, how to train DKMs/DMKMs automatically and

Nearly-tight Bounds for Deep Kernel Learning

efﬁciently is still an urgent issue to be studied. However, there is a very coincidental example of DMKMs that can be used to illustrate the difﬁculty of generalization analysis. Xue et al. (2019) and Li et al. (2020) proposed the (deep) spectral kernel network (DSKN) which use the inverse Fourier transform to make the implicit mapping explicit, the proposed method makes it possible to use the gradient descent algorithm to train DKMs easily. Li et al. (2020) gave the bound of spectral kernel models based on the Rademacher complexity. Li et al. (2022) extended the bound to deep (convolutional) spectral kernel models, which is depth-independent, with certain assumptions on the kernels. When these assumptions are not satisﬁed, or even if we can use some methods to make kernel mappings explicit, but the kernels involved in DKMs are not spectral kernels, the theoretical results in (Li et al., 2022) will no longer apply. Although DSKN satisﬁes the premise of the peeling argument, this explicit mapping is relatively ﬁxed, and it is still difﬁcult to constrain 2) and 3). At this time, in order to upper bound the Rademacher complexity of the deep kernel classes, we need to use an analysis method similar to that in deep learning, i.e., the peeling argument, which will still result in the bound that is exponentially dependent on the depth. Fortunately, our generalization analysis based on the composite relationship of hypothesis spaces provides a good demonstration for overcoming this challenge, which reduces such exponential depth dependency to the square-root one. We provide general and efﬁcient theoretical guarantees with square-root dependence on the depth for DKL/DMKL.

9. Conclusion

In this paper, we propose novel and nearly-tight capacitybased bounds on the uniform covering number and the Rademacher chaos complexity for DKL. We then bound pseudo-dimensions and kernel pseudo-dimensions for some common classes and further estimate their uniform covering numbers and Rademacher chaos complexities, respectively.

In future work, we will extend our bounds to more general settings (especially for more types of combinations and composite methods of multiple kernels), and derive tighter capacity-based generalization bounds for DKL with weaker dependence on the depth (i.e., log-dependent or even depth-independent), and further design general and efﬁcient deep kernel training algorithms to construct end-toend DKMs/DMKMs with good generalization performance.

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (62225602).

Al-Shedivat, M., Wilson, A. G., Saatchi, Y., Hu, Z., and Xing, E. P. Learning scalable deep kernels with recurrent structure. Journal of Machine Learning Research, 18(82): 1 37, 2017.

Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 1999.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Bartlett, P. L. and Williamson, R. C. The VC dimension and pseudodimension of two-layer neural networks with discrete inputs. Neural Computation, 8(3):625 628, 1996.

Bartlett, P. L., Maiorov, V., and Meir, R. Almost linear VC dimension bounds for piecewise polynomial networks. Advances in Neural Information Processing Systems, 11 (NIPS 1998):190 196, 1998.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30 (NIPS 2017):6240 6249, 2017.

Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019.

Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. Proceedings of Machine Learning Research, 80(ICML 2018): 540 548, 2018.

Bietti, A. and Mairal, J. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32(Neur IPS 2019):12873 12884, 2019.

Bietti, A., Mialon, G., Chen, D., and Mairal, J. A kernel perspective for regularizing deep neural networks. Proceedings of Machine Learning Research, 97(ICML 2019): 664 674, 2019.

Bohn, B., Rieger, C., and Griebel, M. A representer theorem for deep kernel learning. Journal of Machine Learning Research, 20(64):1 32, 2019.

Cho, Y. and Saul, L. K. Kernel methods for deep learning. Advances in Neural Information Processing Systems, 22 (NIPS 2009):342 350, 2009.

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efﬁciently improving

Nearly-tight Bounds for Deep Kernel Learning

generalization. In Proceedings of 9th International Conference on Learning Representations, number ICLR 2021, 2021.

Goldberg, P. W. and Jerrum, M. Bounding the vapnikchervonenkis dimension of concept classes parameterized by real numbers. Machine Learning, 18(2-3):131 148, 1995.

Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. International Conference on Computational Learning Theory, 75(COLT 2018):297 299, 2018.

Graf, F., Zeng, S., Niethammer, M., and Kwitt, R. On measuring excess capacity in neural networks. ar Xiv:2202.08070v2, 2022.

Hazan, T. and Jaakkola, T. S. Steps toward deep kernel methods from inﬁnite neural networks. ar Xiv:1508.05133v2, 2015.

He, F., Liu, T., and Tao, D. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. Advances in Neural Information Processing Systems, 32(Neur IPS 2019):1141 1150, 2019.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31 (Neur IPS 2018):8571 8580, 2018.

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. E. Similarity of neural network representations revisited. Proceedings of Machine Learning Research, 97(ICML 2019): 3519 3529, 2019.

Ledent, A., Mustafa, W., Lei, Y., and Kloft, M. Norm-based generalisation bounds for multi-class convolutional neural networks. Proceedings of the 35th AAAI Conference on Artiﬁcial Intelligence, 35(AAAI 2021):8279 8287, 2021.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as gaussian processes. In Proceedings of the 6th International Conference on Learning Representations, number ICLR 2018, 2018.

Lei, Y. and Ding, L. Reﬁned rademacher chaos complexity bounds with applications to the multikernel learning problem. Neural Computation, 26(4):739 760, 2014.

Lei, Y. and Ying, Y. Sharper generalization bounds for learning with gradient-dominated objective functions. In Proceedings of 9th International Conference on Learning Representations, number ICLR 2021, 2021.

Li, J., Liu, Y., and Wang, W. Automated spectral kernel learning. Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, 34(AAAI 2020):4618 4625, 2020.

Li, J., Liu, Y., and Wang, W. Convolutional spectral kernel learning with generalization guarantees. Artiﬁcial Intelligence, 313:103803, 2022.

Mairal, J. End-to-end kernel learning with supervised convolutional kernel networks. Advances in Neural Information Processing Systems, 29(NIPS 2016):1399 1407, 2016.

Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. Convolutional kernel networks. Advances in Neural Information Processing Systems, 27(NIPS 2014):2627 2635, 2014.

Montavon, G., Braun, M. L., and M uller, K. Kernel analysis of deep networks. Journal of Machine Learning Research, 12:2563 2581, 2011.

Nagarajan, V. and Kolter, J. Z. Deterministic pac-bayesian generalization bounds for deep networks via generalizing noise-resilience. In Proceedings of 6th International Conference on Learning Representations, number ICLR 2019, 2019.

Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. International Conference on Computational Learning Theory, 40(COLT 2015):1376 1401, 2015.

Neyshabur, B., Bhojanapalli, S., Mc Allester, D., and Srebro, N. Exploring generalization in deep learning. Advances in Neural Information Processing Systems, 30(NIPS 2017): 5947 5956, 2017.

Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In Proceedings of 6th International Conference on Learning Representations, number ICLR 2018, 2018.

Song, H., Thiagarajan, J. J., Sattigeri, P., Ramamurthy, K. N., and Spanias, A. A deep learning approach to multiple kernel fusion. In IEEE International Conference on Acoustics, Speech and Signal Processing, number ICASSP 2017, pp. 2292 2296, 2017.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. In Proceedings of 6th International Conference on Learning Representations, number ICLR 2018, 2018.

Srebro, N. and Ben-David, S. Learning bounds for support vector machines with learned kernels. International Conference on Computational Learning Theory, (COLT 2006):169 183, 2006.

Nearly-tight Bounds for Deep Kernel Learning

Suzuki, T. Fast generalization error bound of deep learning from a kernel perspective. Proceedings of Machine Learning Research, 84(AISTATS 2018):1397 1406, 2018.

Wang, J., Feng, K., and Wu, J. Svm-based deep stacking networks. The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, (AAAI 2019):5273 5280, 2019.

Wei, C. and Ma, T. Data-dependent sample complexity of deep neural networks via lipschitz augmentation. Advances in Neural Information Processing Systems, 32 (Neur IPS 2019):9722 9733, 2019.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Stochastic variational deep kernel learning. Advances in Neural Information Processing Systems, 29(NIPS 2016): 2586 2594, 2016a.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. Proceedings of Machine Learning Research, 51(AISTATS 2016):370 378, 2016b.

Xue, H., Wu, Z., and Sun, W. Deep spectral kernel learning. International Joint Conference on Artiﬁcial Intelligence, (IJCAI 2019):4019 4025, 2019.

Ying, Y. and Campbell, C. Generalization bounds for learning the kernel problem. In International Conference on Computational Learning Theory, number COLT 2009, 2009.

Ying, Y. and Campbell, C. Rademacher chaos complexities for learning the kernel problem. Neural Computation, 22 (11):2858 2886, 2010.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In Proceedings of the 5th International Conference on Learning Representations, number ICLR 2017, 2017.

Zhuang, J., Tsang, I. W., and Hoi, S. C. Two-layer multiple kernel learning. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, number AISTATS 2011, pp. 909 917, 2011.

Nearly-tight Bounds for Deep Kernel Learning

A. Appendix Outline

In the appendix, we give the detailed proofs of those theoretical results which we only provide proof sketches in the main paper. Our main proofs include:

The relationship between uniform covering numbers of composite function classes (Lemma 4.1).

The bound of the uniform covering number for DKMs (Lemma 4.2).

The generalization bound based on the uniform covering number for DKMs (Theorem 4.3).

Bounds of uniform covering numbers for DMKMs (Lemma 5.1).

The generalization bound based on the Rademacher chaos complexity for kernels (Lemma 5.2).

The bound of the Rademacher chaos complexity for DMKMs (Lemma 5.3).

The pseudo-dimension of kernels (Theorem 6.1).

The kernel pseudo-dimension of multiple deep kernels (Theorem 6.2).

The lower bound of the Rademacher complexity for DKMs (Theorem 7.1).

B. Generalization Bounds Based on the Uniform Covering Number

B.1. Proof of Lemma 4.1

Proof. Given a dataset D with n training samples drawn independently from a probability distribution P on X R. According to the deﬁnition of F, we have

= n (f2 (f1 (x1)) , . . . , f2 (f1 (xn))) | f1 F(1), f2 F(2)o

n (f2 (z1) , . . . , f2 (zn)) | f2 F(2)o .

Hence, we have F|D F(1) |D F(2) |D . Let CF(1) |D F(1) |D be a smallest ε-cover of F(1) |D , i.e. card(CF(1) |D ) = N ε, F(1) |D , d H1 .

For any element c1 CF(1) |D , let CF(2) |D (c1) F(2) |D be a smallest ε-cover of F(2) |D , i.e., card(CF(2) |D (c1)) = N ε, F(2) |D , d H2 .

According to the reproducing property of RKHS, combined with the Cauchy-Schwartz inequality, we have |f(x) f (x )| = | f, Φ(x) Φ(x ) H| f Hd H (x, x ). This means that functions in the RKHS fulﬁll a Lipschitz-like condition, with Lipschitz constant given by the norm f H. Since f 2 H = P

i,j αiαj K ( , ) 1, without loss of generality, we can obtain

that the function f2 F(2) |D is 1-Lipschitz.

Let c2 CF(2) |D (c1), according to the deﬁnition, for any f1 F(1) |D and any f2 F(2) |D , we have f1 c1 H1 ε and

f2 c2 H2 ε. Since

f2 f1 c2 c1 H1 = (f2 f1 f2 c1) + (f2 c1 c2 c1) H1 f2 f1 f2 c1 H1 + f2 c1 c2 c1 H1 f1 c1 H1 + f2 c2 H2 2ε,

than we have that CF|D = {c2 c1 : c1 CF(1) |D , c2 CF(2) |D (c1)} F|D is a 2ε-cover of F|D. Hence,

Nearly-tight Bounds for Deep Kernel Learning

N 2ε, F|D, d H1

c1 C F(1) |D

card CF(2) |D (c1)

= card CF(2) |D (c1) card CF(1) |D

card CF(2) |D (f1) card CF(1) |D

N ε, F(2) |D , d H2 N ε, F(1) |D , d H1

n N(ε, F(1) |D , d H1) o max |D|=n

n N(ε, F(2) |D , d H2) o

=Np(ε, F(1), n) Np(ε, F(2), n).

Since D is arbitrary, which completes the proof.

B.2. Proof of Lemma 4.2

Proof. We start with the following lemma that bounds the d -covering numbers by a quantity involving the pseudodimension.

Lemma B.1 (Theorem 12.2 of (Anthony & Bartlett, 1999)). Let F be a set of real functions from a domain X to the bounded interval [0, r]. Let ε > 0 and suppose that the pseudo-dimension of F is dp. Then the following holds for n dp

N (ε, F, n) enr

Since Np(2ε, F, n) Np(ε, F(1), n) Np(ε, F(2), n) holds for F = F(2) F(1), let F(i) = F(i 1) F(2) F(1), we denote the pseudo-dimension of the i-th layer function class F(i) as dp i 1, i [l], let dp max = maxi{dp i }, then for the function class of l-layer DKMs, we have

N (lε, F, n) =N (lε, F(l) F(l) , n)

N (ε, F(l), n) N ((l 1)ε, F(l) , n)

N (ε, F(l), n) N (ε, F(l 1), n) N (ε, F(l 1) , n)

i=1 N (ε, F(i), n)

Therefore, N (ε, F, n) Ql i=1 lenri

ε dp i . Taking the logarithm of the above inequality on both sides,

ln N (ε, F, n) l ln len A

dp max = ln len A

N (ε, F, n) len A

Nearly-tight Bounds for Deep Kernel Learning

B.3. Proof of Theorem 4.3

Proof. We start with the following lemma that proves a uniform convergence result for a class of real-valued functions.

Lemma B.2 (Theorem 10.1 of (Anthony & Bartlett, 1999)). Suppose that F is a set of real-valued functions deﬁned on the domain X. Let P be any probability distribution on X { 1, 1}. Given a dataset D of size n. Let ε be any real number between 0 and 1, the loss function be the γ-margin cost function. Then for f F

P n E(f) ˆED(f) + ϵ o 2N (γ/2, F, 2n) exp ϵ2n

For the function class of l-layer DKMs, since N (ε, F, n) len A

ε ldp max, we have

2N (γ/2, F, 2n) exp ϵ2n

ldp max exp ϵ2n

Let 2 4len A

γ ldp max exp ϵ2n

8 = δ, then

δ + 8ldp max ln( 4le An

Substituting equation (5) into the above lemma, we complete the proof.

C. Generalization Bounds Based on the Rademacher Chaos Complexity

C.1. Proof of Lemma 5.1

Proof. For any 0 < ε r2 i , we have dk i 1, r4 i ε2 1 and en2r2 i ε 1, where dk i represents the kernel pseudo-dimension of the domain (class) K(i) mul of the i-th layer kernels and i [l]. Let dk max = maxi{dk i }. Then, with the following lemmas,

Lemma C.1 (Theorem 3 in (Ying & Campbell, 2010)). If the kernel pseudo-dimension dk K of the set of basis kernels is ﬁnite, then we have that

N (ε, K, d2) 2 4er4

Lemma C.2 (Lemma 3 in (Srebro & Ben-David, 2006)). For any kernel family K with kernel pseudo-dimension dk K:

N (ε, K, n) en2r2

we have that

N2(ε,K(i) mul, n) 2 4er4 i ε2

dk i 8er4 i ε2

N (ε, K(i) mul, n) en2r2 i ε

Since Np(2ε, F, n) Np(ε, F(1), n) Np(ε, F(2), n) holds for F = F(1) F(2), then for a domain Kmul of l-layer deep kernels, we have

N2(lε, Kmul, n)

i=1 N2(ε, K(i) mul, n)

N (lε, Kmul, n)

i=1 N (ε, K(i) mul, n)

Nearly-tight Bounds for Deep Kernel Learning

Taking the logarithm of the above inequality on both sides,

ln N2(ε, Kmul, n) l ln 8l2e A4

dk max = ln 8l2e A4

N2(ε, Kmul, n) 8l2e A4

Similarly, one can bound

N (ε, Kmul, n) len2A2

C.2. Proof of Lemma 5.2

Proof. We denote the union of the unit balls of RKHSs as

BKmul := {f : f HK and f HK 1, K Kmul} .

For some RKHS HK, we have

i=1 ψ (yif (xi)) + λ f HK 1

i=1 ψ(0) + λ 0 HK = 1,

f = arg min K Kmul,f HK

i=1 ψ (yif (xi)) + λ f HK.

Hence, f HK 1/λ. This implies, for any samples in D, that

λBKmul := f

λ : f BKmul

For any training set D = {(xi, yi) : i [n]}, let D = {(xi, yi) : i [n]} be the training set with only one sample different

from D, where the k-th sample is replaced by (x k, y k). Let Φ(D) = supf Bλ

E(f) ˆED(f) , then

|Φ (D ) Φ(D)|

E(f) ˆED (f) sup f Bλ

E(f) ˆED(f)

ˆED(f) ˆED (f)

|ψ (ykf (xk)) ψ (y kf (x k))| n

According to Mc Diarmid s inequality, for any 0 < δ < 1, with probability at least 1 δ over the training dataset D, the following holds:

Φ(D) ED[Φ(D)] +

Nearly-tight Bounds for Deep Kernel Learning

Then, we will estimate the upper bound of ED[Φ(D)].

ED h ˆED (f) ˆED(f) i

ˆED (f) ˆED(f)

i=1 ψ (y if (x i)) ψ (yif (xi))

i=1 ϵi (ψ (y if (x i)) ψ (yif (xi)))

i=1 ϵiψ (y if (x i))

i=1 ϵiψ (yif (xi))

i=1 ϵiψ (yif (xi))

To this end, we introduce the decomposition of γ-margin cost function that ψ(x) = ψ(x) ψ(0) : R R, which has the Lipschitz constant 1

γ and ψ(0) = 0 (ψ(0) = 1). Then, applying the contraction property of Rademacher complexities, with probability at least 1 δ, the following holds:

i=1 ϵiψ (yif (xi))

i=1 ϵi ψ (yif (xi))

i=1 ϵif (xi)

i=1 ϵif (xi)

Hence, we will bound Eϵ,D supf Bλ 1 n |Pn i=1 ϵif (xi)| .

i=1 ϵif (xi)

" 1 λ sup K Kmul sup f K 1

i=1 ϵi K( , xi), f

i,j=1 ϵiϵj K (xi, xj)

λ sup K Kmul

n + A λ n. (9)

Nearly-tight Bounds for Deep Kernel Learning

Substituting inequalities (8) (9) into (7), we have

n + 4A γλ n + 2 n. (10)

Combining with (6) and (10), then

E(f) ˆED(f) + 4

n + 4A γλ n + 2 n +

C.3. Proof of Lemma 5.3

Proof. Let Kmul be a domain of l-layer deep kernels. The empirical Rademacher chaos complexity ˆUD(Kmul) can be bounded by the metric entropy integral as follows:

Lemma C.3 (Theorem 2 in (Ying & Campbell, 2010)). For any D = {x1, . . . , xn}, there holds

ˆUD(K) r2 + 24e Z r2

0 log [1 + N (ε, K, d2)] dε.

We assume that 0 K, then we have that

ˆUD(K) 24e Z r2

0 log [N2(ε, K, n) + 1] dε.

Since N2(ε, Kmul, n) 8l2e A4

ε2 ldk max, we have that

Un(Kmul) = ED h ˆUD(Kmul) i

0 log [N2(ε, Kmul, n) + 1] dε

ldk max + 1

0 ln 8.4l2e A4

= 24eldk max

A2 ln(8.4l2e) + Z A2

= 24eldk max A2 ln(8.4l2e) + 4A2

172e A2ldk max + 48e A2ldk max ln l

220e A2ldk max ln l (l 3).

D. Estimating the Uniform Covering Number and the Rademacher Chaos Complexity

D.1. Proof of Theorem 6.1

Proof. Let {x1, . . . , xdp} be the set that can be pseudo-shattered by F, then according to the deﬁnition of pseudodimension, there are real numbers {ri R : i [dp]} such that for any y { 1, 1}dp there is a function f F with property sgn (f(xi) ri) = yi for any i [dp].

Nearly-tight Bounds for Deep Kernel Learning

For any y = (y1, . . . , ydp) { 1, 1}dp, we can take the real number ri as ri = w Φ(xi) for any i [dp], then w, w

yi w Φ(xi) ri 1 ( i [dp]),

yi w Φ(xi) w Φ(xi) 1 ( i [dp]),

let w = w w , then we have

yi w Φ(xi) 1 ( i [dp]).

Summing these dp inequalities, we have

i=1 yiΦ(xi)

i=1 yiΦ(xi)

i=1 yiΦ(xi)

Since the above inequality holds for any y { 1, 1}dp, taking the expectation of y1, . . . , ydp on both sides, where y1, . . . , ydp obey independent and uniform distribution. From the independence assumption, we have E [yiyj] = E [yi] E [yj] , i = j. From the uniform distribution, we have E [yiyj] = 0 (i = j), E [yiyj] = 1 (i = j). Then,

i=1 yiΦ(xi)

i=1 yiΦ(xi)

i,j=1 Ey [yiyj] Φ(xi) Φ(xj)

Φ(xi) Φ(xi) #1/2

i=1 (K(xi, xi))

where the second inequality uses the Jensen s inequality and the property of convex functions. Hence, we have dp r2Λ2.

D.2. Proof of Theorem 6.2

Proof. It can be known that Kconv (S) Kline(S) Kspan(S), and Kspan(S) is a vector space with dimension d(Kspan(S)) k. Since the fact that the pseudo-dimension of a k-dimensional vector space of real-valued functions is k, the following relationship of the kernel pseudo-dimension holds:

dk (Kconv(S)) dk (Kline(S)) dk (Kspan(S)) = d (Kspan(S)) k. (11)

We then have the following lemma for the pseudo-dimension and compositions with non-decreasing functions:

Nearly-tight Bounds for Deep Kernel Learning

Lemma D.1 (Theorem 11.3 in (Anthony & Bartlett, 1999)). Suppose F is a class of real-valued functions and σ : R R is a non-decreasing function. Let σ(F) denote the class {σ f : f F}. Then dk(σ(F)) dk(F).

Since the gi operations produced by the compositions of common-used kernels are monotonous (non-decreasing) with respect to their input kernels, therefore, if the kernel domain of their input kernels is denoted by K1 mul, then dk(K1 mul) = d(K1 mul) = m. For any i {2, . . . , l}, we have

dk(Ki mul) dk(gi hi 1(Ki 1 mul)) dk(hi 1(Ki 1 mul)) dk(Ki 1 mul). (12)

Hence, dk(Ki mul) m, and the bound of the Rademacher chaos complexity can be derived immediately.

E. A Lower Bound Based on the Rademacher Complexity

E.1. Proof of Theorem 7.1

Proof. We ﬁrst deﬁne a function class of DKMs as follows:

i=1 αi Kl (x, xi) : X

i,j αiαj Kp (xi, xj) 1, Kp(xi, xi) A2, Kp( , ) Kp , xi X, p [l]},

where the class of q-th layer deep kernels for some q {2, , l} is deﬁned as follows:

Kq = Kq( , ) = gq Kq 1( , ) ,

where Kq 1( , ) Kq 1 , gq is the nonlinear function produced by Kq( , ) which composites Kq 1( , ), and the classes of all other layers deep kernels are linear kernel classes for any p [l] and p = q, i.e., the gp operation is the identity transformation. Therefore, the l-layer deep kernel can be denoted as Kl(x, x ) = Kq(x, x ). Note that the linear kernel does not produce nonlinear mapping, and Kq( , ) represents a single-layer nonlinear kernel. It can be shown that G F.

i=1 ϵif (xi)

sup W l: W l 1 sup Φl: Φl A

i=1 ϵi W l Φl(xi)

sup W q: W q 1 sup φq: φq A

i=1 ϵi W q φq(xi)

= sup φq: φq A

i=1 ϵiφq(xi)

sup φq: φq A

i=1 φq(xi) 2 (Use Khintchine-Kahane inequality, c > 0)

= sup φq: φq A

i=1 Kq(xi, xi)

Since G is a subset of F, so ˆRD(F) ˆRD(G) c A n.