# deep_spectral_kernel_learning__bacd0ffb.pdf Deep Spectral Kernel Learning Hui Xue1,2 , Zheng-Fan Wu1,2 and Wei-Xiang Sun1,2 1School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China 2MOE Key Laboratory of Computer Network and Information Integration (Southeast University), China {hxue, zfwu, vex-soon}@seu.edu.cn Recently, spectral kernels have attracted wide attention in complex dynamic environments. These advanced kernels mainly focus on breaking through the crucial limitation on locality, that is, the stationarity and the monotonicity. But actually, owing to the inefficiency of shallow models in computational elements, they are more likely unable to accurately reveal dynamic and potential variations. In this paper, we propose a novel deep spectral kernel network (DSKN) to naturally integrate nonstationary and non-monotonic spectral kernels into elegant deep architectures in an interpretable way, which can be further generalized to cover most kernels. Concretely, we firstly deal with the general form of spectral kernels by the inverse Fourier transform. Secondly, DSKN is constructed by embedding the preeminent spectral kernels into each layer to boost the efficiency in computational elements, which can effectively reveal the dynamic input-dependent characteristics and potential longrange correlations by compactly representing complex advanced concepts. Thirdly, detailed analyses of DSKN are presented. Owing to its universality, we propose a unified spectral transform technique to flexibly extend and reasonably initialize domainrelated DSKN. Furthermore, the representer theorem of DSKN is given. Systematical experiments demonstrate the superiority of DSKN compared to state-of-the-art relevant algorithms on varieties of standard real-world tasks. 1 Introduction Kernel method is a class of artful statistical learning approaches. Benefiting from their flexible modeling frameworks and excellent statistical theories, they have been successfully used in many traditional learning applications over the past few decades. However, with the rapid development of machine learning in recent years, most classic kernels are no longer applicable to complex tasks in practical dynamic environments. In fact, many theoretical and experimental analyses have shown that the most critical and fundamental limitation of these kernels is that they are local, that is, station- ary and monotonic [Bengio et al., 2006]. For instance, the most widely-used local Gaussian kernel k(x, x ) only considers the distance x x and quickly converges to a constant when x x increases. Consequently, it can t reveal more essential information over feature spaces except for the identical similarity [Remes et al., 2017]. In order to solve such a problem, some new kinds of kernels have been presented to break the restriction on locality. They generally fall into three categories [Rasmussen and Williams, 2006]: (1) non-stationary kernels; (2) nonmonotonic kernels; (3) non-stationary and non-monotonic kernels. In the first category, these kernels only focus on non-stationarity, which are constructed by mapping input spaces [Sampson and Guttorp, 1992] or taking inputdependent parameters [Gibbs, 1998; Heinonen et al., 2016]. However, they can t adequately reflect pivotal long-range data correlations due to neglecting the non-monotonicity. On the contrary, the kernels in the second category only focus on the non-monotonicity, which are generated by solving the inverse Fourier transform based on Bochner s theorem [Wilson and Adams, 2013; L azaro-Gredilla et al., 2010]. But they lose essential input-dependent characteristics that can be learned well by non-stationary kernels. Just recently, several improved spectral kernels are proposed to acquire the non-stationarity and the nonmonotonicity simultaneously following the generalized Fourier analysis theory about kernels [Yaglom, 1987]. Remes et al. [2017] derived a generalized spectral mixture kernel by defining the spectral density as bivariate Gaussian mixture components. Ton et al. [2018] directly solved the inverse Fourier transform using Monte Carlo approximation, and proposed a generalized sparse spectrum kernel. To some extent, these innovative spectral kernels make up for the deficiency caused by locality. But, due to the inefficiency of shallow models in computational elements, they still can t effectively and efficiently acquire the appropriate non-stationary and non-monotonic properties when learning problems become more and more complicated. That likely leads to their poor performance in real-world tasks. Although Bochner s theorem [Gikhman and Skorohod, 1974] and Yaglom s theorem [Yaglom, 1987] theoretically guarantee that spectral kernels can availably approximate most kernels under specific conditions, they actually require exponential computational elements to represent complex Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) kernels [Rahimi and Recht, 2008], which restricts the further development and application of spectral kernels. Indeed, this serious problem is ubiquitous in most shallow models, not only spectral kernels, but also shallow neural networks [Delalleau and Bengio, 2011]. In neural networks, deep architectures have been introduced to obtain the exponential improvement in computational elements from hierarchical nonlinear linking structures [Bengio et al., 2007]. Consequently, in order to break the limitation on locality and improve the efficiency in computational elements, it is quite necessary to apply deep architectures to spectral kernels. In this paper, we focus on effectively and flexibly integrating the non-stationary and non-monotonic spectral kernels into well-designed deep architectures in a natural and interpretable manner. Specifically, we firstly deal with the spectral kernel based on Yaglom s theorem and derive a more general spectral form. DSKN is then formulated and constructed by embedding the distinguished spectral kernels into each layer to boost the efficiency in computational elements, which benefits a lot from the elegant deep architecture, including universality, flexibility, efficiency and interpretability. It can be further generalized to cover most kernels naturally. Detailed analyses of DSKN are presented later. We propose a unified spectral transform technique to flexibly extend and reasonably initialize DSKN due to the universality of the deep architecture, and thus we can dynamically inject some vital priori information to construct domain-related DSKN. The representer theorem and the reproducing kernel Hilbert space of DSKN are also derived recursively. More importantly, DSKN strengthens the link between kernel method and deep learning. It can not only improve spectral kernels by compactly representing highly non-linear and highly-varying advanced concepts, but also significantly enhance the performance of deep learning in medium-scale and small-scale tasks, especially when the deep architectures can t be well constructed from the complex but sparse data distribution. Systematical experiments demonstrate the superiority of DSKN compared to state-of-the-art relevant algorithms on varieties of standard classification and regression tasks, which indicates that the proposed approach adequately learns from the intrinsic advantages of spectral kernels and deep architectures. 2 Related Work Recently, some deep kernel algorithms have been presented to try to link kernel method with deep learning. Most of them directly combine frequently-used deep modules as the frontend or back-end of kernels. Wilson et al. [2016b] placed a plain deep neural network as the front-end of a spectral mixture kernel to extract features, which is further extended to a structured kernel interpolation framework [Wilson and Nickisch, 2015] and stochastic variational inference [Wilson et al., 2016a]. Sun et al. [2018] used a sum-product network as the back-end of multiple kernels to merge the kernel mappings. However, the deep modules and the kernels are relatively detached in these algorithms and thus they can t intrinsically improve the efficiency of kernels in computational elements. Some other algorithms stack kernel mappings in a hierarchical composite way. Cho and Saul [2009] designed a class of arc-cosine kernels and integrated them into a deep hierarchical structure. Zhuang et al. [2011] proposed the 2-layer multiple kernel structure which further leads to a series of refined algorithms [Rebai et al., 2016]. But these above models are relatively closed and inflexible, which are only suitable for the specific kernels and difficult to be optimized. Moreover, there are some algorithms that aim to introduce kernels into deep learning and further construct end-to-end complete deep models. Stacked kernel network is derived by replacing the non-linear activation functions of deep neural network with kernel mappings [Zhang et al., 2017]. Deep Gaussian process combines multiple Gaussian processes hierarchically [Cutajar et al., 2017]. But, the models derived from these methods can t be regarded as kernel functions anymore, and thus can t be simply applied to kernel algorithms. 3 Deep Spectral Kernel Learning In this section, we firstly introduce some brief concepts about the spectral kernels with the non-stationary and nonmonotonic properties. Subsequently, we explicitly formulate and construct the novel DSKN by hierarchically stacking the kernel mappings of the derived spectral kernels in a scalable manner. Furthermore, we present detailed analyses of DSKN and propose a unified spectral transform technique to flexibly extend and reasonably initialize domain-related DSKN. The representer theorem and the reproducing kernel Hilbert space of DSKN are also derived recursively. 3.1 Spectral Kernel Spectral kernels are constructed from the inverse Fourier transform in frequency domain. Most commonly-used spectral kernels are stationary, such as spectral mixture kernel [Wilson and Adams, 2013] and sparse spectrum kernel [L azaro-Gredilla et al., 2010]. These stationary kernels are shift-invariant functions that only depend on the distance τ = x x of inputs x and x , and thus can be rewritten as k(x, x ) = k(x x ) = k(τ). A stationary kernel k(τ) can be uniquely identified by a spectral density s(ω) on the basis of Bochner s theorem [Gikhman and Skorohod, 1974; Stein, 1999]: RD eiωT τs(ω)dω, RD e iωT τk(τ)dτ, (1) where s(ω) is the spectral density of a non-negative measure. Some important stationary kernels and corresponding spectral densities are shown in Table 1 [Rahimi and Recht, 2008]. Spectral kernels k(τ) = k(x x ) derived from the theorem are non-monotonic but stationary, which can learn potential long-range relationships but neglect the important dynamic input-dependent characteristics. Recently, the Fourier analysis theory of kernels have achieved tremendous improvements. Yaglom s theorem further indicates that a general kernel k(x, x ) is related to a spectral density s(ω, ω ) in accordance with the following Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Kernel k(τ) s(ω) Gaussian e τ 2 2 2 (2π) D 2 e ω 2 2 2 Laplacian e τ 1 QD i=1 1 π(1+ω2 i ) Cauchy QD i=1 2 1+τ 2 i e τ 1 Table 1: Important stationary kernels and spectral densities Fourier duals: k(x, x ) = Z RD RD ei(ωT x ω T x )s(ω, ω )dωdω , s(ω, ω ) = Z RD RD e i(ωT x ω T x )k(x, x )dxdx , (2) where the kernel k(x, x ) is positive semi-definite if and only if s(ω, ω ) is the positive semi-definite bounded variation spectral density of a Lebesgue-Stieltjes measure [Yaglom, 1987]. As a result, spectral kernels constructed by Eq. (2) can not only learn long-range relationships but also important input-dependent characteristics, benefiting from their non-locality, that is, the non-stationarity and the nonmonotonicity. 3.2 Deep Spectral Kernel Network Many theoretical and experimental researches have indicated that deep models have significant efficiency superiority than shallow counterparts in computational elements [Delalleau and Bengio, 2011]. Inspired by the intrinsic superiority of deep learning, we explicitly design and construct DSKN through two progressive steps. We firstly deal with the nonstationary and non-monotonic spectral kernels, and derive a more general spectral form. Then we naturally and flexibly integrate the solved spectral kernels into elegant deep architectures in a hierarchical composite way. Following Eq. (2), we symmetrize the integral to alleviate the restriction of Yaglom s theorem. Concretely, the exponential component ei(ωT x ω T x ) in Eq. (2) is replaced by an augmented part Eω,ω (x, x ): Eω,ω (x, x ) = 1 h ei(ωT x ω T x ) + ei(ω T x ωT x ) +ei(ωT x ωT x ) + ei(ω T x ω T x )i . (3) As a result, the spectral surface density s(ω, ω ) and the corresponding spectral surface S(ω, ω ) can be further regarded as a continuous probability density and a corresponding cumulative distribution function, respectively [Ton et al., 2018; Rahimi and Recht, 2008]. In other words, we can directly optimize ω, ω over the RD RD Euclidean space. Subsequently, the inverse Fourier transform of s(ω, ω ) is solved as follows: k(x, x ) = Z RD RD Eω,ω (x, x )s(ω, ω )dωdω = Eω,ω S Eω,ω (x, x ) = Eω,ω S Tω,ω (x, x ) , Figure 1: The structure of the kernel mapping Φ(x) where Tω,ω (x, x ) is: h cos(ωT x ω T x ) + cos(ω T x ωT x ) + cos(ωT x ωT x ) + cos(ω T x ω T x ) i . (5) Consequently, we derive the non-stationary and nonmonotonic spectral kernel k(x, x ) by directly approximating the expectation with Monte Carlo integral: k(x, x ) = Eω,ω S Tω,ω (x, x ) Ψ(x), Ψ(x ) , (6) where the spectral kernel mapping Ψ is: cos(ΩT x) + cos(Ω T x) sin(ΩT x) + sin(Ω T x) M is the sampling number. The D M frequency matrices Ω, Ω are denoted as: Ω= ω1, , ωM , Ω = ω 1, , ω M . (8) The frequency pairs {(ωi, ω i)}M i=1 i.i.d. S [Ton et al., 2018]. In DSKN, we further adopt another more general form Φ instead of Ψ: h cos(ΩT x + ϕ) + cos(Ω T x + ϕ ) i , (9) which is equivalent to Ψ in the sense of expectation that E[Φ(x)] = E[Ψ(x)]. The phase vectors ϕ and ϕ are drawn uniformly from [0, 2π]M. Compared with Ψ, the adopted form Φ can halve the computational overhead and alleviate the difficulty in programming. The detailed structure of the spectral kernel mapping Φ(x) is illustrated in Figure 1. Without losing generality, it can be regarded as a slightly more complex neural network with only single hidden layer and using cosine as the activation function. Therefore, according to Eq. (9), the non-stationary and non-monotonic spectral kernel k(x, x ) can be identified by a pair of frequency matrices Ω, Ω sampled from the spectral surface S. It s worthy to point out that we can not only approximate almost all kernels by assigning specific Ω, Ω , but also derive more powerful spectral kernels by dynamically optimizing Ω, Ω . Although some fundamental theorems guarantee that spectral kernels shown in Eq. (9) and Figure 1 can availably approximate most kernels under specific conditions [Gikhman Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) and Skorohod, 1974; Yaglom, 1987], their learning capability for representing complex kernels in practical dynamic environments requires exponential computational elements Φ1, , ΦM [Rahimi and Recht, 2008]. Fortunately, considering the superiority of deep learning in improving the efficiency of computational elements, we further effectively and flexibly integrate spectral kernels k(x, x ) into well-designed deep architectures in an interpretable way. Actually, one of the most serious problems in existing deep kernels is that they can t pertinently construct flexible deep structures for kernels. The reason more likely lies in the form restrictions of kernels. That is, a kernel function is a bivariate scalar function and its kernel mapping is implicit. However, benefiting from the general spectral form of the spectral kernels shown in Eq. (9) and Figure 1, we directly avoid the above limitations to naturally construct DSKN by stacking Φ in a deep hierarchical composite way: K(l)(x, x ) = Φl(Φl 1( Φ1(x))), Φl(Φl 1( Φ1(x ))) , (10) where K(l) denotes the l-layer DSKN, and Φl denotes the l-th layer spectral kernel mapping. The parentheses (l) here mean that the l-th layer and the previous layers are both included. We further simplify the notation of K(l) as: K(l)(x, x ) = Ξ(l)(x), Ξ(l)(x ) , (11) where the composite kernel mapping Ξ(l) of DSKN is: Ξ(l)(x) = Φl(Φl 1( Φ1(x))). (12) Based on the structure of Φ illustrated in Figure 1, the complete architecture of the l-layer DSKN K(l) is shown as Figure 2. 3.3 Analysis of DSKN The deep architecture of DSKN illustrated in Figure 2 is significantly more natural and elegant than those closed structures of existing deep kernels. DSKN benefits a great deal from the concise hierarchical architecture, including universality, flexibility, efficiency and interpretability. Specifically, DSKN is an unprecedented universal deep kernel framework that can be directly applied to most positive semi-definite kernels, without losing any generality, to construct their composite deep models. In other words, we can assign all kinds of positive semi-definite kernels as the internal elements of DSKN with the unified spectral transform technique. Concretely, given a kernel ˆk to be embedded, we firstly derive its spectral density ˆs and spectral surface ˆS by solving the inverse Fourier transform in Eq. (1) or Eq. (2), where ˆk is uniquely identified by ˆs and ˆS. Then, based on the frequency pairs {(ωi, ω i)}M i=1 i.i.d. ˆS and Eq. (9), we further construct a spectral kernel k to approximate the kernel ˆk. Consequently, we can flexibly embed any positive semi-definite kernel ˆk by indirectly integrating spectral kernel k ˆk. What s more, these essentially different kernels ˆk can be uniformly optimized. Some commonly-used kernels and corresponding spectral densities have been shown in Table 1. By doing so, according to practical applications, we can naturally assign well-designed kernels as the internal elements of DSKN by the unified spectral transform technique, and thus more pivotal priori knowledge can be reasonably injected to construct domain-related DSKN. Furthermore, DSKN can be flexibly adjusted in vertical and horizontal manners. That is to say, for the vertical adjustment, we can increase the depth l of DSKN by directly integrating more spectral kernels into the network. For the horizontal adjustment, in addition to directly increasing the sampling number M, actually, each layer Φi for i = 1, , l can be further replaced by an augmented heterogeneous multiple kernel structure including K components {φi k}K k=1: Φi( ) = h φi 1( )T , φi 2( )T , , φi K( )T i T , (13) where each element φi k( ) is essentially a complete spectral kernel mapping shown in Eq. (9) and Figure 1. Therefore, DSKN can effectively embed different kernels into any position and easily adjust the deep network. Considering that all parameters in DSKN are represented as frequencies ω, DSKN can be uniformly optimized by most existing optimization algorithms in deep learning, such as SGD and Adam, in accordance with the error backpropagation. Furthermore, in Θ notation, the computational complexity of DSKN is the same as that of classic deep neural networks with the same architectures. Compared with other deep kernels, the architecture of DSKN is more interpretable by explicitly stacking spectral kernel mappings in a hierarchical composite way. We derive the representer theorem and the reproducing kernel Hilbert space V(i) of the i-layer DSKN K(i) for all i = 1, , l recursively [Bohn et al., 2017]. Concretely, let Hi be the reproducing kernel Hilbert space of the mapping Φi about the kernel ki with finite-dimensional domain Di and range Ri where Ri RMi with Mi N for i = 1, , l. Such that Ri 1 Di for i = 2, , l and D1 RD. Let L be an arbitrary loss function and Θ1, , Θl be strictly monotonically increasing functions. The minimization objective J is defined as: J (Φ1, , Φl) = n=1 L(Φl(Φl 1( Φ1(xn))), yn) i=1 Θi( Φi 2 Hi). Then, a set of minimizers {Φi}l i=1 with Φi Hi fulfills Ξ(i)( ) = Φi(Φi 1( Φ1( ))) V(i) Hi for all i = 1, , l with the spanned reproducing kernel Hilbert space V(i) of the i-layer DSKN K(i): V(i) = span ki(Φi 1(Φi 2( Φ1(xn))), )emi |n = 1, N; mi = 1, Mi , (15) where emi RMi is the mi-th unit vector. Intuitively, according to the image spaces of hidden layers, the kernels in previous layers try to align the intrinsic features of data in such a way that they can be easily resolved by the posterior Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 2: The deep architecture of DSKN ones, and thus the output of the last layer in DSKN can represent more important advanced concepts for tasks at hand. It is worth noting that DSKN further strengthens the link between kernel method and deep learning. On the one hand, it is distinctly superior to shallow spectral kernels in all aspects by effectively applying deep architectures to spectral kernels, which significantly improve the learning capability of spectral kernels to compactly represent highly non-linear and highly-varying complex concepts by abstracting the advanced internal relationships of inputs. On the other hand, compared with deep learning, it can adapt to medium-scale and small-scale learning tasks better, by naturally embedding the powerful non-stationary and non-monotonic spectral kernels into deep networks, which can incisively learn dynamic input-dependent characteristics and potential long-range correlations across the whole feature spaces. 4 Experiments In this section, we experimentally evaluate the performance of DSKN compared with several state-of-the-art algorithms on varieties of typical tasks, which demonstrates that DSKN can achieve all-round performance improvements. 4.1 Experimental Setup As the priori knowledge, the most widely-used classic Gaussian kernels are used as the internal basic kernel elements of DSKN, whose spectral surfaces S are Gaussian distributions. Moreover, the scales of all deep architectures in the experiments are uniformly set to 1000 500 50. Sigmoid is applied to the activation functions in neural networks. DSKN and the compared kernels are applied to the same Gaussian process models for classification and regression, and optimized by Adam. The accuracy and the mean absolute error (MAE) are chosen as the evaluation criteria for classification tasks and regression tasks, respectively, which reflect the average performance better. Compared Algorithms DSKN is compared with several state-of-the-art relevant algorithms including: 1-SKN [Ton et al., 2018]: 1-layer Spectral Kernel Network is a shallow baseline, and the number of computational elements is set to be the same as DSKN. DNN [Le Cun et al., 1998]: Deep Neural Network is the most classic model in deep learning. DKL-GA [Wilson et al., 2016b]: Deep Kernel Learning with GAussian kernel combines a DNN as the front-end of a Gaussian kernel. DKL-SM [Wilson et al., 2016b]: Deep Kernel Learning with Spectral Mixture kernel combines a DNN as the front-end of a spectral mixture kernel. Datasets We systematically evaluate the performance of DSKN on several standard classification and regression tasks. We firstly conduct classification experiments on four benchmark datasets, including four, ionosphere, splice and wbdc [Blake and Merz, 1998]. Secondly, we conduct regression experiments on other four datasets including airfoil, boston, concrete and energy [Blake and Merz, 1998]. All these data are scaled by z-score standardization and randomly divided into two non-overlapping training and test sets, which are equal in size. The division, training and test processes are repeated ten times to generate ten independent results for each dataset, and then we assess the average performance with corresponding evaluation criterion. Furthermore, to evaluate the performance of DSKN on training data with different scales, we specifically conduct an image classification experiment on MNIST dataset [Le Cun et al., 1998]. The division of training data and test data is consistent with the default scheme. Specifically, 10,000 images are randomly selected as test data, and the rest are training data. The training data are further sampled to different scales from 5% to 100%. 4.2 Experimental Results We evaluate the performance of DSKN in varieties of standard classification and regression tasks. Experimental results are shown in Table 2, where the best results are highlighted in bold. ( ) indicates the larger the better, while ( ) indicates Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Classification Accuracy ( ) Regression MAE ( ) four ionosphere splice wbdc airfoil boston concrete energy 1-SKN 0.984 0.021 0.864 0.065 0.674 0.064 0.964 0.023 0.282 0.010 0.248 0.019 0.269 0.010 0.121 0.013 DNN 0.864 0.143 0.828 0.102 0.777 0.082 0.932 0.106 0.198 0.009 0.290 0.018 0.237 0.014 0.088 0.008 DKL-GA 0.896 0.157 0.743 0.115 0.682 0.134 0.902 0.147 0.246 0.033 0.249 0.016 0.239 0.012 0.099 0.006 DKL-SM 0.963 0.106 0.787 0.105 0.748 0.082 0.940 0.103 0.280 0.020 0.257 0.024 0.258 0.018 0.104 0.004 DSKN 0.999 0.000 0.917 0.033 0.855 0.012 0.974 0.006 0.176 0.012 0.244 0.016 0.233 0.012 0.069 0.003 Table 2: Classification accuracy and regression MAE (mean std) of each compared algorithm on several real-world datasets. ( ) indicates the larger the better, while ( ) indicates the smaller the better. The best results are highlighted in bold. In addition, / indicates whether DSKN is statistically superior/inferior to the compared algorithms on each dataset (pairwise t-test at 0.05 significance level). 1-SKN DNN DKL-GA DKL-SM DSKN 5% 0.9052 0.9589 0.9498 0.9437 0.9762 10% 0.9312 0.9733 0.9714 0.9595 0.9824 20% 0.9479 0.9817 0.9820 0.9782 0.9881 40% 0.9614 0.9862 0.9858 0.9866 0.9920 70% 0.9726 0.9876 0.9874 0.9865 0.9936 100% 0.9746 0.9891 0.9899 0.9881 0.9945 Table 3: Accuracy on MNIST dataset with different scales. Figure 3: Accuracy curves on MNIST dataset with different scales. the smaller the better. In order to measure the significance of performance difference statistically, pairwise t-test at 0.05 significance level is conducted. Specifically, when DSKN is significantly superior/inferior to the compared algorithms, a marker / is denoted [Xu et al., 2017]. According to the results illustrated in Table 2, 1-SKN with only single hidden layer performs relatively well on most easy classification tasks benefiting from the non-stationary and non-monotonic properties to some extent. But the shallow 1-SKN lacks the efficiency in computational elements, and thus performs poorly on regression tasks which need to be learned more accurately. As a classic and commonlyused deep learning algorithms, DNN is still very competitive on average. The two compared deep kernels, DKL-GA and DKL-SM, have relatively poor performance. Although additional kernels are combined as the back-end, these traditional deep kernels can t achieve effective performance improvements. By contrast, the proposed DSKN evidently outperforms all compared algorithms on all tasks. The experimental results explicitly demonstrate the excellent performance and stability of DSKN, which further implicitly indicates the necessity of effectively construct natural deep architectures for non-stationary and non-monotonic spectral kernels. We further conduct an experiment on MNIST dataset with different scales to demonstrate the excellent learning ability of DSKN. The results are collected in Table 3 and visualized as Figure 3. On the relatively complex dataset, 1-SKN performs poorly due to its shallow structure, which needs exponential computational elements to availably represent image patterns. There is no significant performance gap between DNN, DKL-GA and DKL-SM. But the deep kernels, DKLGA and DKL-SM, still perform a little bit worse than DNN. The kernels in the last layer are more likely to affect the adequate backpropagation of error information. By contrast, DSKN can not only stably achieve the best performance on all scales, but also enlarge the performance superiority over compared algorithms on medium-scale and small-scale tasks. In fact, the sparser the data distribution, the more crucial the appropriate non-stationary and non-monotonic properties are. DSKN accurately learns the crucial properties and thus performs better, by introducing deep architectures to improve the non-stationary and non-monotonic spectral kernels. 5 Conclusion In view of the prominent superiority of deep architectures over shallow ones, we pay attention to effectively integrate non-stationary and non-monotonic spectral kernels into elegant deep architectures, and propose the novel DSKN, which can be further generalized to cover most kernels. Specifically, we firstly deal with the spectral kernels by inverse Fourier transform and present a more general spectral form. DSKN is then derived by naturally embedding the spectral kernels into each layer to achieve better efficiency in computational elements. Consequently, DSKN can effectively reveal the dynamic input-dependent characteristics and potential longrange correlations by compactly representing complex advanced concepts. In addition, some intuitive analyses and the representer theorem of DSKN are also presented. Systematical experiments demonstrate the superiority of DSKN compared to state-of-the-art relevant algorithms on varieties of standard tasks, which implicitly indicates that DSKN can relatively adopt more strong points of spectral kernels and deep architectures while overcoming their weak points. Acknowledgments This work was supported by the National Key R&D Program of China (2017YFB1002801), the National Natural Science Foundations of China (Grant No. 61876091). It is also supported by Collaborative Innovation Center of Wireless Communications Technology. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) References [Bengio et al., 2006] Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of highly variable functions for local kernel machines. In Advances in neural information processing systems, pages 107 114, 2006. [Bengio et al., 2007] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pages 153 160, 2007. [Blake and Merz, 1998] Catherine Blake and Christopher J Merz. Uci repository of machine learning databases. Online at http://archive.ics.uci.edu/ml/, 1998. [Bohn et al., 2017] Bastian Bohn, Michael Griebel, and Christian Rieger. A representer theorem for deep kernel learning. ar Xiv preprint ar Xiv:1709.10441, 2017. [Cho and Saul, 2009] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342 350, 2009. [Cutajar et al., 2017] Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep gaussian processes. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 884 893. JMLR. org, 2017. [Delalleau and Bengio, 2011] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pages 666 674, 2011. [Gibbs, 1998] Mark N Gibbs. Bayesian Gaussian Processes for Regression and Classification. Ph D thesis, University of Cambridge, 1998. [Gikhman and Skorohod, 1974] Iosif Ilyich Gikhman and Anatolli Vladimirovich Skorohod. The Theory of Stochastic Processes, volume 1. Springer Verlag, Berlin, 1974. [Heinonen et al., 2016] Markus Heinonen, Henrik Mannerstr om, Juho Rousu, Samuel Kaski, and Harri L ahdesm aki. Non-stationary gaussian process regression with hamiltonian monte carlo. In Artificial Intelligence and Statistics, pages 732 740, 2016. [L azaro-Gredilla et al., 2010] Miguel L azaro-Gredilla, Joaquin Qui nonero-Candela, Carl Edward Rasmussen, and An ıbal R Figueiras-Vidal. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11:1865 1881, 2010. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Rahimi and Recht, 2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008. [Rasmussen and Williams, 2006] Carl Edward Rasmussen and Christopher K I Williams. Gaussian Process for Machine Learning. MIT Press, 2006. [Rebai et al., 2016] Ilyes Rebai, Yassine Ben Ayed, and Walid Mahdi. Deep multilayer multiple kernel learning. Neural Computing and Applications, 27(8):2305 2314, 2016. [Remes et al., 2017] Sami Remes, Markus Heinonen, and Samuel Kaski. Non-stationary spectral kernels. In Advances in Neural Information Processing Systems, pages 4642 4651, 2017. [Sampson and Guttorp, 1992] Paul D. Sampson and Peter Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Publications of the American Statistical Association, 87(417):108 119, 1992. [Stein, 1999] Michael L Stein. Interpolation of Spatial Data. Springer New York, 1999. [Sun et al., 2018] Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. Differentiable compositional kernel learning for gaussian processes. ar Xiv preprint ar Xiv:1806.04326, 2018. [Ton et al., 2018] Jean Francois Ton, Seth Flaxman, Dino Sejdinovic, and Samir Bhatt. Spatial mapping with gaussian processes and nonstationary fourier features. Spatial Statistics, 2018. [Wilson and Adams, 2013] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pages 1067 1075, 2013. [Wilson and Nickisch, 2015] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775 1784, 2015. [Wilson et al., 2016a] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586 2594, 2016. [Wilson et al., 2016b] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370 378, 2016. [Xu et al., 2017] Hai-Ming Xu, Hui Xue, Xiao-Hong Chen, and Yun-Yun Wang. Solving indefinite kernel support vector machine with difference of convex functions programming. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. [Yaglom, 1987] Akira Moiseevich Yaglom. Correlation Theory of Stationary and Related Random Functions, volume 1. Springer Series in Statistics, 1987. [Zhang et al., 2017] Shuai Zhang, Jianxin Li, Pengtao Xie, Yingchun Zhang, Minglai Shao, Haoyi Zhou, and Mengyi Yan. Stacked kernel network. ar Xiv preprint ar Xiv:1711.09219, 2017. [Zhuang et al., 2011] Jinfeng Zhuang, Ivor W Tsang, and Steven CH Hoi. Two-layer multiple kernel learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 909 917, 2011. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)