# copulanested_spectral_kernel_network__198be5fc.pdf Copula-Nested Spectral Kernel Network Jinyue Tian 1 2 Hui Xue 1 2 Yanfang Xue 1 2 Pengfei Fang 1 2 Spectral Kernel Networks (SKNs) emerge as a promising approach in machine learning, melding solid theoretical foundations of spectral kernels with the representation power of hierarchical architectures. At its core, the spectral density function plays a pivotal role by revealing essential patterns in data distributions, thereby offering deep insights into the underlying framework in real-world tasks. Nevertheless, prevailing designs of spectral density often overlook the intricate interactions within data structures. This phenomenon consequently neglects expanses of the hypothesis space, thus curtailing the performance of SKNs. This paper addresses the issues through a novel approach, the Copula-Nested Spectral Kernel Network (Coke Net). Concretely, we first redefine the spectral density with the form of copulas to enhance the diversity of spectral densities. Next, the specific expression of the copula module is designed to allow the excavation of complex dependence structures. Finally, the unified kernel network is proposed by integrating the corresponding spectral kernel and the copula module. Through rigorous theoretical analysis and experimental verification, Coke Net demonstrates superior performance and significant advancements over SOTA algorithms in the field. 1. Introduction Kernel methods are an essential class of machine learning approaches (Shawe-Taylor & Cristianini, 2004). However, with the ever-growing scale of datasets in the machine learning community, the computation cost of kernel methods significantly increases. To overcome these limitations, re- 1School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China. Correspondence to: Hui Xue . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). searchers have proposed several approaches to improve the scalability of traditional kernels (Williams & Seeger, 2000; Rahimi & Recht, 2007; Yang et al., 2014; Liu et al., 2021). For instance, Random Fourier Features (RFF) (Rahimi & Recht, 2007), as a widely adopted method, proposed an explicit kernel mapping based on the spectral representation of stationary kernels, which inspires many researchers to delve deeper into the exploration of spectral kernels (L azaro Gredilla et al., 2010; Wilson & Adams, 2013; Sinha & Duchi, 2016). To further enhance the representation ability of spectral kernels, some integrated them into the hierarchical architectures, resulting in spectral kernel networks (SKNs) (Xue et al., 2019; Li et al., 2022; Xu et al., 2022), which share the benefits of reduced computation cost along with the improved capability of representation while maintaining solid theoretical foundations. In the SKNs, the spectral density serves as a crucial determinant of their properties. It is refined through the backpropagation procedure of the SKN, encapsulating the intrinsic characteristics in the data. However, the selection and formulation of spectral density remain underexplored. Previous works prefer to utilize the spectral density function of traditional kernels (Rahimi & Recht, 2007; Xue et al., 2019) or assume the spectral density to be a linear combination of Gaussian probability density functions (Wilson & Adams, 2013; Samo & Roberts, 2015; Remes et al., 2017). The former is more akin to approximating an existing kernel, rather than creating a novel data-based kernel. The latter fails to investigate complex architectures between variables, such as the dependence structures. Both of these methods limit the exploration of the hypothesis space. To adjust the spectral densities, Avron et al. related the spectral density to an appropriately defined ridge leverage function (Avron et al., 2017), but the modified spectral density can only be used in the 1-dimensional Gaussian kernel. Li et al. employed the empirical ridge leverage score distribution and proposed an algorithm to approximate the distribution and the leverage weights (Li et al., 2019). However, the distribution is not learnable and hence unable to dynamically mine the structure in the data. Xue and Wu derived a priorposterior bridge to strengthen the uncertainty of spectral density (Xue & Wu, 2020), but the bridge is manually chosen, reducing the flexibility in the parametric form of the spectral density. The insufficient exploration of spectral den- Copula-Nested Spectral Kernel Network sity prevents SKNs from fully demonstrating their inherent, powerful data mining capabilities. In this paper, we propose an expressive and flexible method named Copula-Nested Spectral Kernel Network (Coke Net). The core idea is to introduce copula networks into the design of the spectral density based on Sklar s theorem. This scheme can investigate a broader range of hypothesis space of SKNs and capture the complicated relations between variables in the data. Concretely, we first redefine the spectral density as a multiplication of copulas and pseudo probability density by employing Sklar s theorem. Secondly, we construct the copula module. Particularly, we select the Archimedean copulas due to their simplicity and ability to model complex distributions. Berstein s theorem is then utilized to further enrich the choice of generators in Archimedean copulas, which leads to a copula module in hierarchical architectures. Finally, we insert the redefined spectral density and the copula module into the spectral kernel mapping, resulting in Coke Net. With the copula and the kernel both in the representation of networks, we can train the network in an end-to-end manner and flexibly fit the network to the data, learning the spectral density most suitable for the given task. Our contributions are: We propose a learnable copula-nested spectral kernel network, Coke Net. By introducing the copula networks into the construction of spectral kernels, the proposed model can explore wider expanses of hypothesis spaces and capture the complicated relations in the data, thus deriving the most task-appropriate spectral density. We theoretically analyze the improvements of Coke Net. This includes a broadening in the variety of spectral densities, an enhanced capability for data structure extraction and an augmented generalization ability. We conduct experiments on synthetic datasets and real-world datasets to evaluate the effectiveness of our method. The experimental results on the synthetic data show that Coke Net can capture the dependence structure between data variables. Results of the real-world experiments demonstrate the superiority of Coke Net compared to state-of-the-art relevant methods. 2. Related Work Spectral Kernel Based on Bochner s theorem, Rahimi and Recht proposed the RFF method based on the spectral representation of stationary kernels (Rahimi & Recht, 2007). Its core idea is generating an explicit expression of the kernel mapping to approximate the implicit mapping of the target kernel, resulting in a succinct expression of cosine functions. This work motivates many researchers to further explore spectral kernels (Wilson & Adams, 2013; Sinha & Duchi, 2016; L azaro-Gredilla et al., 2010). Some researchers generalized spectral kernels to non-stationary scenarios. Remes et al. define spectral density as a combination of bivariate Gaussian components and present a family of non-stationary and non-monotonic kernels, which can learn input-dependent and long-range covariance between inputs (Remes et al., 2017). Ton et al. obtain the sparse spectrum kernel by solving a more general spectral characterization of non-stationary kernels (Ton et al., 2018). Samo and Roberts leverage Wiener s Tauberian theorem and Yaglom s theorem to construct families of kernels that can approximate arbitrarily well any bounded continuous non-stationary kernels (Samo & Roberts, 2015). To further enhance the representation ability of spectral kernels, ones combine spectral kernels and various deep architectures, resulting in SKNs (Xue et al., 2019; Li et al., 2022; Xu et al., 2022). These methods have not only strong mathematical guarantees of kernel methods but also the power to model sophisticated data patterns brought by deep learning. Furthermore, Xue et al. generalize the SKNs on the real number domain to the complex number domain by taking both the real and imaginary parts of the spectral kernel mapping into account (Xue et al., 2023). Now, SKNs are not merely a plug-in within kernel methods. Their improved integrability and diverse architectures enable them to be used as standalone learners. Copula Theory Copula is a powerful statistical tool for capturing the dependence structure between random variables, introduced by Sklar (Sklar, 1959). Along with the introduction of the notion of copula, Sklar s theorem points out that any d-dimensional continuous joint distribution is a product of d marginal distribution functions and a single d-dimensional copula, which reduces the complex issue of estimating the multivariate distribution to a much easier problem of estimating the univariate distribution and their dependence structure. Hence, various copulas have been developed and widely applied in fields where dependence matters, such as economics (Oh & Patton, 2018), biology (Disegna et al., 2017), medicine (Genest & Rivest, 1993) et al. However, the diversity of copula functions brings the challenge of selecting or estimating a well-suited and tractable copula. To tackle this issue, many copula generative models have been proposed (Ling et al., 2020; Ng et al., 2021; Janke et al., 2021). These methods enhance the ability to tailor copula functions to specific datasets, enabling more precise modeling of complex dependencies. 3. Preliminary In this section, we introduce some necessary concepts about spectral kernels and copulas. Throughout the paper, the vectors are denoted by bold letters (e.g. ω) while the scalars are not (e.g. i). Copula-Nested Spectral Kernel Network Figure 1. The overall structure of the Coke Net. First, we initial the weights according to the pseudo probability density function and pass them to the Copula Module. Then, the copulas are passed into the spectral kernel mapping. Finally, the copula-nested spectral kernel mapping makes each layer of Coke Net. 3.1. Spectral Kernel According to Yaglom s theorem (Yaglom, 1987), a nonstationary kernel k(x, x ) is positive definite in Rd if and only if it has the form k(x, x ) = Z Rd Rd ei(ω x ω x )µ(dω, dω ), (1) where µ(dω, dω ) is the Lebesgue-Stieltjes measure associated to some positive definite function p(ω, ω ) with bounded variation. And they satisfy that µ(dω, dω ) = p(ω, ω )dωdω . Hence, p(ω,ω ) R p(ω,ω)dωdω is a probability density function. Without loss of generality, we can assume that R p(ω, ω)dωdω < and p(ω, ω ) is a probability density function. There exists a one-to-one correspondence between k and p based on Fourier duality. k(x, x ) = Z Rd Rd ei(ω x ω x )p(ω, ω )dωdω , p(ω, ω ) = Z Rd Rd e i(ω x ω x )k(x, x )dxdx , (2) where p is the spectral density of the kernel and x, x , ω, ω Rd. Thereby, k is also called the spectral kernel. 3.2. Copula Copulas are powerful mathematical tools that fully depict the dependence structure among random variables, offering great flexibility in building multivariate models. They are defined as Definition 3.1. A function C : [0, 1]d [0, 1] is a copula if and only if the following properties hold (i.) for every j {1, . . . , d}, C(u) = uj when all elements of u are equal to 1 with the exception of the j-th one that is equal to uj [0, 1]; (ii.) C(u) C(v) for all u, v [0, 1]d, u v; (iii.) C is d-increasing. These properties guarantee copulas to be a special kind of cumulative distribution function. Their corresponding probability density function, namely the copula density function is defined as c(u1, . . . , ud) = C(u1, . . . , ud) u1 . . . ud . (3) 4. Coke Net In this section, we first provide the overall architecture of our Coke Net of three parts. Subsequently, we introduce each part in detail. Overall Architecture Coke Net comprises three parts: the redefinition of spectral density, the construction of copula modules, and the copula-nested spectral kernel mapping. Copula-Nested Spectral Kernel Network The overall architecture is shown in Figure 1. Concretely, the spectral density is defined as p(ω, ω ) = c(ω, ω )ˆp(ω, ω ), (4) where c is the integrated copula module and ˆp represents the pseudo sampling distribution for initialization. Then, we obtain the copula module by representing a family of Archimedean copulas C in a network form. Combining the novel spectral kernel density p(ω, ω ) and copula module, we derive the spectral kernel mapping Φ as one layer of Coke Net. Stacking L layers of spectral kernel mapping, our Coke Net is formulated as Coke Net(x) = ΦL(. . . Φ2(Φ1(x))), (5) where Φl is the copula-nested spectral kernel mapping in l-th layer. 4.1. Copula-Nested Spectral Density Theorem 4.1. (Sklar s Theorem) (Jaworski et al., 2010) Let F be a d-dimensional distribution function with univariate margins F1, F2, . . . , Fd. Let Aj denote the range of Fj, Aj = Fj(R)(j = 1, 2 . . . , d). Then there exists a copula C such that for all (x1, x2, . . . , xd) Rd, F(x1, x2, . . . , xd) = C(F1(x1), F2(x2), . . . , Fd(xd)), (6) where C is uniquely determined on A1 A2 . . . Ad. Hence, it is unique when F1, F2, . . . , Fd are all continuous. By Sklar s theorem, copulas serve as a link to build a joint distribution with marginals. They depict the dependence structure between margins. Furthermore, Corollary 4.2 explains how the copula density function connects probability density functions. Corollary 4.2. (Jaworski et al., 2010) Follow the notation in Theorem 4.1. Let f be a d-dimensional probability density function with univariate margins f1, f2, . . . , fd. Let Aj denote the range of fj, Aj = fj(R)(j = 1, 2, . . . , d). Then there exists a copula C such that for all (x1, . . . , xd) Rd, f(x1, x2, . . . , xd) = c(F1(x1), . . . , Fd(xd)) k=1 fk(xk), c(F1(x1), . . . , Fd(xd)) = C(F1(x1), . . . , Fd(xd)) F1(x1) . . . Fd(xd) , where c is the copula density function uniquely determined on A1 A2 . . . Ad. It satisfies that c is unique when f1, f2, . . . , fd are all continuous. Look back at Equation (2), by Corollary 4.2, we formulate the spectral density function p as p(ω, ω ) = c(ω, ω )ˆp(ω, ω ) = c(P1(ω1) . . . , P d(ω d)) j=1 pj(ωj)p j(ω j), (8) where ωj, ω j is the j-th element of the vector ω, ω Rd respectively. pj and p j are the corresponding probability density functions. Pj and P j are the corresponding cumulative distribution functions. c is the copula density and ˆp is the pseudo probability density function. Note that when c(ω, ω ) 1, the spectral density degenerates to the scenario where each feature is independent of each other. 4.2. Copula Module 𝐴𝑇 𝐴𝑇exp( 𝐵𝑇𝑡) 𝐴1exp( 𝐵1𝑡) Copula Net 𝝓 Copula Module 𝐴𝑇 𝐴𝑇exp( 𝐵𝑇𝑡 ) 𝐴1exp( 𝐵1𝑡 ) Copula Net 𝝓 Figure 2. The structure of copula module. Motivated by ACNet (Ling et al., 2020), we select the family of Archimedean copulas to link the marginal probability density functions to a joint probability density function due to its simple form and ability to model complex distributions. The Archimedean copulas are given by C(u1, . . . , ud) = ϕ(ϕ 1(u1)+ϕ 1(u2)+. . .+ϕ 1(ud)), (9) where ϕ is the generator of C. And ϕ : [0, ) [0, 1] is d-monotone, i.e. ( 1)kϕ(k)(t) 0 for all k d, t 0. From Equation (9), it can be seen that the generator ϕ determines the characteristics of Archimedean copulas. Hence, an important problem is the selection of ϕ since there are so many functions that satisfy the d-monotone constraint. If we strengthen the condition in Equation (9) to require that the generator is totally monotone, i.e. ( 1)kϕ(k)(t) 0 for all nonnegative integers k and all t 0. These functions can be expressed in a unified form via Theorem 4.3. Copula-Nested Spectral Kernel Network Theorem 4.3. (Berstein s theorem) (Bernstein, 1929) A function ϕ is totally monotone if and only if ϕ is the Laplace transform of a positive random variable M, i.e. ϕ(t) = R 0 e Mtp(M)d M and P(M > 0) = 1. Hence, by Berstein s theorem and Monte Carlo approximation, the generator ϕ can be modified as 0 e Mtp(M)d M = EM P (M)[e Mt] k=1 exp( Mkt), (10) where T is the number of samplings and Mk > 0. Since the positive linear combination of monotone functions remains monotone, we can derive a more general form of Equation (10) k=1 Akexp( Mkt), (11) where Ak 0. It can be seen as a one-layer neural network with exponential function as its activation function. We name the network as the Copula Net. Note that ϕ can be extended to deep architecture, but that is not the focus of this paper. For more information, please refer to (Ling et al., 2020). Subsequently, by Equation (9), we generate copulas with respect to the variables (ω, ω ) R2d k=1 Akexp( Mk( j=1 [ϕ 1(ωj) + ϕ 1(ω j)])), (12) where ωj, ωj is the j-th element of ω, ω respectively. ϕ is defined in Equation (11). ϕ 1 is its inverse, which can be calculated by numerical methods such as Newton s method and bisection method, i.e. by solving for t in ϕ(t) ωi = 0. We present a simple case in Figure 2, where t, t R are scalars, ˆt = ϕ(t). All steps in the copula module theoretically guarantee that Equation (12) is an efficient approximation of copulas. Further, the derivatives of the copula, namely the copula density function c, can be obtained by automatic differentiation libraries such as Py Torch (Paszke et al., 2019). The copula is structured within the framework of a network architecture instead of a conventional statistical format, offering two significant advantages. Firstly, there is no need to choose the generator manually. According to Theorem 4.3, the exact parametric form of the generator relies on the optimization of the approximation (Equation (10)) and the selection of hyper-parameters in the copula network. This significantly improves the flexibility in the architecture of the generator, consequently enhancing the agility of the copulas. Secondly, using the expression of networks makes the copula more task-oriented. Through gradient descent, Coke Net optimizes the parameters in the copula module so that the consequent output is a reliable prediction. Therefore, the copulas are formed in a most task-oriented way. 4.3. Copula-Nested Spectral Kernel Mapping Using the novel spectral density in Equation (8) and our elaborate copula density c, the spectral representation in Equation (2) can be written as k(x, x ) = Z Rd Rd ei(ω x ω x )c(ω, ω )ˆp(ω, ω )dωdω . To loosen the symmetry restriction of spectral kernel k, we rewrite Equation (13) as k(x, x ) = Z Rd Rd τ(ω, ω , x, x )c(ω, ω )ˆp(ω, ω )dωdω , (14) where τ(ω, ω , x, x ) is defined as 1 8[ei(ω x ω x ) + ei(ω x ω x ) + ei( ω x+ω x ) + ei( ω x+ω x ) + ei(ω x ω x ) + ei(ω x ω x ) + ei( ω x+ω x ) + ei( ω x+ω x )]. Applying Euler s formula that eix = cos(x) + i sin(x), we rewrite τ(ω, ω , x, x ) as 1 4[cos(ω x ω x ) + cos(ω x ω x ) + cos(ω x ω x ) + cos(ω x ω x )]. (16) We deem that (ω, ω ) follow the pseudo probability density function ˆp. Following Monte Carlo approximation and Equation (38), we have k(x, x ) = E(ω,ω ) ˆ P [τ(ω, ω , x, x )c(ω, ω )] m=1 τ(ωm, ω m, x, x )c(ωm, ω m) = Ψ(x) Ψ(x ), cos(ω 1 x) + cos(ω 1 x) . . . cos(ω Dx) + cos(ω D x) sin(ω 1 x) + sin(ω 1 x) . . . sin(ω Dx) + sin(ω D x) c(ω1, ω 1) . . . c(ωD, ω D) c(ω1, ω 1) . . . c(ωD, ω D) (18) D is the number of random features, and (ωi, ω i) ˆP are the weights sampled in the i-th random feature. Copula-Nested Spectral Kernel Network Further, we replace Ψ with Φ, given by cos(ω 1 x) + cos(ω 1 x) . . . cos(ω Dx) + cos(ω D x) c(ω1, ω 1) . . . c(ωD, ω D) since E[Φ (x)Φ(x )] = E[Ψ (x)Ψ(x )]. The explicit mapping defined by Equation (19) enables the kernel to be regarded as a neural network with cosine function as its activation function. Its parameters can be updated via algorithms such as gradient descent. Furthermore, the Coke Net is formulated as Coke Net(x) = ΦL(ΦL 1(. . . Φ2(Φ1(x))), (20) where Φl identifies the mapping of the l-th layer. The corresponding Copula-nested spectral kernel (Coke) is ˆk(x, x ) = ΦL(. . . (Φ1(x)), ΦL(. . . (Φ1(x )) . (21) 5. Analysis By incorporating copulas into SKNs, our proposed model has great improvements in its representation and flexibility. The explicit description of dependence between variables enhances the diversity of the spectral densities which makes it more convenient for users to add prior information about the data. Further, integrating copulas strengthens the generalization of the model. We discuss the advantages in detail. 5.1. The Diversity and Uncertainty of Spectral Densities We theoretically explain the capacity of Coke Net to increase the uncertainty of spectral densities via 1) the entropy of the weights and 2) the Wigner distribution of the copulanested spectral kernel. Furthermore, we illustrate that this improvement makes Coke Net fall into local minima less easily compared to plain SKN. 1) The entropy of the spectral density function. Equation (19) indicates that the spectral density, including the pseudo probability density function ˆp as well as the copula module, determines the parameter space of Coke Net, ΘCoke Net = {(Θc, ωi, ω i)|(ωi, ω i) ˆP, ωi, ω i Rd, Θc are parameters in copula module}. Similarly, the parameter space of the SKN without copulas is ΘSKN = {(ωi, ω i)|(ωi, ω i) P, ωi, ω i Rd}. Therefore, the design of spectral densities strongly influences the performance of SKNs. In previous works, the spectral density p is either the oneto-one corresponding spectral density (defined in Equation (2)) of traditional kernels or a mixture of Gaussian components, leading to the following disadvantages: firstly, using spectral densities of existing kernels is similar to kernel approximation rather than kernel designing. Secondly, empirically choosing Gaussian components limits the selection of spectral kernels. In practice, many functions satisfying the condition described in Equation (2) are not taken into scope during the construction of spectral densities, which leaves a large part of the hypothesis space unexplored. Furthermore, the preconceived assumption may not be most task-appropriate and hinder the optimization process. The dependence structure of the weights is neglected under this assumption. Therefore, we consider incorporating copulas in the construction of spectral densities, which improves the diversity and uncertainty of spectral densities and depicts dependence relations. We demonstrate this improvement via the entropy of the weights (ω, ω ). By Equation (8), we know that (ω, ω ) p(ω, ω ) = c(ω, ω )ˆp(ω, ω ), so the entropy is defined and formulated as H (ω,ω ) P(ω, ω ) = Z p(ω, ω )log(p(ω, ω ))dωdω = H (ω,ω ) C(ω, ω ) + H (ω,ω ) ˆ P (ω, ω ). It is trivial that the entropy of the weights is determined by the marginal distribution of each weight and the copula function. When the copula functions vary dynamically according to the weights, they increase the entropy of the weights, which indicates the enhancement of the diversity and uncertainty in spectral densities. Hence, by involving copulas in the construction of spectral densities, sampled weights in Coke Net contain more information, and the parameter space is enriched. 2) Wigner distribution of the copula-nested spectral kernel. Wigner transform (Flandrin, 1998) is a mathematical tool in time-frequency analysis, demonstrating the relation between input and frequency. The Wigner distribution function of a kernel k is defined as Wk : Rd Rd R Wk(x, ω) = Z 2 )e 2iπω τdτ. (23) Note that when the kernel is stationary, the Wigner distribution function reduces to the spectral density. It reveals the frequency structure of the kernel and provides insight into the properties of the spectral density. We demonstrate the Wigner distribution of Coke and plain spectral kernel in Figure 3 and leave the detailed formula derivation and explanation in the appendix. In brief, both Coke and plain spectral kernel imply several sinusoidal signals. But given a fixed input x, the Wigner distribution of Coke exhibits a clear value change with frequency w (denoted in the color change in the figure), yet this variation is not apparent on the plain spectral kernel. This indicates that the copula-nested spectral density exhibits higher diversity. Copula-Nested Spectral Kernel Network (b) Plain Spectral Kernel Figure 3. The Wigner distribution of Coke(a) and SK(b). 3) Coke Net falls into local minima less easily compared to plain SKN. A wider parameter space as well as the hypothesis space enables the model to explore deeper. It can prevent the model from local minima. Figure 4 demonstrates the changes of loss surface according to parameters in an experiment described in Section 6.1. The loss surface of the SKN is rougher and fluctuates more with multiple minima. While there is only one minimum in the loss surface of Coke Net. It indicates that the proposed model is less likely to fall into the local minima during optimization. (b) Coke Net Figure 4. The loss surface of the SKN (a) and Coke Net (b). 5.2. Description of Dependence Structures Equation (22) shows that copulas bring extra information into the spectral densities. By further analyzing the increase in entropy brought by copulas, we obtain that H (ω,ω ) C(ω, ω ) = KL(p(ω, ω )||ˆp(ω, ω )). (24) The entropy of c indicates the Kullback-Leibler divergence of using ˆp to approximate p. The difference between p and ˆp is the dependence structure between variables. Thus, the extra information indicated by Equation (22) is the internal dependence of weights, verifying its capacity to tackle complicated relations between variables. From another perspective, since the interactions between variables are explicitly described by copulas, we can incorporate reliable prior information into the construction of models via the copula module. For instance, when the data appear to be Gaussian, the pseudo probability density functions can be set to be Gaussian components, and the copula module is set to be a constant. While the data appear to have tail dependence, it is hard to tackle by Gaussian components. We can use the Clayton copula to pre-train the copula module, which exhibits tail dependencies to describe the interaction. The involved prior information can guide the learning process and lead to a better alignment between the model and existing patterns, enhancing the interpretability. 5.3. The Generalization Ability The integration of copulas improves the generalization of SKNs. We show its improvements via the following theorems. Define the empirical Rademacher complexity. Definition 5.1. The empirical Rademacher complexity of F = {f|f is a binary classifier} is defined as ˆR(F) = Eσ[sup f F ( 1 i=1 σif(xi))], (25) where σ1, . . . , σm are independent random variables uniformly chosen from { 1, +1} and {xi}N i=1 is the dataset. By Equation (19), the reproducing kernel Hilbert space (RKHS) of Coke Net is G = {Φ( )|ωm, ω m Rd, Θc}, (26) where Θc are the parameters in the copula modules. Note that the plain spectral kernel mapping is cos(ω 1 x) + cos(ω 1 x) . . . cos(ω Dx) + cos(ω D x) And its RKHS is defined as G = { Φ( )|ωm, ω m Rd}. (28) We have the following theorem considering the empirical Rademacher complexities of these hypothesis spaces. Theorem 5.2. Following the notation in Equation (19) and Equation (27) and considering a dataset {(xi, yi)}N i=1, the empirical Rademacher complexity of G is bounded by m=1 2c2(ωm, ω m)[1+cos((ωm ω m) xi)]]1/2. (29) The empirical Rademacher complexity of G is bounded by m=1 2[1 + cos((ωm ω m) xi)]]1/2. Proof. The proof is given in the Appendix. Copula-Nested Spectral Kernel Network Theorem 5.3. (Shalev-Shwartz & Ben-David, 2014) Given the datasets {(xi, yi)}N i=1 and the hypothesis space H, assume the loss function ℓis l-Lipstchitz and ℓ( ) < . With probability at least 1 δ, the following risk bound holds ϵ(f ) ˆϵ(f) 2l ˆR(H) + O( where f H and f H is the most accurate estimator in the hypothesis space. ϵ denotes the expected risk and ˆϵ denotes the empirical risk. By Theorem 5.2 and Theorem 5.3, we can guarantee that the proposed model has better generalization ability by constraining that ||c2(ω, ω )|| 1, which can be done by adding a regularizer to the loss function ℓ(y, ˆy) = loss(y, ˆy) + λ(1 ||c2(ω, ω )||), (32) where λ is the regularization parameter. ˆy is the prediction and y is the groundtruth. 6. Experiment In this section, systematical experiments are performed to evaluate our proposed Coke Net. We first empirically demonstrate the efficacy of Coke Net in depicting relations between variables in the synthetic data. Then, we evaluate the performance of Coke Net compared with several state-of-the-art algorithms on six real-world datasets. All the experiments are implemented with Py Torch (Paszke et al., 2019). 6.1. Synthetic Data To verify the capacity to capture the dependence structures between variables of Coke Net, we elaborate a series of synthetic data and conduct comparison experiments and ablation experiments. Figure 5. Data points of different dependence degrees. The dependence degrees of blue points are all 0 while the dependence degree of red points is 0.1 in (a), 0.5 in (b) and 0.9 in (c). Note that with an increase in the degree of dependence, the mixing of differently colored points becomes less distinct, thereby simplifying the classification task. Table 1. Classification accuracy of synthetic data. The best results are highlighted in bold. And DD stands for dependence degree. DD SKN COKENET COKENET-P COKENET-R 0.01 0.5616 0.5800 0.5250 0.5383 0.1 0.5366 0.5616 0.5516 0.5300 0.3 0.5433 0.5750 0.5483 0.5300 0.5 0.5916 0.5950 0.5816 0.5750 0.7 0.6516 0.6600 0.6583 0.6400 0.9 0.7616 0.7650 0.7566 0.7583 0.999 0.9700 0.9750 0.9750 0.9716 Data We generate 7 pairs of 2-dimensional synthetic data points with different degrees of dependences. Each pair contains two groups of data points: one is a group of individually independent data points sampled from N(0, 1) and the other is a group of data points sampled from N(0, Σ), where the Σ = 1 α α 1 . The α, re- garded as the dependence degree of each pair, is set to be 0.1, 0.3, 0.5, 0.7, 0.9, 0.01, 0.999 in each pair respectively. Then a binary classification task is conducted in each pair. We illustrate three pairs of synthetic data points (dependence degree:0.1, 0.5 and 0.9) in Figure 5 to show the complexity of the task associated with the dependence degree. In Figure 5(a), the red and blue points are intermixed with no clear separation or pattern. In Figure 5(b), it starts to show that the variables of red points are linearly correlated. While in Figure 5(c), the dependence relation is very obvious. As the dependence degree increases from Figure 5(a) to Figure 5(c), the complexity and challenge of the classification task simultaneously drops. Compared Methods We compared the performances of (1) our method (Coke Net), (2) plain spectral kernel network without copulas (SKN), and several variants of our method: (3) Coke Net-P: replace the copula module c(ω, ω ) with parameters of the same dimension but with no dependence with ω, (4) Coke Net-R: change the copula network to a neural network with Re LU function as its activation function. The architectures of all networks are set to be 2 4 4 2, with a softmax function at the end for classification. Results Table 1 represents the results of all 7 pairs. The best results are highlighted in bold. The performance of all models improves with the increase in dependence degree, which is associated with the decrease in task complexity. Coke Net consistently leads across almost all tasks. The performance gap between Coke Net and other methods is more obvious at lower dependence degrees, which suggests the effectiveness of Coke Net in dealing with dependent data even when the dependence structure is nontrivial. The ablation experiment between Coke Net, Coke Net-R and Coke Net-P indicates that adding arbitrary parameters to the SKN cannot effectively capture the complex dependent relations. Copula-Nested Spectral Kernel Network Table 2. Classification accuracy in real-world datasets. The best results are highlighted in bold. DSKN ASKL Cos Net GRFF SRFF Coke Net-P Coke Net-R Coke Net Distal Phalanx Outline Correct 0.7789 0.7608 0.7717 0.6131 0.7754 0.7282 0.7355 0.8007 Earthquakes 0.7482 0.6762 0.6043 0.7482 0.6763 0.7482 0.7482 0.7553 Hand Outlines 0.9081 0.8378 0.8838 0.6459 0.9108 0.9081 0.9054 0.9135 Distal Phalanx TW 0.6474 0.6474 0.6043 0.6331 0.6331 0.6187 0.6402 0.6618 Proximal Phalanx Outline Correct 0.8797 0.8659 0.8729 0.6838 0.8866 0.9037 0.9003 0.9072 Proximal Phalanx TW 0.7804 0.8048 0.8049 0.7854 0.7805 0.7951 0.7707 0.8146 6.2. Real-World Data To demonstrate the efficacy of Coke Net, we conduct comparison and ablation experiments in several real-world datasets. Datasets We systematically evaluate the performance of Coke Net on six standard classification tasks from UCI repository (Blake, 1998). Compared Methods We compare the proposal with several mainstream spectral kernel methods, as follows: DSKN (Xue et al., 2019): Deep Spectral Kernel Network which embedded non-stationary spectral kernel into deep architectures; ASKL (Li et al., 2020): Automated Spectral Kernel Learning which incorporates the process of finding suitable kernels and model training in a learning framework; Cos Net (Xue et al., 2023): Complex-valued spectral kernel network, which generalizes spectral kernel mapping in real number domain to complex number domain; GRFF (Fang et al., 2023): Generative Random Fourier Features, a onestage kernel learning approach that models some latent distribution of the kernel via a generative network based on the random Fourier features; SRFF (Zhang et al., 2017): Stacked kernel network that learns a hierarchy of RKHS functions via random Fourier feature representation. Implementation Details Following the common practice, Gaussian probability density function is set to be the spectral density of all kernel methods. We denote the number of features in the dataset as nfeature, the number of classes as nclasses. The scale of all networks is uniformly set to nfeature 512 256 nclasses. The architecture of the neural network, with Re LU functions as its activation function in Coke Net-R, is set to be as same as the copula net. Detailed information about the setting in each dataset is in the Appendix. All method is trained by ADAM (Kingma & Ba, 2015) using mean squared error (MSE) loss. The learning rate is 0.001 without weight decay. Accuracy is the measurement. Table 2 demonstrates the results of real-world datasets. The best results are highlighted in bold. Results Overall, Coke Net consistently outperforms other models across all datasets. GRFF exhibits varied performances because of its progressive training strategy. In the backpropagation on a single batch, this scheme first only updates the last layer and the corresponding generator. Then parameters are added to the training sequence layer by layer, starting from the penultimate layer. Updating parameters multiple times during a single backpropagation process incurs unnecessary disturbances. The ablation studies between Coke Net, Coke Net-R and Coke Net-P indicate the efficacy of adding copula modules compared to arbitrary parameters and arbitrary neural networks. Additionally, while Coke Net-R has a lot more parameters than Coke Net-P, the performances of Coke Net-R are not always better. This suggests that merely making the architectures of the copula complicated does not guarantee better performance. 7. Conclusion In this paper, we propose Coke Net, a copula-nested spectral kernel network. In Coke Net, we first redefine spectral densities in the form of copulas. Secondly, we generalize Archimedean copulas to a hierarchical architecture and form the copula module. Subsequently, copula-nested spectral kernel mapping is obtained by integrating the copula module into the novel spectral density. Coke Net significantly expands the diversity of the hypothesis space and allows for an excavation of the complex dependencies inherent in data variables, overcoming the limitations posed by traditional spectral kernel networks. Theoretical analysis verifies the increase in the uncertainty of spectral densities, an improved data description capacity and a better generalization ability. The experimental results affirm the superiority of Coke Net over relevant state-of-the-art algorithms. Acknowledgements This work was supported by the National Natural Science Foundation of China (No. 62076062 and 62306070) and the Social Development Science and Technology Project of Jiangsu Province (No. BE2022811). Furthermore, the work was also supported by the Big Data Computing Center of Southeast University. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Copula-Nested Spectral Kernel Network Avron, H., Kapralov, M., Musco, C., Musco, C., Velingker, A., and Zandieh, A. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning ICML, volume 70, pp. 253 262, Sydney, NSW, Australia, 2017. Bernstein, S. Sur les fonctions absolument monotones. Acta Mathematica, 52:1 66, 1929. Blake, C. Uci repository of machine learning databases. http://www. ics. uci. edu/ mlearn/MLRepository. html, 1998. Disegna, M., D Urso, P., and Durante, F. Copula-based fuzzy clustering of spatial time series. Spatial Statistics, 21:209 225, 2017. Fang, K., Liu, F., Huang, X., and Yang, J. End-to-end kernel learning via generative random fourier features. Pattern Recognition, 134:109057, 2023. Flandrin, P. Time-frequency/time-scale analysis. Academic press, 1998. Genest, C. and Rivest, L.-P. Statistical inference procedures for bivariate archimedean copulas. Journal of the American statistical Association, 88(423):1034 1043, 1993. Janke, T., Ghanmi, M., and Steinke, F. Implicit generative copulas. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), In Advances in Neural Information Processing Systems 34, volume 34, pp. 26028 26039, virtual, 2021. Jaworski, P., Durante, F., Hardle, W. K., and Rychlik, T. Copula theory and its applications, volume 198. 2010. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 2015. L azaro-Gredilla, M., Quinonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. R. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11:1865 1881, 2010. Li, J., Liu, Y., and Wang, W. Automated spectral kernel learning. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, volume 34, pp. 4618 4625, 2020. Li, J., Liu, Y., and Wang, W. Convolutional spectral kernel learning with generalization guarantees. Artificial Intelligence, 313:103803, 2022. Li, Z., Ton, J.-F., Oglic, D., and Sejdinovic, D. Towards a unified analysis of random fourier features. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, volume 97, pp. 3905 3914, Long Beach, California, USA, 2019. Ling, C. K., Fang, F., and Kolter, J. Z. Deep archimedean copulas. In Advances in Neural Information Processing Systems 33, volume 33, pp. 1535 1545, Virtual, 2020. Liu, F., Huang, X., Chen, Y., and Suykens, J. A. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128 7148, 2021. Ng, Y., Hasan, A., Elkhalil, K., and Tarokh, V. Generative archimedean copulas. In de Campos, C. P., Maathuis, M. H., and Quaeghebeur, E. (eds.), Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, 27-30 July 2021, volume 161 of Proceedings of Machine Learning Research, pp. 643 653. AUAI Press, 2021. Oh, D. H. and Patton, A. J. Time-varying systemic risk: Evidence from a dynamic copula model of cds spreads. Journal of Business & Economic Statistics, 36(2):181 195, 2018. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. 32:8024 8035, 2019. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, volume 20, pp. 1177 1184, Vancouver, British Columbia, Canada, 2007. Remes, S., Heinonen, M., and Kaski, S. Non-stationary spectral kernels. In Advances in Neural Information Processing Systems 30, volume 30, pp. 4642 4651, Long Beach, CA, USA, 2017. Samo, Y.-L. K. and Roberts, S. Generalized spectral kernels. ar Xiv preprint ar Xiv:1506.02236, 2015. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Shawe-Taylor, J. and Cristianini, N. Kernel methods for pattern analysis. 2004. Sinha, A. and Duchi, J. C. Learning kernels with random features. In Advances in Neural Information Processing Systems 29, pp. 1298 1306, Barcelona, Spain, 2016. Copula-Nested Spectral Kernel Network Sklar, M. Fonctions de r epartition a n dimensions et leurs marges. In Annales de l ISUP, volume 8, pp. 229 231, 1959. Ton, J.-F., Flaxman, S., Sejdinovic, D., and Bhatt, S. Spatial mapping with gaussian processes and nonstationary fourier features. Spatial statistics, 28:59 78, 2018. Williams, C. and Seeger, M. Using the nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pp. 682 688, Denver, CO, USA, 2000. Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on Machine Learning, ICML, volume 28, pp. 1067 1075, Atlanta, GA, USA, 2013. Xu, P., Wang, Y., Chen, X., and Tian, Z. Deep kernel learning networks with multiple learning paths. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4438 4442, Virtual and Singapore, 2022. Xue, H. and Wu, Z.-F. Baker-nets: Bayesian random kernel mapping networks. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI, pp. 3073 3079, 2020. Xue, H., Wu, Z.-F., and Sun, W.-X. Deep spectral kernel learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI, pp. 4019 4025, Macao, China, 2019. Xue, Y., Fang, P., Tian, J., Zhu, S., et al. Cosnet: A generalized spectral kernel network. In Advances in Neural Information Processing Systems 37, 2023. Yaglom, A. M. Correlation Theory of Stationary and Related Random Functions, Volume I: Basic Results, volume 131. 1987. Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. Quasimonte carlo feature maps for shift-invariant kernels. In Proceedings of the 31th International Conference on Machine Learning, ICML, volume 32, pp. 485 493, Beijing, China, 2014. Zhang, S., Li, J., Xie, P., Zhang, Y., Shao, M., Zhou, H., and Yan, M. Stacked kernel network. ar Xiv preprint ar Xiv:1711.09219, 2017. Copula-Nested Spectral Kernel Network In the Appendix, we provide: Detailed derivation of equations and theorems. The Wigner distribution functions of kernels. Loss surface of SKN and Coke Net in the experiments described in Section 6.1. Extensive experiments. Additional information about the real-world classification experiments. A. The Detailed Derivation of Equation (22) The entropy of weights (ω, ω ) is defined as H (ω,ω ) P(ω, ω ) = Z p(ω, ω )log(p(ω, ω ))dωdω , (33) where (ω, ω ) P. We define the joint probability density function of (ω, ω ) as c(ω, ω )ˆp(ω, ω ) and ˆp(ω, ω ) = p(ω)p(ω ) is the product of the marginal probability density of ω and ω . Therefore, Equation (33) can be written as H (ω,ω ) P(ω, ω ) = Z p(ω, ω )log(p(ω, ω ))dωdω = Z c(ω, ω )ˆp(ω, ω )[log(c(ω, ω )) + log(ˆp(ω, ω ))]dωdω = Z c(ω, ω )log(c(ω, ω ))dωdω Z ˆp(ω, ω )log(ˆp(ω, ω ))dωdω = H (ω,ω ) C(ω, ω ) + H (ω,ω ) P(ω, ω ). B. The Detailed Derivation of Equation (24) Given that p(ω, ω ) = c(ω, ω )ˆp(ω, ω ), we have KL(p(ω, ω )||ˆp((ω, ω )) = Z p(ω, ω )log(p(ω, ω ) ˆp(ω, ω ))dωdω = Z c(ω, ω )ˆp(ω, ω )log(c(ω, ω ))dωdω = Z c(ω, ω )log(c(ω, ω ))dωdω = H (ω,ω ) C(ω, ω ). C. The Wigner Distribution of Kernels. Recall that the Wigner distribution function of a kernel k is defined as Wk : Rd Rd R Wk(x, ω) = Z 2 )e 2iπω τdτ. (36) And the copula-nested spectral kernel can be written as k(x, x ) = E(ω,ω ) ˆ P [τ(ω, ω , x, x )c(ω, ω )] 1 m=1 τ(ωm, ω m, x, x )c(ωm, ω m), (37) Copula-Nested Spectral Kernel Network where τ(ω, ω , x, x ) is 1 4[cos(ω x ω x ) + cos(ω x ω x ) + cos(ω x ω x ) + cos(ω x ω x )]. (38) By inserting Equation (37) into Equation (36), and set Am = ωm ωm, Bm = ωm+ωm 2 , Wk(x, ω) of Coke is Rd τ(ωm, ω m, x + τ 2 )c(ωm, ω m)e 2iπω τdτ Rd[cos(A mx + B mτ) + cos( A mx + B mτ) + cos(ω mτ) + cos(ω m τ)]c(ωm, ω m)e 2iπω τdτ m=1 c(ωm, ω m) Z Rd[cos(A mx + B mτ) + cos( A mx + B mτ) + cos(ω mτ) + cos(ω m τ)]e 2iπω τdτ. (39) For simplicity of understanding, we break down the calculation of the integral into four parts. By Euler s formula, the calculation of the first part R Rd cos(A mx + B mτ)e 2iπω τdτ is as follows Rd[ei(A mx+B mτ) + e i(A mx+B mτ)]e 2iπω τdτ Rd ei(A mx+B mτ)e 2iπω τdτ + Z Rd e i(A mx+B mτ)]e 2iπω τdτ] 2[ei A mx Z Rd ei(Bm 2πω) τdτ + e i A mx Z Rd e i(Bm+2πω) τdτ] 2(2π)d[ei A mxδ(Bm 2πω) + e i A mxδ(Bm + 2πω)], where δ( ) is the dirac delta function. Similarly, that of the second part is the same. And we can obtain the integral of the third part is 1 2(2π)d[δ(ωm 2πω) + δ(ωm + 2πω)]. (41) While that of the fourth part is 1 2(2π)d[δ(ω m 2πω) + δ(ω m + 2πω). (42) Combining these four parts together, we obtain the Wigner distribution of the Coke m=1 c(ωm, ω m)[2ei A mxδ(Bm 2πω) + 2e i A mxδ(Bm + 2πω) + δ(ωm 2πω) + δ(ωm + 2πω) + δ(ω m 2πω) + δ(ω m + 2πω)]. And we can obtain the Wigner distribution of plain spectral kernel by setting c(ωm, ω m) 1. In terms of the figures in Figure 3, for the simplicity of visualization, we consider the simple scenario where d = 1, i.e. ωm, ω m, Am, Bm R. And set x = ai, a R, i is the imaginary unit. D. Loss Surface In Figure 4, we demonstrate the change of loss according to parameters in the experiment in Section 6.1. Here we represent the loss surface of Coke Net and SKN on the 7 pairs of synthetic data. Note that the figures in Figure 4 are the results of the dependence degree 0.1. Copula-Nested Spectral Kernel Network Figure 6. Loss surface of Coke Net on synthetic data of different dependence degrees. Figure 7. Loss surface of SKN on synthetic data of different dependence degrees. E. Extensive Experiments To further verify the superiority of Coke Net, we conduct regression tasks on several UCI datasets and an image classification task on the CIAFR10 datasets compared to some relevant algorithms. The results are represented in the following tables. Table 3. The MSE loss of DSKN, Coke Net, Cos Net, SRFF on several regression datasets. The best results are highlighted in bold. Dataset DSKN Coke Net Cos Net SRFF power 0.4812 0.1004 0.1021 0.1009 concrete 0.8803 0.7977 0.8891 0.8924 yacht 1.9578 1.6857 2.3417 2.1974 airfoil 0.1128 0.0980 0.1070 0.4986 boston 0.4938 0.4492 0.5476 0.7136 wine red 0.6058 0.5967 0.6560 0.6423 wine white 0.7598 0.7476 0.7525 0.8347 Table 4. The accuracy of DSKN, Coke Net, Cos Net, SRFF on CIFAR10. The best results are highlighted in bold. Dataset DSKN Coke Net Cos Net SRFF Accuracy 0.8148 0.8172 0.6631 0.7359 F. Proof of Theorem 5.2 Theorem F.1. Following the notation in Equation (27) and Equation (19) and considering a dataset {(xi, yi)}N i=1, the empirical Rademacher complexity of G is bounded by m=1 2c2(ωm, ω m)[1 + cos((ωm ω m) xi)]1/2. (44) The empirical Rademacher complexity of G is bounded by m=1 2[1 + cos((ωm ω m) xi)]1/2. (45) Copula-Nested Spectral Kernel Network Proof. Following the same notation as above and the definition of empirical Rademacher complexity (Equation (25)), we obtain that ˆR(G) = Eσ[sup Φ G ( 1 i=1 σiΦ(xi))] 1 N Eσ[sup Φ G (|| i=1 σiΦ(xi))||]. (46) Since || PN i=1 σiΦ(xi))||2 = PN i=1 PN j=1 σiσjΦ (xi)Φ(xj), we have N Eσ[sup Φ G ( j=1 σiσjΦ (xi)Φ(xj))1/2] = 1 N sup Φ G ( i=1 Φ (xi)Φ(xi))1/2. (47) Next, we use Equation (18), which is equivalent to Equation (19), Φ (x)Φ(x) = m=1 c2(ωm, ω m)[(cos(ω mx) + cos(ω m x))2 + (sin(ω mx) + sin(ω m x))2] m=1 2c2(ωm, ω m)[1 + cos((ωm ω m) x)]. Insert Equation (48) into Equation (47), we obtain that m=1 2c2(ωm, ω m)[1 + cos((ωm ω m) xi)]]1/2. (49) By setting c(ω, ω ) = 1, we obtain the bound for the plain non-stationary spectral kernel. G. Additional Information of Real-World Classification Experiments Table 5. Detailed information about the real-world experiments Dataset Type Input.Dim Train.Num Test.Num Classes Distal Phalanx Outline Correct Image 80 400 139 3 Earthquakes Sensor 512 322 139 2 Hand Outlines Image 2709 1000 370 2 Distal Phalanx TW Image 80 400 139 6 Proximal Phalanx Outline Correct Image 80 600 291 2 Proximal Phalanx TW Image 80 400 205 6 Table 6. Architecture of networks in each dataset. Dataset Architecture Distal Phalanx Outline Correct 80 512 256 3 Earthquakes 512 512 256 2 Hand Outlines 2709 512 256 2 Distal Phalanx TW 80 512 256 6 Proximal Phalanx Outline Correct 80 512 256 2 Proximal Phalanx TW 80 512 256 6