# nonparametric_representation_learning_with_kernels__57b59763.pdf Non-parametric Representation Learning with Kernels Pascal Esser , Maximilian Fleissner , Debarghya Ghoshdastidar Technical University of Munich, Germany esser@cit.tum.de, fleissner@cit.tum.de, ghoshdas@cit.tum.de Unsupervised and self-supervised representation learning has become popular in recent years for learning useful features from unlabelled data. Representation learning has been mostly developed in the neural network literature, and other models for representation learning are surprisingly unexplored. In this work, we introduce and analyze several kernelbased representation learning approaches: Firstly, we define two kernel Self-Supervised Learning (SSL) models using contrastive loss functions and secondly, a Kernel Autoencoder (AE) model based on the idea of embedding and reconstructing data. We argue that the classical representer theorems for supervised kernel machines are not always applicable for (self-supervised) representation learning, and present new representer theorems, which show that the representations learned by our kernel models can be expressed in terms of kernel matrices. We further derive generalisation error bounds for representation learning with kernel SSL and AE, and empirically evaluate the performance of these methods in both small data regimes as well as in comparison with neural network based models. Introduction Representation learning builds on the idea that for most data, there exists a lower dimensional embedding that still retains most of the information useful for a downstream task (Bengio, Courville, and Vincent 2013). While early works relied on pre-defined representations, including image descriptors such as SURF (Bay, Tuytelaars, and Van Gool 2006) or SIFT (Lowe 1999) as well as bag-of-words approaches, over the past decade the focus has moved to representations learned from data itself. Across a wide range of tasks, including image classification and natural language processing (Bengio, Courville, and Vincent 2013), this approach has proven to be more powerful than the use of hand-crafted descriptors. In addition, representation learning has gained increasing popularity in recent years as it provides a way of taking advantage of unlabelled data in a partially labelled data setting. Since the early works, methods for representation learning have predominantly relied on neural networks and there has been little focus on other classes of models. This may be These authors contributed equally. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. part of the reason why it is still mostly driven from an experimental perspective. In this work, we focus on the following two learning paradigms that both fall under the umbrella of representation learning: Self-Supervised representation learning using contrastive loss functions has been established in recent years as an important method between supervised and unsupervised learning as it does not require explicit labels but relies on implicit knowledge of what makes samples semantically close to others. Therefore SSL builds on inputs and inter-sample relations (𝑿, 𝑿), where 𝑿is often constructed through data-augmentations of 𝑿known to preserve input semantics such as additive noise or horizontal flip for an image (Kanazawa, Jacobs, and Chandraker 2016). While the idea of SSL is not new (Bromley et al. 1993) the main focus has been on deep SSL models, which have been highly successful in domains such as computer vision (Chen et al. 2020; Jing and Tian 2019) and natural language processing (Misra and Maaten 2020; Devlin et al. 2019). Unsupervised representation learning through reconstruction relies only on a set of features 𝑿without having access to the labels. The high level idea is to map the data to a lower dimensional latent space, and then back to the features. The model is optimised by minimising the difference between the input data and the reconstruction. This has been formalized through principal component analysis (PCA) (Pearson 1901) and its nonlinear extension Kernel PCA (Sch olkopf, Smola, and M uller 1998). While few approaches exist in traditional machine learning, the paradigm of representation through reconstruction has built the foundation of a large number of deep learning methods. Autoencoders (AE) (Kramer 1991) use a neural network for both the embedding into the latent space as well as for the reconstruction. The empirical success of autoencoders has given rise to a large body of work, developed for task specific regularisation (e.g. (Yang et al. 2017)), as well as for a wide range of applications such as image denoising (Buades, Coll, and Morel 2005a), clustering (Yang et al. 2017) or natural language processing (Zhang et al. 2022). However, their theoretical understanding is still limited to analyzing critical points and dynamics in shallow linear networks (Kunin et al. 2019; Pretorius, Kroon, and Kamper 2018; Refinetti and Goldt 2022). An extended version including proofs can be found at https://arxiv.org/abs/2309.02028 The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Kernel representation learning. In spite of the widespread use of deep learning, other models are still ubiquitous in data science. For instance, decision tree ensembles are competitive with neural networks in various domains (Shwartz-Ziv and Armon 2022; Roßbach 2018; Gu, Kelly, and Xiu 2020), and are preferred due to interpretability. Another well established approach is kernel methods, which we will focus on in this paper. At an algorithmic level, kernel methods rely on the pairwise similarities between datapoints, denoted by a kernel π‘˜(𝒙, 𝒙 ). When the map π‘˜is positive definite, π‘˜(𝒙, 𝒙 ) corresponds to the inner product between (potentially infinite-dimensional) nonlinear transformations of the data, and implicitly maps the data to a reproducing kernel Hilbert space (RKHS) through a feature map πœ™ that satisfies π‘˜(𝒙, 𝒙 ) = πœ™(𝒙), πœ™(𝒙 ) . Thus, any algorithm that relies exclusively on inner products can implicitly be run in by simply evaluating the kernel π‘˜. Kernel methods are among the most successful models in machine learning, particularly due to their inherently non-linear and non-parametric nature, that nonetheless allows for a sound theoretical analysis. Kernels have been used extensively in regression (Kimeldorf and Wahba 1971; Wahba 1990) and classification (Cortes and Vapnik 1995; Mika et al. 1999). Since representation learning, or finding suitable features, is a key challenge is many scientific fields, we believe there is considerable scope for developing such models in these fields. The goal of this paper is to establish that one can construct non-parametric representation learning models, based on data reconstruction and contrastive losses. By reformulating the respective optimisation problems for such models using positive definite kernels (Aronszajn 1950; Sch olkopf and Smola 2002), we implicitly make use of non-linear feature maps πœ™( ). Moreover, the presented approaches do not reduce to traditional (unsupervised) kernel methods. In the reconstruction based setting we define a Kernel AE and also present kernel-based self-supervised methods by considering two different contrastive loss functions. Thereby, our work takes a significant step towards this development by decoupling the representation learning paradigm from deep learning. To this end, kernel methods are an ideal alternative since (i) kernel methods are suitable for small data problems that are prevalent in many scientific fields (Xu et al. 2023; Todman, Bush, and Hood 2023; Chahal and Toner 2021); (ii) kernels are non-parametric, and yet considered to be quite interpretable (Ponte and Melko 2017; Hainmueller and Hazlett 2014); and (iii) as we show, there is a natural translation from deep SSL to kernel SSL, without compromising performance. Contributions. The main contributions of this work is the development and analysis of kernel methods for reconstruction and contrastive SSL models. More specifically: 1. Kernel Contrastive Learning. We present kernel variants of a single hidden layer network that minimises two popular contrastive losses. For a simple contrastive loss (Saunshi et al. 2019), the optimisation is closely related to a kernel eigenvalue problem, while we show that the minimisation of spectral contrastive loss (Hao Chen et al. 2021) in the kernel setting can be rephrased as a kernel matrix based optimisation. 2. Kernel Autoencoder. We present a Kernel AE where the encoder learns a low-dimensional representation. We show that a Kernel AE can be learned by solving a kernel matrix based optimisation problem. 3. Theory. We present an extension to the existing representer theorem under orthogonal constraints. Furthermore we derive generalisation error bounds for the proposed kernel models in which show that the prediction of the model improve with increased number of unlabelled data. 4. Experiments. We empirically demonstrate that the three proposed kernel methods perform on par or outperform classification on the original features as well as Kernel PCA and compare them to neural network representation learning models. Related Work (Johnson, Hanchi, and Maddison 2022) show that minimising certain contrastive losses can be interpreted as learning kernel functions that approximate a fixed positive-pair kernel, and hence, propose an approach of combining deep SSL with Kernel PCA. Closer to our work appears to be (Kiani et al. 2022), where the neural network is replaced by a function learned on the RKHS of a kernel. However, their loss functions are quite different from ours. Moreover, by generalising the representer theorem, we can also enforce orthonormality on the embedding maps from the RKHS itself. (Zhai et al. 2023) studies the role of augmentations in SSL through the lense of the RKHS induced by an augmentation. (Shah et al. 2022) present a margin maximisation approach for contrastive learning that can be solved using kernel support vector machines. Their approach is close to our simple contrastive loss method (Definition 1), but not the same as we obtain a kernel eigenvalue problem. While (Johnson, Hanchi, and Maddison 2022; Shah et al. 2022) consider specific contrastive losses, we present a wider range of kernel SSL models, including Kernel AE, and provide generalisation error bounds for all proposed models. Notation We denote matrices by bold capital letters 𝑨, vectors as 𝒂, and π‘°π‘šfor an identity matrix of size π‘š N. For a given kernel π‘˜ R𝑑 R𝑑 R, we denote πœ™ R𝑑 for its canonical feature map into the associated RKHS . Given data 𝒙1, , 𝒙𝑛collected in a matrix 𝑿 R𝑑 𝑛, we write Ξ¦ = (πœ™(𝒙1), , πœ™(𝒙𝑛)) and define 𝑋as the finitedimensional subspace spanned by Ξ¦. Recall that can be decomposed as 𝑋 𝑋. We denote by 𝑲= Φ𝑇Φ R𝑛 𝑛 the kernel matrix, and define π‘˜(𝒙 , 𝑿) = Ξ¦π‘‡πœ™(𝒙 ). Throughout the paper we assume 𝑛datapoints are used to train the representation learning model, which embeds from R𝑑into a β„Ž-dimensional space. On a formal level, the problem could be stated within the generalised framework of matrix-valued kernels 𝐾 R𝑑 R𝑑 Rβ„Ž β„Ž, because the vector-valued RKHS (𝐾) associated with a matrix-valued kernel 𝐾naturally contains functions 𝑾that map from R𝑑to Rβ„Ž. For the scope of this paper however, it is sufficient to assume 𝐾(π‘₯, 𝑦) = π‘°β„Ž π‘˜(π‘₯, 𝑦) for some scalar kernel π‘˜(π‘₯, 𝑦) with realvalued RKHS . Then, the norm of any 𝑾 (𝐾) is simply the Hilbert-Schmidt norm that we denote as 𝑾 = 𝑾 (for finite-dimensional matrices, the Frobenius norm), and learning the embedding from R𝑑to Rβ„Žreduces to learning β„Žindividual vectors π’˜1, , π’˜β„Ž . In other words, we can The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) interpret 𝑾 (𝐾) as a (potentially infinite-dimensional) matrix with columns π’˜1, , π’˜β„Ž , sometimes writing 𝑾= (π’˜1, , π’˜β„Ž) for notational convenience. To underline the similarity with the deep learning framework, we denote π‘Ύπ‘‡πœ™(π‘₯) = ( π’˜π‘‘, πœ™(π‘₯) )β„Ž 𝑑=1 = (π’˜1(π‘₯), , π’˜β„Ž(π‘₯)) Rβ„Ž, where we invoke the reproducing property of the RKHS in the last step. We denote by 𝑾 the adjoint operator of 𝑾(in a finite-dimensional setting, 𝑾 simply becomes the transpose 𝑾𝑇). The constraint 𝑾 𝑾= π‘°β„Ženforces orthonormality between all pairs (π’˜π‘–, π’˜π‘—)𝑖,𝑗 β„Ž. Representer Theorems In principle, kernel methods minimise a loss functional over the entire, possibly infinite-dimensional RKHS. It is the celebrated representer theorem (Kimeldorf and Wahba 1971; Sch olkopf, Herbrich, and Smola 2001) that ensures the practical feasibility of this approach: Under mild conditions on the loss , the optimiser is surely contained within the finite-dimensional subspace 𝑋. For example, in standard kernel ridge regression, the loss functional is simply the regularised empirical squared error 𝑖=1 (π’˜(π‘₯𝑖) 𝑦𝑖)2 + πœ† π’˜ The fact that all minimisers of this problem indeed lie in 𝑋can be seen by simply decomposing = 𝑋 𝑋, observing that π’˜(π‘₯𝑖) = 0 for all π’˜ 𝑋, and concluding that projecting any π’˜onto can only ever decrease the functional . This very argument can be extended to representation learning, where regularisation is important to avoid mode collapse. We formally state the following result. Theorem 1. (Representer Theorem for Representation Learning) Given data 𝒙1, , 𝒙𝑛, denote by 𝑋(π’˜1, , π’˜β„Ž) a loss functional on β„Žthat does not change whenever π’˜1, , π’˜β„Ž are projected onto the finite-dimensional subspace 𝑋spanned by the data. Then, any minimiser of the regularised loss functional (π’˜1, , π’˜β„Ž) = 𝑋(π’˜1, , π’˜β„Ž) + πœ† 𝑾 consists of π’˜1, , π’˜β„Ž 𝑋. This justifies the use of kernel methods when the norm of the embedding map is penalized. However, it does not address loss functionals that instead impose an orthonormality constraint on the embedding 𝑾. It is natural to ask when a representer theorem exist for these settings as well. Below, we give a necessary and sufficient condition. Theorem 2 (Representer theorem under orthonormality constraints). Given data 𝑿and an embedding dimension β„Ž N, let β„Ž R be a loss function that vanishes on 𝑋. Assume dim( 𝑋) β„Ž. Consider the following constrained minimisation problem over π’˜1, , π’˜β„Ž minimise (π’˜1, , π’˜β„Ž) s.t. 𝑾 𝑾= π‘°β„Ž (1) Furthermore, consider the inequality-constrained problem over 𝑋 minimise (π’˜1, , π’˜β„Ž) s.t. 𝑾𝑇𝑾 π‘°β„Žand π’˜1, , π’˜β„Ž 𝑋 (2) Then, every minimiser of (1) is contained in β„Ž 𝑋if and only if every minimiser of (2) satisfies 𝑾𝑇𝑾= π‘°β„Ž. In practice, the conditions (2) can often be verified directly by checking the gradient of on 𝑋, or under orthonormalization (see Appendix). Together with the standard representer theorem, this guarantees that kernel methods can indeed be extended to representation learning without sacrificing the appealing properties that the representer theorem provides us with. Representation Learning with Kernels Building on this foundation, we can now formalize the previously discussed representation learning paradigms in the kernel setting namely SSL using contrastive loss functions, as well as unsupervised learning through reconstruction loss. Simple Contrastive Loss For convenience, we restrict ourselves to a triplet setting with training samples (𝒙𝑖, 𝒙+ 𝑖, 𝒙 𝑖), 𝑖= 1, , 𝑛. The idea is to consider an anchor image 𝒙𝑖, a positive sample 𝒙+ 𝑖generated using data augmentation techniques, as well as an independent negative sample 𝒙 𝑖. The goal is to align the anchor more with the positive sample than with the independent negative sample. In the following, we consider two loss functions that implement this idea. In both cases, we kernelize a single hidden layer, mapping data 𝒙 R𝑑to an embedding 𝒛 Rβ„Ž. 𝒙 R𝑑 πœ™( ) 𝒓 𝑾 𝒛 Rβ„Ž. (3) We start with a simple contrastive loss inspired by (Saunshi et al. 2019), with additional regularisation. Intuitively, this loss directly compares the difference in alignment between the anchor and the positive an the anchor and the negative sample. Formally, we define it as follows. Definition 1 (Contrastive Kernel Learning). We learn a representation of the form 𝑓𝑾(𝒙) = π‘Ύπ‘‡πœ™(𝒙) (see mapping in Eq. 3) by optimising the objective function 𝑖=1 𝑓𝑾(𝒙𝑖)𝑇(𝑓𝑾(𝒙 𝑖) 𝑓𝑾(𝒙+ 𝑖)) s.t. 𝑾 𝑾= π‘°β„Ž By verifying the conditions of Theorem 2, we reduce the problem to a finite-dimensional optimisation. Theorem 3 then provides a closed from solution to the optimisation problem in Eq. 4. Theorem 3 (Closed Form Solution and Inference at Optimal parameterization). Consider the optimisation problem as stated in Definition 1. Let 𝑿, 𝑿+, 𝑿 R𝑑 𝑛denote the data corresponding to the anchors, positive and negative samples, respectively. Define the kernel matrices 𝑲= [π‘˜(π‘₯𝑖, π‘₯𝑗)]𝑖,𝑗 𝑲 = [π‘˜(π‘₯𝑖, π‘₯ 𝑗)]𝑖,𝑗 𝑲+ = [π‘˜(π‘₯𝑖, π‘₯+ 𝑗)]𝑖,𝑗 𝑲 = [π‘˜(π‘₯ 𝑖, π‘₯ 𝑗)]𝑖,𝑗 𝑲++ = [π‘˜(π‘₯+ 𝑖, π‘₯+ 𝑗)]𝑖,𝑗 𝑲 + = [π‘˜(π‘₯ 𝑖, π‘₯+ 𝑗)]𝑖,𝑗 The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Furthermore, define the matrices 𝑲3 = 𝑲 𝑲+ as well as 𝑲Δ = 𝑲 + 𝑲++ 𝑲 + 𝑲𝑇 + 𝑲1 = [ 𝑲 𝑲3 𝑲3 𝑲Δ] 𝑩= [ 𝑲3 𝑲Δ] [𝑲 𝑲 𝑲+] 𝑲2 = 1 2 (𝑩+ 𝑩𝑇) . Let 𝑨2 consist of the top β„Žeigenvectors of the matrix 𝑲 1/2 1 𝑲2𝑲 1/2 1 , which we assume to have β„Žnon-negative eigenvalues. Let 𝑨= 𝑲 1/2 1 𝑨2. Then, at optimal parameterization, the embedding of any π‘₯ R𝑑can be written in closed form as 𝒛 = 𝑨𝑇 [ π‘˜(π‘₯ , 𝑿) π‘˜(π‘₯ , 𝑿 ) π‘˜(π‘₯ , 𝑿+)] Spectral Contrastive Loss Let us now consider a kernel contrastive learning based on an alternative, commonly used spectral contrastive loss function (Hao Chen et al. 2021). Definition 2 (Spectral Kernel Learning). We learn a representation of the form 𝑓𝑾(𝒙) = π‘Ύπ‘‡πœ™(𝒙) (see mapping in Eq. 3) by optimising the following objective function, : 𝑖=1 2𝑓𝑾(𝒙𝑖)𝑇𝑓𝑾(𝒙+ 𝑖) + (𝑓𝑾(𝒙𝑖)𝑇𝑓𝑾(𝒙 𝑖)) 2 + πœ† 𝑾 2 . For universal kernels, we can directly rewrite the loss function using the kernel trick and optimise it using simple gradient descent. This allows us to state the following result, which yields an optimisation directly in terms of the embeddings 𝒛1, , 𝒛𝑛 Rβ„Ž. Theorem 4 (Gradients and Inference at Optimal Parameterization). Consider the optimisation problem as stated in Definition 2, with 𝑲denoting the kernel matrix of a universal kernel. Then, we can equivalently minimise the objective w.r.t. the embeddings 𝒁 Rβ„Ž 3𝑛. Denoting by 𝒛1, , 𝒛3𝑛the columns of 𝒁, the loss to be minimised becomes min 𝒁 Rβ„Ž 3𝑛 𝑖=1 2𝒛𝑇 𝑖𝒛𝑖+𝑛+ (𝒛𝑇 𝑖𝒛𝑖+2𝑛) 2 + πœ† Tr (𝒁𝑲 1𝒁𝑇) The gradient of the loss function in terms of 𝒁is therefore given by 2𝒛𝑖+𝑛+ 2(𝒛𝑇 𝑖𝒛𝑖+2𝑛)𝒛𝑖+2𝑛 , 𝑖 [𝑛] 2𝒛𝑖 𝑛 , 𝑖 [𝑛+ 1, 2𝑛] 2(𝒛𝑇 𝑖𝒛𝑖 2𝑛)𝑧𝑖 2𝑛 , 𝑖 [2𝑛+ 1, 3𝑛] For any new point 𝒙 R𝑑, the trained model maps it to 𝒛 = 𝒁𝑲 1π‘˜(𝑿, 𝒙 ). Kernel Autoencoders In general, AE architectures involve mapping the input to a lower dimensional latent space (encoding), and then back to the reconstruction (decoding). In this work we propose a Kernel AE, where both encoder and decoder correspond to kernel machines, resulting in the mapping 𝒙 R𝑑 πœ™1( ) 𝒓1 1 𝑾1 𝒛 Rβ„Žπœ™2( ) 𝒓2 2 𝑾2 𝒙 R𝑑 where typically β„Ž< 𝑑. While several materializations of this high-level idea come to mind, we define the Kernel AE as follows. Definition 3 (Kernel AE). Given data 𝑿 R𝑑 𝑛and a regularisation parameter πœ†> 0, define the loss functional (𝑾1, 𝑾2) = 𝑿 𝑾𝑇 2πœ™2 (𝑾𝑇 1πœ™1 (𝑿)) + πœ†( 𝑾1 2 + 𝑾2 2 ) The Kernel AE corresponds to the optimisation problem min 𝑾1,𝑾2 (𝑾1, 𝑾2) s.t. 𝑾𝑇 1πœ™(𝒙𝑖) 2 = 1 𝑖 [𝑛] (5) Let us justify our choice of architecture briefly. Firstly, we include norm regularisations on both the encoder as well as the decoder. This is motivated by the following observation: When the feature map πœ™2 maps to the RKHS of a universal kernel, any choice of 𝑛distinct points 𝒛1, , 𝒛𝑛in the bottleneck allows for perfect reconstruction. We therefore encourage the Kernel AE to learn smooth maps by penalizing the norm in the RKHS. In addition, we include the constraint 𝑾𝑇 1πœ™(𝒙𝑖) 2 = 1 𝑖 [𝑛] to prevent the Kernel AE from simply pushing the points 𝒛1, , 𝒛𝑛to zero. This happens whenever the impact of rescaling 𝒛𝑖affects the norm of the encoder 𝑾1 differently from the decoder 𝑾2 (as is the case for commonly used kernels such as Gaussian and Laplacian). Nonetheless, we stress that other choices of regularisation are also possible, and we explore some of them in the Appendix. While a closed form solution of Definition 3 is difficult to obtain, we show that for universal kernels, the optimisation can again be rewritten in terms of kernel matrices. Theorem 5 (Kernel formulation and inference at optimal parameterization). For any bottleneck 𝒁 Rβ„Ž 𝑛, define the reconstruction 𝑸(𝒁) = 𝑿(𝑲𝑍+ πœ†πΌπ‘›) 1𝑲𝑍 For universal kernels, learning the Kernel AE from Definition 3 is then equivalent to minimising the following expression over all possible embeddings 𝒁 Rβ„Ž 𝑛: 𝑸(𝑍) 𝑿 2 + πœ†Tr (𝒁𝑲 1 𝑋𝒁𝑇+ 𝑸𝑲 1 𝑍𝑸𝑇) s.t. 𝒛𝑖 2 = 1 𝑖 [𝑛] Given 𝒁, any new 𝒙 R𝑑is embedded in the bottleneck as 𝒛 = 𝒁𝑲 1 π‘‹π‘˜(𝒙 , 𝑿) and reconstructed as 𝒙 = 𝑿(𝑲𝑍+ πœ†π‘°π‘›) 1 π‘˜(𝒛 , 𝒁) Remark 1 (Connection to Kernel PCA). In light of the known connections between linear autoencoders and standard PCA, it is natural to wonder how above Kernel AE relates to Kernel PCA (Sch olkopf, Smola, and M uller 1998). The latter performs PCA in the RKHS , and is hence equivalent The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) to minimising the reconstruction error over all orthogonal basis transformations 𝑾in 𝑖=1 πœ™(𝒙𝑖) π‘Ύπ‘‡π‘·β„Žπ‘Ύπœ™(𝒙𝑖) 2 (6) where π‘·β„Ždenotes the projection onto the first β„Žcanonical basis vectors, and we assume that the features πœ™(𝒙𝑖) are centered. How does the Kernel AE 𝑾𝑇 2πœ™2(𝑾𝑇 1πœ™1(𝒙)) relate to this if we replace the regularisation terms on 𝑾1, 𝑾2 by an orthogonality constraint on both? For simplicity, let us assume β„Ž= 1. The optimisation problem then essentially becomes 𝑖=1 𝒙𝑖 𝑾2(𝑾1(𝒙𝑖)) 2 (7) where 𝑾1 R𝑑 R is a function from the RKHS over R𝑑(with unit norm), and 𝑾2 R R𝑑consists of 𝑑orthonormal functions from the RKHS over R. Clearly, Eq. 7 evaluates the reconstruction error in the sample space, much in contrast to the loss function in Eq. 6 which computes distances in the RKHS. Additionally, the map 𝑾𝑇learned in Eq. 6 from the bottleneck back to is given by the basis transformation 𝑾in Kernel PCA, whereas it is fixed as the feature map πœ™over Rβ„Žin the AE setting. Kernel PCA can be viewed as an AE architecture that maps solely within , via πœ™(𝒙) π‘Ύπœ™(𝒙) π‘·β„Žπ‘Ύπœ™(𝒙) π‘Ύπ‘‡π‘·β„Žπ‘Ύπœ™(𝒙). Notably, the results of Kernel PCA usually do not translate back to the sample space easily. Given a point 𝒙 R𝑑, the projection of πœ™(𝒙) onto the subspace spanned by Kernel PCA is not guaranteed to have a pre-image in R𝑑, and a direct interpretation of the learned representations can therefore be difficult. In contrast, our method is quite interpretable, as it also provides an explicit formula for the reconstruction 𝒙 of unseen data points not just their projection onto a subspace in an abstract Hilbert space. In particular, by choosing an appropriate kernel1 and tuning the regularisation parameter πœ†, a practitioner may directly control the complexity of both decoder as well as the encoder. Remark 2 (De-noising Kernel AE). In this section, we considered the standard setting where the model learns the reconstruction of the input data. A common extension is the de-nosing setting (e.g. (Buades, Coll, and Morel 2005b; Vincent et al. 2010)), which formally moves the model from a reconstruction to a SSL setting, where we replace the input with a noisy version of the data. The goal is now to learn a function that removes the noise and, in the process, learns latent representations. More formally, the mapping becomes 𝒙 R𝑑 πœ™1( ) 𝒓1 1 𝑾1 𝒛 Rβ„Žπœ™2( ) 𝒓2 2 𝑾2 𝒙 R𝑑. where 𝒙is given by 𝒙 = 𝒙+ πœ€with πœ€being the noise term. A precise formulation is provided in the Appendix. We again note that the simple extension to this setting further distinguishes our approach from Kernel PCA, where such augmentations are not as easily possible. 1The choice of kernel could be influenced by the type of functions that are considered interpretable in the domain of application. Generalisation Error Bounds Kernel methods in the supervised setting are well established and previous works offer rigorous theoretical analysis (Wahba 1990; Sch olkopf and Smola 2002; Bartlett and Mendelson 2002). In this section, we show that the proposed kernel methods for contrastive SSL as well as for the reconstruction setting can be analysed in a similar fashion, and we provide generalisation error bounds for each of the proposed models. Error Bound for Representation Learning Setting In general we are interested in characterizing (𝑓) = E𝑿 [𝑙(𝑓(𝑿))] where 𝑓(𝑿) is the representation function and 𝑙( ) is a loss function, which is either a contrastive loss or based on reconstruction. However, since we do not have access to the distribution of the data , we can only observe the empirical (training) error, (𝑓) = 1 𝑛 𝑛 𝑖=1 𝑙(𝑓(𝑿𝑖)), where 𝑛is the number of unlabelled datapoints we can characterise the generalisation error as (𝑓) (𝑓) + complexity term + slack term The exact form of the complexity and slack term depends on the embeddings and the loss. In the following, we precisely characterise them for all of the proposed models. Theorem 6 (Error Bound for Kernel Contrastive Loss). Let = { 𝑿 π‘Ύπ‘‡πœ™(𝑿) 𝑾𝑇 πœ” } be the class of embedding functions we consider in the contrastive setting. Define 𝛼 = ( β„ŽTr [𝑲𝑿 ] + β„ŽTr [𝑲𝑿+]) as well as πœ… = max { π‘˜(𝒙 𝑖, 𝒙 𝑖) 𝒙 𝑖 {𝒙𝑖, 𝒙 𝑖, 𝒙+ 𝑖}𝑛 𝑖=1 } . We then obtain the generalisation error for the proposed losses as follows. 1. Simple Contrastive Loss. Let the loss be given by Definition 1. Then, for any 𝛿> 0, the following statement holds with probability at least 1 𝛿for any 𝑓 : (𝑓) (𝑓) + 𝑂( πœ”2 πœ…π›Ό 2. Spectral Contrastive Loss. Let the loss be given by Definition 2. Then, for any 𝛿> 0, the following statement holds with probability at least 1 𝛿for any 𝑓 : (𝑓) (𝑓) + 𝑂(πœ†πœ”2 + πœ”3πœ… 3 2 𝛼 𝑛 + πœ”4πœ…2 Similarly to the contrastive setting, we obtain a generalisation error bound for the Kernel AE as follows. Theorem 7 (Error Bound for Kernel AE). Assume the optimisation be given by Definition 3 and define the class of encoders/decoders as = { 𝑿 𝑾𝑇 2πœ™2 (𝑾𝑇 1πœ™1 (𝑿)) 𝑾𝑇 1πœ™(𝒙𝑖) 2 = 1 𝑖, 𝑾𝑇 1 πœ”1, 𝑾𝑇 2 πœ”2 } Let π‘Ÿ = πœ†(πœ”2 1 + πœ”2 2) and 𝛾= max𝒔 Rβ„Ž { π‘˜(𝒔, 𝒔) 𝒔 2 = 1 } . Then for any 𝛿> 0, the following statement holds with probability at least 1 𝛿for any 𝑓 : 𝐴𝐸(𝑓) 𝐴𝐸(𝑓) + 𝑂(π‘Ÿ+ πœ”2 The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) The above bounds demonstrate that with increasing number of unlabelled datapoints, the complexity term in the generalisation-error bound decreases. Thus, the proposed models follow the general SSL paradigm of increasing the number of unlabelled data to improve the model performance. Error Bound for Supervised Downstream Task While the above bounds provide us with insights on the generalisation of the representation learning setting, in most cases we are also interested in the performance on downstream tasks. Conveniently, we can use the setup presented in (Saunshi et al. 2019) to bound the error of the supervised downstream tasks in terms of the unsupervised loss, providing a bound of the form 𝑠𝑒𝑝(𝑓) 𝑐1 𝑒𝑛(𝑓) + 𝑐1 complexity term where 𝑐1 and 𝑐2 are data dependent constants. We present the formal version of this statement in the supplementary material for all presented models. This highlights that a better representation (as given by a smaller loss of the unsupervised task) also improves the performance of the supervised downstream task. Experiments In this section we illustrate the empirical performance of the kernel-based representation learning models introduced in this paper. As discussed in the introduction, there is a wide range of representation learning models, that are often quite specific to the given task. We mainly consider classification in a setting with only partially labelled data at our disposal, as well as image de-noising using the Kernel AE. We state the main setup and results in the following2. Classification on Embedding Data. In this section, we consider the following four datasets: concentric circles, cubes (Pedregosa et al. 2011), Iris (Fisher 1936) and Ionosphere (Sigillito. et al. 1989). We fix the following data split: π‘’π‘›π‘™π‘Žπ‘π‘’π‘™π‘™π‘’π‘‘= 50%, π‘™π‘Žπ‘π‘’π‘™π‘™π‘’π‘‘= 5% and 𝑑𝑒𝑠𝑑= 45%, and consider β„Ž= 2 as the embedding dimension. Classification task using π‘˜nearest neighbours (π‘˜-nn) using embedding as features. We investigate classification as an example of a supervised downstream task. The setting is the following: We have access to π‘Ώπ‘’π‘›π‘™π‘Žπ‘. and π‘Ώπ‘™π‘Žπ‘. datapoints, which we use to train the representation learning model without access to labels. Then, as the downstream classification model, we consider a π‘˜-nn model (with π‘˜= 3) learned on the embedding of π‘Ώπ‘™π‘Žπ‘., with corresponding labels π’€π‘™π‘Žπ‘.. We test on 𝑿𝑑𝑒𝑠𝑑, 𝒀𝑑𝑒𝑠𝑑. As a benchmark, we compare to π‘˜-nn both on the original features as well as on the embeddings obtained by standard Kernel PCA. Choice of kernel and their parameterization. For the proposed kernel methods as well as for Kernel PCA we consider three standard kernels, Gaussian, Laplacian and 2We provide all further details (as well as experiments on additional datasets) in the arxiv version. We provide a Python implementation on https://github.com/ pascalesser/Representation-Learning-with-Kernels. Original Kernel PCA SSL spectral Lin. Re LU Gaus. Lap. Figure 1: From left to right: we first consider π‘˜-nn on the original features followed by π‘˜-nn on embeddings obtained by Kernel PCA, and the proposed methods. linear kernels as well as a 1-layer Re LU Kernel (Bietti and Bach 2021). For Gaussian and Laplacian kernel we choose the bandwidth using a grid search over 15 steps spaced logarithmically between 0.01 and 100. We perform leave-oneout validation on π‘Ώπ‘™π‘Žπ‘. to pick the bandwidth of the method applied to the test set. The classification experiments on the above listed datasets are present in Figure 1. All results show the mean and standard deviation over five splits of each dataset. It is apparent throughout the experiments that the choice of kernel plays a significant role in the overall performance of the model. This dependency is not surprising, as the performance of a specific kernel directly links to the underlying data-structure, and the choice of kernel is an essential part of the model design. This is in accordance with existing kernel methods and an important future direction is to analyze what kernel characteristics are beneficial in a representation learning setting. Comparison of supervised and representation learning. As stated in the introduction (and supported theoretically in the previous section), the main motivation for representation learning is to take advantage of unlabelled data by learning embeddings that outperform the original features on downstream tasks. To evaluate this empirically for the kernel representation learning models analyzed in this paper, we compare π‘˜-nn on the original data to π‘˜-nn on the embeddings as shown in Figure 1. We observe that for Circles, Cubes, Iris and Ionosphere there always exists an embedding that outperforms π‘˜-nn on the original data. Comparing different embedding methods. Having observed that learning a representation before classification is beneficial, we now focus on the different embedding approaches. While the performed experiments do not reveal clear trends between different methods, we do note that the proposed methods overall perform on par or outperform Kernel PCA, underlining their relevance for kernel SSL. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) MNIST, d=784 Cifar, d=3072 NN Linear Re LU Gaus. Lap. Figure 2: Comparison of kernel methods and neural network models for classification. MNIST CIFAR SVHN NN Linear Re LU Gaus. Lap. Figure 3: De-noising using NN AE with and Kernel AE. Comparison to Neural Networks for Classification and De-noising Representation learning has mainly been established in the context of deep neural networks. In this paper, we make a step towards decoupling the representation learning paradigm from the widely used deep learning models. Nonetheless, we can still compare the proposed kernel methods to neural networks. We construct the corresponding NN model by replacing the linear function in the reproducing kernel Hilbert space, 𝑾 πœ™(𝒙) by an one-hidden layer neural network 𝑾2𝜎(𝑾1𝒙), where 𝜎( ) is a non-linear activation function (and we still minimise a similar loss function). Classification. We compare the performance of both representation learning approaches in Figure 2 for datasets CIFAR-10 (Krizhevsky, Hinton et al. 2009), as well as a subset of the first two classes of MNIST (Deng 2012) (i.e. 𝑛= 500). We observe that the kernel methods perform on par with, or even outperform the neural networks. This indicates that there is not one dominant approach but one has to choose depending on the given task. De-Noising. As a second task, we consider de-noising using (Kernel) AE. Data is sampled from the first five classes of MNIST, CIFAR-10 and SVHN (Netzer et al. 2011) with 𝑛= 225 and the noisy version are generated by 𝒙 = 𝒙+ πœ€, πœ€π‘– (0, 0.1). We compare the performance of kernel-based approaches with the neural network reconstructions in Figure 3 by plotting the mean square error on the test set between the AE output and the clean data. Kernel AE outperforms the neural network AE in all considered settings. Moreover, there is little variation among the different kernels. This indicates that at least in the presented settings, the proposed kernel methods pose a viable alternative to traditional neural network based representation learning. Formal connection between Kernel and neural network model. While it is known that regression with infinitewidth networks is equivalent to kernel regression with neural tangent kernel (NTK) (Jacot, Gabriel, and Hongler 2018; Arora et al. 2019), similar results are not known for SSL and this brings up the question: Is kernel SSL equivalent to SSL with infinitely-wide neural networks? It is possible to show that single-layer Kernel AE with NTK is the infinite-width limit of over-parameterized AE (Nguyen, Wong, and Hegde 2021; Radhakrishnan, Belkin, and Uhler 2020). We believe that the same equivalence also holds for kernel contrastive learning (Definition 1) with NTK, but leave this as an open problem. We do not know if Definition 3 with NTK is the limit for bottleneck deep learning AE since, as we note earlier, there is no unique formulation for Kernel AE. Discussions and Outlook In this paper, we show that new variants of representer theorem allows one to rephrase SSL optimisation problems or the learned representations in terms of kernel functions. The resulting kernel SSL models provide natural tools for theoretical analysis. We believe that presented theory and method provide both scope for precise analysis of SSL and can also be extended to other SSL principles, such as other pretext tasks or joint embedding methods (Saunshi et al. 2019; Bardes, Ponce, and Le Cun 2022; Grill et al. 2020; Chen and He 2020). We conclude with some additional discussions. Computational limitations and small dataset setting. Exactly computing kernel matrices is not scalable, however random feature (RF) approximations of kernel methods are well suited for large data (Rahimi and Recht 2007; Carratino, Rudi, and Rosasco 2018). While one may construct scalable kernel representation learning methods using RF, it should be noted that RF models are lazy-trained networks (Ghorbani et al. 2019). So fully-trained deep representation learning models may be more suitable in such scenarios. However representation learning is relevant in all problems with availability of partially labelled data. This does not only apply to the big data regime where deep learning approaches are predominantly used, but also to small data settings where kernel methods are traditionally an important tool (Fern andez Delgado et al. 2014). The practical significance of developing kernel approaches is to broaden the scope of the representation learning paradigm beyond the deep learning community. Kernel SSL vs. non-parametric data embedding. Several non-parametric generalisations of PCA, including functional PCA, kernel PCA, principle curves etc., have been studied over decades and could be compared to Kernel AEs. However, unlike kernel SSL, embedding methods are typically not inductive. As shown previously, the inductive representation learning by Kernel AE and contrastive learning make them suitable for downstream supervised tasks. Kernel SSL vs. SSL with infinite-width neural networks. While it is known that regression with infinite-width networks is equivalent to kernel regression with neural tangent kernel (NTK) (Jacot, Gabriel, and Hongler 2018), similar results are not known for SSL. We believe that a study of the learning dynamics of neural network based SSL would show their equivalence with our kernel contrastive models with NTK. However, it is unclear to us whether a similar result can exist for kernel AE, as NTK approximations typically do not hold in the presence of bottleneck layers (Liu, Zhu, and Belkin 2020). The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Acknowledgments This work has been supported by the German Research Foundation (Priority Program SPP 2298, project GH 257/2-1, and Research Grant GH 257/4-1) and the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. References Aronszajn, N. 1950. Theory of reproducing kernels. Transactions of the American mathematical society. Arora, S.; Du, S. S.; Hu, W.; Li, Z.; Salakhutdinov, R.; and Wang, R. 2019. On Exact Computation with an Infinitely Wide Neural Net. In Advances in neural information processing systems. Bardes, A.; Ponce, J.; and Le Cun, Y. 2022. VICReg: Variance Invariance-Covariance Regularization for Self-Supervised Learning. In International Conference on Learning Representations. Bartlett, P. L.; and Mendelson, S. 2002. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research. Bay, H.; Tuytelaars, T.; and Van Gool, L. 2006. SURF: Speeded Up Robust Features. In Computer Vision ECCV 2006. Springer Berlin Heidelberg. Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence. Bietti, A.; and Bach, F. 2021. Deep Equals Shallow for Re LU Networks in Kernel Regimes. In International Conference on Learning Representations. Bromley, J.; Guyon, I.; Le Cun, Y.; S ackinger, E.; and Shah, R. 1993. Signature verification using a siamese time delay neural network. Advances in neural information processing systems. Buades, A.; Coll, B.; and Morel, J.-M. 2005a. A review of image denoising algorithms, with a new one. Multiscale modeling & simulation. Buades, A.; Coll, B.; and Morel, J.-M. 2005b. A review of image denoising algorithms, with a new one. Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal. Carratino, L.; Rudi, A.; and Rosasco, L. 2018. Learning with SGD and Random Features. In Advances in Neural Information Processing Systems 31. Chahal, H.; and Toner, H. 2021. Small Data Are Also Crucial for Machine Learning. Scientific American. Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. Chen, X.; and He, K. 2020. Exploring Simple Siamese Representation Learning. ar Xiv preprint ar Xiv:2011.10566. Cortes, C.; and Vapnik, V. 1995. Support-vector networks. Machine learning. Deng, L. 2012. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine. Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. Fern andez-Delgado, M.; Cernadas, E.; Barro, S.; and Amorim, D. 2014. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research. Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Annals of eugenics. Ghorbani, B.; Mei, S.; Misiakiewicz, T.; and Montanari, A. 2019. Limitations of Lazy Training of Two-layers Neural Network. In Advances in Neural Information Processing Systems 32. Grill, J.-B.; Strub, F.; Altch e, F.; Tallec, C.; Richemond, P. H.; Buchatskaya, E.; Doersch, C.; Pires, B. A.; Guo, Z. D.; Azar, M. G.; Piot, B.; Kavukcuoglu, K.; Munos, R.; and Valko, M. 2020. Bootstrap Your Own Latent a New Approach to Self Supervised Learning. In Advances in neural information processing systems. Gu, S.; Kelly, B.; and Xiu, D. 2020. Empirical asset pricing via machine learning. The Review of Financial Studies. Hainmueller, J.; and Hazlett, C. 2014. Kernel regularized least squares: Reducing misspecification bias with a flexible and interpretable machine learning approach. Political Analysis. Hao Chen, J. Z.; Wei, C.; Gaidon, A.; and Ma, T. 2021. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. In Advances in neural information processing systems. Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in neural information processing systems. Jing, L.; and Tian, Y. 2019. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Johnson, D. D.; Hanchi, A. E.; and Maddison, C. J. 2022. Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions. ar Xiv preprint ar Xiv:2210.01883. Kanazawa, A.; Jacobs, D. W.; and Chandraker, M. 2016. Warpnet: Weakly supervised matching for single-view reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition. Kiani, B. T.; Balestriero, R.; Chen, Y.; Lloyd, S.; and Le Cun, Y. 2022. Joint embedding self-supervised learning in the kernel regime. ar Xiv preprint ar Xiv:2209.14884. Kimeldorf, G.; and Wahba, G. 1971. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. Kramer, M. A. 1991. Nonlinear principal component analysis using autoassociative neural networks. AICh E journal. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Kunin, D.; Bloom, J.; Goeva, A.; and Seed, C. 2019. Loss Landscapes of Regularized Linear Autoencoders. In International Conference on Machine Learning. Liu, C.; Zhu, L.; and Belkin, M. 2020. On the linearity of large non-linear models: when and why the tangent kernel is constant. In Advances in Neural Information Processing Systems 33. Lowe, D. G. 1999. Object recognition from local scaleinvariant features. In Proceedings of the seventh IEEE international conference on computer vision. Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Mullers, K.-R. 1999. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop. Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. Nguyen, T. V.; Wong, R. K. W.; and Hegde, C. 2021. Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis. IEEE Transactions on Information Theory. Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. Ponte, P.; and Melko, R. G. 2017. Kernel methods for interpretable machine learning of order parameters. Physical Review B. Pretorius, A.; Kroon, S.; and Kamper, H. 2018. Learning Dynamics of Linear Denoising Autoencoders. In International Conference on Machine Learning. Radhakrishnan, A.; Belkin, M.; and Uhler, C. 2020. Overparameterized neural networks implement associative memory. Proceedings of the National Academy of Sciences of the United States of America. Rahimi, A.; and Recht, B. 2007. Random Features for Large Scale Kernel Machines. In Advances in Neural Information Processing Systems 20. Refinetti, M.; and Goldt, S. 2022. The dynamics of representation learning in shallow, non-linear autoencoders. In International Conference on Machine Learning. Roßbach, P. 2018. Neural networks vs. random forests does it always have to be deep learning. Saunshi, N.; Plevrakis, O.; Arora, S.; Khodak, M.; and Khandeparkar, H. 2019. A Theoretical Analysis of Contrastive Unsupervised Representation Learning. In Proceedings of the 36th International Conference on Machine Learning. Sch olkopf, B.; Herbrich, R.; and Smola, A. J. 2001. A Generalized Representer Theorem. In Annual Conference on Computational Learning Theory. Sch olkopf, B.; and Smola, A. J. 2002. Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series. MIT Press. Sch olkopf, B.; Smola, A.; and M uller, K.-R. 1998. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation. Shah, A.; Sra, S.; Chellappa, R.; and Cherian, A. 2022. Max Margin Contrastive Learning. In Proceedings of the AAAI Conference on Artificial Intelligence. Shwartz-Ziv, R.; and Armon, A. 2022. Tabular data: Deep learning is not all you need. Inf. Fusion. Sigillito., V.; S., W.; L., H.; ; and K., B. 1989. Ionosphere. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5W01B. Todman, L. C.; Bush, A.; and Hood, A. S. 2023. Small Data for big insights in ecology. Trends in Ecology & Evolution. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.-A.; and Bottou, L. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research. Wahba, G. 1990. Spline Models for Observational Data. SIAM. Xu, P.; Ji, X.; Li, M.; and Lu, W. 2023. Small data machine learning in materials science. npj Computational Materials. Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In international conference on machine learning, 3861 3870. PMLR. Zhai, R.; Liu, B.; Risteski, A.; Kolter, Z.; and Ravikumar, P. 2023. Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation. ar Xiv preprint ar Xiv:2306.00788. Zhang, C.; Zhang, C.; Song, J.; Yi, J. S. K.; Zhang, K.; and Kweon, I. S. 2022. A survey on masked autoencoder for selfsupervised learning in vision and beyond. ar Xiv preprint ar Xiv:2208.00173. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)