# linear_separation_capacity_of_selfsupervised_representation_learning__ccba4d0f.pdf

Journal of Machine Learning Research 26 (2025) 1-48 Submitted 11/24; Revised 5/25; Published 6/25

On Linear Separation Capacity of Self-Supervised Representation Learning

Shulei Wang shuleiw@illinois.edu Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820, USA

Editor: Kilian Weinberger

Recent advances in self-supervised learning have highlighted the eﬃcacy of data augmentation in learning data representation from unlabeled data. Training a linear model atop these enhanced representations can yield an adept classiﬁer. Despite the remarkable empirical performance, the underlying mechanisms that enable data augmentation to unravel nonlinear data structures into linearly separable representations remain elusive. This paper seeks to bridge this gap by investigating under what conditions learned representations can linearly separate manifolds when data is drawn from a multi-manifold model. Our investigation reveals that data augmentation oﬀers additional information beyond observed data and can thus improve the information-theoretic optimal rate of linear separation capacity. In particular, we show that self-supervised learning can linearly separate manifolds with a smaller distance than unsupervised learning, underscoring the additional beneﬁts of data augmentation. Our theoretical analysis further underscores that the performance of downstream linear classiﬁers primarily hinges on the linear separability of data representations rather than the size of the labeled data set, reaﬃrming the viability of constructing eﬃcient classiﬁers with limited labeled data amid an expansive unlabeled data set.

Keywords: self-supervised learning, data augmentation, linear separation capacity

1. Introduction

1.1 Self-Supervised Representation Learning

The recent advance in large-scale machine learning models demonstrates their superior capacity and performance in various ﬁelds (Vaswani et al., 2017), but also demands millions of labeled samples for training, which can be inaccessible in some applications Dosovitskiy et al. (2021). Self-supervised pre-training and transfer learning are introduced to address the challenge of scarce labeled data in natural language processing and computer vision. Unlike the classical supervised learning framework, the current self-supervised learning framework usually involves two steps: self-supervised pre-training and ﬁne-tuning in the downstream task. In the pre-training stage, self-supervised learning ﬁrst learns data representations/features from a large unlabeled data set and pseudo-labels automatically generated from the unlabeled data (Hjelm et al., 2018; Bachman et al., 2019; Chen et al., 2020b; He et al., 2020; Chen and He, 2021; Zbontar et al., 2021; He et al., 2022). The pre-trained representations are then transferred to train a full model via ﬁne-tuning on a labeled data set in the

c 2025 Shulei Wang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-2032.html.

downstream task (Kumar et al., 2022). A comprehensive review is available in Balestriero et al. (2023). Recent progress shows that self-supervised pre-training can reduce the need for external supervision and extract information from a large amount of unlabeled data more eﬃciently than classical unsupervised learning methods (Le Cun, 2022). Due to these beneﬁcial properties, self-supervised learning has been widely used in the pre-training of large foundational models (Zhou et al., 2023). For instance, self-supervised pre-training has played a crucial role in the success of recent large language models, such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2018, 2019).

Generating pseudo labels from unlabeled data is a pivotal component of self-supervised representation learning and a signiﬁcant departure from traditional supervised and unsupervised learning methods. Various approaches to pseudo label generation are employed in diﬀerent applications of self-supervised pre-training, including data masking (Brown et al., 2020; He et al., 2022), data augmentation (Chen et al., 2020b; Zbontar et al., 2021), and multi-modal data fusion (Akbari et al., 2021). These generated pseudo labels serve as supervisory signals in self-supervised representation learning, with the trained representations subsequently ﬁne-tuned for diverse downstream tasks. Notably, empirical observations suggest that linear probing of learned representations, achieved by training a linear model on top of frozen representations, can surpass training all parameters of the same model from scratch (Chen et al., 2020b,c; Kumar et al., 2022). This implies that the representations acquired through self-supervised pre-training exhibit linear separability, enhancing downstream task eﬃciency via a linear model. The improved linear separability brought by self-supervised pre-training made constructing an eﬃcient classiﬁer in scenarios with limited labeled data possible (Tan et al., 2018; Wang et al., 2020). Given these intriguing empirical ﬁndings, questions arise about why training with pseudo labels yields linearly separable representations and what underlying structural information self-supervised representation learning captures through pseudo labels. This paper endeavors to address these questions by theoretically investigating the linear separation capacity of self-supervised representation learning when pseudo labels are generated using data augmentation.

Motivated by recent empirical successes, numerous existing studies have embarked on exploring the theoretical foundations of self-supervised pre-training s advantages in downstream tasks (Tsai et al., 2020; Tian et al., 2020b; Wen and Li, 2022; Saunshi et al., 2022; Cabannes et al., 2023). Speciﬁcally, a prevalent approach to investigating self-supervised learning is grounded in the context of linear representation or linear prediction functions (Tosh et al., 2021; Lee et al., 2021; Wen and Li, 2021; Wang, 2023; Kumar et al., 2022). While the linear framework facilitates precise characterization and insightful insights in this speciﬁc scenario, it may fall short of comprehensively elucidating the triumph of selfsupervised learning, given the inherently nonlinear nature of data distributions. In addition to linear structures, existing endeavors have embraced more adaptable models, including conditional distributions of discrete latent variables (Arora et al., 2019), distributions on graphs (Wei et al., 2020; Balestriero and Le Cun, 2022), nonlinear representations via kernels (Johnson et al., 2022), and manifolds (Hao Chen et al., 2021; Wang, 2025). Despite the remarkable progress, the mechanisms behind how and why pseudo labels facilitate the transformation of nonlinear structures into linearly separable representations remain enigmatic. To bridge this gap, our study delves into self-supervised learning within a multi-manifold

Linear Separation Capacity of Self-Supervised Representation Learning

model, demonstrating the pivotal role of pseudo labels in learning linearly separable representations, particularly when manifolds are close to each other.

1.2 Linearly Separable Representation and Unsupervised Learning

In this and the next subsections, we demonstrate the key idea and results using a twomanifold case, with a more generalized case and formal results provided in later sections. Speciﬁcally, we consider the scenario where the observed data is drawn from a union of two smooth and compact d-dimensional manifolds:

where M+ and M correspond to distinct classes and M RD. The objective is to develop a classiﬁer that distinguishes data originating from these two manifolds. In representation learning, our aim is to construct a mapping Θ : M RS from the observed data, enabling us to build a more eﬀective classiﬁer using Θ(x) as opposed to the conventional data representation x. Recent advancements in self-supervised representation learning have indicated that a simple linear classiﬁer can achieve perfect accuracy in downstream tasks when a linearly separable representation Θ(x) can be learned. Formally, the data representation Θ(x) is considered linearly separable if there exists a weight vector w RS satisfying

sup x M+ w T Θ(x) < inf x M w T Θ(x).

Given a data set (X1, . . . , Xn) randomly drawn from M, it is natural to inquire about the conditions under which consistent learning of linearly separable representations is possible, as well as the potential beneﬁts of using pseudo labels to enhance the learning process. To address these questions, we ﬁrst analyze the performance of the graph Laplacianbased method, a classical unsupervised learning approach widely used in spectral clustering methods (Ng et al., 2001; Belkin and Niyogi, 2003; Coifman and Lafon, 2006; Von Luxburg, 2007; Arias-Castro et al., 2017; Garc ıa Trillos et al., 2021; Chen and Yang, 2021). Our investigation reveals that the resulting representation converges to a linearly separable one when the manifolds are well-separated, as indicated by the condition

where δ(M) represents the smallest Euclidean distance between M+ and M , given by

δ(M) := inf x M+,y M x y .

where is the Euclidean distance in RD. When the condition (1) is not met, our analysis demonstrates that there exists an instance in which the graph Laplacian-based method treats M as a single manifold rather than a union of two separated manifolds. Furthermore, we establish an information-theoretic lower bound, revealing that any method fundamentally struggles to diﬀerentiate a single manifold from a union of multiple manifolds with

δ(M) c log n

where c is a suﬃciently small constant. These ﬁndings indicate that condition (1) serves as a necessary and suﬃcient requirement for learning linearly separable representation in the context of observing an unlabeled data set. Importantly, the graph Laplacian-based method achieves rate-optimal linear separation capacity in the two-manifold setting. This leads us to the question of whether there are strategies to enhance the linear separation capacity stated in (1) by leveraging additional information.

1.3 Data Augmentation Improves Linear Separation Capacity

The graph Laplacian-based method is eﬃcient at separating two manifolds but often overlooks a rich source of invariant structures present in the observed data across many applications. For instance, an image can represent the same context and be considered equivalent even after applying data augmentation techniques such as ﬂipping, rotation, and cropping (Shorten and Khoshgoftaar, 2019; Chen et al., 2020a). To capture such invariant structures in the data, we can extend our assumption that each M+ and M is an isometric embedding of a product manifold:

M+ = T+(Ns,+ Nv,+) and M = T (Ns, Nv, ),

where T+ and T are isometric diﬀeomorphisms. Here, Ns,+ and Ns, represent the dsdimensional manifold capturing the data augmentation invariant structure, while Nv,+ and Nv, correspond to the dv-dimensional manifolds related to irrelevant structure due to data augmentation. A similar product manifold model is also employed in Lederman and Talmon (2018); Wang (2025). A toy example of a single product manifold is illustrated in Figure 1. This model oﬀers a straightforward approach to modeling the data augmentation process. For instance, when Xi M+, we can express the data point as Xi = T+(φi, ψi), where φi Ns,+ and ψi Nv,+, and describe its augmented data as X i = T+(φi, ψ i), where ψ i is randomly drawn from Nv,+. Similarly, when Xi M , ψ i is drawn randomly from Nv, .

Data Augmentation

Data Augmentation

Figure 1: A toy example of a single product manifold: the circle with major radius captures data augmentation invariant structure, and the circle with minor radius captures irrelevant structure due to data augmentation. Data augmentation can help randomly draw samples from each smaller circle.

Unlike classical unsupervised techniques, self-supervised learning explores the aforementioned invariant structure of observed data when pseudo labels are generated by data

Linear Separation Capacity of Self-Supervised Representation Learning

augmentation. Speciﬁcally, the representations learned by most existing data-augmentation based self-supervised learning methods aim to preserve similarity between augmented data, i.e., Θ(Xi) Θ(X i). Can we better learn a linearly separable representation by exploiting the similarity of augmented data? To address this question, this paper studies the performance of Augmentation Invariant Manifold Learning (AIML), introduced in Wang (2025), as it simultaneously leverages the manifold s low dimensionality and the structure induced by augmented data. Capturing the invariant structure in data doesn t appear directly related to linear separation capacity, bacause it s often believed that data augmentation helps to reduce redundant information (Chen et al., 2020a; Wang, 2023). However, our investigation suggests that augmentation invariant manifold learning can surprisingly yield a linearly separable representation by requiring a smaller distance between manifolds than the graph Laplacian-based method, i.e.,

In other words, the linear separation capacity of augmentation invariant manifold learning relies solely on the dimension of the data augmentation invariant structure. These results suggest that selecting a data augmentation method that maintains a concise invariant structure (i.e., smaller ds) can better facilitate the learning of a linearly separable representation. When data augmentation can be applied, our analysis further demonstrates that no method can consistently discern whether the observed data is drawn from a single manifold or a union of multiple manifolds with

δ(M) c log n

for some suﬃciently small constant c. Combined with the upper bound in (2), we observe that augmentation invariant manifold learning can achieve rate-optimal linear separation capacity when exploring the invariant structure within observed data through data augmentation.

1.4 Impact on Downstream Classiﬁer

If we compare unsupervised and self-supervised learning, data augmentation can improve the optimal rate of linear separation capacity by

In other words, through data augmentation, self-supervised learning can achieve separation between manifolds with a ﬁner resolution, detecting diﬀerences between manifolds in greater detail compared to unsupervised learning. However, how does this improved linear separation capacity impact downstream analysis? To investigate the impact on downstream analysis, we consider training a logistic regression, one of the most widely used linear classiﬁers, on top of learned representations to predict whether data originates from M+ or M . The logistic regression is trained

using gradient descent on a limited labeled data set. Our investigation reveals that the misclassiﬁcation rate of the resulting logistic classiﬁer primarily hinges on how eﬀectively the learned data representation linearly separates the data, rather than the size of the labeled data set in downstream tasks. In other words, training on a modest number of labeled samples can lead to a precise linear classiﬁer, as long as the nonlinear structure is disentangled by the data representation learned from unlabeled samples. Consequently, with the aid of data augmentation, the data representation learned through self-supervised learning can result in a superior linear classiﬁer for downstream analysis compared to unsupervised learning. These theoretical ﬁndings are also validated through numerical examples in Section 6. Speciﬁcally, the numerical experiments suggest that 1) the data augmentation can improve the linear separability of learned representation and 2) the performance of the downstream linear model highly relies on the linear separability of representation rather than the number of labeled data.

2. Model and Linear Separable Representation

In this section, we generalize the two-manifold setting in the Introduction section to a multi-manifold model.

2.1 A Multi-Manifold Model for Data Augmentation

The conventional representation of data is often in a high-dimensional space, but empirical observations suggest that the data distribution in various applications, such as natural image data (Pope et al., 2021), can be characterized by latent low-dimensional variables. In particular, the manifold assumption is commonly used to model the underlying lowdimensional structure in high-dimensional space. Alongside the low-dimensional structure, data can exhibit grouping into small clusters (Martinez and Saar, 2001; Basri and Jacobs, 2003; Fu et al., 2005; Vidal and Ma, 2006). Motivated by these insights, we consider data drawn from a union of multiple manifolds

M = M1 M2 . . . MK,

where each Mk RD is a smooth and compact manifold of dimension dk representing a distinct subclass of data. Each manifold Mk may correspond to a subset of a category or class, such as images of blue shirts, blue dresses, red shirts, and red dresses. We assume that the manifolds M1, . . . , MK are well-separated, meaning that

δ(M) := min 1 k1<k2 K

inf x Mk1,y Mk2 x y > 0.

Here, δ(M) represents the smallest Euclidean distance between any pair of manifolds. Since data augmentation transformations are believed not to change the subclass of data (Chen et al., 2020a), we adopt a similar approach as in Wang (2025) by assuming each Mk is an isometric embedding of a product manifold (Figure 1)

Mk = Tk(Ns,k Nv,k),

where Tk is an isometric diﬀeomorphism, and Ns,k, Nv,k are two manifolds with dimensions ds,k and dv,k, such that dk = ds,k + dv,k. Here, Ns,k and Nv,k represent Mk s the invariant

Linear Separation Capacity of Self-Supervised Representation Learning

structure of interest and irrelevant nuisance structures resulting from data augmentation, respectively. For simplicity, we assume ds,1 = . . . = ds,K = ds, dv,1 = . . . = dv,K = dv, and d1 = . . . = d K = d. Given the above multi-manifold assumption, we need to deﬁne how an observed data point X is sampled from M and how the augmented data X is sampled given a data point X. We assume that the observed data point X is drawn from a mixture distribution π deﬁned on M:

k=1 wkπk(x), where wk > 0 and

k=1 wk = 1.

Here, π(x) and πk(x) are probability density functions deﬁned on M (see the formal deﬁnition in Appendix), with πk(x) having support only on Mk. In other words, X is drawn from Mk with probability wk. Next, we discuss how data are sampled from πk on each Mk. The product manifold assumption suggests that all points on Mk can be represented as X = Tk(φ, ψ), where φ Ns,k and ψ Nv,k. Thus, it suﬃces to consider the probability distribution of φ and ψ on Ns,k and Nv,k. We sample independent random variables φ Ns,k and ψ Nv,k as follows:

φ πs k(φ) and ψ πv k(ψ).

Then, we set X = Tk(φ, ψ). For simplicity, we assume that πv k is a uniform distribution on Nv,k. The product manifold assumption also provides a straightforward way to model the sampling process of augmented data X given a data point X. Given X = Tk(φ, ψ) Mk, we assume that the augmented data X = Tk(φ, ψ ), where ψ is another independent random variable drawn from πv k on Nv,k. In other words, given X = Tk(φ, ψ) Mk, we randomly sample the augmented data point from the ﬁber M(φ) = {x Mk : x = Tk(φ, ψ), ψ Nv,k}.

2.2 Linearly Separable Representation

Representation learning aims to learn a mapping Θ : M RS that enhances the performance of downstream analyses, including clustering, classiﬁcation, and regression. In this paper, we focus solely on classiﬁcation as the downstream task. In a classiﬁcation task, our goal is to train a classiﬁer H : M { 1, 1} that predicts the true label H : M { 1, 1}, where 1 and 1 denote the two possible binary label values. Among the simplest and most commonly used classiﬁers is the linear classiﬁer:

( 1 w T x > ζ 1 w T x ζ ,

where w RD represents the weight vector and ζ R is a scalar threshold. Various algorithms have been proposed in the literature to train a linear classiﬁer, such as logistic regression, linear discriminant analysis, and support vector machines (Hastie et al., 2009). A linear classiﬁer can achieve 100% accuracy when the sets M+(H ) = {x M : H (x) = 1} and M (H ) = {x M : H (x) = 1} are linearly separable in terms of their x values.

However, in practice, M+(H ) and M (H ) are often not linearly separable due to their complex shapes in high-dimensional space. Through representation learning, our aim is to enhance the classiﬁer trained by Θ(x), making it more accurate than the classiﬁer trained using x. Additionally, we aspire for the data representation Θ(x) to be versatile, applicable across multiple classiﬁcation labels. Speciﬁcally, we address a collection of classiﬁcation labels within the multi-manifold model

H = {H : H (x) { 1, 1}, H (x) = H (y), x, y Mk, 1 k K} .

In this deﬁnition, we enforce that the labels of data points within each subclass, i.e., manifold Mk, are identical. As a result, H encompasses 2K distinct possible classiﬁcation labels. Our objective is to develop a data representation that enhances the classiﬁer s performance across all classiﬁcation tasks within H . For the purpose of this study, we concentrate on a linear classiﬁer and endeavor to construct a linearly separable representation. In particular, a data representation Θ(x) is deemed linearly separable if, for any classiﬁcation task H H , a weight vector w H RS can be found such that

sup x M+(H ) w T H Θ(x) < inf x M (H ) w T H Θ(x).

Having introduced this deﬁnition, the natural inquiry arises: does a linearly separable representation exist, and if so, how can we learn it from our observed data?

3. Classical graph Laplacian-based Method

To learn data representation on the multi-manifold, it is suﬃcient to ﬁnd a way to separate each individual manifold. A commonly used unsupervised method to learn individual manifold structure is based on the spectral analysis of the graph Laplacian on a neighborhood graph (Belkin and Niyogi, 2003; Coifman and Lafon, 2006). Let X1, . . . , Xn be independent identically distributed samples drawn from the distribution π, which is introduced in Section 2. Given the observed samples, we can construct a neighborhood graph by connecting an edge between two sample points Xi and Xj if and only if Xi Xj r where is the Euclidean distance in RD, and assigning weights Wi,j = I( Xi Xj r), where I( ) is the indicator function. With the neighborhood graph introduced, we deﬁne the graph Laplacian matrix L as

(P i =i Wi,i if i = j Wi,j if i = j .

Then, the ﬁrst S eigenvectors (corresponding to the S smallest eigenvalues) of L (denoted U1, . . . , US) can form the data representations for X1, . . . , Xn. Since these data representations can be used to separate manifolds, they have already been employed in various versions of spectral clustering algorithms (Ng et al., 2001; Von Luxburg, 2007; Arias-Castro et al., 2017; Garc ıa Trillos et al., 2021; Chen and Yang, 2021). To understand why and when the graph Laplacian can help recover the multi-manifold structure, we need to study the asymptotic behavior of L s eigenvectors and introduce the continuum level of the Laplacian operator. Speciﬁcally, we deﬁne the Laplacian operator on Mk as

πk div Mk(π2 k Mkθk),

Linear Separation Capacity of Self-Supervised Representation Learning

where θk : Mk R is a function deﬁned on Mk. After introducing Laplacian operator on each Mk, we now deﬁne the tensorized Laplacian operator M

Mθ = (w1 M1θ1, . . . , w K MKθK),

where θ : M R is a function deﬁned on M and can be written as θ = (θ1, . . . , θK) where each θk is deﬁned on Mk. Since the operator M is deﬁned in a manifold-wise fashion, the eigenfunctions of M only has support on one manifold, i.e., the eigenfunctions θk,lk(x) = θk lk(x)/ wk if x Mk and θk,lk(x) = 0 if x / Mk, where θk lk(x) is the lkth eigenfunction of Mk. In particular, the ﬁrst K eigenfunctions of M has the form θ(x) = ck I(x Mk), where ck is some normalization constant and 1 k K. In other words, the structure information of diﬀerent manifolds is mapped to diﬀerent coordinates with the help of these eigenfunctions. Because of this special property, the representation based on these eigenfunctions can localize each manifold and thus lead to linearly separable representation. It is also interesting to note that these eigenfunctions can capture the geometric information within each manifold. The primary reason behind the capability of the eigenvectors of L to recover the multimanifold structure stems from the fact that the graph Laplacian provides a reliable approximation of the Laplacian operator M, and the eigenvectors of L converge to the eigenfunctions of M. To establish this convergence, we need to introduce the following assumption:

Assumption 1 We assume that the following conditions hold:

1. There exists a constant Cπ > 1 such that 1 Cπ πs k(φ) Cπ, 1 k K;

2. The parameter r is chosen such that

2r < min{1, io, Γ 1/2, R/2},

where R and Γ denote the upper bounds of the reach and absolute values of sectional curvatures, respectively, and i0 is a lower bound on the injectivity radius.

These assumptions are commonly used in the analysis of the convergence of the spectrum of the graph Laplacian (Calder and Garc ıa Trillos, 2022; Garc ıa Trillos et al., 2021). Given Assumption 1, we can establish the convergence of the eigenvectors of L.

Theorem 1 If Assumption 1 holds and we assume r 0,

1/d and δ(M) > r, (3)

then with probability at least 1 4Kn α, if Us is normalized eigenvector of L with eigenvalue λs(L), there is a normalized eigenfunction θs of M with eigenvalue λs(M) such that

Us θs L2(πn) :=

i=1 ((Us(Xi) θs(Xi))2 0, n ,

where θs = (θs(X1), . . . , θs(Xn)) Rn.

The proof of Theorem 1 adopts the same variational method as in Burago et al. (2015); Garc ıa Trillos et al. (2020); Calder and Garc ıa Trillos (2022); Garc ıa Trillos et al. (2021). The detailed convergence rate of eigenfunctions can be found in Appendix. Theorem 1 shows that under certain conditions, including the main condition (3), the eigenvectors of L can approximate the eigenfunctions of M very well and thus lead to linearly separable representation. Can these conditions in (3) be relaxed? The lower bound condition for r in (3) is sharp since the connectivity threshold of the random geometric graph is at order (log n/n)1/d (Penrose, 2003). The other condition δ(M) > r is also a necessary condition for separating diﬀerent manifolds. To show this, we present the following example and theorem to argue that the eigenvectors of L cannot separate multi-manifold when δ(M) r. Consider a simple example of two manifolds M = M1 M2. M1 and M2 are deﬁned in the following way:

M1 = n (y, 0) : y M o and M2 = n (y, zo) : y M o ,

where M is a d-dimensional smooth and compact manifold embedded in RD 1, and zo R is a positive constant. This construction clearly shows that δ(M) = zo. The following theorem shows that eigenvectors of L converge to eigenfunctions of M instead of M.

Theorem 2 Suppose w1 = w2 = 1/2, and π1(x) and π2(x) are the uniform distribution on M1 and M2 in above example. So we can write our observed data as Xi = (Yi, Zi) for 1 i n, where Yi RD 1 is drawn from a uniform distribution on M and P(Zi = zo) = P(Zi = 0) = 1/2. If Assumption 1 holds and we assume r 0,

1/d and δ(M) r,

then with probability at least 1 Cn 2 for some constant C, if Us is normalized eigenvector of L with sth eigenvalue, there is a normalized eigenfunction θs of M with sth eigenvalue such that Us θs L2(πn) 0, n ,

where θs = (θs(Y1), . . . , θs(Yn)) Rn.

As θs depends solely on Y1, . . . , Yn and not on Z1, . . . , Zn, Theorem 2 suggests that the eigenvectors of L cannot eﬀectively distinguish between M1 and M2, thereby failing to provide a linearly separable representation when δ(M) r. The combined implications of Theorem 1 and 2 imply that the classical Laplacian-based method necessitates wellseparated manifolds in order to learn a linearly separable representation, meaning that the minimum distance between manifolds must be suﬃciently large:

While this minimum distance requirement optimally applies to the classical graph Laplacianbased method, it naturally raises the question of whether alternative methods can eﬀectively separate closely situated manifolds.

Linear Separation Capacity of Self-Supervised Representation Learning

We investigate the information-theoretic lower bound for separating manifolds to answer this question. Following a similar strategy in Mossel et al. (2015), we address a simpler question: what is the smallest distance between manifolds that enables us to distinguish whether our observed data is drawn from one or two manifolds? Speciﬁcally, we consider the following hypothesis testing problem:

H0 : M = M0 v.s. H1 : M = M1 M2. (4)

Here M0, M1, and M2 are smooth and compact d-dimensional manifolds such that the distance between M1 and M2 is δ(M) > 0. In other words, under the null hypothesis, the data is drawn from a single manifold, while under the alternative hypothesis, it is drawn from a union of two manifolds. Testing hypothesis in (4) is a simpler problem than constructing linear separable representation because we can distinguish H0 from H1 if we can linearly separate M1 and M2. To detect the above hypothesis, we can consider a test T : (X1, . . . , Xn) {0, 1} such that we reject the null hypothesis if T = 1. A test T is called α-level test if P0(T = 1) α where P0 is the probability measure under H0. The following theorem characterizes the lower bound for separating manifolds.

Theorem 3 For any α-level test T(X1, . . . , Xn), there exists an instance where

δ(M) b log n

for some small enough constant b > 0, such that the type II error of this test converges to 1 α.

This theorem demonstrates that any method is fundamentally limited in its ability to separate manifolds when the minimum distance requirement is not met. Combining this result with Theorem 1, we observe that the optimal rate of linear separation capacity is given by

While the classical graph Laplacian-based method can achieve this optimal rate, a natural question arises: Can we uncover additional data structure and relax this minimum distance requirement by extracting more information from the data?

4. Augmentation Invariant Manifold Learning

Besides the observed samples, we can also generate additional augmented data from the observed data in various applications. For instance, cropping, rotation, colorization, and scaling can produce new images from the original images (Shorten and Khoshgoftaar, 2019). The augmented data is treated as an equivalent yet distinct version of the original data, oﬀering supplementary information beyond the observed data. In recent years, self-supervised representation learning methods, encompassing both contrastive and non-contrastive approaches, have been introduced to learn data representations by harnessing this equivalent relationship among augmented data (Chen et al., 2020b; Grill et al., 2020; Tian et al., 2020a;

Chen and He, 2021; Zbontar et al., 2021; Wang, 2025). The data augmentation invariant representations learned through these methods can enhance downstream analysis. To leverage augmented data, existing self-supervised representation learning methods aim to learn a data representation that is invariant to augmented data, i.e.,

Θ(Xi) Θ(X i),

where X i is a data point randomly generated by applying data augmentation techniques to Xi. Speciﬁcally, augmentation invariant manifold learning has been recently introduced to learn data representation by capturing the geometric structure of the manifold and the invariance property of augmented data (Wang, 2025). To elaborate further, augmentation invariant manifold learning can be formulated as an stochastic optimization algorithm and the loss function on a small batch of samples S {1, . . . , n} is deﬁned as

i,j S Wi,j Θβ(X i) Θβ(X j ) 2 + λ1 X

i S Θβ(X i) Θβ(X i ) 2 + λ2R(Θβ), (5)

where X i and X i are independent augmented copies of Xi, Wi,j represents the weights between X i and X j , and R(Θβ) is a regularization term enforcing an orthonormal representation. In the above loss function, the ﬁrst term corresponds to a computationally eﬃcient version of the graph Laplacian, while the second term aims to capture the similarity between augmented data. The detailed algorithm of augmentation invariant manifold learning can be found in Algorithm 1 in Appendix. Wang (2025) demonstrated that preserving the similarity of augmented data is equivalent to incorporating weights between augmented data of two samples. From this perspective, an equivalent form of augmentation invariant manifold learning in Algorithm 1 follows a similar procedure to the classical Laplacian-based method, but assesses the similarity between two samples using the average of weights between their augmented data:

Wi,j = E I( X i X j r) ,

where X i and X j are the independent augmented data of Xi and Xj, and the expectation is taken over the data augmentation process. These new weights Wi,j lead to the construction of the graph Laplacian matrix L in the same manner as the classical Laplacian-based method. Consequently, the ﬁrst S eigenvectors of L serve as the new data representations. In scenarios involving a single manifold, Wang (2025) shows that augmentation invariant manifold learning can more eﬀectively reduce dimensions nonlinearly compared to the classical Laplacian-based method. Given the resemblance between augmentation invariant manifold learning and the classical Laplacian-based method, a natural question emerges: Can augmentation invariant manifold learning also facilitate the learning of linearly separable representations? If so, how can data augmentation contribute to representation learning? To study the properties of augmentation invariant manifold learning, it is necessary to introduce the Laplacian operator on the augmentation invariant manifolds Ns := Ns,1 . . . Ns,K. Given the multi-manifold Ns, we deﬁne a tensorized Laplacian operator Ns as follows:

Ns θ = w1 Vol Nv,1 Ns,1 θ1, . . . , w K Vol Nv,K Ns,K θK

Linear Separation Capacity of Self-Supervised Representation Learning

where θ : Ns R is a function deﬁned on Ns, θk : Ns,k R is a function deﬁned on Ns,k, Vol Nv,k represents the volume of Nv,k, and Ns,k denotes the Laplacian operator on Ns,k given by

Ns,k θk = 1

πs k div Ns,k(πs k 2 Ns,k θk).

Similar to M, the eigenfunctions of Ns also have support on each individual manifold. Thus, a representation based on the eigenfunctions of Ns can eﬀectively separate manifolds and lead to a linearly separable representation. In the following, we demonstrate that the eigenvectors of L converge to eigenfunctions of Ns, enabling the detection of the multimanifold structure.

Theorem 4 If Assumption 1 holds and we assume r 0,

1/ds and δ(M) > r, (6)

then with probability at least 1 4Kn α, if Us is normalized eigenvector of L with sth eigenvalue, there is a normalized eigenfunction θs of Ns with sth eigenvalue such that

Us θs L2(πn) 0, n ,

where θs = ( θs(φ1), . . . , θs(φn)) Rn.

Theorem 4 suggests that the graph Laplacian of L is a good approximation of the Laplacian operator Ns rather than M. Thus, the eigenvectors of L can also detect each individual manifold and be used to construct a linearly separable representation when

In other words, the weight Wi,j deﬁned by augmented data can separate manifolds with a smaller δ(M) than the classical Laplacian-based method. The intuition behind this improvement is that Wi,j measures the similarity between the ﬁbers M(φi) and M(φj) rather than Xi and Xj, leading a kernel between latent variables φi and φj even though we cannot observe φi and φj directly. More speciﬁcally, Wi,j can be approximately written as the following kernel

1 rdv Wi,j Vdv Vol Nv,k

1 d2 Ns(φi, φj)

where Xi, Xj Mk, Vdv is the volume of a dv-dimensional unit ball, (x)+ = max(x, 0) and d Ns(φi, φj) is the geodesic distance on Ns. Therefore, capturing the invariant data structure can help reduce dimension nonlinearly and separate diﬀerent manifolds. We show that the multiple manifolds can be better separated via capturing the invariant structure of data, but is the above minimum distance requirement sharp? To address this, we establish an information-theoretic lower bound for the task of separating manifolds by analyzing the hypothesis testing problem introduced in (4). Due to the presence of data augmentation, we are not limited to observing individual samples X1, . . . , Xn; instead, we

have access to a collection of ﬁber sets M(φ1), . . . , M(φn). When data augmentation is available, we can consider a test deﬁned by these ﬁbers, that is, T : (M(φ1), . . . , M(φn)) {0, 1} such that we reject the null hypothesis if T = 1. The ensuing theorem establishes a lower bound for the task of separating manifolds by leveraging the invariant structure of augmented data.

Theorem 5 For any α-level test T(M(φ1), . . . , M(φn)), there exists an instance where

δ(M) b log n

for some small enough constant b > 0, such that the type II error of this test converges to 1 α.

The results presented in Theorems 4 and 5 demonstrate that augmentation invariant manifold learning can attain the optimal rate of linear separation capacity. By leveraging the additional information gained through data augmentation, augmentation invariant manifold learning enhances the rate of linear separation capacity to the following level:

A comparison between Theorem 3 and 5 underscores the indispensability of augmented data in acquiring linearly separable data representations, particularly when the manifolds exhibit proximity to one another.

5. Impact on Downstream Linear Classiﬁer

The previous sections characterize the required conditions for learning linearly separable representations, but it is still unclear how these representations can lead to an eﬃcient linear classiﬁer. To demystify the eﬀectiveness of linearly separable representations, we consider a binary classiﬁcation downstream task and study the performance of logistic regression. To be speciﬁc, suppose we observe a collection of labeled samples ( X1, Y1), . . . , ( Xm, Ym) where the label Yi = H ( Xi) for some H H . In the logistic regression, our goal is to minimize the following empirical logistic risk

min β RS Llog(β) = 1

i=1 ln 1 + exp h YiβT ˆΘ( Xi) i ,

where ˆΘ : M RS is some data representations resulted from previous sections. A commonly used way to minimize Llog(β) is the gradient descent method, i.e.,

βt+1 = βt ηt Llog(βt),

where ηt is the step size in the tth iteration. After stopping the iterations, the resulting classiﬁer is

ˆHˆΘ,ˆβ(x) =

( 1, ˆβT ˆΘ(x) > 0 1, ˆβT ˆΘ(x) 0 ,

Linear Separation Capacity of Self-Supervised Representation Learning

where ˆβ is the resulting weighted vector in the above gradient descent iterates. To study the performance of ˆHˆΘ,ˆβ, we consider the misclassiﬁcation rate as our measure

ξ(ˆΘ) = P ˆHˆΘ,ˆβ(X) = H (X) .

To characterize the theoretical properties of ˆHˆΘ,ˆβ, we consider the following assumptions:

Assumption 2 It holds that

1. X1, . . . , Xm is independent of the data in unsupervised/self-supervised learning (X1, . . . , Xn), so X1, . . . , Xm is independent of ˆΘ.

2. We choose the data representation as the ﬁrst K eigenfunctions resulting from the classical Laplacian-based method or augmentation invariant manifold learning, i.e., ˆΘ(x) = (ˆθ1(x), . . . , ˆθK(x)). We also assume ˆθs(x) is close to θs(x) in the following sense K X

Mk (ˆθs(x) θs(x))2πk(x)dx χn, 1 s K,

where θs(x) is the sth eigenfunction of M or Ns.

3. For each 1 s K, we have at least one sample Xi Ms.

4. The step size of the gradient descent is chosen as ηt = 1 and the number of steps is large enough.

The above assumptions are mild in practice. In particular, the second assumption is just a population version of the results in Theorem 1, 2 and 4, and the last assumption is used for the convergence of the gradient descent method when the data is linearly separable Soudry et al. (2018); Ji and Telgarsky (2018). With these assumptions, the following theorem shows the convergence of the misclassiﬁcation rate.

Theorem 6 If Assumption 2 holds, and m and K are upper bounded by some constant, then, with probability 1 Cm,Kχn, we have

ξ(ˆΘ) CKχn.

Here, Cm,K is some constant relying on m and K and CK is some constant relying on K.

Theorem 6 demonstrates that the performance of the downstream linear classiﬁer hinges on the quality of our representation in approximating a linearly separable one, rather than the sample size in the downstream analysis. This result implies that a linear classiﬁer, learned from a limited number of labeled samples, can achieve a low misclassiﬁcation rate when self-supervised learning eﬀectively captures an accurate linearly separable representation through a large quantity of unlabeled samples. By combining Theorem 6 with the ﬁndings from preceding sections, we can further compare the impacts of supervised and self-supervised learning on the downstream linear classiﬁer. Let ˆΘL and ˆΘ L be the data representations acquired through the classical Laplacian-based method and augmentation

invariant manifold learning, respectively. When (log n/n)1/ds δ(M) (log n/n)1/d, with a probability tending to 1, lim n ξ(ˆΘ L) = 0,

and there exists an instance (corresponding to the scenario in Theorem 2) such that

lim n ξ(ˆΘL) 1/2.

6. Numerical Illustration

In this section, we present a series of numerical experiments to validate the theoretical results from the previous sections and to compare the performance of unsupervised and self-supervised learning methods. Speciﬁcally, we utilize the MNIST data set (Le Cun et al., 1998), which consists of 60,000 training and 10,000 testing images of 28 28 gray-scale handwritten digits. For our self-supervised learning approach, we adopt the transformation introduced by Wang (2025) to generate augmented data. This involves random resizing, cropping, and rotation of the images. In all the numerical experiments, we compare the performance of Augmentation Invariant Manifold Learning (AIML) as deﬁned in (5), with that of a continuous version of the classical graph Laplacian-based method (CML). To comprehensively study the impact of data augmentation, we formulate the unsupervised graph Laplacian-based method as a similar optimization problem to (5), but with the removal of all components related to data augmentation:

i,j=1 Wi,j Θβ(Xi) Θβ(Xj) 2 + λ2R(Θβ), (7)

where Wi,j represents the weights between Xi and Xj, and Θβ is the encoder. This optimization problem can be regarded as a computationally eﬃcient and continuous version of the classical graph Laplacian-based method. For a fair comparison, we employ the same convolutional neural network encoder with two convolution+Re LU layers, two pooling layers, and a fully connected layer. Additionally, we use identical tuning parameters and optimization algorithms for both AIML and CML. We present t-SNE plots (Hinton and Roweis, 2002) depicting the learned representations in Figure 2 when applying these two methods to the training images. It is important to note that t-SNE plots can distort the distances between data points, but they oﬀer a visual representation of the cluster structure in the learned representations. As shown in Figure 2, each digit corresponds to multiple clusters, and leveraging data augmentation enhances the separation of distinct digits. To quantitatively assess linear separability, we train a linear classiﬁer on top of the learned representations and measure the resulting classiﬁer s accuracy. We recode the digit labels as 1 when the digit is smaller than 5 and as 1 when the digit is equal to or larger than 5. Thus, the classiﬁcation task aims to predict whether a digit is smaller than 5. We conduct two sets of numerical experiments to evaluate the inﬂuence of sample size on representation learning (via unsupervised or self-supervised methods) and classiﬁer training in the downstream task. In the ﬁrst set of experiments, we use all training samples (without labels) to learn data representations, and the linear classiﬁer is trained using 20%, 40%, 60%,

Linear Separation Capacity of Self-Supervised Representation Learning

Figure 2: t-SNE plots of representation learned by augmentation invariant manifold learning (left) and graph Laplacian-based method (right).

80%, and 100% of labeled samples. The left ﬁgure in Figure 3 presents the misclassiﬁcation rates for AIML and CML representations. It s evident that the performance of AIML and CML remains stable as the sample size increases from 40% to 100%, conﬁrming the ﬁndings in Theorem 6. In the second set of experiments, we learn the representation from 40%, 60%, 80%, and 100% of unlabeled training samples, and the downstream classiﬁer is trained with 40% of labeled samples. The resulting linear classiﬁer s misclassiﬁcation rates are reported in the right ﬁgure of Figure 3. Clearly, a larger sample size in representation learning leads to a more accurate classiﬁer. Combining the outcomes of both experiment sets, we observe that data augmentation aids in learning better linearly separable representations, and the classiﬁer s accuracy primarily depends on the separability of learned representations, rather than the downstream analysis s sample size. These experimental results align with the theoretical ﬁndings presented in previous sections.

7. Concluding Remarks

In this paper, we delve into the signiﬁcance of data augmentation in learning linearly separable representations. Our investigation underscores the pivotal role played by the invariant structures introduced through data augmentation, facilitating the better transformation of intricate nonlinear data structures into linearly separable representations via unlabeled data sets. Coupled with the insights gleaned from the ﬁndings in Wang (2025), we conclude that data augmentation yields two principal advantages within self-supervised representation learning: enhanced separation of closely situated manifolds and nonlinear dimension reduction within each distinct manifold. Furthermore, exploring the implications for downstream analysis reveals that the crux of eﬀective classiﬁcation lies in the successful segregation of data originating from distinct manifolds. Remarkably, even with a limited number of la-

Figure 3: Misclassiﬁcation rate of AIML and CML: the left ﬁgure shows the result when the sample size in representation learning is ﬁxed and in downstream task varies; the right ﬁgure shows the result when the sample size in representation learning varies and in downstream task is ﬁxed.

beled samples, the availability of an eﬃcient linearly separable representation empowers the construction of accurate classiﬁers. This observation elucidates the feasibility of developing eﬃcient algorithms in the context of few-shot learning (Wang et al., 2020), particularly in scenarios where an extensive unlabeled data set coexists with a smaller labeled counterpart. To simplify the analysis, we assume the sampling distribution in data augmentation πv k is uniform on Nv,k. Although it seems restrictive, this assumption holds for many applications. For example, the rotation angle is usually uniformly sampled when the image is rotated randomly. If the πv k is not a uniform distribution anymore, the form of Wi,j becomes more complicated than the current one, but the conclusion of Theorem 4 can hold similarly because we can still consider Wi,j as a kernel between φi and φj. Another essential assumption in our investigation is that the data is drawn from a union of multiple manifolds where each manifold represents a subset of a class. This assumption is reasonable in many data sets with well-deﬁned categories, including MNIST and CIFAR-10 data sets. However, this assumption might be limited when it comes to complex data sets such as the Image Net data set. Therefore, it is interesting to study whether we can study the linear separation capacity under a broader setting beyond the multiple manifolds assumption.

Acknowledgments

This project is supported by grants from the National Science Foundation (DMS-2113458).

Linear Separation Capacity of Self-Supervised Representation Learning

Appendix A. Algorithm for Augmentation Invariant Manifold Learning

This section presents the stochastic optimization algorithm for augmentation invariant manifold learning introduced in Wang (2025), summarized in Algorithm 1.

Algorithm 1 Augmentation Invariant Manifold Learning Input: A set of data {X1, . . . , Xn}, batch size n , encoder Θβ, stochastic data augmentation transformation T , tuning parameters (r, λ1, λ2).

1: for sampled minibatch {Xi : i S} do

2: Generate two independent augmented copies of each sample X i = T (Xi) and X i = T (Xi) for i S.

3: Evaluate representation of each augmented sample Z = {Θβ(X i)}i S and Z = {Θβ(X i )}i S, where Z , Z Rn S.

4: Evaluate the kernel matrix W = {I( X i X j r)}i,j S Rn n and corresponding Laplacian matrix L.

5: Evaluate the loss

ˆℓ(β) = tr(Z T LZ ) + λ1 Z Z 2 F + λ2 Z T Z IS 2 F ,

where IS is a S S identify matrix and F is Frobenius norm of a matrix.

6: Update Θβ to minimize ˆℓ(β).

7: end for Output: Encoder Θβ

Appendix B. Proof

In this proof, represents the Euclidean distance and d M( , ) represents the geodesic distance on the manifold M. C refers to be a constant which can be diﬀerent in diﬀerent places.

B.1 Proof Sketches

Proof sketches for Theorem 1, 2, and 4 The proof strategy in Theorem 1, 2, and 4 adopts the same variational method as in Burago et al. (2015); Garc ıa Trillos et al. (2020); Calder and Garc ıa Trillos (2022); Garc ıa Trillos et al. (2021). The proof can be divided into eight steps.

Step 1: Dirichlet energy In this step, we deﬁne the Dirichlet energy for continuous function θ. The eigenfunctions of Dirichlet energy are eigenfunctions of the Laplacian operator of interest because D(θ) = θ, θ . Diﬀerent theorems deﬁne the function θ and the Laplacian operator diﬀerently.

Step 2: Discrete Dirichlet energy In this step, we introduce the discrete version of Dirichlet energy b(U) for a discrete function U Rn, which is the discrete counterpart of Dirichlet energy D(θ). The discrete Dirichlet energy is deﬁned using the graph Laplacian derived from the observed data. In some theorems, we also need to introduce

intermediate discrete Dirichlet energy that can help establish the connection between continuous and discrete Dirichlet energy in step 4.

Step 3: Discretization and interpolation maps This step connects discrete and continuous functions via discretization and interpolation maps. The discretization map P converts a continuous function to a discrete one, while the interpolation map I transforms a discrete to a continuous one. We can show that the construction of discretization and interpolation maps can approximately preserve the norm of functions θ 2 Pθ 2 = o( θ p

D(θ) + θ 2) and U 2 IU 2 = o( U p

b(U) + U 2).

Step 4: Connection between continuous and discrete Dirichlet energy This step aims to establish the connection between the Dirichlet energy D(θ) and its discrete counterpart b(U). Speciﬁcally, we aim to establish the inequalities with the following form:

b( Pθ) (1 + o(1)) D(θ) + o(1) and D( IU) (1 + o(1)) b(U).

Diﬀerent strategies are needed when D(θ) and b(U) are deﬁned diﬀerently.

Step 5: Upper bound of eigenvalues In this step, we characterize the upper bound for eigenvalues of discrete Dirichlet energy, λs(L), via the connection between continuous and discrete Dirichlet energy. Speciﬁcally, we deﬁne an s-dimensional subspace So = n Ps j=1 aj Pθj : aj R o , where θ1, . . . , θs is an orthonormal set of eigenfunctions of D(θ). By the Courant-Fisher minimax principle, we have

λs(L) max U So, U 1 b(U) max a 1+o(1) b Pθ(a) ,

where θ(a) = Ps j=1 ajθj. If we apply the connection between continuous and discrete Dirichlet energy, we can easily get

λs(L) (1 + o(1)) max a 1+o(1) D (θ(a)) + o(1) (1 + o(1)) λs(M) + o(1).

Here, λs(M) is the eigenvalue of D(θ).

Step 6: Lower bound of eigenvalues In this step, we characterize the lower bound for eigenvalues of discrete Dirichlet energy with a strategy similar to the last step. Deﬁne an s-dimensional subspace So = n Ps j=1 aj IUj : aj R o , where U1, . . . , Us is an orthonormal set of eigenfunctions of b(U). We can again apply the Courant-Fisher minimax principle to show

λs(M) max θ So, θ 1 D(θ) max a 1+o(1) D IU(a) ,

where U(a) = Ps j=1 aj Uj. An application of the connection between continuous and discrete Dirichlet energy suggests

λs(M) (1 + o(1)) max a 1+o(1) b (U(a)) (1 + o(1)) λs(L).

Linear Separation Capacity of Self-Supervised Representation Learning

Step 7: Convergence of the ﬁrst K eigenvectors In this step, we study the convergence of the ﬁrst K eigenvectors of b(U). The convergence of eigenvalues in steps 5 and 6 suggests λs(L) = o(1) for s = 1, . . . , K and λK+1(L) = (1 + o(1))λK+1(M). Let S be the subspace spanned by the ﬁrst K eigenvectors of b(U), PS be the projection operator on S, and θ1, . . . , θK be a set of orthonormal basis for the eigenspace of eigenfunctions of D(θ). We can ﬁrst show that Pθj almost belongs to the subspace S

Pθj PS Pθj b( Pθj) λK+1(L) (1 + o(1))D(θj) + o(1)

(1 + o(1))λK+1(M) = o(1).

Since discretization map P can preserve the norm of the orthonormal basis, we can show that Pθj is approximately orthonormal, i.e., | Pθj1, Pθj2 δj1,j2| = o(1) where δj1,j2 is Kronecker delta and 1 j1 < j2 K. Because Pθj almost belongs to the subspace S and is approximately orthonormal, an orthonormal basis of S, named V 1, . . . , V K exists, such that Pθj V j = o(1). If we write θ = (θ(X1), . . . , θ(Xn)), we can also show that θj and Pθj are close, i.e., θj Pθj = o(1). Therefore, we show that θj V j = o(1) for j = 1, . . . , K. Equivalently, if we apply some rotation matrix, we can show the convergence of the ﬁrst K eigenvectors.

Step 8: Convergence of the rest eigenvectors In this step, we study the convergence of the rest eigenvectors. The analysis is similar to the last step, but we need more involved arguments to show Pθj PS Pθj = o(1),

where S is the corresponding subspace spanned by eigenvectors of b(U), PS is the projection operator on S, and θj is the normalized eigenfunctions of D(θ) with eigenvalue equal to λ.

Since the continuous and discrete Dirichlet energy is deﬁned diﬀerently in steps 1 and 2, we need to adopt diﬀerent strategies to build a connection between continuous and discrete Dirichlet energy in step 4 of diﬀerent theorems. Once the connection is established, the analyses in steps 3, 5, 6, 7, and 8 are almost the same, so we only present them in the proof for Theorem 1. Proof sketches for Theorem 3 and 5 The information-theoretic lower bound is proved in a standard way. See more discussions in Tsybakov (2008). The proof is divided into three steps.

Step 1: Hypothesis construction In this step, we construct the least favorable null and alternative hypothesis. The construction is slightly diﬀerent in the proof of Theorem 3 and 5. Under the null hypothesis, we assume the observed data is drawn from a single manifold and write the probability distribution of observed data as P0. Under the alternative hypothesis, we assume the observed data is drawn from two disjoint manifolds that are very close to each other, and the union of these two manifolds is quite similar to the single manifold under the null hypotheses. Since there are many ways to construct the two manifolds, we consider a mixture distribution of these alternative hypothesis and the probability distribution of observed data as P1.

Step 2: Bounding χ2 divergence In this step, we bound the χ2 divergence between the distributions under null and alternative hypotheses. Speciﬁcally, we aim to ﬁnd a setting such that χ2(P1, P0) 0.

Step 3: Applying theorem of fuzzy hypothesis Once the bound of χ2 divergence is established, we can apply the theorem of the fuzzy hypothesis (Tsybakov, 2008).

The construction of the least favorable hypothesis and the way to bound χ2 divergence diﬀer in diﬀerent theorems.

B.2 Proof for Theorem 1

Instead of proving Theorem 1 directly, we prove a more general version of Theorem 1.

Theorem 7 Suppose Assumption 1 holds and for any 1 k K

πk B(x, r) \ Mk β, x M \ Mk,

where B(x, r) is a ball centered at x with radius r (in Euclidean distance). If we assume r 0,

1/d and β rd+2 + β nrd+2 0,

then with probability at least 1 K2/t2 4Kn α for some α > 0, we have

|λs(L) λs(M)| e1λ2(M) + e2,

where λs(L) is sth eigenvalue of normalized matrix L in b(U) and λs(M) is sth eigenvalue of M. Here, e1 and e2 are deﬁned as

λs(M) + γ + r + δ

and e2 = C βn + t β + tβ n

1 + λd/2+2 s (M) .

Here, we choose δ = p

r(c log n/n)1/d and γ = p

(α + 1) log n/nδd for some large enough α so δ, γ, δ/r 0. With probability at least 1 K2/t2 4Kn α, if Us is normalized eigenvector of L with eigenvalue λs(L), there is a normalized eigenfunction θs of M with eigenvalue λs(M) such that

Us θs L2(πn)

C γ + δ + 1 γλ

βn+t β+tβ n

nrd+2 1/2 , s K

λ + γ + r + δ

r + βn+t β+tβ n

nrd+2 λd/2+2 1/2 + Cδ, s > K ,

where λ = λs(M), θs = (θs(X1), . . . , θs(Xn)) and γλ is the eigengap.

This theorem can lead to Theorem 1 if we choose β = 0. This theorem also suggests that the ﬁrst K eigenvectors converge faster than the rest of eigenvectors.

Linear Separation Capacity of Self-Supervised Representation Learning

Step 1: Dirichlet energy To study the convergence of graph Laplacian, we need to introduce some notations. For a function θ : M R deﬁned on M, we also write it in the following form θ(x) = (θ1(x), . . . , θK(x)),

where each θk : Mk R is a function deﬁned on each manifold Mk. Given two functions θA and θB deﬁned on M, we deﬁne their inner product as

θA, θB L2(π) =

Mk θA k (x)θB k (x)πk(x)dx.

We also deﬁne the weighted Dirichlet energy for a function θ as

k=1 w2 k Dk(θk), where Dk(θk) = Z

Mk θk(x) 2π2 k(x)dx.

From this deﬁnition, it is straightforward to see D(θ) = θ, Mθ . Step 2: Discrete Dirichlet energy To study graph Laplacian, we now introduce a more general neighborhood graph and the corresponding graph Laplacian. Speciﬁcally, when Xi Xj r, we assign weights

Wi,j = h Xi Xj

where h is a function supported on [0, 1]. Clearly, h(x) = I( x 1) used in Section 3 is a special case. Given the general weights, we can deﬁne discrete Dirichlet energy as

1 σhrd Wi,j

U(Xi) U(Xj)

2 = 1 σhn2rd+2 UT LU,

where U(Xi) is the ith component of vector U (corresponds to Xi) and σh is the surface tension of h

σh = Z |y1|2h(|y|)dy.

In particular, when h(x) = I( x 1), σh = Vd/(d + 2), where Vd is the volume of ddimensional Euclidean unit ball. Although U Rn, it can also be considered as a function deﬁned on discrete point {X1, . . . , Xn}. If we write the empirical measure of {X1, . . . , Xn} as

then we can deﬁne the inner product between two discrete functions

UA, UB L2(πn) = 1

i=1 UA(Xi)UB(Xi).

Similarly, we can deﬁne within and cross manifold Dirichlet energy

1 σhrd Wi,j

U(Xi) U(Xj)

2 , b W (U) =

Xi Mk1,Xj Mk2

1 σhrd Wi,j

U(Xi) U(Xj)

Here, nk is the number of point on Mk, i.e., nk = |{i : Xi Mk}|. Step 3: Discretization and interpolation maps After introducing the notations, we now construct the relationship between Dirichlet energy D(θ) and b(U) via discretization and interpolation maps. The Proposition 2.12 in Calder and Garc ıa Trillos (2022) or Corollary A.3 in Garc ıa Trillos et al. (2021) suggests the following Proposition directly. We omit the proof.

Proposition 8 With probability at least 1 Kn exp( Cnγ2δd) 2K exp( cn), there exists a probability measure µn,k with probability density function πn,k for k = 1, . . . , K such that

πk πn,k L (Mk) C(γ + δ)

and there exist transportation maps R1, . . . , RK such that

sup x Mk d Mk(x, Rk(x)) δ.

In Proposition 8, we can choose δ = p

r(c log n/n)1/d and γ = p

(α + 1) log n/nδd for some large enough α. Based on the transportation maps R1, . . . , RK in Proposition 8, we can also introduce the discretization and interpolation maps. First, we can deﬁne a partition of M by R1, . . . , RK. Speciﬁcally, if Xi Mk, then

Ui = R 1 k (Xi) Mk.

After deﬁning the partition, we can introduce the discretization map P : L2(π) L2(πn)

Pθ(Xi) = nk

Ui θ(x) πn,k(x)dx,

where Xi Mk. Similarly, we can introduce the associated extension map P : L2(πn) L2(π)

i=1 u(Xi)I(x Ui).

We then deﬁne the interpolation map I

Iu = Λr 2δ P u

Linear Separation Capacity of Self-Supervised Representation Learning

where Λr 2δ is a convolution operator deﬁned on M. More concretely, we deﬁne

R Kr(x, y)θk(y)dy R Kr(x, y)dy , x Mk,

where the kernel Kr(x, y) is deﬁned as

Kr(x, y) = 1

rd ψ d M(x, y)

and ψ(t) = 1

Step 4: Connection between Dirichlet energy D(θ) and b(U) With the newly introduced discretization and interpolation maps, we can connect the Dirichlet energy D(θ) and b(U). We can directly apply the results from Proposition 4.1 and 4.2 in Calder and Garc ıa Trillos (2022) or Proposition A.4 and A.5 in Garc ıa Trillos et al. (2021) to show the following Proposition.

Proposition 9 Suppose γ and δ are two parameters in Proposition 8. With probability at least 1 Kn exp( Cnγ2δd) 2K exp( cn), we have

b W ( Pθ) 1 + C δ

r + r + γ D(θ) and D( IU) 1 + C δ

r + r + γ b(U). (8)

In addition, we also have θ 2 L2(π) Pθ 2 L2(πn) Cδ θ L2(π) p

D(θ) + C(γ + δ) θ 2 L2(π) (9)

and U 2 L2(πn) IU 2 L2(π) Cr U L2(πn) p

b(U) + C(γ + δ) U 2 L2(πn). (10)

Here, θ L2(π) and U L2(πn).

To connect Dirichlet energy D(θ) and b(U), we also need to ﬁnd an upper bound for the cross manifold Dirichlet energy.

Proposition 10 Suppose Assumption 1 holds and for any 1 k K

πk B(x, r) \ Mk β, x M \ Mk,

where B(x, r) is a ball centered at x with radius r (in Euclidean distance). When θ is in the span of M s eigenfunctions with corresponding eigenvalue less than λ, then we have

b C( Pθ) C βn + t β + tβ n

1 + λd/2+2 θ 2 L2(π),

with probability at least 1 K2/t2 2K exp( cn).

Step 5: Upper bound of λs(L) To show the convergence of eigenvalues, we now construct an upper bound for λs(L) in terms of λs(M). Let θ1, . . . , θs be an orthonormal set of eigenfunctions of M. Then, we can deﬁne

V j = Pθj, j = 1, . . . , s.

We can apply (9) to θj1 θj2, so we have θj1 2 L2(π) + θj2 2 L2(π) 2 θj1, θj2 L2(π) V j1 2 L2(πn) V j2 2 L2(πn) + 2 V j1, V j2 L2(πn)

λs(M) + C(γ + δ)

So we can conclude that for some large enough n, we have

θj1, θj2 L2(π) V j1, V j2 L2(πn) Cδ p

λs(M) + C(γ + δ) < 1

Because θj1, θj2 L2(π) = 1 when j1 = j2 and θj1, θj2 = 0 when j1 = j2, we can know V 1, . . . , V s are linearly independent. Based on V 1, . . . , V s, we deﬁne a s-dimensional subspace

j=1 aj V j : aj R

By the Courant-Fisher minimax principle, we know

λs(L) = min S Ss max U S\{0} b(U) U 2 L2(πn) ,

where Ss is the collection of all possible s-dimensional subspaces. This immediately suggests

λs(L) max U So, U L2(πn)=1 b(U).

For any U So, we can apply (8) and Proposition 10 to obtain

b(U) = b( Pθ) 1 + C δ

r + r + γ D(θ) + C βn + t β + tβ n

1 + λd/2+2 θ 2 L2(π)

r + r + γ λs(M) θ 2 L2(π) + C βn + t β + tβ n

1 + λd/2+2 θ 2 L2(π).

where θ = Ps j=1 ajθj if U = Ps j=1 aj V j and λ = λs(M). An application of (9) suggests

θ 2 L2(π) U 2 L2(πn) + C(δ p

λs(M) + γ + δ) θ 2 L2(π).

If U L2(πn) = 1, then

θ 2 L2(π) 1 + C(δ p

λs(M) + γ + δ).

This leads to

λs(L) 1 + C δ p

λs(M) + γ + r + δ

λs(M)+C βn + t β + tβ n

1 + λd/2+2 s (M) .

Step 6: Lower bound of λs(L) We can ﬁnd a lower bound for λs(L) in terms of λs(M) with a similar idea. Let U1, . . . , Us be an orthonormal set of eigenvectors of L and

θj = IUj, j = 1, . . . , s.

Linear Separation Capacity of Self-Supervised Representation Learning

We can apply (10) to Uj1 Uj2

θj1 2 L2(π) + θj2 2 L2(π) 2 θj1, θj2 L2(π) Uj1 2 L2(πn) Uj2 2 L2(πn) + 2 Uj1, Uj2 L2(πn)

λs(L) + C(γ + δ)

So we can conclude that for some large enough n, we have

θj1, θj2 L2(π) Uj1, Uj2 L2(πn) Cr p

λs(L) + C(γ + δ) < 1

Because Uj1, Uj2 L2(πn) = 1 when j1 = j2 and Uj1, Uj2 L2(πn) = 0 when j1 = j2, we can know θ1, . . . , θs are linearly independent. With θ1, . . . , θs, we deﬁne

j=1 ajθj : aj R

Again, by the Courant-Fisher minimax principle,

λs(M) = min S Ss max θ S\{0} D(θ) θ 2 L2(π) max θ So, θ L2(π)=1 D(θ).

For θ So, there exist a U such that θ = IU where U = Ps j=1 aj Uj. If we apply (8), then we have

D(θ) = b( IU) 1 + C δ

r + r + γ b(U) 1 + C δ

r + r + γ λs(L) U 2 L2(πn).

An application of (8) again suggests that if θ L2(π) = 1, then

U 2 L2(πn) 1 + C(r p

λs(L) + γ + δ).

Now we can conclude

λs(M) 1 + C r p

λs(L) + γ + r + δ

Putting upper and lower bounds together yields

|λs(M) λs(L)| C r p

λs(M) + γ + r + δ

λs(M)+C βn + t β + tβ n

1 + λd/2+2 s (M) .

In the rest of proof, we use the following notations

λs(M) + γ + r + δ

and e2 = C βn + t β + tβ n

1 + λd/2+2 s (M) .

Step 7: Convergence of the ﬁrst K eigenvectors After showing the convergence of eigenvalues, we now study the convergence of eigenvectors. We ﬁrst show the convergence of the

eigenvectors corresponding to the ﬁrst K eigenvalues (λ1(M) = . . . = λK(M) = 0). Let γλ be the eigengap, i.e., γλ = λK+1(M). When n is large enough, we have

|λs(M) λs(L)| γλ

4 , when s = 1, . . . , K + 1.

Let S be the subspace spanned by the eigenvectors of L with eigenvalues λ1(L), . . . , λK(L) and PS be the projection operator on S. Let θ be a normalized eigenvector of M with eigenvalue 0 and U = Pθ. By Proposition 9 and 10, we can know that

b(U) C βn + t β + tβ n

The deﬁnition of eigenvector suggests

b(U) = 1 σhn2rd+2 UT LU λ1(L) PSU 2 L2(πn)+λK+1(L) U PSU 2 L2(πn) = λK+1(L) U PSU 2 L2(πn).

Putting above two inequalities together yields

U PSU 2 L2(πn) C

βn + t β + tβ n

By the deﬁnition of P, we can know that if Xi Mk, then

|U(Xi) θ(Xi)| = nk

Ui (θ(x) θ(Xi)) πn,k(x)dx θk δ.

Since θ is the eigenvector of M with eigenvalue 0, then we can know θk = 0 and U = θ, where θ = (θ(X1), . . . , θ(Xn)) Rn is a n dimensional vector. So we can know

θ PS Pθ 2 L2(πn) C

βn + t β + tβ n

Let θ1, . . . , θK be a set of orthonormal basis for the eigenspace of eigenfunctions of M with eigenvalue 0, Uj = Pθj and V j = PS Pθj. By (12), we have

θj V j 2 L2(πn) = Uj V j 2 L2(πn) C

βn + t β + tβ n

An application of (9) on θj1 θj2 suggests Uj1, Uj2 L2(πn) θj1, θj2 L2(π) C(γ + δ).

We combine (11) and above inequality to obtain

V j1, V j2 L2(πn) θj1, θj2 L2(π) C

γ + δ + 1 γλ

βn + t β + tβ n

Linear Separation Capacity of Self-Supervised Representation Learning

Let V 1, . . . , V K be the Gram-Schmidt orthogonalization of V 1, . . . , V K. Then we can know that

V j V j L2(πn) C

γ + δ + 1 γλ

βn + t β + tβ n

Therefore, we can conclude that

V j θj L2(πn) C

γ + δ + 1 γλ

βn + t β + tβ n

Equivalently, if we apply some rotation matrix, we can ﬁnd an orthonormal set θ 1, . . . , θ K of eigenfunctions of M with eigenvalues 0 such that

Uj θ j L2(πn) C

γ + δ + 1 γλ

βn + t β + tβ n

Step 8: Convergence of the rest eigenvectors We next study the convergence of the other eigenvectors by induction. In particular, let λs 1(M) < λ = λs(M) = . . . = λs+l 1(M) < λs+l(M). Suppose θ1, . . . , θs 1 are a set of orthonormal basis for the eigenspace of eigenfunctions of M with eigenvalue smaller than λ and there exists a set of orthonormal basis for subspace spanned by the eigenvectors of L with eigenvalues smaller than λ such that

V j Pθj L2(πn) C 1

λ + γ + r + δ

+ C βn + t β + tβ n

1 + λd/2+2 1/2 +Cδ

(13) for j = 1, . . . , s 1. Let S be the subspace spanned by the eigenvectors of L with eigenvalues λs(L), . . . , λs+l 1(L) and PS be the projection operator on S. Similarly, we deﬁne S+ as the subspace spanned by the eigenvectors of L with eigenvalues larger than λs+l 1(L) and S as the subspace spanned by the eigenvectors of L with eigenvalues smaller than λs(L). PS+ and PS are the projection operator on S+ and S . Let θ be an eigenvector of M with eigenvalue λ and let U = Pθ. By Proposition 9 and 10, we have

b(U) C βn + t β + tβ n

1 + λd/2+2 + 1 + C δ

r + r + γ λ.

By the spectrum decomposition, we have

b(U) = 1 σhn2rd+2 UT LU

λs(L) PSU 2 L2(πn) + λs+l(L) PS+U 2 L2(πn)

λs(L) U 2 L2(πn) U PSU 2 L2(πn) + λs+l(L) U PSU 2 L2(πn) PS U 2 L2(πn)

λs(L) U 2 L2(πn) + (λs+l(L) λs(L)) U PSU 2 L2(πn) λs+l(L) PS U 2 L2(πn).

By (9), we can know U 2 L2(πn) 1 Cδ

λ + C(γ + δ).

The results in step 6 suggests

|λ λk(L)| C r

λ + γ + r + δ

λ + C βn + t β + tβ n

1 + λd/2+2 .

Therefore, we can have

2 U PSU 2 L2(πn) C r

λ + γ + r + δ

λ+C βn + t β + tβ n

λd/2+2+λk+l(L) PS U 2 L2(πn).

By the choice of orthonormal basis V 1, . . . , V s 1, we have

j=1 V j, U L2(πn)V j.

Since (13) and Pθj, U L2(πn) Cδ

λ + C(γ + δ),

which is implied by (9), we can know

V j, U L2(πn) Cδ

λ + γ + r + δ

+ C βn + t β + tβ n

λd/2+2 1/2 .

This leads to

PS U L2(πn) Cδ

λ + γ + r + δ

+ C βn + t β + tβ n

λd/2+2 1/2 .

The convergence of eigenvalue suggests

|λs+l(M) λs+l(L)| C r p

λk+l(M) + γ + r + δ

λs+l(M)+C βn + t β + tβ n

λd/2+2 s+l (M).

This immediately leads to

2 U PSU 2 L2(πn) C r

λ + γ + r + δ

λ + C βn + t β + tβ n

λd/2+2 + Cδ2.

Therefore, we can conclude that

U PSU 2 L2(πn) C r

λ + γ + r + δ

γλ + C βn + t β + tβ n

λd/2+2 + Cδ2.

By the deﬁnition of P, we can know that if Xi Mk, then

|U(Xi) θ(Xi)| = nk

Ui (θ(x) θ(Xi)) πn,k(x)dx θk δ.

Since θ is the eigenvector of M, then we can know θk is ﬁnite. So we can know

θ PS Pθ 2 L2(πn) C r

λ + γ + r + δ

γλ + C βn + t β + tβ n

λd/2+2 + Cδ2.

Linear Separation Capacity of Self-Supervised Representation Learning

Then, we can follow the same argument in step 7 to show that if U1, . . . , Us+l 1 is an orthonormal basis for the eigenspace of L with eigenvalues smaller than λs+l(L), then there exists an orthonormal basis for the eigenspace of M with eigenvalues smaller than λs+l(M) such that

Uj θj L2(πn) C λ

λ + γ + r + δ

+ βn + t β + tβ n

λd/2+2 1/2 + Cδ.

B.3 Proof for Theorem 2

In this proof, we adopt the same notations in proof of Theorem 1. Besides the Dirichlet energy D(θ) and b(U), we deﬁne a new discrete Dirichlet energy based on Y1, . . . , Yn

b Y (U) = 1

1 σh rd Wi,j

U(Yi) U(Yj)

r2 zo2 and the weight Wi,j is deﬁned by Yi and Yj

Wi,j = I ( Yi Yj r) .

We also deﬁne a Dirichlet energy on M

M θ(y) πY (y)dy,

where πY is a uniform distribution on M. Similarly, we can also introduce the discretization map PY and interpolation map IY by Y1, . . . , Yn on M. We choose δ = p

r(c log n/n)1/d

(α + 1) log n/nδd for some large enough α in PY and IY . Instead of ﬁnding the connection between D(θ) and b(U), we will build the connection between DY (θ) and b(U). By the construction of M1 and M2, we have

Yi Yj r Xi Xj r.

Equivalently, we have

I ( Yi Yj r) I ( Xi Xj r) or Wi,j Wi,j.

This suggests

b Y (U) = 1

1 σh rd Wi,j

U(Yi) U(Yj)

1 σh rd Wi,j

U(Yi) U(Yj)

By Proposition 9, we have

DY ( IY U) 1 + C δ

r + r + γ b Y (U)

r + r + γ r

r + r + γ + zo

On the other hand, if we write U = PY θ where θ is some eigenfucntion of M, then

d+2 b(U) b Y (U)

= 1 σhn2 rd+2 X

Zi=Zj (I( Yi Yj r) I( Yi Yj r)) (U(Yi) U(Yj))2

= 1 σhn2 rd X

Zi=Zj I( r < Yi Yj r) U(Yi) U(Yj)

1 σhn2 rd X

Zi=Zj I( r < Yi Yj r) 3r θ

9r2 θ 2 σhn2 rd+2 X

i,j I( r < Yi Yj r)

Deﬁne a U-statistics Q = 2 n(n 1)

i<j I( r < Yi Yj r).

Σ2 := E I2( r < Yi Yj r) = E (I( r < Yi Yj r)) C(rd rd),

we can apply a Bernstein-type inequality for U-statistics (Arcones, 1995) to have

P n|Q E(Q)| > t 2 exp 2t2

2Σ2 + (2/3)tn 1/2

This suggest

Q > C(rd rd) +

C(rd rd) log n

n + C log n

With probability at least 1 2n 2,

d+2 b(U) b Y (U) Cr2 θ 2 σh rd+2

(rd rd) + log n

Linear Separation Capacity of Self-Supervised Representation Learning

Therefore, we can conclude that

b( PY θ) b Y ( PY θ) + C θ 2 σh

r + r + γ DY (θ) + C θ 2 σh

After building the connection between DY (θ) and b(U), we can apply the same arguments in step 5-8 of proof for Theorem 1 to show that, with probability at least 1 Cn 2, if Us is normalized eigenvector of L with sth eigenvalue, there is a normalized eigenfunction θs of M with sth eigenvalue such that

Us θs L2(πn) C

λ + γ + r + δ

!!1/2 + Cδ,

where λ = λs( M), θs = (θs(Y1), . . . , θs(Yn)) and γλ is the eigengap. The choices of δ, γ, zo

and r suggest Us θs L2(πn) 0.

B.4 Proof for Theorem 4

We adopt a similar strategy in Theorem 1 to prove results. Step 1: Dirichlet energy We need to introduce some notations parallel to the notations in the proof of Theorem 1. Given two functions θA and θB deﬁned on Ns, we deﬁne their inner product as

Ns,k θA k (φ) θB k (φ)πs k(φ)dφ.

Similarly, we deﬁne the weighted Dirichlet energy for a function θ : Ns R as

w2 k Vol Nv,k Dφ,k( θk), where Dφ,k( θk) = Z

Ns,k θk(φ) 2πs k 2(φ)dφ.

Step 2: Discrete Dirichlet energy Given the new weights, we can deﬁne corresponding discrete Dirichlet energy as

1 σhrd Wi,j

U(φi) U(φj)

where U(φi) is the ith component of vector U and σh is the surface tension when

h(x) = (1 x2)dv/2 + where (y)+ = max(y, 0).

We can deﬁne within and cross manifold Dirichlet energy

bφ,k( U) = 1

1 σhrd Wi,j

U(φi) U(φj)

2 , bφ,W ( U) =

2 bφ,k( U),

bφ,C( U) = 1

φi Ns,k1,φj Ns,k2

1 σhrd Wi,j

U(φi) U(φj)

We also need to deﬁne an intermediate discrete Dirichlet energy

1 σhrds h d Ns,k(φi, φj)

U(φi) U(φj)

and corresponding within manifold Dirichlet energy

bφ,W ( U) =

2 1 Vol Nv,k bφ,k( U),

bφ,k( U) = 1

1 σhrds h d Ns,k(φi, φj)

U(φi) U(φj)

Step 3: Discretization and interpolation maps By applying Proposition 8, we can know that with probability at least 1 Kn exp( Cnγ2δds) 2K exp( cn), there exists a probability measure µs n,k with probability density function πs n,k for k = 1, . . . , K such that

πs k πs n,k L (Ns,k) C(γ + δ)

and there exist transportation maps R1, . . . , RK such that

sup x Ns,k d Ns,k(x, Rk(x)) δ.

Similarly, we can choose δ = p

r(c log n/n)1/ds and γ = p

(α + 1) log n/nδds for some large enough α. Based on the transportation maps R1, . . . , RK, we can introduce the discretization map Pφ : L2(πs) Rn and the interpolation map Iφ : Rn L2(πs) in the same way as the proof of Theorem 1.

Step 4: Connection between Dirichlet energy Dφ( θ) and bφ( U) To build the connection between Dφ( θ) and bφ( U), we need to ﬁnd more explicit upper and lower bound for Wi,j.

Linear Separation Capacity of Self-Supervised Representation Learning

If φi, φj Ns,k and d Ns,k(φi, φj) r, then

1 rd Wi,j = 1 rd Vol2Nv,k

M(φj) I( x y r)dxdy

1 rd Vol2Nv,k

M(φj) I (d Mk(x, y) r) dxdy

1 rd Vol2Nv,k

Nv,k I d Nv,k(ψi, ψj) q

r2 d2 Ns,k(φi, φj) dψidψj

1 rd Vol2Nv,k

1 C(r2 d2 Ns,k(φi, φj)) Vdv r2 d2 Ns,k(φi, φj) dv/2 dψ1

Vdv rd Vol Nv,k

1 C(r2 d2 Ns,k(φi, φj)) r2 d2 Ns,k(φi, φj) dv/2

Vdv rds Vol Nv,k

1 d2 Ns,k(φi, φj)

Here we use the following facts

1. x y d Mk(x, y);

2. d2 Mk(x, y) = d2 Ns,k(φi, φj) + d2 Nv,k(ψi, ψj) when x = Tk(φi, ψi) and y = Tk(φj, ψj);

3. |Vol(BNv,k(ψ, r)) Vdvrdv| Crdv+2, where BNv,k(ψ, r) is a geodesic ball with center at ψ and radius r;

4. Nv,k has no boundary.

This suggests 1 rd Wi,j Vdv rds Vol Nv,k

1 Cr2 h d Ns,k(φi, φj)

and bφ,k( U) 1 Cr2 Vdv Vol Nv,k bφ,k( U).

If we apply Proposition 9, we have

bφ,W ( U) 1 Cr2 Vdv bφ,W ( U)

1 Cr2 Vdv Dφ( Iφ U) 1 + C δ

r + r + γ .

Because bφ,C( U) 0,

Dφ( Iφ U) 1 Vdv

r + r + γ + r2 bφ,W ( U)

r + r + γ + r2 bφ( U).

Next, we work on the upper bound of Wi,j. If φi, φj Ns,k, then

1 rd Wi,j = 1 rd Vol2Nv,k

M(φj) I( x y r)dxdy

1 rd Vol2Nv,k

M(φj) I (d M(x, y) r) dxdy

1 rd Vol2Nv,k

Nv,k I d Nv,k(ψi, ψj) q

r2 d2 Ns,k(φi, φj) dψidψj

1 rd Vol2Nv,k

1 + c r2 Vdv( r2 d2 Ns,k(φi, φj))dv/2dψi

1 rd Vol Nv,k

1 + c r2 Vdv r2 d2 Ns,k(φi, φj) dv/2

rds Vol Nv,k

1 + c r2 Vdv

1 d2 Ns,k(φi, φj)

Here r = r + 8r3/R2 and we use the following facts

1. d Mk(x, y) x y + 8 x y 3/R2, where R is the reach of the manifold;

2. d2 Mk(x, y) = d2 Ns,k(φi, φj) + d2 Nv,k(ψi, ψj) when x = Tk(φi, ψi) and y = Tk(φj, ψj);

3. |Vol(BNv,k(ψ, r)) Vdvrdv| Crdv+2, where BNv,k(ψ, r) is a geodesic ball with center at ψ and radius r;

4. Nv,k has no boundary.

So we can have

1 rd Wi,j r

d Vdv rds Vol Nv,k

1 + C r2 h d Ns,k(φi, φj)

d 1 + Cr2 Vdv Vol Nv,k bφ,k( U).

Note that the size of neighborhood in bφ,k( U) of the above inequality is r. An application of Proposition 9 leads to

bφ,W ( Pφ θ) 1 + Cr2 Vdv bφ,W ( Pφ θ)

1 + Cr2 1 + C δ

r + r + γ Vdv Dφ( θ)

r + r + γ + r2 Vdv Dφ( θ).

When δ(M) > r, we can know bφ,C( U) = 0. This immediately suggests

bφ( Pφ θ) 1 + C δ

r + r + γ + r2 Vdv Dφ( θ).

After building the connection between Dφ( θ) and bφ( U), we can apply the same arguments in step 5-8 of proof for Theorem 1 to complete the proof.

Linear Separation Capacity of Self-Supervised Representation Learning

B.5 Proof for Theorem 3

The proof is divided into three steps. Step 1: Hypothesis construction In this step, we construct the least favorable hypothesis H0 and H1. Under H0, we assume the observed data is drawn from a single manifold

M = M0 = {(x, 0D d) : x [0, 1]d},

where 0D d is a D d dimensional vector of zeros. We also assume that the probability distribution of our observed data is uniform distribution on M0. Given the manifold, we write P0 as the probability distribution of observed data X1, . . . , Xn under H0. Next, we construct the other hypothesis H1. We divide the space [0, 1]d into L = Md cubes of size M 1 . . . M 1 for some positive integer M and name these disjoint cubes Q1, . . . , QL. M will be speciﬁed in the third step. For each Ql, we also deﬁne Ql as a cube with the same center of Ql but a smaller size (3M) 1 . . . (3M) 1. Given Ql, we consider the following two manifolds

M1(Ql) = {(x, 0D d) : x [0, 1]d \ Ql} and M2(Ql) = {(x, 0D d) : x Ql}.

Clearly, M1(Ql) and M2(Ql) are two disjoint manifolds and the distance between them is (3M) 1. So δ(M(Ql)) = (3M) 1 if we deﬁne M(Ql) = M1(Ql) M2(Ql). Given these two manifolds, we consider a uniform distribution over M(Ql) and deﬁne it as hypothesis H1(Ql). Write P1(Ql) as the probability distribution of observed data X1, . . . , Xn under H1(Ql). The hypothesis H1 is a mixture of hypotheses H1(Ql) for l = 1, . . . , L and we write

l=1 P1(Ql).

By the construction, the observed data is drawn from single manifold under H0 and two disjoint manifolds under H1. Step 2: Bounding χ2 divergence The goal of this step is to bound χ2 divergence between P0 and P1. To the end, we write the likelihood between P1(Ql) and P0 as

Fl(X1, . . . Xn) = d P1(Ql)

( 1 1 n , Xi ([0, 1]d \ Ql) Ql

0, Xi Ql \ Ql ,

where = M d (3M) d is the volume of set Ql \ Ql. Given the likelihood F1, . . . , FL, we can write the χ2 divergence between P0 and P1 as

χ2(P1, P0) = E0

d P0 1 2 = Var0

where E0 and Var0 are the expectation and variance under the probability distribution P0. Therefore, it is suﬃcient to bound the variance and covariance of F1, . . . , FL. For any l1 = l2, we have

1 1 2n , Xi [0, 1]d \ (Ql1 Ql2) Ql1 Ql2 0, Xi (Ql1 \ Ql1) (Ql2 \ Ql2) .

Because E0(Fl) = 1, we can bound the covariance of Fl1 and Fl2 in the following way

|Cov0(Fl1, Fl2)| = |E0(Fl1Fl2) 1|

2n (1 2 )n 1

1 exp n 2/(1 )

Here we the fact that log(1 x) x/ 1 x when 0 < x < 1 and 1 x e x. For the variance term, we have

Var0(Fl) E0(F 2 l ) = 1 1

= exp n log 1 + 1

where we use the fact log(1+x) x. Putting the bound of covariance and variance together yields

χ2(P1, P0) 1

Step 3: Wrap up the proof Let T be any test. We still need to specify the choice of M in the hypotheses constructed in step 1. Speciﬁcally, we can choose M = (2n/ log n)1/d , so L 2n/ log n and log n

Therefore, we have

χ2(P1, P0) log n

n1/3 + log2 n

Therefore, we have

P0(T = 1) + P1(T = 0) 1 χ2(P1, P0) 1.

The construction suggests that under the hypothesis H1, we have

Therefore, we can conclude that when c = 1/6,

P0(T = 1) + P1(T = 0) 1.

Linear Separation Capacity of Self-Supervised Representation Learning

B.6 Proof for Theorem 5

We follow a similar strategy and the same notations in proof for Theorem 3, but need a diﬀerent way to construct hypotheses. Step 1: Hypothesis construction Same to the proof for Theorem 3, we assume the observed data is drawn from a uniform distribution on the single manifold M0 under H0. The hypothesis H1 is constructed in a diﬀerent way from the proof for Theorem 3. Speciﬁcally, we divide the space [0, 1]ds into L = Mds cubes of size M 1 . . . M 1 for some positive integer M, which will be speciﬁed later. We still name these disjoint cubes Q1, . . . , QL and deﬁne Ql as a cube with the same center of Ql but a smaller size (3M) 1 . . . (3M) 1. Under H1, we consider the following manifolds

M1(Ql) = {(x, y, 0D d) : x [0, 1]ds \ Ql, y [0, 1]dv}

and M2(Ql) = {(x, y, 0D d) : x Ql, y [0, 1]dv}.

If we deﬁne M(Ql) = M1(Ql) M2(Ql), we have δ(M(Ql)) = (3M) 1. Given each pair of Ql and Ql, we deﬁne a uniform distribution over M(Ql) as hypothesis H1(Ql) and write P1(Ql) as the corresponding probability distribution of observed data X1, . . . , Xn. P1 is deﬁned as a mixture distribution of P1(Ql)

l=1 P1(Ql).

Step 2: Bounding χ2 divergence We can bound the χ2 divergence in the same way as the proof for Theorem 3. Speciﬁcally, we can show

χ2(P1, P0) 1

where = M ds (3M) ds. Step 3: Applying theorem of fuzzy hypothesis We can choose M = (2n/ log n)1/ds and still show P0(T = 1) + P1(T = 0) 1.

B.7 Proof for Theorem 6

Without loss of generality, we can assume Yi = 1 when Xi Ms for 1 s K and Yi = 1 if Xi Ms for K < s K. Furthermore, we assume the ﬁrst K eigenfunctions have the following forms

( 1, x Ms 0, x M \ Ms .

Step 1 In this step, we aim to show that the learned representation ˆΘ( X1), . . . , ˆΘ( Xm) are linearly separable with respect to Y in a large probability. An application of Markov s inequality suggests

P |ˆθs( Xi) θs( Xi)| > 1 3K

E(ˆθs( Xi) θs( Xi))2

(1/3K)2 9K2χn.

Therefore, with probability 1 9m K3χn, we have

|ˆθs( Xi) θs( Xi)| 1 3K , 1 s K, 1 i m. (14)

Let βo = (βo s) such that βo s = 1 when 1 s K and βo s = 1 when K < s K. When Yi = 1, then

βo T ˆΘ( Xi) 1 1

3K (K K ) 1

On the other hand, if Yi = 1, then

βo T ˆΘ( Xi) K 1

3K + (K K 1) 1

Therefore, {ˆΘ( Xi) : Yi = 1} and {ˆΘ( Xi) : Yi = 1} are linearly separable. Step 2 The goal of this step is to identify a set B which covers all converged weight vector β in the logistic regression. According to Soudry et al. (2018); Ji and Telgarsky (2018), the weight vector β of logistic regression converges to the direction of the max-margin solution. Speciﬁcally, Theorem 1.1 in Ji and Telgarsky (2018) suggests that

lim t βt βt = β

where β is the optimal solution of the following optimization problem

min β β 2, s.t. YiβT ˆΘ( Xi) 1, 1 i n. (15)

Now we study the property of β . Clearly, 3βo/2 satisfy the constraint in (15) and therefore we have β 2 3βo/2 2 = 9K/4.

When Xi Ms for some 1 s K , (14) implies

βT ˆΘ( Xi) βs + 1

s=1 |βs| βs + 1

An application of β 2 9K/4 yields

1 β T ˆΘ( Xi) β s + 1

Therefore, we can show that β s 1/2 when 1 s K . Through the same way, we can show that β s 1/2 when K < s K. Overall, we can conclude

β B := β RK : β 2 9K

2 if 1 s K and βs 1

2 if K < s K .

Linear Separation Capacity of Self-Supervised Representation Learning

Step 3 Because β B, it is suﬃcient to study the misclassiﬁcation rate of the following classiﬁer

ˆHˆΘ,β(x) =

( 1, βT ˆΘ(x) > 0 1, βT ˆΘ(x) 0

for any β B. Recall the true label is

( 1, x Ms when 1 s K

1, x Ms when K < s K .

When β B, we can have

P( ˆHˆΘ,β(X) = H (X)) P |ˆθs(X) θs(X)| > 1 6K , 1 s K

because when |ˆθs(X) θs(X)| 1/6K,

βT ˆΘ(X) βs 1

4 if X Ms for some 1 s K

βT ˆΘ(X) βs + 1

4 if X Ms for some K < s K.

By union bound and Markov s inequality, we have

P |ˆθs(X) θs(X)| > 1 6K , 1 s K

s=1 P |ˆθs(X) θs(X)| > 1 6K

K E(ˆθs(X) θs(X))2

Therefore, we can conclude that with probability 1 9m K3χn,

ξ(ˆΘ) 36K3χn.

B.8 Proof for Proposition 10

Let k(Xi) be the manifold that Xi belongs to, i.e., k(Xi) = k if Xi Mk. The following analysis is conducted by conditioning on k(X1), . . . , k(Xn). We write the number of edges connecting points in Mk1 and Mk2 as

Xi Mk1,Xj Mk2

I( Xi Xj r).

The expectation of Zk1,k2 can be written as

E(Zk1,k2) = E

Xi Mk1,Xj Mk2

I( Xi Xj r)

k(X1), . . . , k(Xn)

nk1 sup x Mk1 E

k(X1), . . . , k(Xn)

To bound the variance of Zk1,k2, we apply Efron-Stein inequality (See Theorem 3.1 in Boucheron et al., 2013)

Var(Zk1,k2) 1

I( Xi Xj r) I( X i Xj r)

I( Xi Xj r) I( Xi X j r)

where X i and X j are independent copies of Xi and Xj. The ﬁrst term of the right hand side in above inequality can be written as

I( Xi Xj r) I( X i Xj r)

I( Xi Xj r)

4 sup x Mk1 Var

I( Xi Xj r)

4 sup x Mk1

Var (I( x Xj r))

4βnk2 + 4β2n2 k2.

Here we use the fact

I( Xi Xj r)

We can apply the same strategy to bound the other term of the right hand side to obtain

I( Xi Xj r) I( Xi X j r)

4βnk1 + 4β2n2 k1.

Linear Separation Capacity of Self-Supervised Representation Learning

Putting two terms together yields

Var(Zk1,k2) 4βnk1nk2 + 2β2nk1nk2(nk1 + nk2).

By Chebyshev s inequality, we can conclude that

P Zk1,k2 2 βnk1nk2 + t p

βnk1nk2 + tβ q

nk1nk2(nk1 + nk2) 1

By union bound and the fact

P(nwk/2 nk 2nwk, 1 k K) 1 2K exp( cn),

P max 1 k1,k2 K Zk1,k2 cβn2 + t p

βn + βn3/2 2K exp( cn) + K2

If we apply Proposition A.4 in Garc ıa Trillos et al. (2021), we can conclude that

b C( Pθ) C βn + t β + tβ n

1 + λd/2+2 θ 2 L2(π).

Appendix C. Notations

In this section, we formally deﬁne the probability density function on the manifold. The volume of a Riemannian manifold (M, g) is deﬁned in the following way:

det(gij)dx1 . . . dxd,

where (x1, . . . , xd) is a set of local coordinates. If X is a random variable deﬁned on the manifold (M, g), the probability density function π of X is a function M R that satisﬁes

det(gij)dx1 . . . dxd,

for any measurable set A M. See also Hendriks (1990).

H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206 24221, 2021.

M. A. Arcones. A bernstein-type inequality for u-statistics and u-processes. Statistics & probability letters, 22(3):239 247, 1995.

E. Arias-Castro, G. Lerman, and T. Zhang. Spectral clustering based on local pca. The Journal of Machine Learning Research, 18(1):253 309, 2017.

S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019.

P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.

R. Balestriero and Y. Le Cun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. ar Xiv preprint ar Xiv:2205.11508, 2022.

R. Balestriero, M. Ibrahim, V. Sobal, A. Morcos, S. Shekhar, T. Goldstein, F. Bordes, A. Bardes, G. Mialon, Y. Tian, A. Schwarzschild, A.G. Wilson, J. Geiping, Q. Garrido, P. Fernandez, A. Bar, H. Pirsiavash, Y. Le Cun, and M. Goldblum. A cookbook of selfsupervised learning. ar Xiv preprint ar Xiv:2304.12210, 2023.

R. Basri and D. W. Jacobs. Lambertian reﬂectance and linear subspaces. IEEE transactions on pattern analysis and machine intelligence, 25(2):218 233, 2003.

M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

D. Burago, S. Ivanov, and Y. Kurylev. A graph discretization of the laplace beltrami operator. Journal of Spectral Theory, 4(4):675 714, 2015.

V. Cabannes, B. T. Kiani, R. Balestriero, Y. Le Cun, and A. Bietti. The ssl interplay: Augmentations, inductive bias, and generalization. ar Xiv preprint ar Xiv:2302.02774, 2023.

J. Calder and N. Garc ıa Trillos. Improved spectral convergence rates for graph laplacians on ε-graphs and k-nn graphs. Applied and Computational Harmonic Analysis, 60:123 175, 2022.

S. Chen, E. Dobriban, and J. H. Lee. A group-theoretic framework for data augmentation. The Journal of Machine Learning Research, 21(1):9885 9955, 2020a.

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020b.

X. Chen and K. He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750 15758, 2021.

Linear Separation Capacity of Self-Supervised Representation Learning

X. Chen and Y. Yang. Diﬀusion k-means clustering on manifolds: Provable exact recovery via semideﬁnite relaxations. Applied and Computational Harmonic Analysis, 52:303 347, 2021.

X. Chen, H. Fan, R. Girshick, and K. He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020c.

R. R. Coifman and S. Lafon. Diﬀusion maps. Applied and Computational Harmonic Analysis, 21(1):5 30, 2006.

J. Devlin, M. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171 4186, 2019.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Z. Fu, W. Hu, and T. Tan. Similarity based vehicle trajectory clustering and anomaly detection. In IEEE International Conference on Image Processing 2005, volume 2, pages II 602. Ieee, 2005.

N. Garc ıa Trillos, M. Gerlach, M. Hein, and D. Slepˇcev. Error estimates for spectral convergence of the graph laplacian on random geometric graphs toward the laplace beltrami operator. Foundations of Computational Mathematics, 20(4):827 887, 2020.

N. Garc ıa Trillos, P. He, and C. Li. Large sample spectral analysis of graph-based multimanifold clustering. ar Xiv preprint ar Xiv:2107.13610, 2021.

J. Grill, F. Strub, F. Altch e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020.

J. Z. Hao Chen, C. Wei, A. Gaidon, and T. Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34, 2021.

T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729 9738, 2020.

K. He, X. Chen, S. Xie, Y. Li, P. Doll ar, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000 16009, 2022.

H. Hendriks. Nonparametric estimation of a probability density on a riemannian manifold using fourier expansions. The Annals of Statistics, pages 832 849, 1990.

G. E. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15, 2002.

R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018.

Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300, 2018.

D. D. Johnson, A. E. Hanchi, and C. J. Maddison. Contrastive learning can ﬁnd an optimal basis for approximately view-invariant functions. ar Xiv preprint ar Xiv:2210.01883, 2022.

A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.

Y. Le Cun. A path towards autonomous machine intelligence. Open Review, 62, 2022.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diﬀusion. Applied and Computational Harmonic Analysis, 44(3):509 536, 2018.

J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309 323, 2021.

V. J. Martinez and E. Saar. Statistics of the galaxy distribution. Chapman and Hall/CRC, 2001.

E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162:431 461, 2015.

A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.

M. Penrose. Random geometric graphs, volume 5. OUP Oxford, 2003.

P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein. The intrinsic dimension of images and its impact on learning. ar Xiv preprint ar Xiv:2104.08894, 2021.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. Open AI, 2018.

Linear Separation Capacity of Self-Supervised Representation Learning

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

N. Saunshi, J. Ash, S. Goel, D. Misra, C. Zhang, S. Arora, S. Kakade, and A. Krishnamurthy. Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pages 19250 19286. PMLR, 2022.

C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1 48, 2019.

D. Soudry, E. Hoﬀer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1): 2822 2878, 2018.

C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning. In Artiﬁcial Neural Networks and Machine Learning ICANN 2018: 27th International Conference on Artiﬁcial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27, pages 270 279. Springer, 2018.

Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. In European conference on computer vision, pages 776 794. Springer, 2020a.

Y. Tian, L. Yu, X. Chen, and S. Ganguli. Understanding self-supervised learning with dual deep networks. ar Xiv preprint ar Xiv:2010.00578, 2020b.

C. Tosh, A. Krishnamurthy, and D. Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pages 1179 1206. PMLR, 2021.

Y. Tsai, Y. Wu, R. Salakhutdinov, and L. Morency. Self-supervised learning from a multiview perspective. ar Xiv preprint ar Xiv:2006.05576, 2020.

A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

R. Vidal and Y. Ma. A uniﬁed algebraic approach to 2-d and 3-d motion segmentation and estimation. Journal of Mathematical Imaging and Vision, 25(3):403 421, 2006.

U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

S. Wang. Self-supervised metric learning in multi-view data: A downstream task perspective. Journal of the American Statistical Association, 118(544):2454 2467, 2023.

S. Wang. Augmentation invariant manifold learning. Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkaf003, 2025.

Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1 34, 2020.

C. Wei, K. Shen, Y. Chen, and T. Ma. Theoretical analysis of self-training with deep networks on unlabeled data. ar Xiv preprint ar Xiv:2010.03622, 2020.

Z. Wen and Y. Li. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pages 11112 11122. PMLR, 2021.

Z. Wen and Y. Li. The mechanism of prediction head in non-contrastive self-supervised learning. ar Xiv preprint ar Xiv:2205.06226, 2022.

J. Zbontar, L. Jing, I. Misra, Y. Le Cun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310 12320. PMLR, 2021.

C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. ar Xiv preprint ar Xiv:2302.09419, 2023.