# understanding_representation_learnability_of_nonlinear_selfsupervised_learning__b6d2e82b.pdf

Understanding Representation Learnability of Nonlinear Self-Supervised Learning

Ruofeng Yang1, Xiangyuan Li1, Bo Jiang1, Shuai Li1*

1Shanghai Jiao Tong University wanshuiyin@sjtu.edu.cn, lixiangyuan19@sjtu.edu.cn, bjiang@sjtu.edu.cn, shuaili8@sjtu.edu.cn

Self-supervised learning (SSL) has empirically shown its data representation learnability in many downstream tasks. There are only a few theoretical works on data representation learnability, and many of those focus on final data representation, treating the nonlinear neural network as a black box . However, the accurate learning results of neural networks are crucial for describing the data distribution features learned by SSL models. Our paper is the first to analyze the learning results of the nonlinear SSL model accurately. We consider a toy data distribution that contains two features: the labelrelated feature and the hidden feature. Unlike previous linear setting work that depends on closed-form solutions, we use the gradient descent algorithm to train a 1-layer nonlinear SSL model with a certain initialization region and prove that the model converges to a local minimum. Furthermore, different from the complex iterative analysis, we propose a new analysis process which uses the exact version of Inverse Function Theorem to accurately describe the features learned by the local minimum. With this local minimum, we prove that the nonlinear SSL model can capture the label-related feature and hidden feature at the same time. In contrast, the nonlinear supervised learning (SL) model can only learn the label-related feature. We also present the learning processes and results of the nonlinear SSL and SL model via simulation experiments.

1 Introduction In recent years, self-supervised learning has become an important paradigm in machine learning because it can use datasets without expensive target labels to learn useful data representations for many downstream tasks (Devlin et al. 2018; Radford et al. 2019; Wu et al. 2020). At present, contrastive learning, a common self-supervised learning method, has shown superior performance in learning data representations and outperformed supervised learning in some downstream tasks (He et al. 2020; Chen and He 2021; Grill et al. 2020; Caron et al. 2020; Wang et al. 2022). Contrastive learning methods usually form a dual pair of siamese networks (Bromley et al. 1993) and use data augmentations for each datapoint. They treat two augmented datapoints of the same datapoint as positive pairs and maximize the similarity between positive pairs to learn data representations.

*Corresponding author Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

However, the siamese networks often collapse to a trivial solution during the training process, rendering the learned representation meaningless. To avoid the above problem, earlier contrastive learning methods such as Mo Co (He et al. 2020) and Sim CLR (Chen et al. 2020) treat augmented datapoints from different datapoints as negative pairs and prevent model collapse by the trade-off between positive and negative pairs. However, obtaining high-quality negative pairs is difficult (Khosla et al. 2020), which in turn requires additional changes to the model. Recently, other classes of the SSL model, such as BYOL (Grill et al. 2020) and Sim Siam (Chen and He 2021), which do not use negative pairs, have been studied. These models will not collapse to a trivial solution because they construct subtle asymmetry in the structure of the siamese network and create a dynamic buffer area (Tian, Chen, and Ganguli 2021). Sim Siam further simplifies the structure of BYOL and only retains the core asymmetry. The simplified model makes training and analysis more convenient while obtaining competitive and meaningful data representations. Despite the empirical success of SSL (He et al. 2020; Chen et al. 2020; Chen and He 2021; Zhong et al. 2022), there are only a few works that focus on data representation learnability (Arora et al. 2019; Tosh, Krishnamurthy, and Hsu 2021; Lee et al. 2021; Hao Chen et al. 2021, 2022; Tian 2022a,b; Wen and Li 2021; Liu et al. 2021). However, studying the learnability is helpful in understanding why SSL models can obtain meaningful data representations. Many of the above works used final data representation to study the data representation learnability. Arora et al. (2019) obtained the data representation function by minimizing the empirical SSL loss in a special data representation function class. Hao Chen et al. (2021) and Hao Chen et al. (2022) studied final data representation by closed-form solutions. They viewed the nonlinear neural network as a black box and ignored the learning result of the nonlinear neural network. Thus their results do not describe the features accurately captured by SSL models and explain the encoding process of neural networks. Wen and Li (2021) and Tian et al. (2020) tried to understand the learning results of nonlinear SSL models by analyzing a relatively overparameterized neural network. However, their results do not provide an accurate answer to whether SSL models could exactly capture the important features of data distribution or just capture a mixture of features.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Liu et al. (2021) studied the learning results of SSL models, and it is the most relevant work to us. They proved that SSL models could learn the label-related features and hidden features at the same time. However, their work is a linear framework, and their results depend on the closed-form solutions of the learning results. When considering a nonlinear SSL model, we can not get closed-form solutions due to the nonconvexity. Therefore, which features can be exactly learned by nonlinear SSL models remains an important open question. We need a new analysis process to analyze the specific learning results of the nonlinear SSL model. In this work, for the first time, we use gradient descent to train a nonlinear SSL model and analyze the data representation learnability by using the learning results of neural networks. We accurately describe the data distribution features captured by the SSL model. Specifically, we accomplish:

1. With a designed data distribution, we use gradient descent (GD) to train a 1-layer nonlinear SSL model and prove that the model can converge to a local minimum under a certain initialization region. Using locally strong convexity, we also obtain the convergence rate of the algorithm. 2. We describe the properties of the local minimum using the exact version of Inverse Function Theorem. Using these properties, we prove that the SSL model learns the label-related feature and hidden feature at the same time. 3. We prove that the nonlinear SL model can only learn the label-related feature. In other words, SSL is superior to SL in learning data representation. We verify the correctness of the above results through simulation experiments.

2 Related Work Theoretical analyses for final data representation. For the analysis of the data representation learnability, many works focus on the final data representation (the optimal solution of the pretext task) and measure the quality of the final data representation in the downstream tasks by using a linear classifier (Hao Chen et al. 2021, 2022; Arora et al. 2019; Lee et al. 2021; Tosh, Krishnamurthy, and Hsu 2021). The main difference in this line of work is how to obtain the final data representation. Arora et al. (2019) assumed that the data representation function class contains a function with low SSL loss and minimized the empirical SSL loss in this class. Hao Chen et al. (2021) constructed the population positive-pair graph with augmented datapoints as vertices and the correlation of augmented datapoints as edge weights. Then they proved that the closed-form solutions of the data representations are approximately equivalent to the eigenvectors of the adjacency matrix of the above graph. Lee et al. (2021) used the nonlinear canonical correlation analysis (CCA) method to obtain the final data representation. The above works viewed the nonlinear neural network as a black box and ignored the learning results of the neural network. However, the learning results are crucial for analyzing which features are exactly captured by SSL methods. Hence we need to propose a new method to analyze the learning results.

Theoretical analyses for learning results of SSL. Liu et al. (2021) analyzed the learning results of SSL methods.

With a 1-layer linear SSL model, similar to Sim Siam, they demonstrated that the SSL models could learn label-related and hidden features simultaneously. Because of the linear structure and the objective function with a designed quartic regularization, they can directly obtain the closed-form solutions of the learning results by using spectral decomposition of the matrix related to the data distribution. Tian (2022a) and Tian (2022b) dealt with the learning results of the nonlinear SSL model by analyzing an objective function similar to traditional Principal Component Analysis (PCA). However, their results were extended by a hidden neuron. Hence their results can not definitively answer which data features are captured by the model and which are ignored. Wen and Li (2021) and Tian et al. (2020) tried to understand the learning results of the nonlinear SSL by using stochastic gradient descent (SGD). However, their results relied heavily on special data augmentation and relatively overparameterized neural networks. Furthermore, their results only showed that with a large number of neurons, the neural networks contain all data features. They did not accurately characterize the learning result of each neuron. In other words, these results did not show the features exactly captured by the SSL methods.

Theoretical guarantees for supervised learning. For the analysis of the supervised learning, researchers focus on (1) How to characterize the landscape of the objective function; (2) How to converge to the local minima through algorithms (such as GD and SGD); (3) How fast the algorithm converges to the local minimum (Allen-Zhu, Li, and Song 2019; Du et al. 2017; Li and Yuan 2017; Brutzkus and Globerson 2017; Du et al. 2019). Hence they focus on characterizing the relationship between the objective function and its gradient and less on the specific form or the properties of local minima. However, the specific forms of local minima are helpful to determine whether SSL methods can capture important data distribution features.

3 Problem Formulation In this section, we introduce the data distribution and the nonlinear SSL and SL model to be studied in this paper.

3.1 Data Distribution The classification problem is a typical downstream task in machine learning, which can be used to measure the quality of data representation. We start with a simple binary classification and want to explore the differences in the data representations learned by SSL and SL models. To train models, we first build the data distribution. In most cases, the data distribution contains not only label-related features but also some hidden features. These hidden features may not be helpful for the current task but may be useful for other downstream tasks. We want to determine whether the nonlinear SSL models capture hidden features, resulting in a richer data representation. At the same time, we also wonder whether the SL models only learn label-related features. For the simplicity of analysis, we consider the label-related features as a group, represented by the feature e1. We also use e2 to represent the hidden features. Inspired by previous work (Liu et al. 2021), which solved the above question in the linear

setting, we construct a data distribution containing four kinds of datapoints. The number of these four kinds of datapoints are n1, n2, n3, n4 and n = n1 + n2 + n3 + n4. Every time we generate a datapoint, we draw among the four kinds of datapoints with a probability of 1/4, which means E[nl] = n/4, l [4]. Let τ > 1, ρ > 0 are two hyperparameters of the data distribution and ξ1, ..., ξn Rd are datapoint noise terms sampled from a Gaussian distribution N(0, I). Define D1 = {xi|xi = e1 + ρξi}n1 i=1 ,

D2 = {xi|xi = e1 + τe2 + ρξi}n1+n2 i=n1+1 ,

D3 = {xi|xi = e1 + ρξi}n1+n2+n3 i=n1+n2+1 ,

D4 = {xi|xi = e1 + τe2 + ρξi}n i=n n4+1 , (1) as the datasets of four kinds of datapoints, where e1, e2 Rd are two orthogonal unit-norm vectors. Then, the data distribution in this paper is D = D1 D2 D3 D4. Because labels are required during the SL model training process, we modify the data distribution. Specifically, we denote the class label by y = {0, 1}. When xi D1 S D2, y = 0, otherwise y = 1. After the above steps, we obtain the data distribution DSL of the nonlinear SL model. It is clear that the binary classification task can be completed using only the representative label-related feature e1. However, since τ 1, e2 is also an important hidden feature. Although this data distribution is a toy setting, it is sufficient to distinguish the learnability of the SSL and SL models. This data distribution is also representative. In Sec 4.2, we explain that the proof process can be easily extended to a more general data distribution containing many label-related and hidden features.

3.2 Model In this section, we introduce the activation function and then the nonlinear SSL and SL model. In this paper, we analyze nonlinear models, so it is necessary to introduce activation functions. We discuss two activation functions: sigmoid function σ(x) = 1 1+e x and tanh

function σ2(x) = ex e x

The SSL model. We focus on a variant of Sim Siam (Chen and He 2021). Sim Siam has shown impressive performance in various downstream experiments using only positive pairs and has become a representative SSL model. Fig. 1 shows the structure of the model in this paper. The datapoint xi is augmented by data augmentation ξaug and ξ aug to obtain two augmented datapoints x i and x i . The data representations z i and z i of x i and x i are obtained through the nonlinear encoder σ(Wx i) and σ(Wx i ). We use inner product z i, z i to measure the similarity between z i and z i . Tian, Chen, and Ganguli (2021) showed that a regularizer is essential for the existence of the non-collapsed solution. Hence, α W 2 F is added in L. The objective function L is defined as

min W L = α W 2 F

i=1 Eξaug,ξ aug h σ(W(xi + ξaug)), σ(W(xi + ξ aug)) i ,

Figure 1: The structure of SSL model.

where α is the coefficient of regularizer, W = [w1, w2] R2 d and ξaug, ξ aug N 0, ρ2I . W is the weight matrix of the encoder containing two neurons, and the parameters of the encoder are the same on both sides. Note that when Liu et al. (2021) took expectation over ξaug, ξ aug, they canceled the effect of ξaug, ξ aug due to its linear framework. In other words, the variance of ξaug, ξ aug can be arbitrarily large. However, Jing et al. (2021) showed that strong augmentation causes dimensional collapse. Hence it is necessary to consider the variance of the data augmentation. In our formulation, ξaug and ξ aug can not be canceled due to the nonlinear model. Thus our setting is more reasonable and more in line with the models in practice. To deal with data augmentation operation, we adopt ξaug, ξ aug N 0, ρ2I .

The SL model. We consider a simple two-layer nonlinear SL model to deal with the above 2-classification problem. Define f F,W SL(x) Fσ W SLx with F R1 2 as the projection matrix and W SL [w SL 1 , w SL 2 ] R2 d as the weight matrix of the feature extractor. The usual process is to use sigmoid function to transform f F,W SL(xi) to byi (0, 1), i [n]. Then, a binary cross-entropy loss function can be constructed with byi and label information yi, i [n]. However, in this paper, we focus on the performance of the feature extractor σ W SLx . Therefore an objective function that minimizes the norm of feature extractor matrix W SL with margin constraint is used:

min W SL LSL = w SL 1 2 2 + w SL 2 2 2 ,

s.t. σ (w SL y+1) x σ (w SL y +1) x

σ(2) σ( 2) 5ρd 1 10 , (x, y) DSL, y = y . (3)

The SL objective function in this paper is similar to the linear SL model in Liu et al. (2021). We set this SL objective function mainly out of intuition: If a supervised learning model is good enough to complete the classification task, it should satisfy the above margin constraint.

Definitions and notations. To characterize the objective functions, we give the following definitions and notations.

Definition 1 (Locally strong convexity and smooth on B0). Function f : Rd R is locally µ-strongly convex and Lmsmooth if µI 2f(x) Lm I, x B0, (4)

where B0 := {x : x x 2 x(0) x 2} and x argminx X f(x). Definition 2 (LH-Lipschitz continuous Hessian). Function f : Rd R is LH-Lipschitz continuous Hessian if 2f(x) 2f(y) 2 LH x y 2, x, y Rd. (5)

Notations. For x Rd, we denote by x 2 the vector s Euclidean norm. For A Rd d, we denote by A F the standard Frobenius norm and define A 2 =

λ where λ is the largest eigenvalue of A A. For x Rd and 3f(x) Rd d d, we give an upper bound of 3f(x) 2 by considering 3f(x) as a matrix-vector. Each element of the matrix-vector is 2f(x)

xi Rd d. It is clear that 3f(x) 2 2 Pd i=1 2f(x)

F . We denote by O( ) standard Big-O notations, only hiding constants. We denote by z(k) the k-th element of z Rd and z(t) the t-th iteration of the gradient descent algorithm.

4 SSL is Superior to SL in Learning Representation In this section, we show that the nonlinear SSL model can capture the label-related feature and the hidden feature of data distribution at the same time. In contrast, the nonlinear SL model can only learn the label-related feature. For simplicity, we assume e1 = (1, 0, ..., 0) , e2 = (0, 1, ..., 0) Rd.

4.1 The Learning Abilities of SSL and SL For the convenience, we define D1(τ) = { x Rd|x(1) (3.1, 3.9), τx(2) (8.5, 9), x(k) ( 3 d0.49 , 3 d0.49 ), k [3, d]} and D2(τ) = { x Rd|x(1) ( 3.9, 3.1), τx(2) (8.5, 9), x(k) ( 3 d0.49 , 3 d0.49 ), k [3, d]} as the initialization region of w1 and w2 in Theorem 1.

Theorem 1. For α = 1/800, τ = max{7, d 1 10 }, ρ = 1/d1.5

and n = d2, with probability 1 O e d 1 10 , the SSL objec-

tive function L exists a local minimum W = (w 1, w 2) :

w 1 ew 1 2 O(d 1

w 2 ew 2 2 O(d 1

where ew (1) 1 [3.1, 3.9], ew (1) 1 = ew (1) 2 , τ ew (2) 1 = τ ew (2) 2 9, ew (k) 1 = ew (k) 2 = 0, k [3, d]. Furthermore, when w1(0), w2(0) D1(τ) D2(τ), using the gradient descent algorithm and choosing learning rate η = 2 4α+τ 2+1.5, κ = 1 + τ 2+1.5+2d 0.1

2α d 0.1 , we have

w1(t) w 1 2 κ 1

t w1(0) w 1 2 ,

w2(t) w 2 2 κ 1

t w2(0) w 2 2 .

The projection of e1 and e2 on the space spanned by w 1 and w 2 is very close to 1, i.e.,

|Πe1| 1 O(τ 3d 1

|Πe2| 1 O(τ 3d 1

Theorem 1 shows that using GD to train the nonlinear SSL model under a certain initialization region D1(τ) D2(τ), the model can converge to a local minimum (w 1, w 2). Further, the projection of e1, e2 on the space spanned by w 1, w 2 is almost 1. In other words, the nonlinear SSL model has simultaneously learned e1 and e2, which are the labelrelated and hidden features. Theorem 2. Let w SL, 1 and w SL, 2 be the optimal solution of

LSL. Then with probability 1 O(e d 1 10 ), w SL, (2) 1 2 + w SL, (2) 2 2 O ρd 1 10 .

When ρ = 1/d1.5, w SL, (2) 1 2 + w SL, (2) 2 2 O 1/d1.4 .

From Theorem 2, we show that (w SL, (2) 1 )2 + (w SL, (2) 2 )2 is very small, which means SL model can only learn labelrelated feature and some noise terms. Note that the previous works (Tian 2017; Zhang et al. 2019; Li and Liang 2018) used the gradient-based algorithm to analyze the SL model with one hidden layer and obtained asymptotic convergence guarantees. They did not analyze the specific form of the learning results. Hence, Theorem 2 is different compared with the previous results. We obtain the bounds of each dimension of the learning results by constructing margin constraints. These bounds accurately describe the features learned by the SL model and help to characterize the representation learnability of the SL model. Finally, Theorem 1 and Theorem 2 show that the nonlinear SSL model is superior to the nonlinear SL model in capturing important data features, which means SSL can obtain a more competitive data representation than SL.

4.2 Discussion The extension to more general data distributions. As described in Sec 3.1, we treat label-related features as a group, represented by e1 (e2 represents hidden features), and obtain Theorem 1. In this part, we show that Theorem 1 can be extended to data distributions with many labelrelated and hidden features. Suppose there are P label-related features EL = {e1, . . . , e P } and Q + 1 hidden features EH = {e P +1, . . . , e P +Q, 0}, where E = {EL, EH} is column-orthogonal matrix. Each datapoint consists of a labelrelated feature and a hidden feature, xi = zie L i +τe H i , where P(zi = 1) = P(zi = 1) = 1/2. e L i and e H i are features in EL and EH. This general distribution only considers the relationship between label-related features and hidden features. Hence, the gradient can be decoupled, and the method of this paper can be applied. Finally, we can know that if W contains P + Q neurons, the learning results of W will span the space spanned by {e1, . . . , e P +Q}. This conclusion can be regarded as the general version of Theorem 1.

The challenges for nonlinear models. Since the objective function is non-convex and nonlinear, it is difficult to get a closed-form solution with a similar process of Liu et al. (2021). We need to use an optimization algorithm such as GD to converge to a local minimum (w 1, w 2) and determine features captured by (w 1, w 2). For nonlinear SSL models, previous work (Wen and Li 2021) used SGD to update the model step-by-step and observed the learning result during the iteration. However, the step-by-step process is complex, and it is easy to ignore the change process. Therefore, this procedure makes it difficult to analyze the learning results of local minima accurately. Different from the complex iterative analysis of the previous work, we propose a new analysis process. We first obtain the approximate region and properties of the local minimum from the simplified objective function e L and then extend it to the original complex objective function L. For the transformation from e L to L, we use the exact version of Inverse Function Theorem as a bridge, avoiding the direct analysis of the local minimum of L. In the remainder of this part, we demonstrate the intuitions and techniques for each part. Non-convex and nonlinear objective function. At this step, we consider the structure of the objective function, ignore noise terms, and take expectation over data distribution:

min e L = Eex[ σ(W ex), σ(W ex) ] + α W 2 F ,

where exi is the datapoint without noise term ρξi. We use the intermediate value principle, locally strong convexity of e L, and the properties of activation function carefully to prove the existence of the local minimum ( ew 1, ew 2) of e L. The exact version of Inverse Function Theorem. There are many noise terms in L, such as ρξi, i [n] (datapoint noise), ξaug (data augmentation noise), and the error terms due to the expectation operation over the data distribution. After obtaining the upper bound of these noise terms (Sec. 4.3), we need a bridge to deal with the transformation from e L to L. Since ( ew 1, ew 2) is local minimum of e L and noise terms are bounded, L should be µ-strongly convex and Lm-smooth in the neighborhood of ( ew 1, ew 2). With these properties, it is clear that L w1 is one-to-one in a small neighborhood of ew 1 using the origin Inverse Function Theorem (Rudin et al. 1976). However, we need a exact neighborhood to guarantee that the solution w 1 of L w1 = 0 is in the one-to-one region. Hence we introduce Lipschitz continuous Hessian constant LH to build an open ball centered at w 1 with radius r = 1 2µLH as the exact neighborhood and modify the Inverse Function Theorem to complete our proof.

4.3 Proof Sketch of Main Theorem Proof sketch of SSL. For the sake of discussion, we respectively define e D1(τ) = { x Rd|x(1) [3.1, 3.9], τx(2) [9, + ), x(k) = 0, k [3, d]} and e D2(τ) = { x Rd|x(1) [ 3.9, 3.1], τx(2) [9, + ), x(k) = 0, k [3, d]} as the region of ew 1 and ew 2. As a beginning, we focus on e L. To obtain the solution of e L w1 = 0, we first solve e L w(k) 1 = 0, k [2] separately in

Figure 2: Theoretical Results of Theorem 1

e D1(τ). Subsequently, we use the intermediate value principle twice to prove the existence of ew 1. Finally, we use the Hessian matrix to prove that ew 1 is a local minimum. We demonstrate that e L is eµ-strongly convexity and e Lm-smooth in the region around ew 1. To prove Theorem 1, we need to deal with the noise terms in L. Due to the activation function, we cannot use the noise matrix to treat the noise terms as in Liu et al. (2021). Hence, we use the Lagrange s Mean Value Theorem to separate ξi, ξaug, ξ aug from the activation function and bound these noise terms using the tail bound of Gaussian variable. There are also some error terms due to the expectation operation over data distribution. With the intuition that nl, l [4] can not be far away from n/4, we bound these error terms. After obtaining the upper bound of the above noise terms, we characterize the landscape of L by using the Matrix Eigenvalue Perturbation Theory (Kahan 1975). We sum up the properties of L when w1 around ew 1 as follows.

1. L w1 |w1= e w 1 is very close to 0. 2. L is µ-strongly convex and Lm-smooth. Specifically, we show that eµ ϵ1 µ eµ and e Lm Lm e Lm + ϵ1 where ϵ1 is a small term related to ρ and d. 3. L is LH-Lipschitz continuous Hessian. Combined with these properties, we use the exact version of Inverse Function Theorem to prove the existence of the local minimum (w 1, w 2) of L. Finally, we show that with good initialization, specifically initialization around the local minimum, w1(0) converges to w 1 using the gradient descent algorithm (Bubeck et al. 2015). We remark that the above process only analyzes w1, we can get w2 through symmetry.

Proof sketch of SL. The proof sketch of SL is similar to the proof of the linear SL model in Liu et al. (2021). However, because of the nonlinear SL model in this paper, we need to perform finer scaling to get a high probability guarantee.

Different activation function. We can easily extend the results to the case where the activation function is tanh because sigmoid can be viewed as a compressed version of tanh. To get similar results with Theorem 1, we just need to modify the region of the local minimum and the initialization region. For e D1(τ), we change the range of x(1) from [3.1, 3.9] to [2.7, 3.1] and the range of x(2) from [9, + ) to [6.1, + ) to obtain e Dσ2 1 (τ). For D1(τ), we change the

(a) Final weight matrix W

(b) Learning curve

(c) The projection of e2

Figure 3: Experiment results of SSL model with d = 10, τ = 7

Figure 4: The experiments results with the correct sign

range of x(1) from (3.1, 3.9) to (2.7, 3.1) and the range of x(2) from (8.5, 9) to (5.75, 6.1) to obtain Dσ2 1 (τ). With similar process, we can get e Dσ2 2 (τ) and Dσ2 2 (τ).

5 Simulation Experiments

In this section, we illustrate the correctness of Theorem 1 and Theorem 2 experimentally. We conduct experiments for the nonlinear SSL model in Sec. 5.1 and Sec. 5.2. Furthermore, we show the training process of the nonlinear SL model with projection matrix F in Sec. 5.3. In this section, we choose τ = 7, d = 10, ρ = 1/d1.5, α = 1 800, n = d2 and learning rate η = 0.001 if we do not specify otherwise. Experiments are averaged over 20 random seeds, and we show the average

Figure 5: SSL final weight matrix W with d = 10, τ = 3

results with 95% confidence interval for learning curves.

5.1 SSL Model: the Correctness of Theorem 1

In this part, we validate the correctness of Theorem 1 by strictly following the settings of the theorem. Define T1 = 4000 as the number of iterations of the SSL model. Fig. 3a shows the learning result of weight matrix W = [w1, w2] . The blue points are learning results of (w(1) 1 (T1), w(2) 1 (T1)) and the red stars are learning results of w2. It is clear that w1(T1) and w2(T1) are almost symmetrical about the e2-axis, which is consistent with the theoretical result (Fig. 2). Fig. 3b shows the learning process of w1 and w2. Because we initialize w1(0) and w2(0) around the local minimum, w1 and w2 can easily converge to (w 1, w 2). Fig. 3c shows the projection of e2 on the space spanned by w1(T1) and w2(T1). We can find the projection is almost 1. These experimental results show that W learns e1, e2 at the same time. The results of larger τ are similar to the results of τ = 7.

5.2 SSL Model: Results Beyond Analysis

In this part, we relax requirements in Theorem 1, such as (w1, w2) must be initialized near (w 1, w 2), τ must be large. We show that the SSL model still learns the label-related and hidden feature even if the requirements are relaxed. Good enough initialization. In Theorem 1, we initialize w1 and w2 around (w 1, w 2). We experimentally show

(a) Learning curve

(b) The projection of e1

(c) The projection of e2

Figure 6: Experiment results of SL model with d = 10, τ = 7

that initialization only need the correct sign (w(1) 1 (0) > 0, w(1) 2 (0) < 0, w(2) 1 (0) > 0, w(2) 2 (0) > 0) is required. Fig. 4 shows that if the initialization sign is correct, the SSL model can converge to (w 1, w 2) with high probability. With high probability means there are still a few cases where the SSL model cannot converge to (w 1, w 2). However, compared with the learning results (Fig. 6c) of the SL model, the SSL model with the correct sign still shows the ability to learn e2. Large enough τ. In the proof process of Theorem 1, we need τ = max{7, d 1 10 } to use the monotonicity of the solution of e L w(2) 1 . We experimentally show that the SSL model

can get a good result even if τ does not meet this requirement. Fig. 5 shows even if τ = 3 , the space spanned by w1 and w2 is still very close to the space spanned by e1 and e2.

5.3 SL Experiment Results In Theorem 2, we mainly focus on the performance of the feature extractor W SL and ignore the projection matrix F. In this section, we experimentally show that even if F is considered, W SL still only learns the label-related feature. Specifically, we consider the binary cross-entropy loss function:

min e LSL = 1

i=1 yi ln(ˆyi) + (1 yi) ln(1 ˆyi)

+ β W SL 2 F + γ F 2 2 ,

where ˆyi = σ(Fσ(W SLxi)), i [n], β is the coefficient of W SL regularizer, and γ is the coefficient of F regularizer. In this section, we choose β = γ = 1/800. Define T2 = 8000 as the number of iterations of the nonlinear SL model. Fig. 6a shows the learning curve of ( ew SL 1 , ew SL 2 ). It is clear that ew SL(1) 1 (T2) and ew SL(1) 2 (T2) are the main terms, and the other terms ew SL(k) j , j [2], k [2, d] will converge to 0. Fig. 6b and Fig. 6c show the projection of e1 and e2 on the space spanned by ew SL 1 (T2) and ew SL 2 (T2). It is clear that the projection of e1 is almost 1, and the projection of e2 is almost 0. The above experiment results mean that the nonlinear SL model can only learn label-related feature e1, which is consistent with the results of Theorem 2. All experiments are conduct on a desktop with AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz

and 16 GB memory. The codes of this section are available at https://github.com/wanshuiyin/AAAI-2023-The Learnability-of-Nonlinear-SSL.

6 Conclusion

Summary. Our paper is the first to analyze the data representation learnability of the nonlinear SSL model by analyzing the learning results of the neural network. We start with a 1-layer nonlinear SSL model and use GD to train this model. We prove that the model converges to a local minimum. Further, we accurately describe the properties of this local minimum and prove that the nonlinear SSL model can capture label-related features and hidden features at the same time. In contrast, the nonlinear SL model only learns label-related features. This conclusion shows that even though the nonlinear network significantly improves the learnability of the SL model, the SSL model still has a superior ability to capture important features compared with the SL model. We verify the correctness of the results through simulation experiments. Due to the nonconvexity of the objective function and noise terms, we propose a new analysis process to describe the properties of the local minimum. This analysis process is divided into two steps. In the first step, we focus on the structure of L by ignoring all noise terms. Then we obtain the approximate region of the local minimum. In the second step, we use the exact version of Inverse Function Theorem as a bridge to connect the simplified objective function e L and L. Finally, we prove the existence of the local minimum (w 1, w 2) and describe the properties of this local minimum. Compared with linear SSL models, nonlinear alternatives are closer to the state-of-the-art SSL methods. The conclusions in this paper can guide us further in understanding the learning results of SSL methods and provide a theoretical basis for subsequent improvements.

Future work. This paper analyzes a 1-layer nonlinear SSL model. After that, we plan to expand the scope of the analysis to a multi-layer nonlinear network. The multi-layer network analysis requires a more refined exploration of local minima. The weight matrix of each layer needs to be uniformly processed to analyze the landscape of the objective function, which we will do in the follow-up work.

Acknowledgments

The corresponding author Shuai Li is supported by National Key Research and Development Program of China No. 2020AAA0107600, National Natural Science Foundation of China (62006151, 62076161) and Shanghai Sailing Program. The author Bo Jiang is supported by National Natural Science Foundation of China (62072302).

Allen-Zhu, Z.; Li, Y.; and Song, Z. 2019. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, 242 252. PMLR. Arora, S.; Khandeparkar, H.; Khodak, M.; Plevrakis, O.; and Saunshi, N. 2019. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229. Bromley, J.; Guyon, I.; Le Cun, Y.; S ackinger, E.; and Shah, R. 1993. Signature verification using a siamese time delay neural network. Advances in neural information processing systems, 6. Brutzkus, A.; and Globerson, A. 2017. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, 605 614. PMLR.

Bubeck, S.; et al. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4): 231 357. Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; and Joulin, A. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33: 9912 9924. Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597 1607. PMLR.

Chen, X.; and He, K. 2021. Exploring Simple Siamese Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 15750 15758. Computer Vision Foundation / IEEE. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805.

Du, S.; Lee, J.; Li, H.; Wang, L.; and Zhai, X. 2019. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, 1675 1685. PMLR. Du, S. S.; Jin, C.; Lee, J. D.; Jordan, M. I.; Singh, A.; and Poczos, B. 2017. Gradient descent can take exponential time to escape saddle points. Advances in neural information processing systems, 30. Grill, J.-B.; Strub, F.; Altch e, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. 2020. Bootstrap your own latenta new approach to self-supervised learning. Advances in neural information processing systems, 33: 21271 21284.

Hao Chen, J. Z.; Wei, C.; Gaidon, A.; and Ma, T. 2021. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34: 5000 5011. Hao Chen, J. Z.; Wei, C.; Kumar, A.; and Ma, T. 2022. Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. ar Xiv preprint ar Xiv:2204.02683. He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729 9738. Jing, L.; Vincent, P.; Le Cun, Y.; and Tian, Y. 2021. Understanding dimensional collapse in contrastive self-supervised learning. ar Xiv preprint ar Xiv:2110.09348. Kahan, W. 1975. Spectra of nearly Hermitian matrices. Proceedings of the American Mathematical Society, 48(1): 11 17. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; and Krishnan, D. 2020. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33: 18661 18673. Lee, J. D.; Lei, Q.; Saunshi, N.; and Zhuo, J. 2021. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34: 309 323. Li, Y.; and Liang, Y. 2018. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31. Li, Y.; and Yuan, Y. 2017. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30. Liu, H.; Hao Chen, J. Z.; Gaidon, A.; and Ma, T. 2021. Selfsupervised learning is more robust to dataset imbalance. ar Xiv preprint ar Xiv:2110.05025. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. 2019. Language models are unsupervised multitask learners. Open AI blog, 1(8): 9. Rudin, W.; et al. 1976. Principles of mathematical analysis, volume 3. Mc Graw-hill New York. Tian, Y. 2017. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International conference on machine learning, 3404 3413. PMLR. Tian, Y. 2022a. Deep contrastive learning is provably (almost) principal component analysis. ar Xiv preprint ar Xiv:2201.12680. Tian, Y. 2022b. Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning. ar Xiv preprint ar Xiv:2206.01342. Tian, Y.; Chen, X.; and Ganguli, S. 2021. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, 10268 10278. PMLR.

Tian, Y.; Yu, L.; Chen, X.; and Ganguli, S. 2020. Understanding self-supervised learning with dual deep networks. ar Xiv preprint ar Xiv:2010.00578. Tosh, C.; Krishnamurthy, A.; and Hsu, D. 2021. Contrastive estimation reveals topic posterior information to linear models. J. Mach. Learn. Res., 22: 281 1. Wang, Y.; Wang, H.; Shen, Y.; Fei, J.; Li, W.; Jin, G.; Wu, L.; Zhao, R.; and Le, X. 2022. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 4238 4247. IEEE. Wen, Z.; and Li, Y. 2021. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, 11112 11122. PMLR. Wu, A.; Wang, C.; Pino, J.; and Gu, J. 2020. Self-supervised representations improve end-to-end speech translation. ar Xiv preprint ar Xiv:2006.12124. Zhang, X.; Yu, Y.; Wang, L.; and Gu, Q. 2019. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, 1524 1534. PMLR. Zhong, Y.; Tang, H.; Chen, J.; Peng, J.; and Wang, Y.-X. 2022. Is Self-Supervised Learning More Robust Than Supervised Learning? ar Xiv preprint ar Xiv:2206.05259.