# on_the_power_of_foundation_models__26647986.pdf On the Power of Foundation Models Yang Yuan 1 2 3 With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks for the category defined by pretext task, with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest. 1. Introduction Foundation models have recently exhibited remarkable proficiency in addressing a myriad of complex downstream tasks that were very difficult or impossible for the previous methods (Ramesh et al., 2021; Rombach et al., 2022; Ramesh et al., 2022; Sohl-Dickstein et al., 2015; Brown et al., 2020; Radford et al., 2018; 2019; He et al., 2022). Being differ- 1IIIS, Tsinghua University 2Shanghai Artificial Intelligence Laboratory 3Shanghai Qi Zhi Institute. Correspondence to: Yang Yuan . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ent in network structure, dataset and training algorithms, the foundation models are similar in terms of the training process: the model is first trained with a large unlabeled dataset on a pretext task, and then being applied to other downstream tasks with the parameters frozen. Freezing the parameter is necessary because training a foundation model is very expensive, and the downstream tasks usually have limited data samples, which makes retraining less attractive. Given a set of downstream tasks, which of them are solvable by the foundation models, and which are not? This is a very fundamental question. The existing theory attacks this question through various perspectives like dataset quality and quantity, computational power, network structure, etc. However, as we collect trillions of data points, build gigantic GPU data centers, and optimize huge networks with billions of parameters, a simple question just pops up: Where are we heading to? Specifically, with infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for every downstream task? As we will see, this question is not about model representation, optimization or generalization, but structural representation of the tasks. Category theory, as the theory of mathematical structures, is the ideal tool for answering this question. For the downstream tasks, there are mainly two types of methods: prompt tuning and fine tuning. Prompt tuning does not train with the downstream tasks. Instead, it only sends a task specific prompt to the model, so that the model can switch its working mode for solving the task. Fine tuning trains a small network connecting to the foundation model with the labeled dataset for the downstream task. In Theorem 1, we show that with prompt tuning, the model can solve the task if and only if the task is representable in the category defined by the pretext task. If the category does not have complicated structures, our theorem indicates that the power of prompt tuning is limited. For example, in Corollary 1, we show that by training with the pretext task of predicting the rotation of an image (Gidaris et al., 2018), the foundation model is not able to solve complicated downstream tasks like segmentation or classification. On the Power of Foundation Models On the other hand, the results on fine tuning is more promising. Our Theorem 2 proves that for the foundation model with the minimum required power (up to symmetry) for the pretext task and enough resources including training data, it can potentially solve any downstream tasks for the category defined by pretext task. The role of pretext task is crucial in the sense that if the pretext task fails to extract adequate information from the unlabeled dataset, the power of fine tuning remains restricted. Along the way, we have provided a categorical framework for machine learning. Interestingly, the framework injects the learning perspective to category theory as well. Therefore, we also proved a generalization theorem for structural learning (Theorem 3), which explains why self-supervised learning for text-image generation tasks can generate images like avocado chair, which do not exist in the dataset or real world. Theorem 3 can be easily generalized to the compositional theorem of multiple categories, see Theorem 4. Unlike most machine learning theory papers, our paper does not have any assumptions on the data distribution or network structure. Instead, we take the bird s-eye view that is model oblivious, and only focuses on the structure defined by the pretext task. It is indeed possible that by designing a special network, one may get a more powerful model with better performance. However, we stick with our setting because: Empirically, people do not customize network structures for different tasks. Instead, they tend to use similar structures like Res Net (He et al., 2016) or Transformer (Vaswani et al., 2017). By the no free lunch theorem (Shalev-Shwartz & Ben-David, 2014), if the model does not contain task-specific prior information, it will not be able to completely solve all the tasks. In other words, the standard models are not universally competent, and it is likely that the limitation that we derived from the model oblivious setting, also applies to the practical settings. Pretext task design is a central problem in selfsupervised learning. Currently, there are various datasets floating around, and a wide range of diverse tasks to solve, but the practitioners do not really know what kinds of pretext tasks to pick for solving a given task, or the limitations of each pretrained model. They get the intuition by trial and error with experiments, which are both expensive and noisy. Our framework will help them to think about this problem in a more mathematical and systematic way, and provide guidance for better pretext task design. 2. Related Work 2.1. Self-supervised learning Self-supervised learning. Recently, researchers have proposed many self-supervised learning algorithms for foundation models, including contrastive methods (Chen et al., 2020; He et al., 2020; Grill et al., 2020; Chen & He, 2021; Noroozi & Favaro, 2016; Zbontar et al., 2021), masked image models (He et al., 2022; Dosovitskiy et al., 2020; Doersch et al., 2015; Pathak et al., 2016), masked language models (Devlin et al., 2018; Raffel et al., 2020), pure language models (Brown et al., 2020; Radford et al., 2018; 2019), and with other pretext tasks (Oord et al., 2018; Gidaris et al., 2018; Clark et al., 2020; Noroozi & Favaro, 2016; Pathak et al., 2017). Multimodal learning. Self-supervised learning can also be applied to multimodal learning, including text + image (Ramesh et al., 2021; Rombach et al., 2022; Ramesh et al., 2022; Sohl-Dickstein et al., 2015), video + audio (Arandjelovic & Zisserman, 2018). For generating images for the multimodal tasks, diffusion model is the stateof-the-art approach (Rombach et al., 2022; Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021; Ho et al., 2020). Prompt tuning. There are various prompt tuning methods, including the discrete prompts (Brown et al., 2020; Jiang et al., 2020; Shin et al., 2020; Gao et al., 2020) and continuous prompts (Liu et al., 2021; Li & Liang, 2021). 2.2. Theory for deep learning and self-supervised learning Theory for deep learning has been an active research area recently. Optimization theory focuses on how and why first order methods like stochastic gradient descent finds the local/global optimum of the neural networks (Du et al., 2019b; Arora et al., 2019a; Du et al., 2018; Allen-Zhu et al., 2019b;a; Zou et al., 2020; Li & Yuan, 2017). Generalization theory focuses on how the performance of the model in the training set transfers to the population distribution (Allen Zhu et al., 2019a; Arora et al., 2019a; Bartlett et al., 2017; 2020; Yin et al., 2019). Representation theory focuses on the representation power of the neural networks (Hornik et al., 1989; Cybenko, 1989; Raghu et al., 2017). There are also theoretical results on analyzing various aspects of reinforcement learning (Du et al., 2019a;c; Jin et al., 2018; Cai et al., 2020). Theory for self-supervised learning. There are many interesting theory results for self-supervised learning learning (Wen & Li, 2021; 2022; Luo et al., 2022; Hao Chen et al., 2021; Arora et al., 2019b; Tosh et al., 2021; Lee et al., 2021; Zimmermann et al., 2021). For example, Hao Chen et al. (2021) show that Sim CLR is essentially computing spectral graph clustering on the unlabeled dataset. Liu (2022) gives a better rectified flow algorithm with elegant theoretical guarantees for improving the diffusion models. Application of category theory. Category theory has been applied to many research areas (Fong & Spivak, 2018), including physics (Marquis, 2008; Ku s & Skowron, 2019), On the Power of Foundation Models design (Censi, 2015), and machine learning (Shiebler et al., 2021; Mahadevan, 2022a;b). The most relevant paper on category theory might be (Bradley et al., 2022), where they provide an enriched category theory of natural language. 3. Preliminaries Category theory is used in almost all areas of mathematics. Here we only introduce the necessary notions for understanding the results of our paper, and skip many important details (e.g., universe and diagram). Curious readers may check Mac Lane (2013); Riehl (2017); Adámek et al. (1990) for more comprehensive introductions. A category C has a set of objects Ob(C), and a set of morphisms Hom C(X, Y ) from X to Y for every X, Y Ob(C). Given f Hom C(X, Y ), g Hom C(Y, Z), we define their composition as g f Hom C(X, Z). Notice that is associative, i.e., (h g) f = h (g f). For every X Ob(C), there exists an unique identity morphism id X Hom C(X, X). A morphism f : X Y is an isomorphism if there exists g : X Y such that f g = id Y and g f = id X. In this case, we say X and Y are isomorphic and write X Y . Given a category C, we define its opposite Cop by setting Ob(Cop) = Ob(C) and Hom Cop(X, Y ) = Hom C(Y, X). Moreover, given f Hom Cop(X, Y ), g Hom Cop(Y, Z), the new composition is g op f = f g Hom Cop(X, Z). We define Set to be the category of sets, where the objects are sets, and Hom Set(X, Y ) is the set of all functions with domain X and codomain Y . Notice that we ignore the subtleties about the universe for better presentation, so here just assume that Set does not contain strange objects like a set containing all sets. Functor is like a function between two categories. Given two categories C, C , a functor F : C C maps objects from C to C with F : Ob(C) Ob(C ) and morphisms from C to C with F : Hom C(X, Y ) Hom C (F(X), F(Y )) for all X, Y C, so that F preserves identity and composition. Formally, we have F(id X) = id F (X) for all X C, and F(g f) = F(g) F(f) for all f : X Y, g : Y Z. We call F an embedding functor, if it is injective on morphisms. Notice that if F is an embedding, it is also injective on objects. We call F a full functor, if for any X, Y C, Hom C(X, Y ) Hom C (FX, FY ) is surjective. The morphisms of functors, also called the natural transformation, is the way to transform the functors while preserving the structure. Given two categories C, C , and two functors F1, F2 from C to C . A morphism of functors θ : F1 F2 has a morphism θX : F1(X) F2(X) for all X C such that for all f : X Y Hom C(X, Y ), we have θY (F1(f)(F1(X))) = F2(f)(θX(F1(X))). We write F1 F2 if there exists an isomorphism between F1 and F2. A functor F : C C is an isomorphism of categories if there exists G : C C such that G F(X) = X and F G(Y ) = Y for all X C, Y C , and similarly for the morphisms. In this case, we say C and C are isomorphic and write C C . A category A is a subcategory of a category B, if 1. Ob(A) Ob(B), 2. for each A, A Ob(A), Hom A(A, A ) Hom B(A, A ), 3. for each A-object A, the B-identity on A is the Aidentity on A; 4. the composition law in A is the restriction of the composition law in B to the morphisms of A. Moreover, A is a full subcategory of B, if it is a subcategory of B and for each for each A, A Ob(A), Hom A(A, A ) = Hom B(A, A ). Every subcategory A of category B naturally defines an inclusion functor E : A , B, which is an embedding. For such embeddings, we have the following lemma. Lemma 1 (Adámek et al. (1990)). A functor F : C B is a full embedding if and only if there exists a full subcategory A of B with inclusion functor E : A , B and an isomorphism G : C A with F = E G. 3.1. Reproducing Kernel Hilbert Space (RKHS) Given two objects X, Y C, consider a feature map f : C H, where the feature space H is usually much larger than C. We may define a kernel k that measures the similarity of X and Y as k(x, y) f(x), f(y) H, i.e., the inner product between the two object after mapping them to the feature space. For any vector T H, it also corresponds to a function T( ) : C R, defined as T(x) = T, f(x) H. Specifically, f(y) as a vector in H also represents the function k( , y) : C R, because for any x C, we have k(x, y) = f(x), f(y) H. Formally, we have: Definition 1 (Reproducing kernel Hilbert space). Let H be a Hilbert space of R-valued functions defined on a non-empty set C. A function k : C C R is called a reproducing kernel of H, and H is a reproducing kernel Hilbert space (RKHS), if k satisfies x C, k( , x) H, x C, f H, f, k( , x) H = f(x). 4. Categorical Framework of Supervised Learning In supervised learning, we have a population distribution D = (DX, DY ), representing the ground truth distribution of the input and output data points. The training On the Power of Foundation Models set (Xtrain, Ytrain) and test set (Xtest, Ytest) are uniformly sampled from (DX, DY ). We hope to learn a function f : X Y so that f(x) accurately predicts the label x X. We also define a loss function L(f, x, y) to measure the distance between the prediction f(x) and the correct label, which is hopefully close to 0. The loss function on a dataset (X, Y ) is L(f, X, Y ) E(x,y) (X,Y )L(f, x, y). The task of supervised learning, is to minimize the population loss Lpopulation E(X,Y ) (DX,DY )L(f, X, Y ), with access of the training set (Xtrain, Ytrain). Using the language of category theory, we have two categories X and Y, with objects x, y as input and output data points, respectively. To avoid confusion, below we switch notation from x, y to X, Y to represent the objects, as later we will not use DX, DY any more. The population distribution can be seen as a functor Fp from X to Y, representing the correct label Y given the input X. Due to the inherent noise in the real world, there may not always exist the unique correct label for each input X. In other words, the Bayesian optimal solution does not give zero population loss. We will discussion this issue in Section 7.3, and for now let us simply assume that the unique correct labels always exist. With this formulation, training/test set can be seen as samples over objects in X with correct labels in Y. Therefore, supervised learning investigates the following question: can we learn a functor Fp with samples of X X and Fp(X) Y? Consider the special case that both categories are discrete, meaning that the only morphisms existed are identity morphisms like id X Hom X (X, X) and id Y Hom Y(Y, Y ). In other words, both categories are sets without any morphisms between different objects. In this case, learning Fp is impossible, because it maps a set to another set without any prior knowledge. The no free lunch theorem tells us that unless we have sample size larger than half of the set size, the functor we learned will have constant generalization error with constant probability. Generalization theory deals with this problem by assuming that Fp is in a predefined hypothesis class H, or close to a function in H. In category theory, we do not add restrictions to the functors, but to the structure of the categories instead. For example, if we know X has a linear structure, and Fp preserves the structure, it is possible to learn Fp with a few samples. This formulation seems useless, as it does not even characterize the loss function, which is crucial in supervised learning. This is because category theory takes the bird seye view. Our categorical framework will not replace the classical framework, or generate better supervised learning algorithms. Instead, it treats the existing supervised learning algorithms a subroutine or a building block, which can be incorporated into a bigger picture. Therefore, it does not care about the loss functions or optimization process, which are treated as the implementation details. Instead, it focuses on the structure of the categories and functors, and tries to understand whether certain functors are learnable or not. It also investigates whether the ability of mastering at one or more tasks can be generalized to other tasks. 5. Categorical Framework of Self-supervised Learning 5.1. Pretext tasks and morphisms In self-supervised learning, we have a population distribution D of data points without labels, and try to extract useful information from D by setting up pretext tasks. As summarized in Section 2, there are different kinds of pretext tasks, including contrastive methods, masked image/language models, pure language models, etc. The model trained with the pretext task can be seen as a feature extractor f D H, mapping the input x to its feature representation f(x). Using the language of category theory, we have a category C, where each object in C is a data point X D. For X, Y C, Hom C(X, Y ) is defined by the pretext task. More specifically, all the existing pretext tasks can be seen as building morphisms between two objects in C: Contrastive methods. There are different variants of contrastive methods, but all of them essentially modify X C to get two semantically similar copies X , X C as the positive pair, and sometimes pick Y C as the negative sample, and the pretext task says f(X ), f(X ) should be close and both of them should be far away from f(Y ). As observed by (Hao Chen et al., 2021), when using the spectral contrastive loss for these pairs, the pretext task is exactly doing spectral clustering in C. Specifically, the spectral graph is defined as for X, Y C, X = Y , X and Y are connected if their form a positive pair. Ideally, if the graph can be perfectly decomposed into k components where k is the dimension of the embedding space, the category we obtained has the property that Hom C(X, Y ) = if and only if X, Y are in the same component. More generally, all the objects in C will be mapped to the embedding space, so that their pairwise similarity matches with the probability of being sampled together in the pretext task. In other words, Hom C(X, Y ) Pr{(X, Y ) gets sampled}. Masked image/language models. Unlike commonly believed, masking a portion of X C and then predicting the marked part is not doing prediction on X itself, because X and the masked X are not the same object. We On the Power of Foundation Models can treat the mask X as a new X C, by extending the definition of C to include the masked objects as well. Therefore, masked image/language models define morphisms as Hom C(X, Y ) = if X is a masked version of Y . Pure language models. Each object X C is a sentence of length at most1 N, and Hom C(X, Y ) = if and only if Y is the a sentence successive to X that differs by the last word. This is a simplified version without probabilities for better presentation, and we will discuss the full version in Section 7.2. Now, we should formally define what we meant by an ideal foundation model. 5.2. Ideal foundation model Definition 2 (Ideal foundation model). Given a category C defined by a pretext task, a foundation model f : C H is ideal, if there exists a data-oblivious function k : H H Set so that for any X, Y C, k(f(X), f(Y )) = Hom C(X, Y ). Here, data-oblivious means k is predefined without seeing the data. When Hom C(X, Y ) R, H can be seen as the RKHS with k as a kernel function. For example, in Sim CLR and Mo Co, k is the Gaussian kernel for the two object embeddings. In the pure language models, it is also the Gaussian kernel for the two sentence embeddings due to the use of softmax loss. For the masked models, k can be a kernel function in the embedding space between two objects, denoting the probability of the first object being the masked version of the second one. In Definition 2, the ideal foundation model is not unique. Theoretically, it is possible that the model f has the general intelligence, therefore can do everything even without learning the data in C. In this case, we can make no meaningful arguments about the power of f. Therefore, we should look at the opposite side, and ask: Is f guaranteed to have some property P, if it is ideal with C defined by a pretext task? This is the main motivation of our paper. Before moving forward, we would like to point out that Definition 2 does not rule out the possibility that f simply memorizes all possible morphisms. The intuition is, the pretext task is so difficult, that even memorizing all the morphisms in C is already powerful enough. Moreover, a simple memorization function will not have an efficient implementation empirically (although category theory does not care about the implementation). 1Length restriction is not important as the category can contain infinitely many objects. Below we first introduce an important notion called Yoneda embedding. Definition 3 (Yoneda embedding h C). Given any X C, h C(X) Hom C( , X), which takes input Y C and outputs Hom C(Y, X). Define C as the category of functors from Cop to Set, we have Hom C( , X) C and h C : C C . Notice that we can define Hom C(X, ) similarly, which defines another Yoneda embedding k C. How do we represent C ? Similar to the RKHS, we assume that it is a space equipped with a kernel function k : C C Set, such that k(h C(X), h C(Y )) = Hom C(X, Y ). In category theory, k is denoted as Hom C . Note that such H always exists. For example, we can set h C(X) such that it explicitly includes the object X as the first part in its representation, and the remaining parts representing other morphisms. By doing that, it suffices to send X to h C(Y ) to get Hom C(X, Y ). Intuitively, h C takes X and outputs all relationships of X, so we immediately have the following lemma. Lemma 2. h C is ideal. The next lemma says, h C is not only ideal, but with the minimum power , up to symmetry2. Lemma 3. Given any other ideal foundation model f, all the morphisms in h C are also contained in f. Proof. Since f gets all the pretext tasks correctly, and k is data-oblivious, we know the morphism Hom C(X, Y ) is stored in f for every X, Y C. In other words, for every h Hom C(X, Y ), either h f(X) or h f(Y ). Notice that h C contains exactly all the morphisms of Hom C( , ), we proved our lemma. The above two lemmas tell us that we should work with h C, as it is the weakest ideal model. 5.3. Downstream tasks After the model is trained, we apply it to the downstream tasks with two different approaches: prompt tuning and fine tuning. In this section, we investigate the power of both approaches, with different outcomes. We first define what we mean by solving a downstream task. Definition 4 (Task). A task T is a functor in C . Definition 5 (Task solving). We say the model solves a task T, if for any input X C, the model outputs a solution that is isomorphic to T(X). 2We say h C has the minimum power up to symmetry, because there are other foundation models that are equally powerful, like k C. On the Power of Foundation Models In category theory, when two objects (or functors) are isomorphic, we treat them as equal. The technical details for making them equal to each other is omitted, as we observe that this is not a problem for modern networks empirically. For example, if there exists an isomorphism between two objects X and Y , so that f(X) = Y , empirically a deep neural network can easily fit this f. More generally, techniques like rectified flow (Liu, 2022) can fit very complicated isomorphisms between random noise and images, using neural network with ODE. Can h C solve the task T? To answer this question, we should first learn arguably the most important theorem in category theory, as follows. Lemma 4 (Yoneda lemma). For A C , and X C, Hom C (h C(X), A) A(X). Given a task defined as a functor T C , prompt tuning means we freeze the parameters of the model, and only use a task specific prompt P (usually in text or image), followed with the actual input of the task X, to get the output T(X). Therefore, the prompt P and input X are the only two inputs to the model. By Lemma 4, if we directly send h C(X) and T as a functor in C to g, we have Hom C (h C(X), T) T(X). (1) That is, g can accurately compute T(X) with these two representations. However, notice that the prompt has to be sent through h C instead of g, which means we will get a h C(P) instead of T for the input of g. This brings up another important definition in category theory. Definition 6 (Representable functor). A functor T C is representable if there is an isomorphism h C(X) T for some X C. Such object X is called a representative of T. Based on this definition, we have the following characterization of prompt tuning. Theorem 1 (Power on prompt tuning). h C can solve T with prompt tuning, if and only if task T C is representable. When T is representable, the optimal prompt is the representative of T. Proof. When the task T is representable, we can use its representative X as the prompt for T. By (1), we know h C can solve T exactly. On the other hand, if the task T is not representable, there exists no object X C such that h C(X) T. Therefore, assume we use task specific prompt P, and get the representation h C(P). Since h C(P) T, we can always find an object Y C such that h C(P)(Y ) T(Y ). In that case, our model using g and h C cannot solve T on Y. Remark. There are some interesting results on continuous prompts (Liu et al., 2021; Li & Liang, 2021), which directly tunes the prompts in the feature space of the neural network, therefore the resulting prompt is not a representation of any real words/sentences. By doing this, it is possible to obtain more power than the representable tasks, but the enhancement depends on the expressive power of the feature space. Below we provide a simple application of Theorem 1. Corollary 1. For the pretext task of predicting image rotations (Gidaris et al., 2018), prompt tuning cannot solve complicated downstream tasks like segmentation or classification. Proof. The pretext task of predicting image rotations will rotate a given image X C, by four different degrees: 0 , 90 , 180 , and 270 . Therefore, the four objects form a simple group of 4 elements. For any object X C, we know Hom C( , X) contains exactly 4 morphisms. Clearly, tasks likes segmentation or classification are not representable by objects in C. Remark. The statement of Corollary 1 is a bit counterintuitive, because as presented in (Gidaris et al., 2018), the model trained with the rotation pretext task indeed works for downstream tasks like classification and segmentation. However, our definition of solving a task in Definition 2 means the model should generates correct output for every input, so being partially accurate is not considered a success. This also matches with the goal of our paper: with infinite resource, can rotation pretext task used for complicated downstream tasks? Corollary 1 gives the negative answer. Note that in Theorem 1 we consider the model with the minimum power, while empirically, the network structure may contain prior knowledge like the convolutional layers for some specific downstream tasks. However, these prior knowledge is inherently noisy and hard to characterize, therefore cannot be relied for completely solving the downstream tasks. The power of prompt tuning is limited, and can be characterized by representable functors. What about fine tuning? In the case of fine tuning, we may use the Kan extension. However, the details of Kan extension is fairly abstract, so here we only provide one related theorem: Lemma 5 (Yoneda Extension of Functors). Let F : C Set be any functor, then there exists an extension functor h CF : C Set such that h CF h C F. The actual Yoneda extension of functors theorem is more general (see e.g., Proposition 2.7.1 in Masaki Kashiwara (2006)), it can be applied to any category A, but here we only use Set to avoid unnecessary technical details of inductive limits. Using Lemma 5, we immediately get the following theorem. Theorem 2 (Power on fine tuning). Ideally, given enough resources, h C can solve any task F : C Set. On the Power of Foundation Models Proof. Applying Lemma 5, we know that with enough training data for the downstream task, computational power, and a fine tuning model h that learns h CF perfectly, we can concatenate h with f, to solve the task F. The downstream tasks considered in Theorem 2 are based on the structure in C, not the data content in the dataset. As a result, the category defined by (Gidaris et al., 2018) still has very simple group-structure, but with Theorem 2 it is possible to solve more diverse tasks. For example, we can map all the objects to the same output, which cannot be achieved with prompt tuning. Theorem 2 conveys the importance of pretext task, as more informative pretext tasks will create more structured category C, which further improves the power of fine tuning on C . Even with very informative C, Theorem 2 only provides a raw upper bound with strong assumptions on the resources needed. By contrast, empirically the fine tuning step is usually done with a small network. Therefore, instead of treating it as a strong backup of what people are doing right now, it is more like exhibiting the future possibility. In other words, it implies that by transforming the objects to the corresponding feature space C , we have captured all the information encoded in C. For readers who are familiar with machine learning theory, Theorem 2 may look similar to the theory of overparameterization at the first glance. However, they are analyzing different steps of self-supervised learning. Overparameterization analyzes the pretraining step, saying that under certain assumptions, the optimization and generalization error will be very small for the pretext task, as long as the model is big enough and the learning rate is small enough. By contrast, Theorem 2 analyzes the fine tuning step after pretraining. Even if we have successfully pretrained a network on Imagenet with contrastive learning and zero generalization error, it remains unclear whether the model can be used for image segmentation as the downstream task, unless someone verifies it empirically. But Theorem 2 says, as long as the model is ideal, the feature layer contains all the information of C for any downstream tasks. 6. Multimodal Learning In the previous section, we have seen how the foundation models are related to learning a category defined by a pretext task. What happens if we have multiple different categories? We can first use the pretext tasks to learn different foundation models separately, then connect these models together in the embedding space. This is exactly how multimodal models like CLIP (Radford et al., 2021), Dall-E 2 (Ramesh et al., 2022), Multilingual CLIP (Carlsson et al., 2022) and Alt CLIP (Chen et al., 2022) work. In this section, we analyze the functors between different categories. Similar to the previous section, we consider the case that the functors are perfectly learned, and investigate the implications under this assumption. This setting hides unnecessary details like how loss is defined or how network structure is designed. However, it provides interesting insights of how to connect different categories together, and what to expect after the functor connection. As the starting point, we would like to emphasize that we assume the categories we consider are natural , which means they are not only self consistent, but also consistent with each other. For example, when we use texts like white, red, blue or big, small, tiny to describe a chair in the language category, there are corresponding images in the image category. Such connection can be described by functors. 6.1. Generalization Theorem We first assume that our functor from the embedding space learns the object mapping between the two categories perfectly, then show the structural information will be preserved with this functor. Notice that below we still use the notations of h B, h C and B , C to denote the foundation models and the corresponding embedding spaces, but our results hold for any other ideal foundation models and their embedding spaces as well. Definition 7 (feature-aligned functor). Given two categories B, C, a full embedding F : C B, denote the corresponding foundation model as h B, h C. A function ˆF : C B is feature-aligned with F if for any X Ob(C), ˆF(h C(X)) h B(F(X)). Theorem 3 (Generalization theorem for structural learning). Consider two categories B, C and a full embedding F : C B. In the learning scenario, an ideal foundation model h C for C together with a feature-aligned functor ˆF : C B , preserves the structure of C in a full subcategory A of B: for any X, Y C, Hom C(X, Y ) Hom A ( ˆF(h C(X)), ˆF(h C(Y ))). Moreover, when h B is available and invertible, we have F(X) h 1 B ( ˆF(h C(X))) for any X C. Proof. By Lemma 1, there exists a full subcategory A of B with inclusion functor E, which is isomorphic to C with functor G, and F = E G. Therefore, for any X, Y C, Hom C(X, Y ) Hom A(GX, GY ) = Hom B(FX, FY ) =Hom A(FX, FY ) Hom A (h A(FX), h A(FY )) =Hom A (h B(FX), h B(FY )) Hom A ( ˆF(h C(X)), ˆF(h C(Y ))) The first equality holds as G is isomorphic. The second equality holds because F is full. The third equality holds as E maps the objects in A to the same objects in B. The On the Power of Foundation Models fourth equality uses the Yoneda lemma. The fifth equality holds because A is a full subcategory. The last equality holds since ˆF is feature-aligned. If h B is invertible, we have F(X) = h 1 B (h B(F(X))) h 1 B ( ˆF(h C(X))). Theorem 3 is much more powerful that it appears. We call it the generalization theorem, because it provides another kind of generalization, different from the existing generalization theory on stability or Rademacher complexity. It tells us that, the structural information of one category can be recovered in the feature space by the structural information of another category with a feature-aligned functor. 6.2. Application of Theorem 3 In this subsection, we apply Theorem 3 to analyze two models: CLIP and Dall-E 2. CLIP. The dataset of CLIP contains millions of (image, text) pairs. During pretraining, for each batch of N pairs of data points, CLIP uses an image encoder and a text encoder to get N pairs of embeddings, and learns a function for matching the N pairs out of N N possible connections. Since the network used in CLIP is invertible, Theorem 3 immediately gives us the following corollary. Corollary 2 (Creativity of CLIP). Let C be a category that describes images, which is a subcategory of language category C . Let B be the image category. Assume CLIP learns a feature-aligned functor ˆF : C B for a full embedding F, then CLIP is able to create new images that can be described in C, but do not existed in the training set. Remark. Corollary 2 explains why CLIP can draw images like avocado chair that did not exist previously. The full embedding assumption is crucial, otherwise the embedding computed in A cannot be used for generating images in B. Dall-E 2. Dall-E 2 is the combination of Clip and the diffusion model (Rombach et al., 2022; Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021; Ho et al., 2020). If Clip can be seen as the feature-aligned functor between two isomorphic subcategories, why do we need an extra diffusion model? This is because these two categories are not purely isomorphic. When we type a photo of dog , there exists millions of different matching images. Therefore, Dall-E2 modifies the definition of image category, so that each object is a probability distribution of images. From this perspective, the diffusion model is a generator of images, while Clip learns the functor from the category of texts to the category of probability distributions of images. 6.3. Compositional Theorem By applying Theorem 3 multiples times through a list of categories, we immediately get the following theorem. Theorem 4 (Compositional Theorem). Consider a list of categories {Bi}n i=1, and n 1 full embeddings {Fi}n 1 i=1 , where Fi : Ai Bi+1, where A1 = B1, and for i > 1, Ai is the full subcategory of Bi induces by Fi 1. Denote the foundation model for Bi as h Bi, and feature-aligned functors as { ˆFi}n 1 i=1 . Denote F Fn 1Fn 2 F1 and ˆF ˆFn 1 ˆFn 2 ˆF1. By composing all the ˆFis together, we preserve the structure of B1 in An: for any X, Y C, Hom B1(X, Y ) = Hom A n( ˆF(h B1(X)), ˆF(h B1(Y ))). Moreover, when h Bn is invertible, we have F(X) = h 1 Bn( ˆF(h C(X))) for any X C. If we want to map objects from B1 and Bn, but the training data between the two categories is limited, Theorem 4 provides an alternative way. It suffices to find a path between the two categories, and learn the functors for each edge of the path. This is especially useful when some of the categories are pretrained (like GPT or CLIP). For example, Carlsson et al. (2022) and Chen et al. (2022) use multilingual text pairs to train the functors between various language categories and the English category, which are further connected to the image category using CLIP. By doing that, they easily extend CLIP to the multilingual settings. 7. Discussion 7.1. Applying to small categories In our abstract and introduction, we raised the question about an infinitely large model with infinite resources, which makes our theorem look unrealistic. However, these assumptions are purely rhetoric, and our real assumption is simply the model being ideal, i.e., the model perfectly solves the pretext task. This assumption is much weaker than the infinite one, because it means our theorems can be directly applied to small datasets, which correspond to smaller categories. For instance, consider a dataset containing only 100k sentences, and a model is trained to learn the data using a language model where each sentence is connected to its neighboring sentences with certain probabilities. While the sentences or words in the corresponding category C may not be as informative as the language category that humans possess, it is well-defined, and the ideal model is also welldefined. Training an ideal model for a small dataset of this scale is feasible. According to our Theorem 1, such model can only perfectly solve tasks that are representable in C. 7.2. Modeling probabilities In Section 5.1, we showed that the pretext task of pure language model defines the morphisms in a category. However, the formulation we used is over-simplified without probabilities, and here we provide a better one. In our new category C, every object X C is a probability On the Power of Foundation Models measure µ on the space of sentences S of length at most N. when µ concentrates at a single point s S, it means we have a deterministic sentence s. Otherwise, µ represents a probability distribution over all possible sentences. Given any sentence s1 = (w1, , w N), we will define a probability distribution ν on the next word w N+1 (or character), so that we can define an object Y with the same probability measure on the sentences of concatenating (w2, , w N) to w N+1. By defining X as the object contains single sentence s1, we have a morphism from X to Y . We call Y the canonical successor of X, and its probability distribution (on length N sentences) the canonical successive measure of X. Given any object Z C, assume Z has nonzero probabilities on various objects, like X1, , Xn, with probability measure µ. We define canonical successive distribution of Xi as νi. In this case, we define its canonical successive distribution µ as: s dνi(s)dµ(Xi) Intuitively, if s1 is the first sentence of a paragraph, we may use an object with measure concentrated at s1 to represent it. However, due to the nature of natural language (not the dataset noise), the next word of s1 is not deterministic. So we define s2 that drops the first word and adds the next word of s1, which should be represented as a probability distribution (canonical successive distribution). What about the next word of s2? Without the realization of s1, we have to enumerate all the possibilities of s1, and use the canonical successive distribution of each of them to define s2. By doing so, we have defined a category with probabilities, where the pretext task defines its morphisms. 7.3. Bayesian optimal classifier The notion of Bayesian optimal classifier is widely used in the generalization theory for supervised learning. It simply means that due to the inherent noise in the labels, even the best model will not be able to get zero loss in the population distribution. For example, given a blurry image of a dog, it might be difficult to tell its exact breed. However, in our categorical framework, we take the physicist s view of the world, and assume that all the objects and relationships are self consistent, without any noises. For example, Alaskan Husky and Siberian Husky indeed look similar from their photos, but they are fundamentally different breeds, in terms of their origins, size, appearance, temperament and so on. Our framework models these conceptual differences, and treat the photos of the dogs (which have noise) as the outcome of the last step of diffusion as discussed in Section 6. If the reader is familiar with the functional programming, the categorical framework can be seen as the pure functional part, which is predictable and precise. By contrast, the process that deals with the actual input and output, can be seen as the side effects , where the notion of Bayesian optimal classifier comes in. 7.4. Relationship to RKHS Our categorical framework can be seen as a natural generalization to the RKHS framework, where in RKHS a kernel function outputs a value in R, and our framework generalizes it to Set. For example, applying the Yoneda Lemma on RKHS, we get the property: x C, f H, k( , x), f H = f(x), where f can be seen as a functor in C , k( , x) is a representable functor, and , H computes Hom C. Therefore in RKHS, the morphisms between two objects are represented by a single real number computed by , H. This case works perfectly for algorithms like Sim CLR where the relationship between two objects is a real number representing the similarity. However, in practice, the relationship between two objects can be much more than a real number, especially for NLP tasks. In that case, RKHS is insufficient, and the general Yoneda lemma is necessary. Acknowledgements This paper was greatly inspired by the fruitful interactions with the Qianfang (functor) team at the Beijing Academy of Artificial Intelligence. Special thanks are extended to Yue Cao and anonymous reviewers for their helpful discussions and suggestions. This paper is supported by the Ministry of Science and Technology of the People s Republic of China, the 2030 Innovation Megaprojects Program on New Generation Artificial Intelligence (Grant No. 2021AAA0150000). Adámek, J., Herrlich, H., and Strecker, G. Abstract and concrete categories. Wiley-Interscience, 1990. Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a. Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242 252. PMLR, 2019b. Arandjelovic, R. and Zisserman, A. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pp. 435 451, 2018. On the Power of Foundation Models Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322 332. PMLR, 2019a. Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019b. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017. Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. Bradley, T.-D., Terilla, J., and Vlassopoulos, Y. An enriched category theory of language: from syntax to semantics. La Matematica, pp. 1 30, 2022. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020. Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283 1294. PMLR, 2020. Carlsson, F., Eisen, P., Rekathati, F., and Sahlgren, M. Crosslingual and multilingual clip. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 6848 6854, 2022. Censi, A. A mathematical theory of co-design. ar Xiv preprint ar Xiv:1512.08055, 2015. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750 15758, 2021. Chen, Z., Liu, G., Zhang, B.-W., Ye, F., Yang, Q., and Wu, L. Altclip: Altering the language encoder in clip for extended language capabilities. ar Xiv preprint ar Xiv:2211.06679, 2022. Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Electra: Pre-training text encoders as discriminators rather than generators. ar Xiv preprint ar Xiv:2003.10555, 2020. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021. Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422 1430, 2015. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudik, M., and Langford, J. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665 1674. PMLR, 2019a. Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675 1685. PMLR, 2019b. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. Du, S. S., Kakade, S. M., Wang, R., and Yang, L. F. Is a good representation sufficient for sample efficient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019c. Fong, B. and Spivak, D. I. Seven sketches in compositionality: An invitation to applied category theory. ar Xiv preprint ar Xiv:1803.05316, 2018. Gao, T., Fisch, A., and Chen, D. Making pre-trained language models better few-shot learners. ar Xiv preprint ar Xiv:2012.15723, 2020. Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. ar Xiv preprint ar Xiv:1803.07728, 2018. On the Power of Foundation Models Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020. Hao Chen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000 5011, 2021. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000 16009, 2022. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423 438, 2020. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018. Ku s, M. and Skowron, B. Category Theory in Physics, Mathematics, and Philosophy. Springer, 2019. Lee, J. D., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309 323, 2021. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021. Li, Y. and Yuan, Y. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017. Liu, Q. Rectified flow: A marginal preserving approach to optimal transport. ar Xiv preprint ar Xiv:2209.14577, 2022. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. Gpt understands, too. ar Xiv preprint ar Xiv:2103.10385, 2021. Luo, Z., Weng, C., Wu, S., Zhou, M., and Ge, R. One objective for all models self-supervised learning for topic models. ar Xiv preprint ar Xiv:2203.03539, 2022. Mac Lane, S. Categories for the working mathematician, volume 5. Springer Science & Business Media, 2013. Mahadevan, S. Categoroids: Universal conditional independence. ar Xiv preprint ar Xiv:2208.11077, 2022a. Mahadevan, S. Unifying causal inference and reinforcement learning using higher-order category theory. ar Xiv preprint ar Xiv:2209.06262, 2022b. Marquis, J.-P. From a geometrical point of view: A study of the history and philosophy of category theory, volume 14. Springer Science & Business Media, 2008. Masaki Kashiwara, P. S. Categories and Sheaves. Springer, 2006. Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69 84. Springer, 2016. Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536 2544, 2016. Pathak, D., Girshick, R., Dollár, P., Darrell, T., and Hariharan, B. Learning features by watching objects move. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2701 2710, 2017. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. Open AI blog, 2018. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. On the Power of Foundation Models Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1 67, 2020. Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl Dickstein, J. On the expressive power of deep neural networks. In international conference on machine learning, pp. 2847 2854. PMLR, 2017. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot textto-image generation. In International Conference on Machine Learning, pp. 8821 8831. PMLR, 2021. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Riehl, E. Category theory in context. Courier Dover Publications, 2017. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Shiebler, D., Gavranovi c, B., and Wilson, P. Category theory in machine learning. ar Xiv preprint ar Xiv:2106.07032, 2021. Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. ar Xiv preprint ar Xiv:2010.15980, 2020. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Tosh, C., Krishnamurthy, A., and Hsu, D. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pp. 1179 1206. PMLR, 2021. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Wen, Z. and Li, Y. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pp. 11112 11122. PMLR, 2021. Wen, Z. and Li, Y. The mechanism of prediction head in non-contrastive self-supervised learning. ar Xiv preprint ar Xiv:2205.06226, 2022. Yin, D., Kannan, R., and Bartlett, P. Rademacher complexity for adversarially robust generalization. In International conference on machine learning, pp. 7085 7094. PMLR, 2019. Zbontar, J., Jing, L., Misra, I., Le Cun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 12310 12320. PMLR, 2021. Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pp. 12979 12990. PMLR, 2021. Zou, D., Cao, Y., Zhou, D., and Gu, Q. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109(3):467 492, 2020.