# improving_selfsupervised_learning_by_characterizing_idealized_representations__c5e56c92.pdf Improving Self-Supervised Learning by Characterizing Idealized Representations Yann Dubois, Tatsunori Hashimoto, Stefano Ermon, Percy Liang Stanford University {yanndubs,thashim,ermon,pliang}@stanford.edu Despite the empirical successes of self-supervised learning (SSL) methods, it is unclear what characteristics of their representations lead to high downstream accuracies. In this work, we characterize properties that SSL representations should ideally satisfy. Specifically, we prove necessary and sufficient conditions such that for any task invariant to given data augmentations, desired probes (e.g., linear or MLP) trained on that representation attain perfect accuracy. These requirements lead to a unifying conceptual framework for improving existing SSL methods and deriving new ones. For contrastive learning, our framework prescribes simple but significant improvements to previous methods such as using asymmetric projection heads. For non-contrastive learning, we use our framework to derive a simple and novel objective. Our resulting SSL algorithms outperform baselines on standard benchmarks, including Sw AV+multicrops on linear probing of Image Net. 1 Introduction We study self-supervised learning (SSL), where the goal is to learn representations from minimal supervision, such that simple probes trained on these representations achieve high downstream accuracy. Recently, there has been many different SSL methods achieving impressive empirical results (e.g. Sim CLR [1], Sw AV [2]) using label-preserving augmentations (e.g. cropping or color jittering) as supervision. We dub this setting invariant SSL (ISSL). Despite these empirical successes, it remains unclear how these various SSL methods relate to one another, how to improve them, and how to derive new ones. Our goal is to provide a simple conceptual framework to think about those questions. To derive such a framework, we ask ourselves: what are the ideal requirements that ISSL representations should aim to satisfy? We prove necessary and sufficient requirements to ensure that probes from a specified family, e.g. linear or multi-layer perceptron (MLP), perfectly classify any task that is invariant to desired data augmentations. This complements theoretical work in ISSL [3 8], which analyze specific ISSL algorithms. Our work instead focuses on properties of representations that should serve as a goal for any ISSL algorithm. These ideal properties are: (i) desired predictors should be able to distinguish positive and negative examples from the representation; (ii) the dimensionality of the representation should be sufficiently large; (iii) augmented inputs should map to the same representations. Previous ISSL methods can be seen as approximations of our ideal requirements. Using our requirements, we derive simple improvements to those approximations as well as new ISSL objectives. Our theory thus results in a unifying conceptual ISSL framework, with practical prescriptions including: improvements to existing methods, such as increasing the dimensionality of representations and using asymmetric projections heads, which lead to 5% point gains on Tiny Image Net; a novel non-contrastive ISSL objective that outperforms all baselines, including Sw AV+multicrops, by at least 1% point on linear classification of Image Net; extensions of ISSL algorithms to learn representations that are better suited for non-linear probes. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). 2 Problem statement 2.1 Invariant Self-Supervised Learning Figure 1: ISSL setting. (Top) 1D inputs, partitioned into 3 equivalence classes (shapes), are encoded by φ into a 2D representation. (Bot. left) 3 - invariant tasks, where labels are the colors. (Bot. right) examples of probes for 2 of the invariant tasks. SSL learns an encoder φ that maps an input x from a finite space X (e.g., 256 256 images) into a representation φ(x) 2 Rd. Given the encoder and a dataset Dt drawn from some task of interest pt(X, Y ), we fit a classifier f from a desired family of probes F. Families of probes are sets of k-ary classifiers for any k 2 such as linear or MLP probes. For clarity, we consider linear probes F until Sec. 5. Supervision for ISSL comes from unlabeled data p X and label-preserving augmentations. Augmentations are ways of sampling positive x, x+ examples that are equivalent for downstream classification tasks. We formalize this using an equivalence relation x x+ that partitions the inputs X into equivalent classes [x] 2 X/ , and we consider the following downstream tasks T whose labelings are deterministic and constant within these classes (we allow stochastic labeling in appendices). Definition 1. The -invariant tasks T , is the set of all input-label distributions pt(X, Y ) such that the labeling pt(Y |X) is deterministic and invariant to , i.e., for all pt 2 T , x, x+ 2 X : x x+ =) arg max pt(y|x) = arg max pt(y|x+). (1) As an illustration, consider the 3 classes (triangle, square, and circle) shown in Fig. 1. Then T consists of all tasks that are predictable from those shapes, e.g., recognizing shapes with vertices (blue/orange in Fig. 1) or recognizing the shape (yellow/red/purple in Fig. 1). Importantly, equivalence classes (here shapes) are different from and essentially refinements of downstream classes Y (here colors). Note that equivalence relations can model arbitrary transformations including cropping and adding Gaussian noise, which contrasts with typical restrictions to augmentations defined by group actions[9 11]. 2.2 Idealized representations for ISSL In this section, we define the optimal encoders, those that induce idealized representations that ISSL should be striving for. Although such encoders exist, they will likely not be learned in practice. Those idealized representations will nevertheless allow us to derive practical algorithms in Sec. 4. Note that our approach to defining ideal representations is to take into account how they will be used downstream, and thus depends on the desired family of probe F and potential invariant tasks T . The goal of ISSL is to learn representations from which (typically simple) probes f 2 F classify well downstream tasks pt 2 T , i.e., they achieve low 0-1 risk Rt(φ, f) := Ept(X,Y )[1[Y 6= f(φ(X)))]]. Ideally, for any task of interest pt 2 T there will be a probe that can classify it perfectly. Definition 2. An encoder φ is population optimal for T , F, denoted as φ 2 Φpop, iff predictors realize the Bayes error on any invariant task, i.e., for all pt 2 T we have inff2F Rt(φ, f) = 0. Population optimality ensures the existence of perfect downstream probes, which would be learned with infinite downstream data. In practice, however, predictors will be trained from finite datasets Dt of possibly small size n with empirical risk minimization. When n is small a fitted probe (ERM) ˆf 2 b F(Dt, φ) := arg minf2F |Dt| 1 P x,y2Dt 1[y 6= f(φ(x)))] could be a terrible population predictor even when the underlying encoder is population optimal. Ideally, representations would thus also guarantee that any ERM performs as well as possible for any desired task and dataset size n. This suggests minimizing the following worst-case expected risk over tasks and ERMs. Wn(φ, F, T ) := sup sup ˆ f2b F(Dt,φ) Definition 3. An encoder φ is sample optimal for T , F iff if it is population optimal and minimizes the worst-case expected risk of ERMs for arbitrary sample sizes, i.e., for all n 1 : φ 2 arg min Wn(φ, F, T ). (3) (a) M not predictable (b) M non-lin. predictable (c) M linearly predictable (d) Tasks are not lin. pred. (e) Population optimal Figure 2: Representations are population optimal for linear F iff their dimensionality is sufficiently large and equivalence classes M(X) are linearly classifiable. The first 2 figures show representations from which M(X) is (a) never or (b) only non-linearly classifiable. Although (c) ensures linear classification of M(X), there exist invariant tasks, e.g. (d), that are not classifiable linearly. (e) Linear classifiability of M(X) ensures population optimality iff the dimensionality is at least |X/ | 1. 3 Theoretical framework for linear probes F 3.1 Characterizing optimal encoders for linear ISSL In this section, we characterize all sample-optimal encoders (Def. 3), with simple properties that give insights into the form of the idealized representation. The key for our theory is that any -invariant function g can be written as a composition between some function cg and a maximal invariant M [12], i.e., g = cg M where M : X ! {1, . . . , |X/ |} indexes the equivalence class: for any x, x+ 2 X : x x+ () M(x) = M(x+). (4) To build intuition for the final characterization, let us first discuss population optimal encoders (Def. 2) for unconstrained probes, then linear probes, and finally sample optimality (Def. 3). Unconstrained probes. By Def. 1, labels of downstream tasks pt 2 T are -invariant. Labels can thus be written as some function ct of M, i.e., arg maxy pt(y|X) = ct(M(X)). This shows that M(X) contains all and only information about desired tasks T . If probes are unconstrained, an encoder is population optimal iff M(X) is predictable/classifiable, i.e., there exist an h M such that M(X) = h M(φ(X)). Indeed, this ensures that the probe defined by ct h M can classify the task pt. Linear probes F. The problem with constrained probes is that they might not be able to use the information about desired tasks. In particular, the previous probe might not be linear ct h M 62 F. As an illustration, consider 4 equivalence classes: cats, dogs, oranges, and lemons. Fig. 2b shows a representation from which M(X) is predictable but invariant tasks are not linearly classifiable. In fact, even when M(X) is linearly predictable, i.e., h M 2 F as in Fig. 2c, downstream labels might not be, i.e., ct h M 62 F as shown in Fig. 2d. By standard VC dimension arguments [13, 14], such binary task is linearly predictable if the representation s dimensionality d is one less than the number of equivalence classes |X/ | 1, as in Fig. 2e. Building on this intuition we prove that population optimal encoders are essentially1 those that induce d |X/ | 1 dimensional representations from which M(X) is linearly predictable. Figure 3: Invariance of population-optimal encoders is (a) necessary and (b) sufficient to ensure that there exists no bad ERM. Sample optimality. Although population optimality ensures the existence of a perfect linear probe, ERM probes trained on finite samples might still be bad due to generalization issues (Fig. 3 left). Intuitively, one can remove such bad ERMs by mapping equivalent examples to the same representation, i.e., by using invariant encoders. Indeed, this ensures that ERMs that correctly predict one example in an equivalence class also correctly predict all the other ( Fig. 3 right). We prove that such invariance of population optimal encoders is necessary and sufficient for sample optimality. 1The difference with learning theory is that instead of (binary) shatterability of all examples, we want k-ary shatterability of all equivalence classes from representations. The key is that both notions coincide for specific probes (e.g. linear) and invariant encoders, which are necessary for sample optimality. Putting all together gives the following necessary and sufficient properties of sample-optimal encoders. Theorem 1. An encoder φ is sample optimal for -invariant tasks T and linear F if and only if F-predictability of M: there exists a max. invariant M and an f 2 F s.t. M(X) = f(φ (X)); Invariance: φ is -invariant, i.e., for any x, x+ 2 X we have x x+ =) φ (x) = φ (x+); Dimensionality: the effective dimensionality of the representations is at least one less than the number of equivalence classes, i.e., dim(span({φ (x)|x 2 X})) |X/ | 1. 3.2 The impact of augmentations on downstream performance Let us compute the worst-case excess risk Wn(φ, F, T ) of sample optimal encoders and show its dependence on the invariance structure. The key is that since sample optimal encoders are invariant (Theorem 1), ERMs only need to be trained on a single example per equivalence class to perfectly classify all other examples from that class. The risk then depends on the proportion of equivalence classes seen when training the probe, which can be computed in closed form as a function of the number of equivalence classes |X/ | and downstream samples n. For other similar results refer to Appx. B.3. Proposition 2. The worst expected risk of a sample optimal φ for -invariant tasks T and any F is Wn(φ , F, T ) = Prop. 2 shows that fewer equivalence classes, i.e., coarser , leads to better downstream sample efficiency. Of course, the convergence rate will be slower for practical encoders than for sample optimal ones. The result nevertheless suggests that good augmentations should be as strong as possible (induce coarser ) while being label-preserving. Examples of coarse augmentations are those from CLIP [15], which map many images to similar sentences. 4 Practical ISSL objectives for linear probes F The main remaining question is how to practically enforce requirements from Theorem 1. In this section, we derive simple objectives that learn optimal encoders in ideal settings (infinite data, perfect optimizers, universal approximators). In the process, we shed light on why previously proposed ISSL methods work, and how to improve them in practice. Figure 4: s ETF are idealized representations that collapse same-class examples and maximize the angle across classes. Our key insight is that we can learn sample-optimal encoders by jointly training an encoder and a logistic regression to predict M(X) from the representations. Indeed, the resulting representations will be characterized by three properties, each of which implies a requirement from Theorem 1 (F-predictability, invariance, effective dimensionality). First, the representations will allow linear predictability of M(X) due to the linearity of logistic regression. Second, the variance of same-class representations will be minimized due to Jensen s inequality. Finally, the effective dimensionality will be maximal. Indeed, logistic regression favors maximal angles between representations of different classes to increase the confidence of the predicted class. Specifically, the representations learned by this joint procedure form a simplex equiangular tight frame (s ETF) as illustrated in Fig. 4 and discussed in the neural collapse literature [16 21] and we prove that those are optimal. More formally, we prove in Prop. 3 that high dimension encoders trained to minimize the following multinomial logistic regression of M(X), dubbed -ISSL log loss, will be sample optimal, LI(φ; p X) := inf w2W1 Ep X w(M(X))>φ(X) m0=1 exp(w(m0)>φ(X)) where w maps classes to weights, e.g., by indexing a weight matrix. Following previous work on neural collapse, we assume for this section that classes are equiprobable p X([x]) = 1/|X/ |, and that classifier s weights and representations are unit-normalized, i.e., w 2 W1 := {1, . . . , |X/ |} ! S and φ 2 Φ1 := X ! S where S denotes the (d 1)-sphere (these assumptions can be relaxed e.g. [22, 23]). We then have the desired relation between ISSL log loss and sample-optimal encoders. Proposition 3. Let p X be a distribution with support X and equiprobable equivalence classes p X([x]) = 1/|X/ |, 8x 2 X. If d |X/ | 1 then any unit-normalized encoder that minimizes the -ISSL log loss φ 2 arg minφ2Φ1 LI(φ; p X) is sample-optimal for -invariant tasks and linear F. Prop. 3 shows that the ISSL log loss is a perfect pretext task in ideal settings. This suggests optimizing the ISSL log loss in practice and provides a formal relation between self-supervised and classical supervised learning. The challenge is that we typically neither have access to the maximal invariant M(X) nor the number of equivalence classes |X/ | required to compute the denominator of Eq. (6). Instead, knowledge about the equivalence structure comes from data augmentations A( X|X) from which we can sample examples that are equivalent to the input X X. Having established our framework, we can re-interpret previous ISSL methods as practical approximations of the ISSL log loss using data augmentations (e.g., Sim CLR, Sw AV, DINO [24], Sim Siam [25]). These approximations are nevertheless suboptimal: none of them learn sample- (nor population-) optimal encoders in idealized settings. By directly deriving ISSL methods from Eq. (6), we will prescribe improvements to previous ISSL objectives that ensure that sample-optimal encoders are learned in idealized settings. We broadly categorize prior methods into two families depending on whether they explicitly select the number of equivalence classes |X/ | or if it is implicitly inferred from augmentations. We call these approaches distillation and contrastive ISSL respectively. For derivations and Pytorch implementation see Appx. C. 4.1 Contrastive ISSL (CISSL) (a) Sim CLR Figure 5: CISSL corresponds to Sim CLR with a single (asymmetric) projection head g0(z+)>z. This ensures linear predictability of downstream tasks, which will be computed by Wt>z. Inspired by previous contrastive objectives [1, 26 28], we show how to optimize the ISSL log loss using data augmentations and negative samples to remove the need of knowing M(X) and the number of classes |X/ |. The resulting objective, dubbed CISSL, corresponds to Sim CLR using only a projection head g on one branch. See Fig. 5. This asymmetry is necessary to learn population optimal encoders for linear F. CISSL bypasses the need for M(X) by noticing that it only appears in the ISSL log loss through the class weights w(M(X)). CISSL thus learns a function g mapping equivalent (augmented) inputs X to w(M( X)). Such mapping exists since M is invariant, e.g., g := w M. When augmentations satisfy the Markov Chain X M(X) X, we show that we can replace the class weights w(M(X)) in the ISSL loss by g( X), where g is optimized over. Intuitively, this is because predicting X contains only information about M(X) due to the data processing inequality. The only remaining challenge is computing the denominator of Eq. (6) without summing over the unknown classes X/ . To do so we use ranking conditional noise contrastive estimation (NCE; [29 31]). NCE replaces classification of equivalence classes by classification of positives in a batch X := { X+, X 1 , . . . , X k } where the positive X+ is sampled from the conditional A( X | X), while the k negatives X i come from the marginal A( X) = Ep X[A( X|X)]. Using Monte Carlo (MC) estimates with an unlabeled dataset D i.i.d. p X we get our final empirical CISSL objective ˆLC(φ; D) := inf log exp g( X+)>φ(x) P X02 X exp g( X0)>φ(x) By Prop. 3 and NCE s consistency [31], encoders trained with CISSL φ 2 arg minφ2Φ1 ˆLC(φ; D) are sample optimal for linear F in our ideal setting assumption (|D| ! 1 and unconstrained g) and when X M(X) X forms a Markov Chain. While consistency (and thus optimality) holds for k 1, more negatives k improves statistical efficiency [31]. Typically X, X take value in the same space (e.g. images). If so, we can tie parameters by encoding X, i.e., we can replace g by g0 φ where g0 : Rd ! S is called a projection head. The logits inside of Eq. (7) are then computed by g0(φ( X))>φ(x) as shown in Fig. 5b. This is very similar to Sim CLR s objective g0(φ( X)))>g0(φ(x)) shown in Fig. 5a. The difference is that, by projecting the current representation, Sim CLR learns encoders that are not even population optimal for linear F. In contrast, CISSL learns sample optimal encoders in ideal settings. Intuitively, this is because CISSL trains the representations in the same way as they will be used in downstream tasks pt. Indeed, representations φ(X) will be dotted with the downstream tasks weights Wt>φ(x) to compute logits. In ISSL, representations φ(x) rather than their projections g0(φ(x)) should thus be used in the inner product with ISSL weights w(M(X)) = g0(φ( X))) to compute logits. CISSL thus derives, from first principles, an asymmetric use of projection heads. 4.2 Distillation ISSL (DISSL) Figure 6: DISSL. The teacher (top branch) is trained with the top loss to ensure ˆ M is a maximal invariant r.v. The student (bottom branch) distills the teacher by predicting ˆ M . In practice, CISSL requires contrasting many negatives. To avoid contrastive estimation, we can directly approximate M(X) instead of the weight w(M(X)) so that the denominator of Eq. (6) can be computed exactly. The resulting method, dubbed DISSL (Fig. 6), is a simpler and theoretically-grounded version of non-contrastive losses (e.g. Sw AV, DINO, Sim Siam). The challenge and main difference between each of those methods is how they estimate M(X). As M(X) is discrete, we use a conditional categorical distribution q( ˆ M | X) to estimate M(X) with a random variable ˆ M. Replacing terms in the ISSL log loss then gives the following objective. Differences with Eq. (6) are in red. inf w2W1 Ep Xq( ˆ M | X) log sφ,w( ˆ M | X) , sφ,w(m | x) = exp m=1 exp(w(m)>φ(x)) To highlight similarities with previous methods [2, 24, 32] we refer to Eq. (8) as distilling the teacher q( ˆ M | X) into the student sφ,w( ˆ M | X). Importantly, Eq. (8) is exactly the ISSL log loss if the teacher outputs a maximal invariant random variable, i.e., ˆ M = M(X). By Prop. 3, distillation thus learns sample-optimal encoders when the teacher takes at least C |X/ | values and satisfies the following requirements, which correspond to the definition of maximal invariance (Eq. (4)). Deterministic the teacher is a deterministic distribution maxm2{1,...,C} q(m | X) = 1; Invariant the teacher maps positives together x x+ =) q( ˆ M | x) = q( ˆ M | x); Maximal the teacher maps negatives separately x 6 x =) q( ˆ M | x) 6= q( ˆ M | x ). Intuitively, these requirements ensure that the teacher clusters examples by equivalence classes, which will then be classified by the student. There are many ways of enforcing such requirements. In the following, we use information-theoretic quantities (entropies and KL divergences) to jointly train the teacher (online clustering) and student. Such quantities have the advantage of being interpretable and computable in closed form for the categorical distributions given by our teacher and student. Determinism and Invariance. Both properties hold if and only if for any equivalent example x x+ the cross-entropy of the teacher s outputs is minimized Eq( ˆ M | x)[ log q( ˆ M | x+)] = 0. Indeed, zero cross-entropy simultaneously minimizes the conditional entropy H[ ˆ M | x] = 0 (equivalent to determinism) and the KL divergence DKL[q( ˆ M | x)kq( ˆ M | x+)] = 0 (equivalent to invariance). Maximality. A natural way of enforcing maximality is to train the teacher to predict differently negatives x 6 x . This requires accessing batches of negatives, which we wanted to avoid with DISSL. Instead, assume that we know (or have a prior on) the true distribution of equivalence classes p(M(X)). Then a deterministic and invariant teacher is maximal if and only if the KL divergence between its marginal q( ˆ M) = Ep X[q( ˆ M|X)] and the true one is minimized DKL[q( ˆ M)kp(M(X))] = 0. We can thus avoid contrasting negatives using (a prior on) the distribution of equivalence classes. Using a Lagrangian relaxation (with multipliers λ, β) of the teacher s constraints with an MC estimate of the distillation loss (Eq. (8)) we get the following empirical DISSL loss ˆLD(φ; D) := inf q,w λ DKL | {z } Maximality EA( X|x)q( ˆ M | x) β log q( ˆ M | X) | {z } Det. and Inv. + log sφ,w( ˆ M | x) | {z } Distillation Algorithm 1 Batched DISSL Require: head g, weights W, enc. φ, batched inputs X, aug. A, hyp. β, λ. 1: X, X+ sample(A( X | X), 2) 2: q( ˆ M| X) = softmax(g(φ( X)) 3: q( ˆ M| X+) = softmax(g(φ( X+)) 4: s( ˆ M| X+) = softmax(W T φ( X+)) 5: q(M) = batch_avg(q(M| X)) 6: mxml = P m q(m) log q(m) 7: det_inv = P m q(m| X) log q(m| X+) 8: dstl = P m q(m| X) log s(m| X+) 9: return λ mxml β det_inv dstl In ideal settings (|D| ! 1, unconstrained q, known p(M(X))) and for λ, β ! 1 we show that encoders trained with DISSL are sample optimal for linear F and tasks that are invariant to the equivalences defined by the connected components of the augmentation graph [8]. If the marginal is unknown, we follow the Max Ent [33] and use a uniform prior p(M(X)) = Unif(1, C). The KL in Eq. (9) then corresponds to maximizing entropy H[ ˆ M]. As with CISSL, when X, X are in the same space, we tie parameters by encoding X before the teacher, i.e., q( ˆ M | X) = softmax(g0(φ(X))) for a head g0 : Rd ! RC. Algorithm 1 and Fig. 6 illustrate DISSL with a uniform prior, and a marginal q( ˆ M) estimated from batches. DISSL is one of the many distillation methods that can be derived from our teacher s requirements. These requirements generally lead to a framework for deriving, comparing, and analyzing distillation objectives. In Appx. D we provide a taxonomy of 12 previous SSL methods from this perspective none of which recover sample optimal encoders. Typically, previous methods favor: (i) determinism by sharpening the teacher with a temperature parameter; (ii) invariance by making matching the teacher and student on equivalent inputs q( ˆ M | x+) sφ,w( ˆ M | x); (iii) maximality through optimization tricks (e.g. stop gradients or momentum encoder) to avoid a constant teacher referred to as collapsing [25, 34 36]. Note that avoiding constant collapse q( ˆ M | x0) = q( ˆ M | x) for any x, x0 2 X is insufficient, e.g., mapping half the inputs to a constant is also undesirable. Maximality formalizes the desired requirement: non-equivalent examples should not be mapped together. 5 ISSL for non-linear predictors F + Linear probes F are standard for evaluating representations [27, 37, 38]. However, if the goal is to maximize performance it is natural to consider non-linear predictors F +. The question then becomes how should we learn optimal representations for any F +? In Appx. B.5 we extend our framework to certain non-linear probes, such as neural networks, that separate into a hidden component h and linear layer, i.e., f( ) = W T h( ). We informally summarize the theory and implications below. Theory. There are two main differences between our characterization of optimal encoders for linear and non-linear probes. First, encoders should ensure predictability of M(X) for the desired F +. Second, the dimensionality requirement decreases with the complexity of F +, e.g., for universal F + it is d 1. More generally, the necessary dimension and the sufficient dimension for optimality do not coincide and respectively depend on the predictors VC dimension [13] and memory capacity [39]. Implications. Both theoretical differences lead to direct implications for non-linear ISSL. First, we can decrease the dimensionality when performing ISSL for more complex non-linear probes. Second, we should use a projection head that ensures the predictability of the maximal invariant with using the desired probes F +. Similarly to the linear case, we should match how the representation is used in the ISSL log loss and downstream tasks. We should thus apply the non-linear predictor h in a similar asymmetric way. For CISSL we should thus compute the right hand side of Fig. 5 as g(φ( X))h(φ(x)). For DISSL, we should change the line 4 of Algorithm 1 to s( ˆ M| X+) = softmax(W T h(φ( X+))). 6 Summary of insights and relation to previous work Let summarize our framework s main insights and their relation to previous work. Details in Appx. D. Dimensionality. Theorem 1 shows large representation s dimensionality is needed to ensure probes can classify all invariant tasks (note: this is different from the projection s dimensionality analyzed in [1, 40]). This suggests: (i) increasing the dimensionality of the representation; and (ii) ensuring that representations do not live in a lower dimensional subspace. Although the first has surprisingly not been investigated in SSL, there have been recent empirical analyses about the second [41, 42]. Projection heads. Secs. 4 and 5 theoretically show how to choose projection heads, namely, one should be as large as possible while the other should have the architecture of downstream probes. To our knowledge, we are the first to relate the architecture of the probing family and the projection head. Furthermore, our theory suggests why projection heads empirically improve performances [1, 25, 43] in general SSL, i.e., beyond avoiding collapsing in non-contrastive learning [25, 36]. Augmentations. Prop. 2 shows the benefit of using coarse label-preserving augmentations, by proving the exact relation between optimal sample efficiency and the number of equivalence classes. This gives a new theoretical perspective on the use of augmentations that remove a lot of information of the input, which have been suggested to be useful for many different reasons [12, 44 48]. DISSL. In Sec. 4 we derive DISSL to learn optimal encoders in ideal settings, contrary to prior work. DISSL follows recent methods [49 51], in particular Sw AV and DINO, that learn representations by jointly predicting and learning clusters of examples. DISSL/DINO/SWAV each distill a categorical teacher but differ in how they enforce its maximality. DINO does it implicitly through optimization tricks, e.g., exponentially moving averages and stop-gradients. Sw AV explicitly enforces maximality by equiprobably clustering a queue of examples using Sinkhorn s algorithm [52]. DISSL is also explicit but uses a more efficient max entropy regularization (no queue/stop-gradient/internal algorithm). Furthermore, DISSL s student uses a linear projection head to ensure good linear probing (Sec. 4.1). Theoretical SSL. Theorem 1 characterizes optimal encoders for ISSL. This contrasts with standard theoretical work in SSL that typically analyze specific SSL algorithms [3, 5, 6, 8, 53 55] and / or focus on upper-bounding downstream performance [4]. The advantage of our theory is that it provides a simple unifying framework from which we can derive a variety of SSL methods and actionable insights. The downside is that by distancing itself from specific algorithms, the framework does not provide guarantees outside of ideal settings. We thus see our theory as complementary to prior work. 7 Experiments The experiments in the main paper focus on evaluating our frameworks prescriptive implications. For more results see Appx. G. For experimental details see Appx. F and our Git Hub repository. In summary, our experimental results show that: (i) CISSL/DISSL outperform their respective baselines on standard benchmarks as suggested by Sec. 4; (ii) increasing dimensionality improves downstream probing as suggested by Theorem 1; (iii) coarser augmentations improve sample efficiency as in Prop. 2; (iv) smaller ISSL log loss improves downstream performance as suggested by Prop. 3; (v) projection heads should be related to probing family as discussed in Secs. 4 and 5. For the first experiments, we use Tiny Image Net [56], 300 pretraining epochs, and Res Net18s. Note that for Res Net18s the dimensionality of the representation (res5avg) is d = 512. For contrastive baselines we use Sim CLR. For distilling baselines we use the best over DINO, Sw AV, Sim Siam. We use standard Tiny Image Net augmentations [34] (color jittering, grayscaling, cropping) with a parameter controlling the probability and strength of augmentations to study the effect of coarsening. Table 1: Our prescriptive implications improves ISSL for linear probes on Tiny Image Net with Res Net18 encoders. 3 seeds. CONTR. DISTIL. BASE. (SIMCLR | DINO) 44.9 0.2 43.5 0.2 OURS (CISSL | DISSL) 45.8 0.0 45.5 0.1 + DIM. " 47.6 0.1 47.9 0.1 + EPOCHS " 48.7 0.1 49.2 0.1 + COARSE AUG. 51.0 0.1 50.6 0.2 CISSL/DISSL improve linear probing Table 1 shows that each of our prescriptions significantly improve distillation and contrastive ISSL. Ours vs Base shows that removing one non-linear projection head (as in Fig. 5) gives a 1% gain in contrastive learning, while our new distillation objective achieves 2% gains compared to the best distillation baseline. dim. " shows that increasing the dimensionality of representation 512 ! 2048 further improves linear probing by 2%. Epochs " and coarse aug. show that training for longer (300 ! 1000 epochs ) and using coarser augmentations both further improve performance by 1 2%. Altogether our framework s insights significantly improve linear probing accuracy: 6% gains for contrastive learning and 7% for distillation. Coarse augmentations improve sample efficiency. Prop. 2 suggests that coarser label-preserving augmentations improve downstream sample efficiency. We test that by training four DISSL encoders with augmentations of varying strengths ([25%, 50%, 100%, 200%] relative to standard augmentations) and evaluating them at different downstream samples sizes. As suggested by the theory, Fig. 7a shows that stronger label-preserving augmentations improve sample efficiency. (a) Augmentations (b) ISSL log loss (c) Dimensionality Figure 7: As suggested by our theory: (a) coarsening augmentations improves sample efficiency; (b) decreasing ISSL log loss improves representations, which can be achieved by longer training; and (c) increasing the representation s dimensionality improves performance for linear F but less so for MLPs F+. Y-axis is Tiny Image Net probing performance. Each point is obtained by sweeping: (a) augmentation strengths for DISSL and downstream sample sizes; (b) CISSL s optimization hyperparameters including epochs; and (c) the dimensionality of CISSL s representations. ISSL log loss correlates with performance. Previous work [57] suggested contrastive losses may not predict probing performance. On the contrary, Prop. 3 shows that minimal ISSL log loss implies the optimality of encoders if d is large. This suggests that, for fixed augmentations, ISSL log loss is highly related to performance. To test this relation, we trained 80 CISSL models with various hyperparameters, while fixing augmentations and negatives k. Fig. 7b shows that the test ISSL loss indeed correlates with probing accuracy (gray points). Appx. G.4 shows similar results for the train ISSL loss. Longer training decreases ISSL log loss. An efficient way of decreasing ISSL log loss is longer training (blue points in Fig. 7b only vary the number ISSL epochs). Prop. 3 thus provides an explanation of the well-known performance gains of longer ISSL training [1, 58]. Namely, longer training decreases ISSL log loss, which results in representations that are closer to optimality. Increasing dimensionality improves performance. Theorem 1 shows that linear ISSL requires a dimensionality of d = |X/ | 1 to ensure predictability of all invariant tasks. In practice, the number of equivalence classes is most likely very large. Although such dimensionality is impractical, it does suggest increasing the dimensionality beyond the d = 512 of Res Net18s. To test this, we train CISSL with a larger dimensionality by increasing the output channels before the average pooling layer. To keep the number of parameters comparable, we use bottlenecks or low-rank linear layers before and after the representations layer as detailed in Appx. F. Fig. 7c shows that increasing the dimensionality has a significant effect on downstream performance for linear F (blue). Previous results concern linear F which are standard for ISSL evaluation. In practice, one would likely use more complex probes F + to improve results. In Sec. 5 we discussed how to perform ISSL for F +. In the following, we evaluate ISSL for MLPs F + with two hidden layers of 2048 units. Table 2: MLP probes outperform linear ones on Tiny Image Net when using MLP ISSL. F ISSL F EVAL ALL 10 SHOT - MLP 44.7 0.3 22.0 0.2 LIN. LIN. 45.9 0.0 25.3 0.0 LIN. MLP 46.4 0.1 25.6 0.0 MLP MLP 47.5 0.2 26.1 0.1 Performing ISSL for MLPs F+. Table 2 shows that MLP probes ( F EVAL: MLP ) outperforms linear ones ( F EVAL: lin. ) even in few-shot regimes (10 shot). Furthermore, when predicting with MLP probes it is desirable to train the ISSL encoders for MLPs ( F ISSL: MLP ) instead of linear probes (row 4 vs 3). As discussed in Sec. 5, the only difference between DISSL for different probes is that the student s representation is projected using a head with the same architecture as downstream probes. Dimensionality affects less MLPs F +. In Sec. 5 we saw that the optimal dimensionality d decreases for more complex predictors F +. Fig. 7c indeed shows that larger dimensions result in smaller gains when the training and evaluation probe are MLPs F + (orange) compared to linear ones F (blue). Given encouraging Tiny Image Net results, we investigated whether our objectives can be used as a replacement for standard objectives on Image Net [59]. We specifically replaced Sim CLR with CISSL and Sw AV with DISSL in VISSL s codebase [60] without modifying hyperparameters. Table 3: Our models outperform baselines on Image Net. All models use Res Net50, 100 epochs, 2560 batch size. For our models, we increase the dimensionality of the representations (2048 ! 8192). SIMCLR SWAV BARLOW T. CISSL DISSL 65.1 64.6 66.1 67.7 68.9 Our models outperform baselines on Image Net. Table 3 shows that both CISSL and our novel DISSL objective significantly outperform all considered baselines, including Sw AV by a 4%, when using standard augmentations. For CISSL we found (contrary to Table 1) that gains compared to Sim CLR were mostly due to increasing the dimensionality of the representation. Given encouraging results on standard augmentations, we compared the performance of DISSL with a near SOTA model: Sw AV trained with their special multi-crop augmentations. Table 4: Our DISSL outperforms Sw AV using 2 160 + 4 96 multi-crops on Image Net. 2560 batch size, Res Net50. 100 EPOCHS 400 EPOCHS SWAV 69.5 73.5 DISSL 70.7 74.0 DISSL is competitive with SOTA models at scale. Table 4 shows that DISSL outperforms Sw AV. Combined with Table 3 this suggests that DISSL works well in different settings. This is encouraging given the lack of tuning and the simplicity of DISSL s out-of-the-box objective. In contrast, Sw AV requires stopping gradients, storing a queue, running Sinkhorn s algorithm, and freezing certain parameters during initial training steps. Table 5: DISSL is competitive on transfer tasks. Same models as in Table 4 evaluated by linear probes. FOOD CIFAR10 CIFAR100 CARS AIRCRAFTS DTD PETS CALTECH FLOWERS SWAV 75.5 92.0 76.2 58.2 49.1 72.6 86.9 92.0 94.7 DISSL 77.9 93.6 77.6 62.2 48.1 73.9 88.0 91.5 95.3 DISSL is competitive on transfer. Table 5 shows that DISSL generally outperforms Sw AV on the standard transfer benchmarks from [61] even though our theory does not discuss transfer. 8 Summary and Outlook We presented a simple conceptual framework to understand, compare and improve ISSL methods. Inspired by recent work on optimal representations for supervised [62] and robust [63] learning, we derived such a framework by studying algorithmic-agnostic goals for ISSL. In particular, Theorem 1 provides the minimal and sufficient requirements for optimal encoders. On the algorithmic side, Sec. 4 uses our framework to derive simpler and theoretically-motivated variants of prior SSL objectives. Altogether our framework provides actionable insights for ISSL such as how to choose: SSL algorithms, the dimensionality of the representations, projection heads, and augmentations. On the empirical side, Sec. 7 shows that our prescriptions leads to significant gains on linear probing benchmarks. There are many limitations that should be addressed for a more realistic prescriptive framework for ISSL. See Appx. E for more details. First, our theory is binary, e.g., augmentations are either label preserving or not, and encoders are optimal or not. Although our framework likely captures the right intuition, nuance would be more realistic. Second, we consider unconstrained encoders φ, dimensionality, and infinite unlabeled data. As a result, our framework cannot currently improve the computational or data efficiency of the pretraining phase. Finally, we consider all invariant tasks T even though only a subset of those are meaningful. Despite those limitations, we showed that our framework s insights lead to substantial improvements in ISSL performance. Acknowledgments and Disclosure of Funding We thank Ananya Kumar, Shengjia Zhao, Mo Tiwari, Michael Xie, Niladri Chatterji, Shibani Santurkar, Jiaming Song, Ali Malik, Colin Wei, Yangjun Ruan and Chris Maddison for helpful feedback. YD is supported by a Knights-Hennessy Scholarship. The work is partially supported by an Open Philanthropy Project Award, Sloan Fellowship, and ARO (W911NF-21-1-0125). [1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, A simple framework for contrastive learning of visual representations, in International Conference on Machine Learning (ICML), 2020. (Cited on 1, 5, 7, 8, 9, 43, 44, 45, 49) [2] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, Unsupervised learning of visual features by contrasting cluster assignments, in Advances in Neural Information Processing Systems (Neur IPS), 2020. (Cited on 1, 6, 43, 44, 45) [3] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar, A theoretical analysis of contrastive unsupervised representation learning, in International Conference on Machine Learning (ICML), 2019. (Cited on 1, 8, 37, 42, 43) [4] Y. Bansal, G. Kaplun, and B. Barak, For self-supervised learning, rationality implies general- ization, provably, in International Conference on Learning Representations (ICLR), 2021. (Cited on 8, 42) [5] J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo, Predicting what you already know helps: Provable self-supervised learning, in Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 8, 42) [6] C. Tosh, A. Krishnamurthy, and D. Hsu, Contrastive learning, multi-view redundancy, and linear models, in Conference on Algorithmic Learning Theory (ALT), 2021. (Cited on 8, 42) [7] Z. Wen and Y. Li, Toward understanding the feature learning process of self-supervised contrastive learning, in International Conference on Machine Learning (ICML), 2021. (Cited on 42, 45) [8] J. Z. Hao Chen, C. Wei, A. Gaidon, and T. Ma, Provable guarantees for self-supervised deep learning with spectral contrastive loss, in Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 1, 7, 8, 39, 42, 43) [9] T. Leen, From data distributions to regularization in invariant learning, in Advances in Neural Information Processing Systems (Neur IPS), 1994. (Cited on 2) [10] S. Chen, E. Dobriban, and J. H. Lee, A group-theoretic framework for data augmentation, in Advances in Neural Information Processing Systems (Neur IPS), 2020. (Cited on 43) [11] C. Lyle, M. van der Wilk, M. Kwiatkowska, Y. Gal, and B. Bloem-Reddy, On the benefits of invariance in neural networks, ar Xiv preprint ar Xiv:2005.00178, 2020. (Cited on 2, 43) [12] Y. Dubois, B. Bloem-Reddy, K. Ullrich, and C. J. Maddison, Lossy compression for lossless prediction, Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 3, 8, 21, 22, 25, 37, 42, 43, 45) [13] V. N. Vapnik and A. Y. Chervonenkis, On uniform convergence of the frequencies of events to their probabilities, Teoriya Veroyatnostei i ee Primeneniya, vol. 16, no. 2, pp. 264 279, 1971. (Cited on 3, 7, 32, 34, 35) [14] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data mining and knowledge discovery, vol. 2, no. 2, pp. 121 167, 1998. (Cited on 3, 25, 32) [15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, Learning transferable visual models from natural language supervision, in International Conference on Machine Learning (ICML), 2021. (Cited on 4, 21, 43) [16] V. Papyan, X. Y. Han, and D. L. Donoho, Prevalence of neural collapse during the terminal phase of deep learning training, Proceedings of the National Academy of Sciences, vol. 117, pp. 24 652 24 663, 2020. (Cited on 4, 43) [17] T. Ergen and M. Pilanci, Revealing the structure of deep neural networks via convex duality, in International Conference on Machine Learning (ICML), 2021. [18] J. Lu and S. Steinerberger, Neural collapse under cross-entropy loss, Applied and Computa- tional Harmonic Analysis, vol. 59, pp. 224 241, 2022. (Cited on 34, 43) [19] F. Graf, C. Hofer, M. Niethammer, and R. Kwitt, Dissecting supervised contrastive learning, in International Conference on Machine Learning (ICML), 2021. [20] J. Zarka, F. Guth, and S. Mallat, Separation and concentration in deep networks, in Interna- tional Conference on Learning Representations (ICLR), 2021. [21] W. E and S. Wojtowytsch, On the emergence of simplex symmetry in the final and penulti- mate layers of neural network classifiers, in Mathematical and Scientific Machine Learning Conference (MSML), 2021. (Cited on 4, 34) [22] Z. Zhu, T. Ding, J. Zhou, X. Li, C. You, J. Sulam, and Q. Qu, A geometric analysis of neural collapse with unconstrained features, in Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 4, 34, 42, 43) [23] W. Ji, Y. Lu, Y. Zhang, Z. Deng, and W. J. Su, An unconstrained layer-peeled perspective on neural collapse, in International Conference on Learning Representations (ICLR), 2022. (Cited on 4, 34, 42, 43) [24] M. Caron, H. Touvron, I. Misra, H. J egou, J. Mairal, P. Bojanowski, and A. Joulin, Emerging properties in self-supervised vision transformers, in International Conference on Computer Vision (ICCV), 2021. (Cited on 5, 6, 43, 44, 45) [25] X. Chen and K. He, Exploring simple siamese representation learning, in Conference on Computer Vision and Pattern Recognition (CVPR), 2021. (Cited on 5, 7, 8, 43, 44, 45) [26] A. Mnih and K. Kavukcuoglu, Learning word embeddings efficiently with noise-contrastive estimation, in Advances in Neural Information Processing Systems (Neur IPS), 2013. (Cited on 5) [27] A. van den Oord, Y. Li, and O. Vinyals, Representation learning with contrastive predictive coding, ar Xiv preprint ar Xiv:2110.02796, 2019. (Cited on 7, 43, 45) [28] I. Misra and L. van der Maaten, Self-supervised learning of pretext-invariant representations, in Conference on Computer Vision and Pattern Recognition (CVPR), 2020. (Cited on 5) [29] M. Gutmann and A. Hyvärinen, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, in Artificial Intelligence and Statistics (AISTATS), 2010. (Cited on 5, 37) [30] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, Exploring the limits of language modeling, ar Xiv preprint ar Xiv:1602.02410, 2016. (Cited on 37) [31] Z. Ma and M. Collins, Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency, in Empirical Methods in Natural Language Processing, 2018. (Cited on 5, 37) [32] Z. Fang, J. Wang, L. Wang, L. Zhang, Y. Yang, and Z. Liu, SEED: Self-supervised distillation for visual representation, in International Conference on Learning Representations (ICLR), 2021. (Cited on 6) [33] E. T. Jaynes, Information theory and statistical mechanics, Physical Review, vol. 106, no. 4, pp. 620 630, 1957. (Cited on 7) [34] A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe, Whitening for self-supervised represen- tation learning, in International Conference on Machine Learning (ICML), 2021. (Cited on 7, 8, 44, 45, 48) [35] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, Momentum contrast for unsupervised visual representation learning, in Conference on Computer Vision and Pattern Recognition (CVPR), 2020. (Cited on 44) [36] J. B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, Bootstrap Your Own Latent - a new approach to self-supervised learning, in Advances in Neural Information Processing Systems (Neur IPS), 2020. (Cited on 7, 8, 44, 45) [37] Y. Bengio, A. C. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798 1828, 2013. (Cited on 7, 43) [38] R. Zhang, P. Isola, and A. A. Efros, Colorful image colorization, in European Conference on Computer Vision (ECCV), 2016. (Cited on 7, 43) [39] T. M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326 334, Jun. 1965. (Cited on 7, 34, 35) [40] J. Zbontar, L. Jing, I. Misra, Y. Le Cun, and S. Deny, Barlow Twins: Self-supervised learning via redundancy reduction, in International Conference on Machine Learning (ICML), 2021. (Cited on 7, 44) [41] T. Hua, W. Wang, Z. Xue, S. Ren, Y. Wang, and H. Zhao, On feature decorrelation in self- supervised learning, in International Conference on Computer Vision (ICCV), 2021. (Cited on 7, 42) [42] L. Jing, P. Vincent, Y. Le Cun, and Y. Tian, Understanding dimensional collapse in contrastive self-supervised learning, in International Conference on Learning Representations (ICLR), 2022. (Cited on 7, 42) [43] T. Chen, C. Luo, and L. Li, Intriguing properties of contrastive losses, in Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 8, 43, 45) [44] Y. H. Tsai, Y. Wu, R. R. Salakhutdinov, and L. Morency, Self-supervised learning from a multi-view perspective, in International Conference on Learning Representations (ICLR), 2021. (Cited on 8, 42, 45) [45] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, What makes for good views for contrastive learning? in Advances in Neural Information Processing Systems (Neur IPS), 2020. [46] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata, Learning robust representations via multi-view information bottleneck, in International Conference on Learning Representations (ICLR), 2020. (Cited on 45) [47] J. Mitrovic, B. Mc Williams, J. Walker, L. Buesing, and C. Blundell, Representation learning via invariant causal mechanisms, in International Conference on Learning Representations (ICLR), 2021. (Cited on 43) [48] M. Wu, C. Zhuang, M. Mosse, D. L. K. Yamins, and N. D. Goodman, On mutual information in contrastive learning for visual representations, ar Xiv preprint ar Xiv:2005.13149, 2021. (Cited on 8, 42) [49] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, Deep clustering for unsupervised learning of visual features, in European Conference on Computer Vision (ECCV), 2018. (Cited on 8, 43, 44) [50] Y. M. Asano, C. Rupprecht, and A. Vedaldi, Self-labelling via simultaneous clustering and representation learning, in International Conference on Learning Representations (ICLR), 2020. (Cited on 43, 44) [51] J. Yang, D. Parikh, and D. Batra, Joint unsupervised learning of deep representations and image clusters, in Conference on Computer Vision and Pattern Recognition (CVPR), 2016. (Cited on 8) [52] R. Sinkhorn and P. Knopp, Concerning nonnegative matrices and doubly stochastic matrices, Pacific Journal of Mathematics, vol. 21, no. 2, pp. 343 348, 1967. (Cited on 8, 44) [53] C. Tosh, A. Krishnamurthy, and D. Hsu, Contrastive estimation reveals topic posterior information to linear models, Journal of Machine Learning Research (JMLR), vol. 22, pp. 281:1 281:31, 2021. (Cited on 8) [54] K. Nozawa and I. Sato, Understanding negative samples in instance discriminative self- supervised representation learning, in Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 42) [55] Y. Tian, X. Chen, and S. Ganguli, Understanding self-supervised learning dynamics without contrastive pairs, in International Conference on Machine Learning (ICML), 2021. (Cited on 8, 43, 45) [56] cs231n, Tinyimagenet, 2015. (Cited on 8, 47) [57] N. Saunshi, J. T. Ash, S. Goel, D. Misra, C. Zhang, S. Arora, S. M. Kakade, and A. Krish- namurthy, Understanding contrastive learning requires incorporating inductive biases, in International Conference on Machine Learning (ICML), 2022. (Cited on 9, 42) [58] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, Unsupervised feature learning via non-parametric instance discrimination, in Conference on Computer Vision and Pattern Recognition (CVPR), 2018. (Cited on 9) [59] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in Conference on Computer Vision and Pattern Recognition (CVPR), 2009. (Cited on 9, 47) [60] P. Goyal, Q. Duval, J. Reizenstein, M. Leavitt, M. Xu, B. Lefaudeux, M. Singh, V. Reis, M. Caron, P. Bojanowski, A. Joulin, and I. Misra, VISSL, 2021. (Cited on 9, 48) [61] S. Kornblith, T. Chen, H. Lee, and M. Norouzi, Why do better loss functions lead to less transferable features? in Advances in Neural Information Processing Systems (Neur IPS), 2021. (Cited on 10, 49) [62] Y. Dubois, D. Kiela, D. J. Schwab, and R. Vedantam, Learning optimal representations with the decodable information bottleneck, in Advances in Neural Information Processing Systems (Neur IPS), 2020. (Cited on 10, 42, 45) [63] Y. Ruan, Y. Dubois, and C. J. Maddison, Optimal representations for covariate shift, in International Conference on Learning Representations (ICLR), 2022. (Cited on 10, 42) [64] M. L. Eaton, Group invariance applications in statistics, Regional Conference Series in Probability and Statistics, vol. 1, pp. i 133, 1989. (Cited on 22) [65] E. L. Lehmann and J. P. Romano, Testing Statistical Hypotheses, Third Edition, 3rd ed., ser. Springer texts in statistics. Springer, 2008. (Cited on 22) [66] C. Zalinescu, Convex Analysis in General Vector Spaces. World Scientific, 2002. (Cited on 25) [67] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. [68] L. Narici and E. Beckenstein, Topological Vector Spaces, 2nd ed. Chapman and Hall/CRC, 2010. (Cited on 25) [69] A. Brøndsted, An introduction to convex polytopes. Springer Science & Business Media, 1983, no. 90. (Cited on 29) [70] G. M. Ziegler, Lectures on polytopes. Springer Science & Business Media, 2012, vol. 152. [71] B. Grünbaum, V. Klee, M. A. Perles, and G. C. Shephard, Convex polytopes. Springer, 1967, vol. 16. (Cited on 29) [72] S. Straszewicz, Over exposed points of closed point sets, Fundamenta Mathematicae, vol. 24, pp. 139 143, 1935. (Cited on 30) [73] R. T. Rockafellar, Convex analysis. Princeton university press, 1970. (Cited on 30) [74] S. Shalev-Shwartz and T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, Mathematical Programming, pp. 1 41, 2014. (Cited on 31, 32) [75] S. Dutta and A. Goswami, Mode estimation for discrete distributions, Mathematical Methods of Statistics, vol. 19, no. 4, pp. 374 384, 2010. (Cited on 33) [76] P. Flajolet, D. Gardy, and L. Thimonier, Birthday paradox, coupon collectors, caching algorithms and self-organizing search, Discrete Applied Mathematics, vol. 39, no. 3, pp. 207 229, 1992. (Cited on 33) [77] P. Berenbrink and T. Sauerwald, The weighted coupon collector s problem and applications, in International Computing and Combinatorics Conference (COCOON), 2009. (Cited on 33) [78] W. Feller, An introduction to probability theory and its applications. John Wiley & Sons, 2008. (Cited on 33) [79] B. O Neill, The classical occupancy distribution: Computation and approximation, The American Statistician, vol. 75, no. 4, pp. 364 375, 2021. (Cited on 33) [80] C. Fang, H. He, Q. Long, and W. J. Su, Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training, Proceedings of the National Academy of Sciences, vol. 118, 2021. (Cited on 34, 42) [81] D. J. C. Mac Kay, Information theory, inference, and learning algorithms. Cambridge University Press, 2003. (Cited on 35) [82] A. Kowalczyk, Dense shattering and teaching dimensions for differentiable families (extended abstract), in Conference on Learning Theory (COLT), 1997. (Cited on 35, 36) [83] E. D. Sontag, Remarks on interpolation and recognition using neural nets, in Advances in Neural Information Processing Systems (Neur IPS), 1990. (Cited on 35) [84] , Shattering all sets of k points in general position requires (k - 1)/2 parameters, Neural Computation, vol. 9, no. 2, pp. 337 348, 1997. (Cited on 35, 36) [85] E. B. Baum, On the capabilities of multilayer perceptrons, Journal of Complexity, vol. 4, no. 3, pp. 193 215, 1988. [86] A. Sakurai, n-h-1 networks store no less n*h+1 examples, but sometimes no more, in International Joint Conference on Neural Networks (IJCNN), vol. 3, 1992. [87] G. J. Mitchison and R. M. Durbin, Bounds on the learning capacity of some multi-layer networks, Biological Cybernetics, vol. 60, no. 5, pp. 345 365, 1989. (Cited on 35) [88] T. M. Cover, Capacity problems for linear machines, Pattern recognition, pp. 283 289, 1968. (Cited on 35, 36) [89] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian, Nearly-tight vc-dimension and pseu- dodimension bounds for piecewise linear neural networks, JMLR, vol. 20, pp. 63:1 63:17, 2019. (Cited on 36) [90] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. (Cited on 36) [91] A. Kowalczyk, Estimates of storage capacity of multilayer perceptron with threshold logic hidden units, Neural Networks, vol. 10, no. 8, pp. 1417 1433, 1997. (Cited on 36) [92] C. Yun, S. Sra, and A. Jadbabaie, Small relu networks are powerful memorizers: a tight analysis of memorization capacity, in Advances in Neural Information Processing Systems (Neur IPS), 2019. [93] P. Baldi and R. Vershynin, The capacity of feedforward neural networks, Neural Networks, vol. 116, pp. 288 311, 2019. (Cited on 36) [94] A. Xu and M. Raginsky, Minimum excess risk in bayesian learning, IEEE Transactions on Information Theory, 2022. (Cited on 37) [95] O. Kallenberg, Foundations of modern probability. Springer, 1997. (Cited on 37) [96] H. Bao, Y. Nagano, and K. Nozawa, Sharp learning bounds for contrastive unsupervised representation learning, ar Xiv preprint ar Xiv:2110.02501, 2021. (Cited on 42) [97] K. Nozawa, P. Germain, and B. Guedj, Pac-bayesian contrastive unsupervised representation learning, in Conference on Uncertainty in Artificial Intelligence (UAI), 2020. [98] J. T. Ash, S. Goel, A. Krishnamurthy, and D. Misra, Investigating the role of negatives in contrastive representation learning, in Artificial Intelligence and Statistics (AISTATS), 2022. [99] P. Awasthi, N. Dikkala, and P. Kamath, Do more negative samples necessarily hurt in contrastive learning? in International Conference on Machine Learning (ICML), 2022. (Cited on 43) [100] Y. Wang, Y. Zhang, Y. Wang, J. Yang, and Z. Lin, Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap, in International Conference on Learning Representations (ICLR), 2022. (Cited on 42) [101] Y. Tian, L. Yu, X. Chen, and S. Ganguli, Understanding self-supervised learning with dual deep networks, ar Xiv preprint ar Xiv:2010.00578, 2021. (Cited on 42, 43, 45) [102] D. G. Mixon, H. Parshall, and J. Pi, Neural collapse with unconstrained features, Sampling Theory, Signal Processing, and Data Analysis, vol. 20, no. 2, pp. 1 13, 2022. (Cited on 42) [103] T. Wang and P. Isola, Understanding contrastive representation learning through alignment and uniformity on the hypersphere, in International Conference on Machine Learning (ICML), 2020. (Cited on 42) [104] F. Wang and H. Liu, Understanding the behaviour of contrastive loss, in Conference on Computer Vision and Pattern Recognition (CVPR), 2021. (Cited on 42) [105] P. H. Richemond, J.-B. Grill, F. Altché, C. Tallec, F. Strub, A. Brock, S. L. Smith, S. De, R. Pascanu, B. Piot, and M. Valko, BYOL works even without batch statistics, ar Xiv preprint ar Xiv:2010.10241, 2020. (Cited on 43, 45) [106] C. Zhang, K. Zhang, C. Zhang, T. X. Pham, C. D. Yoo, and I. S. Kweon, How does sim- siam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning, in ICLR, 2022. (Cited on 43, 45) [107] L. Ericsson, H. Gouk, and T. M. Hospedales, Why do self-supervised models transfer? inves- tigating the impact of invariance on downstream tasks, ar Xiv preprint ar Xiv:abs/2111.11398, 2021. (Cited on 43) [108] A. Foster, R. Pukdee, and T. Rainforth, Improving transformation invariance in contrastive representation learning, in International Conference on Learning Representations (ICLR), 2021. (Cited on 43) [109] J. von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Lo- catello, Self-supervised learning with data augmentations provably isolates content from style, in Neur IPS, 2021. (Cited on 43) [110] T. Galanti, A. György, and M. Hutter, On the role of neural collapse in transfer learning, in International Conference on Learning Representations (ICLR), 2022. (Cited on 43) [111] L. Hui, M. Belkin, and P. Nakkiran, Limitations of neural collapse for understanding general- ization in deep learning, ar Xiv preprint ar Xiv:2202.08384, 2022. (Cited on 43, 56) [112] A. Bardes, J. Ponce, and Y. Le Cun, VICReg: Variance-invariance-covariance regularization for self-supervised learning, in International Conference on Learning Representations (ICLR), 2022. (Cited on 44) [113] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan, Cluster Fit: Improving general- ization of visual representations, in Conference on Computer Vision and Pattern Recognition (CVPR), 2020. (Cited on 44) [114] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, Learning deep representations by mutual information estimation and maximization, in International Conference on Learning Representations (ICLR), 2019. (Cited on 45) [115] P. Bachman, R. D. Hjelm, and W. Buchwalter, Learning representations by maximizing mutual information across views, in Advances in Neural Information Processing Systems (Neur IPS), 2019. [116] Y. Tian, D. Krishnan, and P. Isola, Contrastive multiview coding, in European Conference on Computer Vision (ECCV), 2020. (Cited on 45) [117] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, On mutual infor- mation maximization for representation learning, in International Conference on Learning Representations (ICLR), 2020. (Cited on 45) [118] Y. Xu, S. Zhao, J. Song, R. Stewart, and S. Ermon, A theory of usable information under computational constraints, in International Conference on Learning Representations (ICLR), 2020. (Cited on 45) [119] A. Pokle, J. Tian, Y. Li, and A. Risteski, Contrasting the landscape of contrastive and non- contrastive learning, in Artificial Intelligence and Statistics (AISTATS), 2022. (Cited on 45) [120] L. Bossard, M. Guillaumin, and L. V. Gool, Food-101 mining discriminative components with random forests, in European Conference on Computer Vision (ECCV), 2014. (Cited on 47) [121] A. Krizhevsky, Learning multiple layers of features from tiny images, University of Toronto, Tech. Rep., 2009. (Cited on 47, 50) [122] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, 3d object representations for fine-grained categorization, in ICCV Wokshop on 3D Representation and Recognition, 2013. (Cited on 47) [123] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi, Fine-grained visual classification of aircraft, ar Xiv preprint ar Xiv:1306.5151, 2013. (Cited on 47) [124] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi, Describing textures in the wild, in Conference on Computer Vision and Pattern Recognition (CVPR), 2014. (Cited on 47) [125] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, Cats and dogs, in Conference on Computer Vision and Pattern Recognition (CVPR), 2012. (Cited on 47) [126] F.-F. Li, M. Andreeto, M. A. Ranzato, and P. Perona, Caltech 101, 2022. (Cited on 47) [127] M.-E. Nilsback and A. Zisserman, Automated flower classification over a large number of classes, in Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP), 2008. (Cited on 47) [128] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770 778. (Cited on 47) [129] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning (ICML), 2015, pp. 448 456. (Cited on 47) [130] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in International Conference on Learning Representations (ICLR), 2015. (Cited on 47) [131] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research (JMLR), vol. 12, 2011. (Cited on 49) [132] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20, pp. 273 297, 1995. (Cited on 49) [133] L. Mc Innes, J. Healy, N. Saul, and L. Großberger, UMAP: uniform manifold approximation and projection, Journal of Open Source Software, vol. 3, no. 29, p. 861, 2018. (Cited on 56) 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] in Sec. 8 expanded in Appx. E . (c) Did you discuss any potential negative societal impacts of your work? [No] . (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] in the paper the main paper we mention the important ones succintly, in Appx. A.2 we discuss each in detail. (b) Did you include complete proofs of all theoretical results? [Yes] in Appx. B. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] all the code to reproduce our results can be found at github.com/Yann Dubs/ Invariant-Self-Supervised-Learning. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] in Appx. F. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] for the Tiny Image Net table. We did not for Image Net due to computational constraints. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] . 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] . (b) Did you mention the license of the assets? [No] . (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] . (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] . (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] . 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] . (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] . (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] .