# selfsupervised_learning_from_a_multiview_perspective__d1df69da.pdf Published as a conference paper at ICLR 2021 SELF-SUPERVISED LEARNING FROM A MULTI-VIEW PERSPECTIVE Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, Louis-Philippe Morency Machine Learning Department, Carnegie Mellon University As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper provides an information-theoretical framework to better understand the properties that encourage successful self-supervised learning. Specifically, we demonstrate that self-supervised learned representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design. In particular, we propose a composite objective that bridges the gap between prior contrastive and predictive learning objectives, and introduce an additional objective term to discard task-irrelevant information. To verify our analysis, we conduct controlled experiments to evaluate the impact of the composite objectives. We also explore our framework s empirical generalization beyond the multi-view perspective, where the cross-view redundancy may not be clearly observed. 1 INTRODUCTION Self-supervised learning (SSL) (Zhang et al., 2016; Devlin et al., 2018; Oord et al., 2018; Tian et al., 2019) learns representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. Empirical evidence suggests that the learned representations can generalize well to a wide range of downstream tasks, even when the SSL objective has not utilize any downstream supervision during training. For example, Sim CLR (Chen et al., 2020) defines a contrastive loss (i.e., an SSL objective) between images with different augmentations (i.e., one as the input and the other as the self-supervised signal). Then, one can take Sim CLR as features extractor and adopt the features to various computer vision applications, spanning image classification, object detection, instance segmentation, and pose estimation (He et al., 2019). Despite success in practice, only a few work (Arora et al., 2019; Lee et al., 2020; Tosh et al., 2020) provide theoretical insights into the learning efficacy of SSL. Our work shares a similar goal to explain the success of SSL, from the perspectives of Information Theory (Cover & Thomas, 2012) and multi-view representation1. To understand (a subset2 of) SSL, we start by the following multi-view assumption. First, we regard the input and the self-supervised signals as two corresponding views of the data. Using our running example, in Sim CLR (Chen et al., 2020), the augmented images (i.e., the input and the self-supervised signal) are an image with different views. Second, we adopt a common assumption in multi-view learning: either view alone is (approximately) sufficient for the downstream tasks (see Assumption 1 in prior work (Sridharan & Kakade, 2008)). The assumption suggests that the image augmentations (e.g., changing the style of an image) should not affect the labels of images, or analogously, the selfsupervised signal contains most (if not all) of the information that the input has about the downstream tasks. With this assumption, our first contribution is to formally show that the self-supervised learned 1The work (Lee et al., 2020; Tosh et al., 2020) are done concurrent and in parallel, and part of their assumptions/ conclusions are similar to ours. We will elaborate the differences more in the related work section. 2We discuss the limitations of the multi-view assumption in Section 2.1. Published as a conference paper at ICLR 2021 Figure 1: High-level takeaways for our main results using information diagrams. (a) We present to learn minimal and sufficient self-supervision: minimize H(ZX|S) for discarding task-irrelevant information and maximize I(ZX; S) for extracting task-relevant information. (b) The resulting learned representation ZX contains all task relevant information from the input with a potential loss ϵinfo and discards task-irrelevant information with a fixed gap I(X; S|T). (c) Our core assumption: the self-supervised signal is approximately redundant to the input for the task-relevant information. representations can 1) extract all the task-relevant information (from the input) with a potential loss; and 2) discard all the task-irrelevant information (from the input) with a fixed gap. Then, using classification task as an example, we are able the quantify the smallest generalization error (Bayes error rate) given the discussed task-relevant and task-irrelevant information. As the second contribution, our analysis 1) connects prior arts for SSL on contrastive (Oord et al., 2018; Bachman et al., 2019; Chen et al., 2020; Tian et al., 2019) and predictive learning (Zhang et al., 2016; Vondrick et al., 2016; Tulyakov et al., 2018; Devlin et al., 2018) approaches; and 2) paves the way to a larger space of composing SSL objectives to extract task-relevant and discard task-irrelevant information simultaneously. For instance, the combination between the contrastive and predictive learning approaches achieves better performance than contrastiveor predictive-alone objective and enjoys less over-fitting problem. We also present a new objective to discard task-irrelevant information. The objective can be easily incorporated with prior self-supervised learning objectives. We conduct controlled experiments on visual (the first set) and visual-textual (the second set) selfsupervised representation learning. The first set of experiments are performed when the multi-view assumption is likely to hold. The goal is to compare different compositions of SSL objectives on extracting task-relevant and discarding task-irrelevant information. The second set of experiments are performed when the input and the self-supervised signal lie in very different modalities. Under this cross-modality setting, the task-relevant information may not mostly lie in the shared information between the input and the self-supervised signal. The goal is to examine SSL objectives generalization, where the multi-view assumption is likely to fail. 2 A MULTI-VIEW INFORMATION-THEORETICAL FRAMEWORK Notations. For the input, we denote its random variable as X, sample space as X, and outcome as x. We learn a representation (ZX/ Z/ zx) from the input through a deterministic mapping FX: ZX = FX(X). For the self-supervised signal, we denote its random variable/ sample space/ outcome as S/ S/ s. Two sample spaces can be different between the input and the self-supervised signal: X = S. The information required for downstream tasks is referred to as task-relevant information : T/ T / t. Note that SSL has no access to the task-relevant information. Lastly, we use I(A; B) to represent mutual information, I(A; B|C) to represent conditional mutual information, H(A) to represent the entropy, and H(A|B) to represent conditional entropy for random variables A/B/C. We provide high-level takeaways for our main results in Figure 1. We defer all proofs to Supplementary. 2.1 MULTI-VIEW ASSUMPTION In our paper, we regard the input (X) and the self-supervised signals (S) as two views of the data. Here, we provide a table showing different X/S in various SSL frameworks: Framework BERT (Devlin et al., 2018) Look & Listen (Arandjelovic & Zisserman, 2017) Sim CLR (Chen et al., 2020) Colorization (Zhang et al., 2016) Inputs (X) Non-masked Words Image Image Image Lightness Self-supervised Signals (S) Masked Words Audio Stream Same Image with Augmentation Image Color We note that not all SSL frameworks realize the inputs and the self-supervised signals as corresponding views. For instance, Jigsaw puzzle (Noroozi & Favaro, 2016) considers (shuffled) image patches as the input and the positions of the patches as the self-supervised signals. Another example is Learning by Predicting Rotations (Gidaris et al., 2018), which considers an image (rotating with a specific Published as a conference paper at ICLR 2021 angle) as the input and the rotation angle of the image as the self-supervised signal. We point out that the frameworks that regard X/S as two corresponding views (Chen et al.; 2020; He et al., 2019) have a much better empirical downstream performance than the frameworks that do not (Noroozi & Favaro, 2016; Gidaris et al., 2018). Our paper hence focuses on the multi-view setting between X/S. Next, we adopt the common assumption (i.e., multi-view assumption (Sridharan & Kakade, 2008; Xu et al., 2013)) in the multi-view learning between the input and the self-supervised signal: Assumption 1 (Multi-view, restating Assumption 1 in prior work (Sridharan & Kakade, 2008)). The self-supervised signal is approximately redundant to the input for the task-relevant information. In other words, there exist an ϵinfo > 0 such that I(X; T|S) ϵinfo. Assumption 1 states that, when ϵinfo is small, the task-relevant information lies mostly in the shared information between the input and the self-supervised signals. We argue this assumption is mild with the following example. For self-supervised visual contrastive learning (Hjelm et al., 2018; Chen et al., 2020), the input and the self-supervised signal are the same image with different augmentations. Using image augmentations can be seen as changing the style of an image while not affecting the content. And we argue that the information required for downstream tasks should only be retained in the content but not the style. Next, we point out the failure cases of the assumption (or have large ϵinfo): the input and the self-supervised signal contain very different task-relevant information. For instance, a drastic image augmentation (e.g., adding large noise) may change the content of the image (e.g., the noise completely occludes the objects). Another example is BERT (Devlin et al., 2018), with too much masking, downstream information may exist differently in the masked (i.e., the self-supervised signals) and the non-masked (i.e., the input) words. Analogously, too much masking makes the non-masked words have insufficient context to predict the masked words. 2.2 LEARNING MINIMAL AND SUFFICIENT REPRESENTATIONS FOR SELF-SUPERVISION We start by discussing the supervised representation learning. The Information Bottleneck (IB) method (Tishby et al., 2000; Achille & Soatto, 2018) generalizes minimal sufficient statistics to the representations that are minimal (i.e., less complexity) and sufficient (i.e., better fidelity). To learn such representations for downstream supervision, we consider the following objectives: Definition 1 (Minimal and Sufficient Representations for Downstream Supervision). Let Zsup X be the sufficient supervised representation and Zsupmin X be the minimal and sufficient representation: Zsup X = arg max ZX I(ZX; T) and Zsupmin X = arg min ZX H(ZX|T) s.t. I(ZX; T) is maximized. To reduce the complexity of the representation ZX, the prior methods (Tishby et al., 2000; Achille & Soatto, 2018) presented to minimize I(ZX; X) while ours presents to minimize H(ZX|T). We provide a justification: minimizing H(ZX|T) reduces the randomness from T to ZX, and the randomness is regarded as a form of incompressibility (Calude, 2013). Hence, minimizing H(ZX|T) leads to a more compressed representation (discarding redundant information)3. Note that we do not constrain the downstream task T as classification, regression, or clustering. Then, we present SSL objectives to learn sufficient (and minimal) representations for self-supervision: Definition 2 (Minimal and Sufficient Representations for Self-supervision). Let Zssl X be the sufficient self-supervised representation and Zsslmin X be the minimal and sufficient representation: Zssl X = arg max ZX I(ZX; S) and Zsslmin X = arg min ZX H(ZX|S) s.t. I(ZX; S) is maximized. Definition 2 defines our self-supervised representation learning strategy. Now, we are ready to associate the supervised and self-supervised learned representations: Theorem 1 (Task-relevant information with a potential loss ϵinfo). The supervised learned representations (i.e., Zsup X and Zsupmin X ) contain all the task-relevant information in the input (i.e., I(X; T)). The self-supervised learned representations (i.e., Zssl X and Zsslmin X ) contain all the task-relevant information in the input with a potential loss ϵinfo. Formally, I(X; T) = I(Zsup X ; T) = I(Zsupmin X ; T) I(Zssl X ; T) I(Zsslmin X ; T) I(X; T) ϵinfo. 3We do not claim H(ZX|T) minimization is better than I(ZX; X) minimization for reducing the complexity in the representations ZX. In Supplementary, we will show that H(ZX|T) minimization and I(ZX; X) minimization are interchangeable under our framework s setting. Published as a conference paper at ICLR 2021 Figure 2: Remarks on contrastive and predictive learning objectives for self-supervised learning. Between the representation ZX and the self-supervised signal S, contrastive objective performs mutual information maximization and predictive objectives perform log conditional likelihood maximization. We show that the SSL objectives aim at extracting task-relevant and discarding task-irrelevant information. Last, we summarize the computational blocks for practical deployments for these objectives. When ϵinfo is small, Theorem 1 indicates that the self-supervised learned representations can extract almost as much task-relevant information as the supervised one. While when ϵinfo is non-trivial, the learned representations may not always lead to good downstream performance. This result has also been observed in prior work (Tschannen et al., 2019) and Info Min (Tian et al., 2020), which claim the representations with maximal mutual information may not have the best performance. Theorem 2 (Task-irrelevant information with a fixed compression gap I(X; S|T)). The sufficient self-supervised representation (i.e., I(Zssl X ; T)) contains more task-irrelevant information in the input than the sufficient and minimal self-supervised representation (i.e., I(Zsslmin X ; T)). The latter contains an amount of the information, I(X; S|T), that cannot be discarded from the input. Formally, I(Zssl X ; X|T) = I(X; S|T) + I(Zssl X ; X|S, T) I(Zsslmin X ; X|T) = I(X; S|T) I(Zsupmin X ; X|T) = 0. Theorem 2 indicates that a compression gap (i.e., I(X; S|T)) exists when we discard the taskirrelevant information from the input. To be specific, I(X; S|T) is the amount of the shared information between the input and the self-supervised signal excluding the task-relevant information. Hence, I(X; S|T) would be large if the downstream tasks requires only a portion of the shared information. 2.3 CONNECTIONS WITH CONTRASTIVE AND PREDICTIVE LEARNING OBJECTIVES Theorem 1 and 2 state that our self-supervised learning strategies (i.e., min H(ZX|S) and max I(ZX; S) defined in Definition 2) can extract task-relevant and discard task-irrelevant information. A question emerges: What are the practical aspects of the presented self-supervised learning strategies? To answer this question, we present 1) the connections with prior SSL objectives, especially for contrastive (Oord et al., 2018; Bachman et al., 2019; Chen et al., 2020; Tian et al., 2019; Hjelm et al., 2018; He et al., 2019) and predictive (Zhang et al., 2016; Pathak et al., 2016; Vondrick et al., 2016; Tulyakov et al., 2018; Peters et al., 2018; Devlin et al., 2018) learning objectives, showing that these objectives are extracting task-relevant information; and 2) a new inverse predictive learning objective to discard task-irrelevant information. We illustrate important remarks in Figure 2. Contrastive Learning (is extracting task-relevant information). Contrastive learning objective (Oord et al., 2018) maximizes the dependency/contrastiveness between the learned representation ZX and the self-supervised signal S, which suggests maximizing the the mutual information I(ZX; S). Theorem 1 suggests that maximizing I(ZX; S) results in ZX containing (approximately) all the information required for the downstream tasks from the input X. To deploy the contrastive learning objective, we suggest contrastive predictive coding (CPC) (Oord et al., 2018)4, which is a mutual information lower bound with low variance (Poole et al., 2019; Song & Ermon, 2019): LCL := max ZS = FS(S), ZX = FX(X), G E(zs1,zx1), ,(zsn,zxn) P n(ZS,ZX) i=1 log e G(zxi),G(zsi) 1 n Pn j=1 e G(zxi),G(zsj) where FS : S Z is a deterministic mapping and G is a project head that projects a representation in Z into a lower-dimensional vector. If the input and self-supervised signals share the same 4Other contrastive learning objectives can be other mutual information lower bounds such as DV-bound or NWJ-bound (Belghazi et al., 2018) or its JS-divergence (Poole et al., 2019; Hjelm et al., 2018) variants. Among different objectives, Tschannen et al. (2019) have suggested that the objectives with large variance (e.g., DV-/NWJ-bound (Belghazi et al., 2018)) may lead to worsen performance compared to the low variance counterparts (e.g., CPC (Oord et al., 2018) and JS-div. (Poole et al., 2019)). Published as a conference paper at ICLR 2021 sample space, i.e., X = S, we can impose FX = FS (e.g., self-supervised visual representation learning (Chen et al., 2020)). The projection head, G, can be an identity, a linear, or a non-linear mapping. Last, we note that modeling equation 1 often requires a large batch size (e.g., large n in equation 1) to ensure a good downstream performance (He et al., 2019; Chen et al., 2020). Forward Predictive Learning (is extracting task-relevant information). Forward predictive learning encourages the learned representation ZX to reconstruct the self-supervised signal S, which suggests maximizing the log conditional likelihood EPS,ZX [log P(S|ZX)]. By the chain rule, I(ZX; S) = H(S) H(S|ZX), where H(S) is irrelevant to ZX. Hence, maximizing I(ZX; S) is equivalent to maximizing H(S|ZX) = EPS,ZX [log P(S|ZX)], which is the predictive learning objective. Together with Theorem 1, if zx can perfectly reconstruct s for any (s, zx) PS,ZX, then ZX contains (approximately) all the information required for the downstream tasks from the input X. A common approach to avoid intractability in computing EPS,ZX [log P(S|ZX)] is assuming a variational distribution Qφ(S|ZX) with φ representing the parameters in Qφ( | ). Specifically, we present to maximize EPS,ZX [log Qφ(S|ZX)], which is a lower bound of EPS,ZX [log P(S|ZX)]5. Qφ( | ) can be any distribution such as Gaussian or Laplacian and φ can be a linear model, a kernel method, or a neural network. Note that the choice of the reconstruction type of loss depends on the distribution type of Qφ( | ), and is not fixed. For instance, if we let Qφ(S|ZX) be Gaussian N S|R(ZX), σI with σI as a diagonal matrix6, the objective becomes: LF P := max ZX=FX(X),R Es,zx PS,ZX s R(zx) 2 2 where R : Z S is a deterministic mapping to reconstruct S from Z and we ignore the constants derived from the Gaussian distribution. Last, in most real-world applications, the self-supervised signal S has a much higher dimension (e.g., a 224 224 3 image) than the representation ZX (e.g., a 64-dim. vector). Hence, modeling a conditional generative model Qφ(S|ZX) will be challenging. Inverse Predictive Learning (is discarding task-irrelevant information). Inverse predictive learning encourages the self-supervised signal S to reconstruct the learned representation ZX, which suggests maximizing the log conditional likelihood EPS,ZX [log P(ZX|S)]. Given Theorem 2 together with H(ZX|S) = EPS,ZX [log P(ZX|S)], we know if s can perfectly reconstruct zx for any (s, zx) PS,ZX under the constraint that I(ZX; S) is maximized, then ZX discards the task-irrelevant information, excluding I(X; S|T). Similar to the forward predictive learning, we use EPS,ZX [log Qφ(ZX|S)] as a lower bound of EPS,ZX [log P(ZX|S)]. In our deployment, we take the advantage of the design in equation 1 and let Qφ(ZX|S) be Gaussian N ZX|FS(S), σI : LIP := max ZS=FS(S),ZX=FX(X) Ezs,zx PZS ,ZX Note that optimizing equation 3 alone results in a degenerated solution, e.g., learning ZX and ZS to be the same constant. Composing SSL Objectives (to extract task-relevant and discard task-irrelevant information simultaneously). So far, we discussed how prior self-supervised learning approaches extract taskrelevant information via the contrastive or the forward predictive learning objectives. Our analysis also inspires a new loss, the inverse predictive learning objective, to discard task-irrelevant information. Now, We present a composite loss to combine them together: LSSL = λCLLCL + λF P LF P + λIP LIP , (4) where λCL, λF P , and λIP are hyper-parameters. This composite loss enables us to extract taskrelevant and discard task-irrelevant information simultaneously. 5EPS,ZX [log P (S|ZX)] = max Qφ EPS,ZX [log Qφ(S|ZX)] + DKL P (S|ZX) Qφ(S|ZX) max Qφ EPS,ZX [log Qφ(S|ZX)]. 6The assumption of identity covariance in the Gaussian is only a particular parameterization of the distribution Q( | ). Other examples are Moco GAN (Tulyakov et al., 2018), which assumes Q is Laplacian (i.e., ℓ1 reconstruction loss) and φ is a deconvolutional network (Long et al., 2015). Transformer-XL (Dai et al., 2019) assumes Q is a categorical distribution (i.e., cross entropy loss) and φ is a Transformer network (Vaswani et al., 2017). Although Gaussian with diagonal covariance is not the best assumption, it is perhaps the simplest one. Published as a conference paper at ICLR 2021 2.4 THEORETICAL ANALYSIS - BAYES ERROR RATE FOR DOWNSTREAM CLASSIFICATION In last subsection, we see the practical aspects of our designed SSL strategies. Now, we provide an theoretical analysis on the representations generalization error when T is a categorical variable . We use Bayes error rate as an example, which stands for the irreducible error (smallest generalization error (Feder & Merhav, 1994)) when learning an arbitrary classifier from the representation to infer the labels. In specific, let Pe be the Bayes error rate of arbitrary learned representations ZX and ˆT as the estimation for T from our classifier, Pe := Ezx PZX [1 max t T P( ˆT = t|zx)]. To begin with, we present a general form of sample complexity with mutual information (I(ZX; S)) estimation using empirical samples from distribution PZX,S. Let P (n) ZX,S denote the (uniformly sampled) empirical distribution of PZX,S and ˆI(n) θ (ZX; S) := EP (n) ZX ;S[ ˆfθ(zx, s)] with ˆfθ being the estimated log density ratio (i.e., log p(s|zx)/p(s)). Proposition 1 (Mutual Information Neural Estimation, restating Theorem 1 by Tsai et al. (2020)). Let 0 < δ < 1. There exists d N and a family of neural networks F := { ˆfθ : θ Θ Rd} where Θ is compact, so that θ Θ, with probability at least 1 δ over the draw of {zxi, si}n i=1 P n ZX,S, b I(n) θ (ZX; S) I(ZX; S) O q This proposition shows that there exists a neural network θ , with high probability, b I(n) θ (ZX; S) can approximate I(ZX; S) with n samples at rate O(1/ n). Under this network θ and the same parameters d and δ, we are ready to present our main results on the Bayes error rate. Formally, let |T| be T s cardinalitiy and Th(x) = min {max {x, 0}, 1 1/|T|} as a thresholding function: Theorem 3 (Bayes Error Rates for Arbitrary Learned Representations). For an arbitrary learned representations ZX, Pe = Th( Pe) with Pe 1 exp H(T) + I(X; S|T) + I(Z; X|S, T) ˆI(n) θ (ZX; S) + O r d + log(1/δ) Given arbitrary learned representations (ZX), Theorem 3 suggests the corresponding Bayes error rate (Pe) is small when: 1) the estimated mutual information b I(n) θ (ZX; S) is large; 2) a larger number of samples n are used for estimating the mutual information; and 3) the task-irrelevant information the compression gap I(X; S|T) and the superfluous information I(Z; X|S, T), defined in Theorem 2 is small. The first and the second results supports the claim that maximizing I(ZX; S) may learn the representations that are beneficial to downstream tasks. The third result implies the learned representations may perform better on the downstream task when the compression gap is small. Additionally, Zsslmin is preferable than Zssl since I(Zsslmin; X|S, T) = 0 and I(Zssl; X|S, T) 0. Theorem 4 (Bayes Error Rates for Self-supervised Learned Representations). Let P sup e /P ssl e /P sslmin e be the Bayes error rate of the supervised or the self-supervised learned representations Zsup X /Zssl X /Zsslmin X . Then, P ssl e = Th( P ssl e ) and P sslmin e = Th( P sslmin e ) with log (1 P sup e ) + log 2 log (|T|) { P ssl e , P sslmin e } 1 exp (log 2 + P sup e log |T| + ϵinfo) . Given our self-supervised learned representations (Zssl X and Zsslmin X ), Theorem 4 suggests a smaller upper bound of P ssl e (or P sslmin e ) when the redundancy between the input and the self-supervised signal (ϵinfo, defined in Assumption 1) is small. This result implies the self-supervised learned representations may perform better on the downstream task when the multi-view redundancy is small. 3 CONTROLLED EXPERIMENTS This section aims at providing empirical supports for Theorems 1 and 2 and comparing different SSL objectives. In particular, we present information inequalities in Theorems 1 and 2 regarding the amount of the task-relevant and the task-irrelevant information that will be extracted and discarded when learning self-supervised representations. Nonetheless, quantifying the information is notoriously hard and often leads to inaccurate quantifications in practice (Mc Allester & Stratos, 2020; Song & Published as a conference paper at ICLR 2021 Figure 3: Comparisons for different compositions of SSL objectives on Omniglot and CIFAR10. Ermon, 2019). Not to mention the information we aim to quantify is the conditional information, which is believed to be even more challenging than quantifying the unconditional one (Póczos & Schneider, 2012). To address this concern, we instead study the generalization error of the selfsupervised learned representations, theoretically (Bayes error rate discussed in Section 2.4) and empirically (test performance discussed in this section). Another important aspect of the experimental design is examining equation 4, which can be viewed as a Lagrangian relaxation to learn representations that contain minimal and sufficient self-supervision (see Definition 2): a weighted combination between I(ZX; S) and H(ZX|S). In particular, the contrastive loss LCL and the forward-predictive loss LF P represent different realizations of modeling I(ZX; S) and the inverse-predictive loss LF P represents a realization of modeling H(ZX|S). We design two sets of experiments: The first one is when the input and self-supervised signals lie in the same modality (visual) and are likely to satisfy the multi-view redundancy assumption (Assumption 1). The second one is when the input and self-supervised signals lie in very different modalities (visual and textual), thus challenging the SSL objective s generalization ability. Experiment I - Visual Representation Learning. We use Omniglot dataset (Lake et al., 2015) 7 in this experiment. The training set contains images from 964 characters, and the test set contains 659 characters. There are no characters overlap between the training and test set. Each character contains twenty examples drawn from twenty different people. We regard image as input (X) and generate self-supervised signal (S) by first sampling an image from the same character as the input image and then applying translation/ rotation to it. Furthermore, we represent task-relevant information (T) by the labels of the image. Under this self-supervised signal construction, the exclusive information in X or S are drawing styles (i.e., by different people) and image augmentations, and only their shared information contribute to T. To formally show the later, if T representing the label for X/S, then P(T|X) and P(T|S) are Dirac. Hence, T S|X and T X|S, suggesting Assumption 1 holds. We train the feature mapping FX( ) with SSL objectives (see eq. equation 4), set FS( ) = FX( ), let R( ) be symmetrical to FX( ), and G( ) be an identity mapping. On the test set, we fix the mapping and randomly select 5 examples per character as the labeled examples. Then, we classify the rest of the examples using the 1-nearest neighbor classifier based on feature (i.e., ZX = FX(X)) cosine similarity. The random performance on this task stands at 1 659 0.15% . One may refer to Supplementary for more details. Results & Discussions. In Figure 3, we evaluate the generalization ability on the test set for different SSL objectives. First, we examine how the introduced inverse predictive learning objective LIP can help improve the performance along with the contrastive learning objective LCL. We present the results in Figure 3 (a) and also provide experiments with Sim CLR (Chen et al., 2020) on CIFAR10 (Krizhevsky et al., 2009) in Figure 3 (b), where λIP = 0 refers to the exact same setup as in Sim CLR (which considers only LCL). We find that adding LIP in the objective can boost model performance, although being sensitive to the hyper-parameter λIP . According to Theorem 2, the improved performance suggests a more compressed representation results in better performance for the downstream tasks. Second, we add the discussions with the forward predictive learning objective LF P . We present the results in Figure 3 (c). Comparing to LF P , LCL 1) reaches better test accuracy; 2) requires shorter training epochs to reach the best performance; and 3) suffers from overfitting with long-epoch training. Combining both of them (LCL + 0.005LF P ) brings their advantages together. Experiment II - Visual-Textual Representation Learning. We provide experiments using MS COCO dataset (Lin et al., 2014) that contains 328k multi-labeled images with 2.5 million labeled 7More complex datasets such as CIFAR10 (Krizhevsky et al., 2009) or Image Net (Deng et al., 2009), to achieve similar performance, require a much larger training scale from contrastive to forward predictive objective. For example, on Image Net, Mo Co (He et al., 2019) uses 8 GPUs for its contrastive objective and Image GPT (Chen et al.) uses 2048 TPUs for its forward predictive objective. We choose the Omniglot to ensure fair comparisons among different self-supervised learning objectives under reasonable computation constraint. Published as a conference paper at ICLR 2021 (a) MS COCO (Using LCL as SSL objective) Setting Micro ROC-AUC Subset Acc. Cross-modality Self-supervised Learning Raw BERT + Raw Res Net 0.5963 0.0034 0.0166 0.0017 Pre-trained BERT + Raw Res Net 0.5915 0.0035 0.0163 0.0011 Raw BERT + Pre-trained Res Net 0.7049 0.0040 0.2081 0.0063 Pre-trained BERT + Pre-trained Res Net 0.7065 0.0026 0.2123 0.0040 Non Self-supervised Learning Only Pre-trained Res Net 0.6761 0.0045 0.1719 0.0015 Figure 4: Comparisons for different settings on self-supervised visual-textual representation training. We report metrics on MS COCO validation set with mean and standard deviation from 5 random trials. instances from 91 objects. Each image has 5 annotated captions describing the relationships between objects in the scenes. We regard image as input (X) and its textual descriptions as self-supervised signal (S). Since vision and text are two very different modalities, the multi-view redundancy may not be satisfied, which means ϵinfo may be large in Assumption 1. We adopt LCL (+λIP LIP) as our SSL objective. We use Res Net18 (He et al., 2016) image encoder for FX( ) (trained from scratch or fine-tuned on Image Net (Deng et al., 2009) pre-trained weights), BERTuncased (Devlin et al., 2018) text encoder for FS( ) (trained from scratch or Book Corpus (Zhu et al., 2015)/Wikipedia pre-trained weights), and a linear layer for G( ). After performing self-supervised visual-textual representation learning, we consider the downstream multi-label classification over 91 categories. We evaluate learned visual representation (ZX) using downstream linear evaluation protocol (Oord et al., 2018; Hénaff et al., 2019; Tian et al., 2019; Hjelm et al., 2018; Bachman et al., 2019; Tschannen et al., 2019). Specifically, a linear classifier is trained from the self-supervised learned (fixed) representation to the labels on the training set. Commonly used metrics for multi-label classification are reported on MS COCO validation set: Micro ROC-AUC and Subset Accuracy. One may refer to Supplementary for more details on these metrics. Results & Discussions. First, Figure 4 (a) suggests that the SSL strategy can still work when the input and self-supervised signals lie in different modalities. For example, pre-trained Res Net with BERT (either raw or the pre-trained one) outperforms pre-trained Res Net alone. We also see that the self-supervised learned representations benefit more if the Res Net is pre-trained but not the BERT. This result is in accord with the fact that object recognition requires more understanding in vision, and hence the pre-trained Res Net is preferrable than the pre-trained BERT. Next, Figure 4 (b) suggests that the self-supervised learned representations can be further improved by combining LCL and LIP , suggesting LIP may be a useful objective to discard task-irrelevant information. Remarks on λIP and LIP . As observed in the experimental results, λIP is a sensitive hyperparameter to the performance. We provide an optimization perspective to address this concern. Note that one of the our goals is to examine the setting when learning the minimal and sufficient representations for self-supervision (see Definition 2): minimize H(ZX|S) under the constraint that I(ZX; S) is maximized. However, this constrained optimization is not feasible when considering gradients methods in neural networks. Hence, our approach can be seen as its Lagrangian Relaxation by a weighted combination between LCL (or LF P , representing I(ZX; S)) and LIP (representing H(ZX|S)) with the λIP being the Lagrangian coefficient. The optimal λIP can be obtained by solving the Lagrangian dual, which depends on the parametrization of LCL (or LF P ) and LIP . Different parameterizations lead to different loss and gradient landscapes, and hence the optimal λIP differs across experiments. This conclusion is verified by the results presented in Figure 3 (a) and (b) and Figure 4 (b). Lastly, we point out that even not solving the Lagrangian dual, an empirical observation across experiments is that λIP which leads to the best performance is when the scale of LIP is one-tenth to the scale of LCL (or LF P ). 4 RELATED WORK Prior work by Arora et al. (2019) and the recent concurrent work (Lee et al., 2020; Tosh et al., 2020) are landmarks for theoretically understanding the success of SSL. In particular, Arora et al. (2019); Lee et al. (2020) showed a decreased sample complexity for downstream supervised tasks when adopting contrastive learning objectives (Arora et al., 2019) or predicting the known information in the data (Lee et al., 2020). Tosh et al. (2020) showed that the linear functions of the learned representations are nearly optimal on downstream prediction tasks. By viewing the input and the self-supervised signal as two corresponding views of the data, we discuss the differences among these works and ours. On the one hand, the work by Arora et al. (2019); Lee et al. (2020) assume strong independence between the views conditioning on the downstream tasks , i.e., I(X; S|T) 0. Published as a conference paper at ICLR 2021 On the other hand, the work by Tosh et al. (2020) and ours assume strong independence between the downstream task and one view conditioning on the other view, i.e., I(T; X|S) 0. Prior work (Balcan et al., 2005; Du et al., 2010) have compared these two assumptions and pointed out the former one (I(X; S|T) 0) is too strong and not likely to hold in practice. We note that all these related work and ours have shown that the self-supervised learning methods are learning to extract task-relevant information. Our work additionally presents to discard task-irrelevant information and quantifies the amount of information that cannot be discarded. Our method also resembles the Info Max principle (Linsker, 1988; Hjelm et al., 2018) and the Multiview Information Bottleneck method (Federici et al., 2020). The Info Max principle aims at preserving the information of itself, while ours aims at extracting the information in the self-supervised signal. On the other hand, to reduce the redundant information across views, the Multi-view Information Bottleneck method proposed to minimize the conditional mutual information I(ZX; X|S) , while ours propose to minimize the conditional entropy H(ZX|S). The conditional entropy minimization problem can be easily optimized via our proposed inversed predictive learning objective. Another related work is Info Min (Tian et al., 2020), where both Info Min and our method suggest to learn the representations that contain not too much information. In particular, Info Min presents to augment the data (i.e., by constructing learnable data augmentations) such that the shared information between augmented variants is as minimal as possible, followed by the mutual information maximization between the learned features from the augmented variants. Our method instead considers standard augmentations (e.g., rotations and translations), followed by learning representations that contain no more than the shared information between the augmented variants of the data. On the empirical side, we explain why contrastive (Oord et al., 2018; Bachman et al., 2019; Chen et al., 2020) and predictive learning (Zhang et al., 2016; Pathak et al., 2016; Vondrick et al., 2016; Chen et al.) approaches can unsupervised extract task-relevant information. Different from these work, we present an objective to discard task-irrelevant information and show its combination with existing contrastive or predictive objectives benefits the performance. 5 CONCLUSION This work studies both theoretical and empirical perspectives on self-supervised learning. We show that the self-supervised learned representations could extract task-relevant information (with a potential loss) and discard task-irrelevant information (with a fixed gap), along with their practical deployments such as contrastive and predictive learning objectives. We believe this work sheds light on the advantages of self-supervised learning and may help better understand when and why self-supervised learning is likely to work. In the future, we plan to connect our framework and recent SSL methods that cannot be easily fit into our analysis: e.g., BYOL (Grill et al., 2020), SWAV (Caron et al., 2020), and Unifromality-Alignment (Wang & Isola, 2020). ACKNOWLEDGEMENT This work was supported in part by the NSF IIS1763562, NSF Awards #1750439 #1722822, National Institutes of Health, IARPA D17PC00340, ONR Grant N000141812861, and Facebook Ph D Fellowship. We would also like to acknowledge NVIDIA s GPU support. Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947 1980, 2018. Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 609 617, 2017. Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509 15519, 2019. Published as a conference paper at ICLR 2021 Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co-training and expansion: Towards bridging theory and practice. In Advances in neural information processing systems, pp. 89 96, 2005. Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525 536, 1998. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. ar Xiv preprint ar Xiv:1801.04062, 2018. Cristian S Calude. Information and randomness: an algorithmic perspective. Springer Science & Business Media, 2013. Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. ar Xiv preprint ar Xiv:2006.09882, 2020. Mark Chen, Alec Radford, Rewon Child, Jeff Wu, and Heewoo Jun. Generative pretraining from pixels. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. ar Xiv preprint ar Xiv:2002.05709, 2020. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. ar Xiv preprint ar Xiv:1901.02860, 2019. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Jun Du, Charles X Ling, and Zhi-Hua Zhou. When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering, 23(5):788 799, 2010. Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861 874, 2006. Meir Feder and Neri Merhav. Relations between entropy and error probability. IEEE Transactions on Information Theory, 40(1):259 266, 1994. M Federici, A Dutta, P Forré, N Kushmann, and Z Akata. Learning robust representations via multi-view information bottleneck. International Conference on Learning Representation, 2020. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ar Xiv preprint ar Xiv:1803.07728, 2018. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. ar Xiv preprint ar Xiv:2006.07733, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. ar Xiv preprint ar Xiv:1911.05722, 2019. Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. ar Xiv preprint ar Xiv:1905.09272, 2019. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018. Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are universal approximators. Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009. Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. Published as a conference paper at ICLR 2021 Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. ar Xiv preprint ar Xiv:2008.01064, 2020. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740 755. Springer, 2014. Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105 117, 1988. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431 3440, 2015. David Mc Allester and Karl Stratos. Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pp. 875 884, 2020. Sudipto Mukherjee, Himanshu Asnani, and Sreeram Kannan. Ccmi: Classifier based conditional mutual information estimation. In Uncertainty in Artificial Intelligence, pp. 1083 1093. PMLR, 2020. Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69 84. Springer, 2016. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536 2544, 2016. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ar Xiv preprint ar Xiv:1802.05365, 2018. Barnabás Póczos and Jeff Schneider. Nonparametric estimation of conditional information and divergences. In Artificial Intelligence and Statistics, pp. 914 923. PMLR, 2012. Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. ar Xiv preprint ar Xiv:1905.06922, 2019. Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. ar Xiv preprint ar Xiv:1910.06222, 2019. Mohammad S Sorower. A literature survey on algorithms for multi-label learning. Karthik Sridharan and Sham M Kakade. An information theoretic framework for multi-view learning. 2008. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. ar Xiv preprint ar Xiv:1906.05849, 2019. Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. ar Xiv preprint ar Xiv:2005.10243, 2020. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. ar Xiv preprint ar Xiv:2008.10150, 2020. Yao-Hung Hubert Tsai, Han Zhao, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Neural methods for point-wise dependency estimation. ar Xiv preprint ar Xiv:2006.05553, 2020. Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. ar Xiv preprint ar Xiv:1907.13625, 2019. Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1526 1535, 2018. Published as a conference paper at ICLR 2021 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In Advances in neural information processing systems, pp. 613 621, 2016. Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. ar Xiv preprint ar Xiv:2005.10242, 2020. Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634, 2013. Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pp. 649 666. Springer, 2016. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19 27, 2015. A REMARKS ON LEARNING MINIMAL AND SUFFICIENT REPRESENTATIONS In the main text, we discussed the objectives to learn minimal and sufficient representations (Definition 1). Here, we discuss the similarities and differences between the prior methods (Tishby et al., 2000; Achille & Soatto, 2018) and ours. First, to obtain sufficient representations (for the downstream task T), all the methods presented to maximize I(ZX; T). Then, to maintain minimal amount of information in the representations, the prior methods (Tishby et al., 2000; Achille & Soatto, 2018) presented to minimize I(ZX; X) and the ours presents to minimize H(ZX|T). Our goal is to relate I(ZX; X) minimization and H(ZX|T) minimization in our framework. To begin with, under the constraint I(ZX; T) is maximized, we see that minimizing I(ZX; X) is equivalent to minimizing I(ZX; X|T). The reason is that I(ZX; X) = I(ZX; X|T) + I(ZX; X; T), where I(ZX; X; T) = I(ZX; T) due to the determinism from X to ZX (our framework learns a deterministic function from X to ZX) and I(ZX; T) is maximized in our constraint. Then, I(ZX; X|T) = H(ZX|T) H(ZX|X, T), where H(ZX|T) contains no randomness (no information) as ZX being deterministic from X. Hence, I(ZX; X|T) minimization and H(ZX|T) minimization are interchangeable. The same claim can be made from the downstream task T to the self-supervised signal S. In specific, when X to ZX is deterministic, I(ZX; X|S) minimization and H(ZX|S) minimization are interchangeable. As discussed in the related work section, for reducing the amount of the redundant information, Federici et al. (2020) presented to use I(ZX; X|S) minimization and ours presented to use H(ZX|T) minimization. We also note that directly minimizing the conditional mutual information (i.e., I(ZX; X|S)) requires a min-max optimization (Mukherjee et al., 2020), which may cause instability in practice. To overcome the issue, Federici et al. (2020) assumes a Gaussian encoder for X ZX and presents an upper bound of the original objective. B PROOFS FOR THEOREM 1 AND 2 We start by presenting a useful lemma from the fact that FX( ) is a deterministic function: Lemma 1 (Determinism). If P(ZX|X) is Dirac, then the following conditional independence holds: T ZX|X and S ZX|X, inducing a Markov chain S T X ZX. Proof. When ZX is a deterministic function of X, for any A in the sigma-algebra induced by ZX we have E[1[ZX A]|X, {T, S}] = E[1[ZX A]|X, S] = E[1[ZX A]|X], which implies T ZX|X and S ZX|X. Theorem 1 and 2 in the main text restated: Published as a conference paper at ICLR 2021 Theorem 5 (Task-relevant information with a potential loss ϵinfo, restating Theorem 1 in the main text). The supervised learned representations (i.e., I(Zsup X ; T) and I(Zsupmin X ; T)) contain all the task-relevant information in the input (i.e., I(X; T)). The self-supervised learned representations (i.e., I(Zssl X ; T) and I(Zsslmin X ; T)) contain all the task-relevant information in the input with a potential loss ϵinfo. Formally, I(X; T) = I(Zsup X ; T) = I(Zsupmin X ; T) I(Zssl X ; T) I(Zsslmin X ; T) I(X; T) ϵinfo. Proof. The proofs contain two parts. The first one is showing the results for the supervised learned representations and the second one is for the self-supervised learned representations. Supervised Learned Representations: Adopting Data Processing Inequality (DPI by Cover & Thomas (2012)) in the Markov chain S T X ZX (Lemma 1), I(ZX; T) is maximized at I(X; T). Since both supervised learned representations (Zsup X and Zsupmin X ) maximize I(ZX; T), we conclude I(Zsup X ; T) = I(Zsupmin X ; T) = I(X; T). Self-supervised Learned Representations: First, we have I(ZX; S) = I(ZX; T) I(ZX; T|S) + I(ZX; S|T) = I(ZX; T; S) + I(ZX; S|T) and I(X; S) = I(X; T) I(X; T|S) + I(X; S|T) = I(X; T; S) + I(X; S|T). By DPI in the Markov chain S T X ZX (Lemma 1), we know I(ZX; S) is maximized at I(X; S) I(ZX; S; T) is maximized at I(X; S; T) I(ZX; S|T) is maximized at I(X; S|T) Since both self-supervised learned representations (Zssl X and Zsslmin X ) maximize I(ZX; S), we have I(Zssl X ; S) = I(Zsslmin X ; S) = I(X; S). Hence, I(Zssl X ; S; T) = I(Zsslmin X ; S; T) = I(X; S; T) and I(Zssl X ; S|T) = I(Zsslmin X ; S|T) = I(X; S|T). Using the result I(Zssl X ; S; T) = I(Zsslmin X ; S; T) = I(X; S; T), we get I(Zssl X ; T) = I(X; T) I(X; T|S) + I(Zssl X ; T|S) and I(Zsslmin X ; T) = I(X; T) I(X; T|S) + I(Zsslmin X ; T|S). Now, we are ready to present the inequalities: 1. I(X; T) I(Zssl X ; T) due to I(X; T|S) I(Zssl X ; T|S) by DPI. 2. I(Zssl X ; T) I(Zsslmin X ; T) due to I(Zssl X ; T|S) I(Zsslmin X ; T|S) = 0. Since H(ZX|S) is minimized at Zsslmin X , I(Zsslmin X ; T|S) = 0. 3. I(Zsslmin X ; T) I(X; T) ϵinfo due to I(X; T) I(X; T|S) + I(Zsslmin X ; T|S) I(X; T) I(X; T|S) I(X; T) ϵinfo, where I(X; T|S) ϵinfo by the redundancy assumption. Theorem 6 (Task-irrelevant information with a fixed compression gap I(X; S|T), restating Theorem 2 in the main text). The sufficient self-supervised representation (i.e., I(Zssl X ; T)) contains more taskirrelevant information in the input than then the sufficient and minimal self-supervised representation (i.e., I(Zsslmin X ; T)). The later contains an amount of the information, I(X; S|T), that cannot be discarded from the input. Formally, I(Zssl X ; X|T) = I(X; S|T) + I(Zssl X ; X|S, T) I(Zsslmin X ; X|T) = I(X; S|T) I(Zsupmin X ; X|T) = 0. Published as a conference paper at ICLR 2021 Proof. First, we see that I(ZX; X|T) = I(ZX; X; S|T) + I(ZX; X|S, T) = I(ZX; S|T) + I(ZX; X|S, T), where I(ZX; X; S|T) = I(ZX; S|T) by DPI in the Markov chain S T X ZX. We conclude the proof by combining the following: From the proof in Theorem 5, we showed I(Zssl X ; S|T) = I(Zsslmin X ; S|T) = I(X; S|T). Since H(ZX|S) is minimized at Zsslmin X , I(Zsslmin X ; X|S, T) = 0. Since H(ZX|T) is minimized at Zsupmin X , I(Zsupmin X ; X|T) = 0. C PROOF FOR PROPOSITION 1 Proposition 2 (Mutual Information Neural Estimation, restating Proposition 1 in the main text). Let 0 < δ < 1. There exists d N and a family of neural networks F := { ˆfθ : θ Θ Rd} where Θ is compact, so that θ Θ, with probability at least 1 δ over the draw of {zxi, si}n i=1 P n ZX,S, b I(n) θ (ZX; S) I(ZX; S) O q Sketch of Proof. The proof is a standard instance of uniform convergence bound. First, we assume the boundness and the Lipschitzness of ˆfθ. Then, we use the universal approximation lemma of neural networks (Hornik et al.). Last, combing all these two along with the uniform convergence in terms of the covering number (Bartlett, 1998), we complete the proof. We note that the complete proof can be found in the prior work (Tsai et al., 2020). An alternative but similar proof can be found in another prior work (Belghazi et al., 2018), which gives us b I(n) θ (ZX; S) I(ZX; S) O q dlog d+log(1/δ) . The subtle difference between them is that, given a neural network function space Θ Rd and its covering number N(Θ, η), Tsai et al. (2020) has N(Θ, η) = O (η) d by Bartlett (1998) and Belghazi et al. (2018) has N(Θ, η) = O (η/ d) d by Shalev-Shwartz & Ben-David (2014). Both are valid and the one used by Tsai et al. (2020) is tighter. D PROOFS FOR THEOREM 3 AND 4 To begin with, we see that I(ZX; T) = I(ZX; X) I(ZX; X|T) + I(ZX; T|X) = I(ZX; X) I(ZX; X|T) = I(ZX; S) I(ZX; S|X) + I(ZX; X|S) I(ZX; X|T) = I(ZX; S) + I(ZX; X|S) I(ZX; X|T) I(ZX; S) I(ZX; X|T), where I(ZX; T|X) = I(ZX; S|X) = 0 due to the determinism from X to ZX. Then, in the proof of Theorem 6, we have shown I(ZX; X|T) = I(ZX; S|T) + I(ZX; X|S, T). Hence, I(ZX; T) I(ZX; S) I(ZX; S|T) I(ZX; X|S, T) I(ZX; S) I(X; S|T) I(ZX; X|S, T), where I(ZX; S|T) I(X; S|T) by DPI. Theorem 3 and 4 in the main text restated: Published as a conference paper at ICLR 2021 Theorem 7 (Bayes Error Rates for Arbitrary Learned Representations, restating Theorem 3 in the main text). For an arbitrary learned representations ZX, Pe = Th( Pe) with H(T )+I(X;S|T )+I(Z;X|S,T ) ˆI(n) θ (ZX;S)+O q Proof. We use the inequality between Pe and H(T|ZX) indicated by Feder & Merhav (1994): log(1 Pe) H(T|ZX). Combining with I(ZX; T) = H(T) H(T|ZX) and I(ZX; T) I(ZX; S) I(X; S|T) I(ZX; X|S, T), we have log(1 Pe) H(T) + I(ZX; S) I(X; S|T) I(ZX; X|S, T). H(T )+I(X;S|T )+I(Z;X|S,T ) I(ZX;S) Next, by definition of the Bayes error rate, we know 0 Pe 1 1 |T |. We conclude the proof by combining Proposition 2, b I(n) θ (ZX; S) I(ZX; S) Theorem 8 (Bayes Error Rates for Self-supervised Learned Representations, restating Theorem 4 in the main text). Let P sup e /P ssl e /P sslmin e be the Bayes error rate of the supervised or the self-supervised learned representations Zsup X /Zssl X /Zsslmin X . Then, P ssl e = Th( P ssl e ) and P sslmin e = Th( P sslmin e ) with log (1 P sup e ) + log 2 log (|T|) { P ssl e , P sslmin e } 1 exp (log 2+P sup e log |T |+ϵinfo). Proof. We use the two inequalities between Pe and H(T|ZX) by Feder & Merhav (1994) and Cover & Thomas (2012): log(1 Pe) H(T|ZX) and H(T|ZX) log 2 + Pelog|T|. Combining the results from Theorem 5: I(Zsup X ; T) I(Zssl X ; T) I(Zsslmin X ; T) I(Zsup X ; T) ϵinfo, the upper bound of the self-supervised learned representations Bayes error rate: { log(1 P ssl e ), log(1 P sslmin e )} {H(T|Zssl X ), H(T|Zsslmin X )} H(T|Zsup X ) + ϵinfo log 2 + P sup e log|T| + ϵinfo, which suggests {P ssl e , P sslmin e } 1 exp (log 2+P sup e log |T |+ϵinfo). the lower bound of the self-supervised learned representations Bayes error rate: log(1 P sup e ) H(T|Zsup X ) {H(T|Zssl X ), H(T|Zsslmin X )} {log 2 + P ssl e log|T|, {log 2 + P sslmin e log|T|}, which suggests log (1 P sup e )+log 2 log (|T |) {P ssl e , P sslmin e }. We conclude the proof by having Pe lie in the feasible range: 0 Pe 1 1 |T |. Published as a conference paper at ICLR 2021 E TIGHTER BOUNDS FOR THE BAYES ERROR RATES We note that the bound used in Theorems 7 and 8: log(1 Pe) H(T|ZX) log 2 + Pelog|T| is not tight. A tighter bound is H (Pe) H(T|ZX) H+(Pe) with H (Pe) := H k(1 Pe) + k(1 Pe)log k when k 1 k Pe k k + 1 , 1 k |T| 1, and H+(Pe) := H(Pe) + Pelog (|T| 1), where H(x) = xlog(x) (1 x)log(1 x). It is clear that log(1 Pe) H (Pe) and H+(Pe) log 2 + Pelog(|T|). Hence, Theorem 7 and 8 can be improved as follows: Theorem 9 (Tighter Bayes Error Rates for Arbitrary Learned Representations). For an arbitrary learned representations ZX, Pe = Th( Pe) with Pe Peupper. Peupper is derived from the program arg max Pe H (Pe) H(T) ˆI(n) θ (Zssl X ; S) + I(X; S|T) + I(ZX; X|S, T) + O r d + log(1/δ) Theorem 10 (Tighter Bayes Error Rates for Self-supervised Learned Representations). Let P sup e /P ssl e /P sslmin e be the Bayes error rate of the supervised or the self-supervised learned representations Zsup X /Zssl X /Zsslmin X . Then, P ssl e = Th( P ssl e ) and P sslmin e = Th( P sslmin e ) with Pe ssl lower { P ssl e , P sslmin e } Pe ssl upper. Pe ssl lower is derived from the following program arg min P ssl e H (P sup e ) H+(P ssl e ) and Pe ssl upper is derived from the following program arg max P ssl e H (P ssl e ) H+(P sup e ) + ϵinfo. F MORE ON VISUAL REPRESENTATION LEARNING EXPERIMENTS In the main text, we design controlled experiments on self-supervised visual representation learning to empirically support our theorem and examine different compositions of SSL objectives. In this section, we will discuss 1) the architecture design; 2) different deployments of contrastive/ forward predictive learning; and 3) different self-supervised signal construction strategy. We argue that these three additional set of experiments may be interesting future work. F.1 ARCHITECTURE DESIGN The input image has size 105 105. For image augmentations, we adopt 1) rotation with degrees from 10 to +10 ; 2) translation from 15 pixels to +15 pixels; 3) scaling both width and height from 0.85 to 1.0; 4) scaling width from 0.85 to 1.25 while fixing the height; and 5) resizing the image to 28 28. Then, a deep network takes a 28 28 image and outputs a 1024 dim. feature vector. The deep network has the structure: Conv BN Re LU Conv BN Re LU Max Pool Conv BN Re LU Max Pool Conv BN Re LU Max Pool Flatten Linear L2Norm. Conv has 3x3 kernel size with 128 output channels, Max Pool has 2x2 kernel size, and Linear is a 1152 to 1024 weight matrix. R( ) is symmetric to FX( ), which has Linear BN Re LU Un Flatten De Conv BN Re LU De Conv BN Re LU De Conv BN Re LU De Conv. R( ) has the exact same number of parameters as FX( ). Note that we use the same network designs in I( , ) and H( | ) estimations. To reproduce the results in our experimental section, please refer to our released code8. 8https://github.com/yaohungt/Self_Supervised_Learning_Multiview Published as a conference paper at ICLR 2021 (a) Omniglot (Composing SSL Objectives with LFP as MSE) Objective Trained for Test Accuracy LCL 500 epochs 85.59 0.05% LCL + LIP 500 epochs 85.90 0.09% LF P 20000 epochs 84.83 0.07% LF P + 10LIP 20000 epochs 84.96 0.04% LCL + 10LF P 9000 epochs 86.13 0.21% LCL + 10LF P + LIP 9000 epochs 86.17 0.13% Figure 5: Comparisons for different objectives/compositions of SSL objectives on self-supervised visual representation training. We report mean and its standard error from 5 random trials. F.2 DIFFERENT DEPLOYMENTS FOR CONTRASTIVE AND PREDICTIVE LEARNING OBJECTIVES In the main text, for practical deployments, we suggest Contrastive Predictive Coding (CPC) Oord et al. (2018) for LCL and assume Gaussian distribution for the variational distributions in LF P / LIP . The practical deployments can be abundant by using different mutual information approximations for LCL and having different distribution assumptions for LF P / LIP . In the following, we discuss a few examples. Contrastive Learning. Other than CPC Oord et al. (2018), another popular contrastive learning objective is JS Bachman et al. (2019), which is the lower bound of Jensen-Shannon divergence between P(ZS, ZX) and P(ZS)P(ZX) (a variational bound of mutual information). Its objective can be written as max ZS=FS(S),ZX=FX(X),G EP (ZS,ZX) h softplus G(zx), G(zs) i EP (ZS)P (ZX) hsoftplus G(zx), G(zs) i, where we use softplus to denote softplus (x) = log (1 + exp (x)). Predictive Learning. Gaussian distribution may be the simplest distribution form that we can imagine, which leads to Mean Square Error (MSE) reconstruction loss. Here, we use forward predictive learning as an example, and we discuss the case when S lies in discrete {0, 1} sample space. Specifically, we let Qφ(S|ZX) be factorized multivariate Bernoulli: max ZX=FX(X),R EPS,ZX i=1 si log [R(zx)]i + (1 si) log [1 R(zx)]i This objective leads to Binary Cross Entropy (BCE) reconstruction loss. If we assume each reconstruction loss corresponds to a particular distribution form, then by ignoring which variatioinal distribution we choose, we are free to choose arbitrary reconstruction loss. For instance, by switching s and z in eq. equation 5, the objective can be regarded as Reverse Binary Cross Entropy Loss (Rev BCE) reconstruction loss. In our experiments, we find Rev BCE works the best among {MSE, BCE, and Rev BCE}. Therefore, in the main text, we choose Rev BCE as the example reconstruction loss as LF P . More Experiments. We provide an additional set of experiments by having {CPC, JS} for LCL and {MSE, BCE, Rev BCE} reconstruction loss for LF P in Figure 5. From the results, we find different formulation of objectives bring very different test generalization performance. We argue that, given a particular task, it is challenging but important to find the best deployments for contrastive and predictive learning objectives. F.3 DIFFERENT SELF-SUPERVISED SIGNAL CONSTRUCTION STRATEGY In the main text, we design a self-supervised signal construction strategy that the input (X) and the self-supervised signal (S) differ in {drawing styles, image augmentations}. This self-supervised signal construction strategy is different from the one that is commonly adopted in most self-supervised visual representation learning work Tian et al. (2019); Bachman et al. (2019); Chen et al. (2020). Specifically, prior work consider the difference between input and the self-supervised signal only in image augmentations. We provide additional experiments in Fig. 6 to compare these two different self-supervised signal construction strategies. We see that, comparing to the common self-supervised signal construction strategy Tian et al. (2019); Bachman et al. (2019); Chen et al. (2020), the strategy introduced in our controlled experiments has much better generalization ability to test set. It is worth noting that, although our construction Published as a conference paper at ICLR 2021 Figure 6: Comparisons for different self-supervised signal construction strategies. The differences between the input and the self-supervised signals are {drawing styles, image augmentations} for our construction strategy and only {image augmentations} for Sim CLR Chen et al. (2020) s strategy. We choose LCL as our objective, reporting mean and its standard error from 5 random trials. strategy has access to the label information (i.e., we sample the self-supervised signal image from the same character with the input image), our SSL objectives do not train with the labels. Nonetheless, since we implicitly utilize the label information in our self-supervised construction strategy, it will be unfair to directly compare our strategy and prior one. An interesting future research direction is examining different self-supervised signal construction strategy and even combine full/part of label information into self-supervised learning. G METRICS IN VISUAL-TEXTUAL REPRESENTATION LEARNING Subset Accuracy (A) Sorower, also know as the Exact Match Ratio (MR), ignores all partially correct (consider them incorrect) outputs and extend accuracy from the single label case to the multi-label setting. i=1 1[Yi=Hi] Micro AUC ROC score Fawcett (2006) computes the AUC (Area under the curve) of a receiver operating characteristic (ROC) curve.