# selfsupervised_learning_disentangled_group_representation_as_feature__8fb5d997.pdf Self-Supervised Learning Disentangled Group Representation as Feature Tan Wang1 Zhongqi Yue1,3 Jianqiang Huang1,3 Qianru Sun2 Hanwang Zhang1 1Nanyang Technological University 2Singapore Management University 3Damo Academy, Alibaba Group {tan317,yuez0003,hanwangzhang}@ntu.edu.sg jianqiang.jqh@gmail.com qianrusun@smu.edu.sg A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics). In this paper, we formulate the notion of good representation from a group-theoretic view using Higgins definition of disentangled representation [40], and show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization, thus unable to modularize the remaining semantics. To break the limitation, we propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM), which successfully grounds the abstract semantics and the group acting on them into concrete contrastive learning. At each iteration, IP-IRM first partitions the training samples into two subsets that correspond to an entangled group element. Then, it minimizes a subset-invariant contrastive loss, where the invariance guarantees to disentangle the group element. We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks. Codes are available at https://github.com/Wangt-CN/IP-IRM. 1 Introduction color digit color digit red digit 0 green digit 0 color digit color digit 𝒈𝒈 𝒖𝒖= green Figure 1: Disentangled representation is an equivariant map between the semantic space U and the vector space X, which is decomposed into color and digit . Deep learning is all about learning feature representations [5]. Compared to the conventional end-to-end supervised learning, Self-Supervised Learning (SSL) first learns a generic feature representation (e.g., a network backbone) by training with unsupervised pretext tasks such as the prevailing contrastive objective [36, 16], and then the above stage-1 feature is expected to serve various stage-2 applications with proper fine-tuning. SSL for visual representation is so fascinating that it is the first time that we can obtain good visual features for free, just like the trending pre-training in NLP community [26, 8]. However, most SSL works only care how much stage-2 performance an SSL feature can improve, but overlook what feature SSL is learning, why it can be learned, what cannot be learned, what the gap between SSL and Supervised Learning (SL) is, and when SSL can surpass SL? The crux of answering those questions is to formally understand what a feature representation is and what a good one is. We postulate the classic world model of visual generation and feature representation [1, 69] as in Figure 1. Let U be a set of (unseen) semantics, e.g., attributes such as digit and color . There 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Aug. Related Aug. Unrelated Sim CLR IP-IRM Sim CLR IP-IRM IP-IRM Sim CLR IP-IRM Sim CLR IP-IRM Sim CLR IP-IRM Sim CLR IP-IRM Sim CLR IP-IRM Sim CLR Figure 2: (a) The heat map visualizes feature dimensions related to augmentations (aug. related) and unrelated to augmentations (aug. unrelated), whose respective classification accuracy is shown in the bar chart below. Dashed bar denotes the accuracy using full feature dimensions. Experiment was performed on STL10 [22] with representation learnt with Sim CLR [16] and our IP-IRM. (b) Visualization of CNN activations [77] of 4 filters on layer 29 and 18 of VGG [75] trained on Image Net100 [81]. The filters were chosen by first clustering the aug. unrelated filters with k-means (k = 4) and then selecting the filters corresponding to the cluster centers. is a set of independent and causal mechanisms [66] ϕ : U I, generating images from semantics, e.g., writing a digit 0 when thinking of 0 [74]. A visual representation is the inference process φ : I X that maps image pixels to vector space features, e.g., a neural network. We define semantic representation as the functional composition f : U I X. In this paper, we are only interested in the parameterization of the inference process for feature extraction, but not the generation process, i.e., we assume I I, u U, such that I = ϕ(u) is fixed as the observation of each image sample. Therefore, we consider semantic and visual representations the same as feature representation, or simply representation, and we slightly abuse φ(I) := f ϕ 1(I) , i.e., φ and f share the same trainable parameters. We call the vector x = φ(I) as feature, where x X. We propose to use Higgins definition of disentangled representation [40] to define what is good . Definition 1. (Disentangled Representation) Let G be the group acting on U, i.e., g u U U transforms u U, e.g., a turn green group element changing the semantic from red to green . Suppose there is a direct product decomposition1 G = g1 . . . gm and U = U1 . . . Um, where gi acts on Ui respectively. A feature representation is disentangled if there exists a group G acting on X such that: 1. Equivariant: g G, u U, f(g u) = g f(u), e.g., the feature of the changed semantic: red to green in U, is equivalent to directly change the color vector in X from red to green . 2. Decomposable: there is a decomposition X = X1 . . . Xm, such that each Xi is fixed by the action of all gj, j = i and affected only by gi, e.g., changing the color semantic in U does not affect the digit vector in X. Compared to the previous definition of feature representation which is a static mapping, the disentangled representation in Definition 1 is dynamic as it explicitly incorporate group representation [35], which is a homomorphism from group to group actions on a space, e.g., G X X, and it is common to use the feature space X as a shorthand this is where our title stands. Definition 1 defines good features in the common views: 1) Robustness: a good feature should be invariant to the change of environmental semantics, such as external interventions [45, 87] or domain shifts [32]. By the above definition, a change is always retained in a subspace Xi, while others are not affected. Hence, the subsequent classifier will focus on the invariant features and ignore the ever-changing Xi. 2) Zero-shot Generalization: even if a new combination of semantics is unseen in training, each semantic has been learned as features. So, the metrics of each Xi trained by seen samples remain valid for unseen samples [95]. Are the existing SSL methods learning disentangled representations? No. We show in Section 4 that they can only disentangle representations according to the hand-crafted augmentations, e.g., color jitter and rotation. For example, in Figure 2 (a), even if we only use the augmentation-related feature, 1Note that gi can also denote a cyclic subgroup Gi such as rotation [0 : 1 : 360 ], or a countable one but treated as cyclic such as translation [(0, 0) : (1, 1) : (width, height)] and color [0 : 1 : 255]. the classification accuracy of a standard SSL (Sim CLR [16]) does not lose much as compared to the full feature use. Figure 2 (b) visualizes that the CNN features in each layer are indeed entangled (e.g., tyre, motor, and background in the motorcycle image). In contrast, our approach IP-IRM, to be introduced below, disentangles more useful features beyond augmentations. In this paper, we propose Iterative Partition-based Invariant Risk Minimization (IP-IRM [ ai"p@:m]) that guarantees to learn disentangled representations in an SSL fashion. We present the algorithm in Section 3, followed by the theoretical justifications in Section 4. In a nutshell, at each iteration, IP-IRM first partitions the training data into two disjoint subsets, each of which is an orbit of the already disentangled group, and the cross-orbit group corresponds to an entangled group element gi. Then, we adopt the Invariant Risk Minimization (IRM) [2] to implement a partition-based SSL, which disentangles the representation Xi w.r.t. gi. Iterating the above two steps eventually converges to a fully disentangled representation w.r.t. Qm i=1 gi. In Section 5, we show promising experimental results on various feature disentanglement and SSL benchmarks. 2 Related Work Self-Supervised Learning. SSL aims to learn representations from unlabeled data with hand-crafted pretext tasks [28, 63, 33]. Recently, Contrastive learning [65, 61, 38, 80, 16] prevails in most state-ofthe-art methods. The key is to map positive samples closer, while pushing apart negative ones in the feature space. Specifically, the positive samples are from the augmented views [82, 3, 94, 42] of each instance and the negative ones are other instances. Along this direction, follow-up methods are mainly four-fold: 1) Memory-bank [90, 61, 36, 18]: storing the prototypes of all the instances computed previously into a memory bank to benefit from a large number of negative samples. 2) Using siamese network [7] to avoid representation collapse [34, 19, 83]. 3) Assigning clusters to samples to integrate inter-instance similarity into contrastive learning [11, 12, 13, 88, 56]. 4) Seeking hard negative samples with adversarial training or better sampling strategies [73, 20, 44, 48]. In contrast, our proposed IP-IRM jumps out of the above frame and introduces the disentangled representation into SSL with group theory to show the limitations of existing SSL and how to break through them. Disentangled Representation. This notion dates back to [4], and henceforward becomes a highlevel goal of separating the factors of variations in the data [84, 79, 86, 58]. Several works aim to provide a more precise description [27, 29, 72] by adopting an information-theoretic view [17, 27] and measuring the properties of a disentangled representation explicitly [29, 72]. We adopt the recent group-theoretic definition from Higgins et al. [40], which not only unifies the existing, but also resolves the previous controversial points [78, 59]. Although supervised learning of disentangled representation is a well-studied field [100, 43, 10, 70, 49], unsupervised disentanglement based on GAN [17, 64, 57, 71] or VAE [39, 15, 99, 50] is still believed to be theoretically challenging [59]. Thanks to the Higgins definition, we prove that the proposed IP-IRM converges with full-semantic disentanglement using group representation theory. Notably, IP-IRM learns a disentangled representation with an inference process, without using generative models as in all the existing unsupervised methods, making IP-IRM applicable even on large-scale datasets. Group Representation Learning. A group representation has two elements [47, 35]: 1) a homomorphism (e.g., a mapping function) from the group to its group action acting on a vector space, and 2) the vector space. Usually, when there is no ambiguity, we can use either element as the definition. Most existing works focus on learning the first element. They first define the group of interest, such as spherical rotations [24] or image scaling [89, 76], and then learn the parameters of the group actions [23, 46, 68]. In contrast, we focus on the second element; more specifically, we are interested in learning a map between two vector spaces: image pixel space and feature vector space. Our representation learning is flexible because it delays the group action learning to downstream tasks on demand. For example, in a classification task, a classifier can be seen as a group action that is invariant to class-agnostic groups but equivariant to class-specific groups (see Section 4). 3 IP-IRM Algorithm Notations. Our goal is to learn the feature extractor φ in a self-supervised fashion. We define a partition matrix P {0, 1}N 2 that partitions N training images into 2 disjoint subsets. Pi,k = 1 if the i-th image belongs to the k-th subset and 0 otherwise. Suppose we have a pretext task loss function L(φ, θ = 1, k, P) defined on the samples in the k-th subset, where θ = 1 is a dummy parameter used to evaluate the invariance of the SSL loss across the subsets (later discussed in Step 1). For example, L can be defined as: L(φ, θ = 1, k, P) = X x Xk log exp x T x θ P x Xk X \x exp (x T x θ), (1) where Xk = φ({Ii|Pi,k = 1}), and x X is the augmented view feature of x Xk. Input. N training images. Randomly initialized φ. A partition matrix P initialized such that the first column of P is 1, i.e., all samples belong to the first subset. Set P = {P}. Output. Disentangled feature extractor φ. Step 1 [Update φ]. We update φ by: h L(φ, θ = 1, k, P) + λ1 θ=1L(φ, θ = 1, k, P) 2i , (2) where λ1 is a hyper-parameter. The second term delineates how far the contrast in one subset is from a constant baseline θ = 1. The minimization of both of them encourages φ in different subsets close to the same baseline, i.e., invariance across the subsets. See IRM [2] for more details. In particular, the first iteration corresponds to the standard SSL with X1 in Eq. (1) containing all training images. Step 2 [Update P]. We fix φ and find a new partition P by P = arg max P h L(φ, θ = 1, k, P) + λ2 θ=1L(φ, θ = 1, k, P) 2i , (3) where λ2 is a hyper-parameter. In practice, we use a continuous partition matrix in RN 2 during optimization and then threshold it to {0, 1}N 2. We update P P P and iterate the above two steps until convergence. 4 Justification Recall that IP-IRM uses training sample partitions to learn the disentangled representations w.r.t. Qm i=1 gi. As we have a G-equivariant feature map between the sample space I and feature space X (the equivariance is later guaranteed by Lemma 1), we slightly abuse the notation by using X to denote both spaces. Also, we assume that X is a homogeneous space of G, i.e., any sample x X can be transited from another sample x by a group action g x. Intuitively, G is all you need to describe the diversity of the training set. It is worth noting that g is any group element in G while gi is a Cartesian building block of G, e.g., g can be decomposed by (g1, g2, ..., gm). We show that partition and group are tightly connected by the concept of orbit. Given a sample x X, its group orbit w.r.t. G is a sample set G(x) = {g x | g G}. As shown in Figure 3 (a), if G is a set of attributes shared by classes, e.g., color and pose , the orbit is the sample set of the class of x; in Figure 3 (b), if G denotes augmentations, the orbit is the set of augmented images. In particular, we can see that the disjoint orbits in Figure 3 naturally form a partition. Formally, we have the following definition: Definition 2. (Orbit & Partition [47]) Given a subgroup D G, it partitions X into the disjoint subsets: {D(c1 x), ..., D(ck x)}, where k is the number of cosets {c1D, ..., ck D}, and the cosets form a factor group1 G/D = {ci}k i=1. In particular, ci x can be considered as a sample of the i-th class, transited from any sample x X. Interestingly, the partition offers a new perspective for the training data format in Supervised Learning (SL) and Self-Supervised Learning (SSL). In SL, as shown in Figure 3 (a), the data is labeled with k classes, each of which is an orbit with D(ci x) training samples, whose variations are depicted 1Given G = D K with K = c1 . . . ck, then D = {(d, e) | d D} is a normal subgroup of G, and G/ D is isomorphic to K [47]. We write G/D = {ci}k i=1 with slight abuse of notation. (c) Our IP-IRM (b) Self-Supervised Learning (a) Supervised Learning Figure 3: Each orbit only illustrates with 5 samples. (a) Orbit: the training samples of a class; D in-orbit actions: intra-class variations (1 2:standing, 2 3:blacken, 3 4:jumping, 4 5:whiten, 5 1:running); G/D cross-orbit actions: inter-class variations. (b) Orbit: a sample and its augmented samples; D in-orbit actions: augmentations (1 2:clock-wise rotation, 2 3:color jitter, 3 4:gray scale, 4 5:counterclockwise rotation, 5 1:color); G/D cross-orbit actions: inter-sample variations. (c) Step 2 in IP-IRM discovers 2 orbits, where the cross-orbit action corresponds to a group action green to red or red to green , which is yet disentangled. by the class-sharing attribute group D. The cross-orbit group action, e.g., cdog x, can be read as turn x into a dog and such turn is always valid due to the assumption that X is a homogeneous space of G. In SSL, as shown in Figure 3 (b), each training sample x is augmented by the group D. So, D(ci x) consists of all the augmentations of the i-th sample, where the cross-orbit group action ci x can be read as turn x into the i-th sample . Thanks to the orbit and partition view of training data, we are ready to revisit model generalization in a group-theoretic view by using invariance and equivariance the two sides of the coin, whose name is disentanglement. For SL, we expect that a good feature is disentangled into a class-agnostic part and a class-specific part: the former (latter) is invariant (equivariant) to G/D cross-orbit traverse, but equivariant (invariant) to D in-orbit traverse. By using such feature, a model can generalize to diverse testing samples (limited to |D| variations) by only keeping the class-specific feature. Formally, we prove that we can achieve such disentanglement by contrastive learning: Lemma 1. (Disentanglement by Contrastive Learning) Training loss log exp(x T i xj) P x X exp(x T j x) disentan- gles X w.r.t. (G/D) D, where xi and xj are from the same orbit. We can draw the following interesting corollaries from Lemma 1 (details in Appendix): 1. If we use all the samples in the denominator of the loss, we can approximate to G-equivariant features given limited training samples. This is because the loss minimization guarantees (xi, xj) X X, i = j xi = xj, i.e., any pair corresponds to a group action. 2. Conventional cross-entropy loss in SL is a special case, if we define x X = {x1, ..., xk} as k classifier weights. So, SL does not guarantee the disentanglement of G/D, which causes generalization error if the class domain of downstream task is different from SL pre-training, e.g., a subset of G/D. 3. In contrastive learning based SSL, D = augmentations (recall Figure 2), and the number of augmentations |Daug| is generally much smaller compared to the class-wise sample diversity |DSL| in SL. This enables the SL model to generalize to more diverse testing samples (|DSL|) by filtering out the class-agnostic features (e.g., background) and focusing on the class-specific ones (e.g., foreground), which explains why SSL is worse than SL in downstream classification. 4. In SL, if the number of training samples per orbit is not enough, i.e., smaller than |D(ci x)|, the disentanglement between D and G/D cannot be guaranteed, such as the challenges in few-shot learning [96]. Fortunately, in SSL, the number is enough as we always include all the augmented samples in training. Moreover, we conjecture that Daug only contains simple cyclic group elements such as rotation and colorization, which are easier for representation learning. Lemma 1 does not guarantee the decomposability of each d D. Nonetheless, the downstream model can still generalize by keeping the class-specific features affected by G/D. Therefore, the key to fill the gap or even let SSL surpass SL is to achieve the full disentanglement of G/Daug. Method DCI IRS MOD EXP LR GBT Average VAE [51] 0.948 0.004 - 0.664 0.121 0.968 0.007 0.824 0.019 0.948 0.004 0.849 0.057 β-VAE [41] 0.945 0.002 - 0.705 0.073 0.963 0.006 0.809 0.013 0.945 0.003 0.874 0.015 β-Anneal VAE [9] 0.911 0.002 - 0.790 0.075 0.965 0.007 0.821 0.022 0.911 0.002 0.880 0.016 β-TCVAE [15] 0.914 0.008 - 0.864 0.095 0.962 0.010 0.801 0.024 0.914 0.008 0.891 0.014 Factor-VAE [50] 0.916 0.004 - 0.893 0.056 0.947 0.011 0.770 0.025 0.916 0.005 0.888 0.014 Sim CLR [16] 0.882 0.019 - 0.767 0.025 0.976 0.011 0.863 0.036 0.876 0.015 0.873 0.016 IP-IRM (Ours) 0.917 0.008 - 0.785 0.031 0.990 0.002 0.921 0.009 0.916 0.007 0.906 0.011 VAE [51] 0.351 0.026 0.284 0.009 0.820 0.015 0.802 0.054 0.421 0.079 0.352 0.027 0.505 0.028 β-VAE [41] 0.369 0.021 0.283 0.012 0.782 0.034 0.807 0.018 0.427 0.025 0.368 0.023 0.506 0.011 β-Anneal VAE [9] 0.327 0.069 0.412 0.049 0.743 0.070 0.643 0.013 0.259 0.021 0.328 0.070 0.452 0.023 β-TCVAE [15] 0.470 0.035 0.291 0.023 0.777 0.031 0.821 0.054 0.439 0.084 0.469 0.034 0.545 0.032 Factor-VAE [50] 0.340 0.021 0.316 0.016 0.815 0.041 0.738 0.043 0.319 0.045 0.339 0.021 0.478 0.020 Sim CLR [16] 0.535 0.016 0.439 0.030 0.678 0.050 0.949 0.005 0.733 0.055 0.536 0.015 0.645 0.026 IP-IRM (Ours) 0.565 0.023 0.420 0.014 0.766 0.036 0.959 0.007 0.757 0.025 0.565 0.023 0.672 0.017 Table 1: Results on disentanglement metrics of existing unsupervised disentanglement methods, standard SSL (Sim CLR [16]) and IP-IRM using CMNIST [2] and Shapes3D [50]. Note that IRS is based on intervening the semantics which requires access to the labels of all the semantics, and hence not applicable for CMNIST dataset. Results are averaged over 4 trails (mean std). Theorem 1. The representation is fully disentangled w.r.t. G/Daug if and only if ci G/Daug, the contrastive loss in Eq. (1) is invariant to the 2 orbits of partition {G (ci x), G (c 1 i x)}, where G = G/ci = Daug c1 . . . ci 1 ci+1 . . . ck. The maximization in Step 2 is based on the contra-position of the sufficient condition of Theorem 1. Denote the currently disentangled group as D (initially Daug). If we can find a partition P to maximize the loss in Eq. (3), i.e., SSL loss is variant across the orbits, then h G/D such that the representation of h is entangled, i.e., P = {D(h x), D(h 1 x)}. Figure 3 (c) illustrates a discovered partition about color. The minimization in Step 1 is based on the necessary condition of Theorem 1. Based on the discovered P , if we minimize Eq. (2), we can further disentangle h and update D D h. Overall, IP-IRM converges as G/Daug is finite. Note that an improved contrastive objective [92] can further disentangle each d Daug and achieve full disentanglement w.r.t. G. 5 Experiments 5.1 Unsupervised Disentanglement Datasets. We used two datasets. CMNIST [2] has 60,000 digit images with semantic labels of digits (0-9) and colors (red and green). These images differ in other semantics (e.g., slant and font) that are not labeled. Moreover, there is a strong correlation between digits and colors (most 0-4 in red and 5-9 in green), increasing the difficulty to disentangle them. Shapes3D [50] contains 480,000 images with 6 labelled semantics, i.e., size, type, azimuth, as well as floor, wall and object color. Note that we only considered the first three semantics for evaluation, as the standard augmentations in SSL will contaminate any color-related semantics. Settings. We adopted 6 representative disentanglement metrics: Disentangle Metric for Informativeness (DCI) [29], Interventional Robustness Score (IRS) [79], Explicitness Score (EXP) [72], Modularity Score (MOD) [72] and the accuracy of predicting the ground-truth semantic labels by two classification models called logistic regression (LR) and gradient boosted trees (GBT) [59]. Specifically, DCI and EXP measure the explicitness, i.e., the values of semantics can be decoded from the feature using a linear transformation. MOD and IRS measure the modularity, i.e., whether each feature dimension is equivariant to the shift of a single semantic. See Appendix for more detailed formula of the metrics. In evaluation, we trained CNN-based feature extractor backbones with comparable number of parameters for all the baselines and our IP-IRM. The full implementation details are in Appendix. Results. In Table 1, we compared the proposed IP-IRM to the standard SSL method Sim CLR [16] as well as several generative disentanglement methods [51, 41, 9, 15, 50]. On both CMNIST and Shapes3D dataset, IP-IRM outperforms Sim CLR regarding all metrics except for only IRS where the most relative gain is 8.8% for MOD. For this MOD, we notice that VAE performs better than airplane ship bird cat monkey dog deer Sim CLR IP IRM Sim CLR IP IRM CMNIST STL10 Figure 4: The t-SNE [85] visualizations of learned feature spaces using Sim CLR [16] and IP-IRM on CMNIST [2] and STL10 [22]. For CMIST in (a), we annotate the digit and color near each cluster. We annotate only half of the feature points for Sim CLR to avoid clutter. For STL10 in (b), we show the labels of the classes. Partition #1 Partition #2 Partition #3 CMNIST Shapes3D Color Digit Size Type Azimuth #2 #1 Subset Figure 5: (a) Visualization of the obtained partitions P during training. Each partition has two subset and the displayed images are randomly sampled from each subset. (b) Visualization of the variance of each feature dimension when perturbing the semantic indicated on the left. The most equivariant dimensions are indicated by triangles and their corresponding indices. our IP-IRM by 6 points, i.e., 0.82 v.s. 0.76 for Shapes3D. This is because VAE explicitly pursues a high modularity score through regularizing the dimension-wise independence in the feature space. However, this regularization is adversarial to discriminative objectives [14, 95]. Indeed, we can observe from the column of LR (i.e., the performance of downstream linear classification) that VAE methods have clearly poor performance especially on the more challenging dataset Shapes3D. We can draw the same conclusion from the results of GBT. Different from VAE methods, our IP-IRM is optimized towards disentanglement without such regularization, and is thus able to outperform the others in downstream tasks while obtaining a competitive value of modularity. What do IP-IRM features look like? Figure 4 visualizes the features learned by Sim CLR and our IP-IRM on two datasets: CMNIST in Figure 4 (a) and STL10 dataset in Figure 4 (b). In the following, we use Figure 4 (a) as the example, and can easily draw the similar conclusions from Figure 4 (b). On the left-hand side of Figure 4 (a), it is obvious that there is no clear boundary to distinguish the semantic of color in the Sim CLR feature space. Besides, the features of the same digit semantic are scattered in two regions. On the right-hand side of (a), we have 3 observations for IP-IRM. 1) The features are well clustered and each cluster corresponds to a specific semantic of either digit or color. This validates the equivariant property of IP-IRM representation that it responds to any changes of the existing semantics, e.g., digit and color on this dataset. 2) The feature space has the symmetrical structure for each individual semantic, validating the decomposable property of IP-IRM representation. More specifically, i) mirroring a feature (w.r.t. * in the figure center) indicates the change on the only semantic of color, regardless of the other semantic (digit); and ii) a counterclockwise rotation (denoted by black arrows from same-colored 1 to 7) indicates the change on the only semantic of digit. 3) IP-IRM reveals the true distribution (similarity) of different classes. For example, digits 3, 5, 8 sharing sub-parts (curved bottoms and turnings) have closer feature points in the IP-IRM feature space. How does IP-IRM disentangle features? 1) Discovered P : To visualize the discovered partitions P at each maximization step, we performed an experiment on a binary CMNIST (digit 0 and 1 in color red and green), and show the results in Figure 5 (a). Please kindly refer to Appendix for the full results on CMNIST. First, each partition tells apart a specific semantic into two subsets, e.g., in Partition #1, red and green digits are separated. Second, besides the obvious semantics digit and color (labelled on the dataset), we can discover new semantics, e.g., the digit slant shown in Partition #3. 2) Disentangled Representation: In Figure 5 (b), we aim to visualize how equivariant each feature dimension is to the change of each semantic, i.e., a darker color shows that a dimension is more equivariant w.r.t. the semantic indicated on the left. We can see that Sim CLR fails to learn the decomposable representation, e.g., the 8-th dimension captures azimuth, type and size in Shapes3D. In contrast, our IP-IRM achieves disentanglement by representing the semantics into interpretable dimensions, e.g., the 6-th and 7-th dimensions captures the size, the 4-th for type and the 2-nd and 9-th for azimuth on the Shapes3D. Overall, the results support the justification in Section 4, i.e., we discover a new semantic (affected by h) through the partition P at each iteration and IP-IRM eventually converges with a disentangled representation. 5.2 Self-Supervised Learning Datasets and Settings. We conducted the SSL evaluations on 2 standard benchmarks following [88, 20, 48]. Cifar100 [54] contains 60,000 images in 100 classes and STL10 [22] has 113,000 images in 10 classes. We used Sim CLR [16], DCL [20] and HCL [48] as baselines, and learned the representations for 400 and 1000 epochs. We evaluated both linear and k-NN (k = 200) accuracies for the downstream classification task. Implementation details are in appendix. Method STL10 Cifar100 k-NN Linear k-NN Linear 400 epoch training Sim CLR [16] 73.60 78.89 54.94 66.63 DCL [20] 78.82 82.56 57.29 68.59 HCL [48] 80.06 87.60 59.61 69.22 Sim CLR+IP-IRM 79.66 84.44 59.10 69.55 DCL+IP-IRM 81.51 85.36 58.37 68.76 HCL+IP-IRM 84.29 87.81 60.05 69.95 1,000 epoch training Sim CLR [16] 78.60 84.24 59.45 68.73 Sim CLR [55] 79.80 85.56 63.67 72.18 Sim CLR +IP-IRM 85.08 89.91 65.82 73.99 Supervised - - - 73.72 Supervised +Mix Up [97] - - - 74.19 Table 2: Accuracy (%) of k-NN and linear classifiers on STL10 [22] and Cifar100 [54] using the representations of Sim CLR [16], DCL [20], HCL [48] and those after incorporating our IPIRM. Sim CLR denotes Sim CLR with Mix Up regularization. Supervised represents the supervised learning that keeps the same codebase, optimizer and parameters with SSL stage-2 fine-tuning while only adds the learning rate decay at 60 and 80 epoch. Results. We demonstrate our results and compare with baselines in Table 2. Incorporating IP-IRM to the 3 baselines brings consistent performance boosts to downstream classification models in all settings, e.g., improving the linear models by 5.55% on STL10 and 2.92% on Cifar100. In particular, we observe that IP-IRM brings huge performance gain with k-NN classifiers, e.g., 4.23% using HCL+IP-IRM on STL10, i.e., the distance metrics in the IP-IRM feature space more faithfully reflects the class semantic differences. This validates that our algorithm further disentangles compared to the standard SSL Moreover, by extending the training process to 1,000 epochs with Mix Up [55], Sim CLR+IP-IRM achieves further performance boost on both datasets, e.g., 5.28% for k-NN and 4.35% for linear classifier over Sim CLR baseline on STL10 dataset. Notably, our Sim CLR+IP-IRM surpasses vanilla supervised learning on Cifar100 under the same evaluation setting. Still, the quality of disentanglement cannot be fully evaluated when the training and test samples are identically distributed while the improved accuracy demonstrates that IP-IRM representation is more equivariant to class semantics, it does not reveal if the representation is decomposable. Hence we present an out-of-distribution (OOD) setting in Section 5.3 to further show this property. Is IP-IRM sensitive to the values of hyper-parameters? 1) λ1 and λ2 in Eq. (2) and Eq. (3). In Figure 6 (a), we observe that the best performance is achieved with λ1 and λ2 taking values from 0.2 to 0.5 on both datasets. All accuracies drop sharply if using λ1 = 1.0. The reason is that a higher λ1 forces the model to push the φ-induced similarity to fixed baseline θ = 1, rather than decrease the loss L on the pretext task, leading to poor convergence. 2) The number of epochs. In Figure 6 (b), we IP-IRM Sim CLR IP-IRM Sim CLR 50 150 250 450 550 650 Epochs 𝒌𝒌-NN Top-1 Acc (%) 0.1 0.2 0.5 1.0 0.1 0.2 0.5 1.0 0.1 0.2 0.5 1.0 Figure 6: Our ablation study on the STL10 and Cifar100 datasets. (a) The Top-1 accuracy (%) of linear classifiers using different values of λ1 and λ2 (in Eq. (2) and Eq. (3)), by training for 200 epochs on two datasets. (b) The Top-1 accuracy (%) of k-NN classifiers on two datasets, for which we trained the models for 700 epochs and updated P every 50 epochs. plot the Top-1 accuracies of using k-NN classifiers along the 700-epoch training of two kinds of SSL representations Sim CLR and IP-IRM. It is obvious that IP-IRM converges faster and achieves a higher accuracy than Sim CLR. It is worth to highlight that on the STL10, the accuracy of Sim CLR starts to oscillate and grow slowly after the 150-th epoch, while ours keeps on improving. This is an empirical evidence that IP-IRM keeps on disentangling more and more semantics in the feature space, and has the potential of improvement through long-term training. 5.3 Potential on Large-Scale Data Datasets. We evaluated on the standard benchmark of supervised learning Image Net ILSVRC2012 [25] which has in total 1,331,167 images in 1,000 classes. To further reveal if a representation is decomposable, we used NICO [37], which is a real-world image dataset designed for OOD evaluations. It contains 25,000 images in 19 classes, with a strong correlation between the foreground and background in the train split (e.g., most dogs on grass). We also studied the transferability of the learned representation following [30, 52]: FGVC Aircraft (Aircraft) [60], Caltech-101 (Caltech) [31], Stanford Cars (Cars) [93], Cifar10 [53], Cifar100 [53], DTD [21], Oxford 102 Flowers (Flowers) [62], Food-101 (Food) [6], Oxford-IIIT Pets (Pets) [67] and SUN397 (SUN) [91]. These datasets include coarseto fine-grained classification tasks, and vary in the amount of training data (2,000-75,000 images) and classes (10-397 classes), representing a wide range of transfer learning settings. Settings. For the Image Net, all the representations were trained for 200 epochs due to limited computing resources. We followed the common setting [80, 36], using a linear classifier, and report Top-1 classification accuracies. For NICO, we fixed the Image Net pre-trained Res Net-50 backbone and fine-tuned the classifier. See appendix for more training details. For the transfer learning, we followed [30, 52] to report the classification accuracies on Cars, Cifar-10, Cifar-100, DTD, Food, SUN and the average per-class accuracies on Aircraft, Caltech, Flowers, Pets. We call them uniformly as Accuracy. We used the few-shot n-way-k-shot setting for model evaluation. Specifically, we randomly sampled 2,000 episodes from the test splits of above datasets. An episode contains n classes, each with k training samples and 15 testing samples, where we fine-tuned the linear classifier (backbone weights frozen) for 100 epochs on the training samples, and evaluated the classifier on the testing samples. We evaluated with n = k = 5 (results of n = 5, k = 20 in Appendix). Image Net and NICO. In Table 3 Image Net accuracy, our IP-IRM achieves the best performance over all baseline models. Yet we believe that this does not show the full potential of IP-IRM, because Image Net is a larger-scale dataset with many semantics, and it is hard to achieve a full disentanglement of all semantics within the limited 200 epochs. To evaluate the feature decomposability of IPIRM, we compared the performance on NICO with various SSL baselines in Table 3, where our approach significantly outperforms the baselines by 1.5-4.2%. This validates IP-IRM feature is more decomposable if each semantic feature (e.g., background) is decomposed in some fixed dimensions and some classes vary with such semantic, then the classifier will recognize this as Method Image Net NICO Ins Dis [90] 56.5 65.6 PCL [56] 61.5 72.6 PIRL [61] 63.6 69.1 Mo Co-v1 [36] 60.6 69.3 Sim CLR (repro.) [16] 63.1 64.5 Mo Co-v2 (repro.) [18] 67.3 78.0 Sim Siam (repro.) [19] 68.8 66.7 Sim CLR+IP-IRM 64.8 66.7 Mo Co-v2+IP-IRM 67.6 79.5 Sim Siam+IP-IRM 69.1 70.9 Table 3: Image Net and NICO Top-1 Accuracy (%) of linear classifiers trained on the representations learnt with different SSL methods. Cat Elephant Bear Sheep Input Mo Co-v2 Mo Co-v2 Figure 7: Visualization of CAM [98] on images from NICO [37] dataset using representations of the baseline Mo Co-v2 [18] and our IP-IRM. Method Aircraft Caltech Cars Cifar10 Cifar100 DTD Flowers Food Pets SUN Average Ins Dis [90] 35.07 75.97 37.49 51.49 57.61 69.38 77.35 50.01 66.38 74.97 59.57 PCL [56] 36.86 90.72 39.68 59.26 60.78 69.53 67.50 57.06 88.31 84.51 65.42 PIRL [61] 36.70 78.63 39.21 49.85 55.23 70.43 78.37 51.61 69.40 76.64 60.61 Mo Co-v1 [36] 35.31 79.60 36.35 46.96 51.62 68.76 75.42 49.77 68.32 74.77 58.69 Mo Co-v2 [18] 31.98 92.32 41.47 56.50 63.33 78.00 80.05 57.25 83.23 88.10 67.22 IP-IRM (Ours) 32.98 93.16 42.87 60.73 68.54 79.30 82.68 59.61 85.23 89.38 69.44 Table 4: Accuracy (%) of 5-way-5-shot few-shot evaluation using the image representation learned on Image Net [25]. More detailed results are given in Appendix. a non-discriminative variant feature and hence focus on other more discriminative features (i.e., foreground). In this way, even though some classes are confounded by those non-discriminative features (e.g., most of the dog images are with grass background), the fixed dimensions still help classifiers neglect those non-discriminative ones. We further visualized the CAM [98] on NICO in Figure 7, which indeed shows that IP-IRM helps the classifier focus on the foreground regions. Few-Shot Tasks. As shown in Table 4, our IP-IRM significantly improves the performance of 5-way-5-shot setting, e.g., we outperform the baseline Mo Co-v2 by 2.2%. This is because IP-IRM can further disentangled G\Daug over SSL, which is essential for representations to generalize to different downstream class domains (recall Corollary 2 of Lemma 1). This is also in line with recent works [86] showing that a disentangled representation is especially beneficial in low-shot scenarios, and further demonstrates the importance of disentanglement in downstream tasks. 6 Conclusion We presented an unsupervised disentangled representation learning method called Iterative Partitionbased Invariant Risk Minimization (IP-IRM), based on Self-Supervised Learning (SSL). IP-IRM iteratively partitions the dataset into semantic-related subsets, and learns a representation invariant across the subsets using SSL with an IRM loss. We show that with theoretical guarantee, IP-IRM converges with a disentangled representation under the group-theoretical view, which fundamentally surpasses the capabilities of existing SSL and fully-supervised learning. Our proposed theory is backed by strong empirical results in disentanglement metrics, SSL classification accuracy and transfer performance. IP-IRM achieves disentanglement without using generative models, making it widely applicable on large-scale visual tasks. As future directions, we will continue to explore the application of group theory in representation learning and seek additional forms of inductive bias for faster convergence. Acknowledgments and Disclosure of Funding The authors would like to thank all reviewers for their constructive suggestions. This research is partly supported by the Alibaba-NTU Joint Research Institute, the A*STAR under its AME YIRG Grant (Project No. A20E6c0101), and the Singapore Ministry of Education (MOE) Academic Research Fund (Ac RF) Tier 2 grant. [1] Philip W Anderson. More is different. Science, 1972. [2] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. [3] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. ar Xiv preprint ar Xiv:1906.00910, 2019. [4] Yoshua Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009. [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. [6] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In European conference on computer vision, 2014. [7] Jane Bromley, Isabelle Guyon, Yann Le Cun, Eduard Säckinger, and Roopak Shah. Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems, 6:737 744, 1993. [8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020. [9] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. [10] Ruichu Cai, Zijian Li, Pengfei Wei, Jie Qiao, Kun Zhang, and Zhifeng Hao. Learning disentangled semantic representation for domain adaptation. In IJCAI: proceedings of the conference, 2019. [11] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132 149, 2018. [12] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pretraining of image features on non-curated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2959 2968, 2019. [13] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. ar Xiv preprint ar Xiv:2006.09882, 2020. [14] Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In CVPR, 2018. [15] Ricky TQ Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in neural information processing systems, 2018. [16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020. [17] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2016. [18] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020. [19] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. ar Xiv preprint ar Xiv:2011.10566, 2020. [20] Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba, and Stefanie Jegelka. Debiased contrastive learning. ar Xiv preprint ar Xiv:2007.00224, 2020. [21] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. [22] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215 223. JMLR Workshop and Conference Proceedings, 2011. [23] Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie groups. In International Conference on Machine Learning, 2014. [24] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. In ICLR, 2018. [25] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. [27] Kien Do and Truyen Tran. Theory and evaluation metrics for learning disentangled representations. In International conference on learning representations, 2020. [28] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422 1430, 2015. [29] Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International conference on learning representations, 2018. [30] Linus Ericsson, Henry Gouk, and Timothy M. Hospedales. How Well Do Self-Supervised Models Transfer? In CVPR, 2021. [31] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, 2004. [32] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 2016. [33] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ar Xiv preprint ar Xiv:1803.07728, 2018. [34] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. ar Xiv preprint ar Xiv:2006.07733, 2020. [35] W.F.J. Harris, W. Fulton, and J. Harris. Representation Theory: A First Course. 1991. [36] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. ar Xiv preprint ar Xiv:1911.05722, 2019. [37] Yue He, Zheyan Shen, and Peng Cui. Towards non-iid image classification: A dataset and baselines. Pattern Recognition, 110:107383, 2021. [38] Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pages 4182 4192. PMLR, 2020. [39] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. [40] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018. [41] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. International conference on learning representations, 2017. [42] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018. [43] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In Advances in neural information processing systems, 2018. [44] Qianjiang Hu, Xiao Wang, Wei Hu, and Guo-Jun Qi. Adco: Adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries. ar Xiv preprint ar Xiv:2011.08435, 2020. [45] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, 2019. [46] Andrew Jaegle, Stephen Phillips, Daphne Ippolito, and Kostas Daniilidis. Understanding image motion with group representations. 2018. [47] Thomas W. Judson. Abstract Algebra: Theory and Applications (The Prindle, Weber & Schmidt Series in Advanced Mathematics). Prindle Weber & Schmidt, 1994. [48] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. ar Xiv preprint ar Xiv:2010.01028, 2020. [49] Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch. Bayesian representation learning with oracle constraints. International conference on learning representations, 2015. [50] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, 2018. [51] Diederik Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. [52] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. [53] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2012. [54] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [55] Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin, and Honglak Lee. i-mix: A domain-agnostic strategy for contrastive representation learning. In ICLR, 2021. [56] Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and Steven CH Hoi. Prototypical contrastive learning of unsupervised representations. ar Xiv preprint ar Xiv:2005.04966, 2020. [57] Zinan Lin, Kiran Thekumparampil, Giulia Fanti, and Sewoong Oh. Infogan-cr and modelcentrality: Self-supervised model training and selection for disentangling gans. In International Conference on Machine Learning, 2020. [58] F. Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem. Disentangling factors of variations using few labels. In 8th International Conference on Learning Representations (ICLR), 2020. [59] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, 2019. [60] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Finegrained visual classification of aircraft. ar Xiv preprint ar Xiv:1306.5151, 2013. [61] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707 6717, 2020. [62] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008. [63] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69 84. Springer, 2016. [64] Utkarsh Ojha, Krishna Kumar Singh, Cho-Jui Hsieh, and Yong Jae Lee. Elastic-infogan: Unsupervised disentangled representation learning in class-imbalanced data. In Advances in neural information processing systems, 2020. [65] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. [66] Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. In Proceedings of the 35th International Conference on Machine Learning, pages 4036 4044, 2018. [67] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, 2012. [68] Robin Quessard, Thomas Barrett, and William Clements. Learning disentangled representations and group structure of dynamical environments. Advances in Neural Information Processing Systems, 2020. [69] Rajesh PN Rao and Daniel L Ruderman. Learning lie groups for invariant visual perception. Advances in neural information processing systems, 1999. [70] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International conference on machine learning, 2014. [71] Xuanchi Ren, Tao Yang, Yuwang Wang, and Wenjun Zeng. Do generative models know disentanglement? contrastive learning is all you need. ar Xiv preprint ar Xiv:2102.10543, 2021. [72] Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. In Advances in neural information processing systems, 2018. [73] Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. ar Xiv preprint ar Xiv:2010.04592, 2020. [74] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning, 2012. [75] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, 2015. [76] Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-equivariant steerable networks. 2020. [77] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014. [78] Raphael Suter, Djordje Miladinovic, Stefan Bauer, and Bernhard Schölkopf. Interventional robustness of deep latent variable models. ar Xiv, 2018. [79] Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In International Conference on Machine Learning, 2019. [80] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. ar Xiv preprint ar Xiv:1906.05849, 2019. [81] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, European conference on computer vision, 2020. [82] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. ar Xiv preprint ar Xiv:2005.10243, 2020. [83] Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. ar Xiv preprint ar Xiv:2102.06810, 2021. [84] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for poseinvariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [85] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [86] Sjoerd Van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, and Olivier Bachem. Are disentangled representations helpful for abstract visual reasoning? In Advances in neural information processing systems, 2019. [87] Tan Wang, Chang Zhou, Qianru Sun, and Hanwang Zhang. Causal attention for unbiased visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. [88] Xudong Wang, Ziwei Liu, and Stella X Yu. Unsupervised feature learning by cross-level discrimination between instances and groups. ar Xiv preprint ar Xiv:2008.03813, 2020. [89] Daniel E Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In Neur IPS, 2019. [90] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733 3742, 2018. [91] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 2010. [92] Tete Xiao, Xiaolong Wang, Alexei A Efros, and Trevor Darrell. What should not be contrastive in contrastive learning. In International Conference on Learning Representations, 2021. [93] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A large-scale car dataset for fine-grained categorization and verification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [94] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6210 6219, 2019. [95] Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. Counterfactual zero-shot and open-set visual recognition. In CVPR, 2021. [96] Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. Interventional few-shot learning. In Neur IPS, 2020. [97] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. [98] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921 2929, 2016. [99] Yizhe Zhu, Martin Renqiang Min, Asim Kadav, and Hans Peter Graf. S3vae: Self-supervised sequential vae for representation disentanglement and data generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. [100] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Multi-view perceptron: a deep model for learning face identity and view representations. Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.