# selfsupervised_relational_reasoning_for_representation_learning__eb1469cc.pdf Self-Supervised Relational Reasoning for Representation Learning Massimiliano Patacchiola School of Informatics University of Edinburgh mpatacch@ed.ac.uk Amos Storkey School of Informatics University of Edinburgh a.storkey@ed.ac.uk In self-supervised learning, a system is tasked with achieving a surrogate objective by defining alternative targets on a set of unlabeled data. The aim is to build useful representations that can be used in downstream tasks, without costly manual annotation. In this work, we propose a novel self-supervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data. Training a relation head to discriminate how entities relate to themselves (intra-reasoning) and other entities (inter-reasoning), results in rich and descriptive representations in the underlying neural network backbone, which can be used in downstream tasks such as classification and image retrieval. We evaluate the proposed method following a rigorous experimental procedure, using standard datasets, protocols, and backbones. Self-supervised relational reasoning outperforms the best competitor in all conditions by an average 14% in accuracy, and the most recent state-of-the-art model by 3%. We link the effectiveness of the method to the maximization of a Bernoulli log-likelihood, which can be considered as a proxy for maximizing the mutual information, resulting in a more efficient objective with respect to the commonly used contrastive losses. 1 Introduction Learning useful representations from unlabeled data can substantially reduce dependence on costly manual annotation, which is a major limitation in modern deep learning. Toward this end, one solution is to develop learners able to self-generate a supervisory signal exploiting implicit information, an approach known as self-supervised learning (Schmidhuber, 1987, 1990). Humans and animals are naturally equipped with the ability to learn via an intrinsic signal, but how machines can build similar abilities has been material for debate (Lake et al., 2017). A common approach consists of defining a surrogate task (pretext) which can be solved by learning generalizable representations, then use those representations in downstream tasks, e.g. classification and image retrieval (Jing and Tian, 2020). A key factor in self-supervised human learning is the acquisition of new knowledge by relating entities, whose positive effects are well established in studies of adult learning (Gentner and Kurtz, 2005; Goldwater et al., 2018). Developmental studies have shown something similar in children, who can build complex taxonomic names when they have the opportunity to compare objects (Gentner and Namy, 1999; Namy and Gentner, 2002). Comparison allows the learner to neglect irrelevant perceptual features and focus on non-obvious properties. Here, we argue that it is possible to exploit a similar mechanism in self-supervised machine learning via relational reasoning. The relational reasoning paradigm is based on a key design principle: the use of a relation network as a learnable function to quantify the relationships between a set of objects. Starting from this principle, we propose a new formulation of relational reasoning which can be used as a pretext task to build useful representations in a neural network backbone, by training the relation head on unlabeled 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. data. Differently from the canonical relational approach, which focuses on relations between objects in the same scene (Santoro et al., 2017), we focus on relations between views of the same object (intra-reasoning) and relations between different objects in different scenes (inter-reasoning), in doing so we allow the learner to acquire both intra-class and inter-class knowledge without the need of labeled data. We evaluate our method following a rigorous experimental methodology, since comparing selfsupervised learning methods can be problematic (Kolesnikov et al., 2019; Musgrave et al., 2020). Gains may be largely due to the backbone and learning schedule used, rather than the self-supervised component. To neutralize these effects we provide a benchmark environment where all methods are compared using standard datasets (CIFAR-10, CIFAR-100, CIFAR-100-20, STL-10, tiny-Image Net, Slimage Net), evaluation protocol (Kolesnikov et al., 2019), learning schedule, and backbones (both shallow and deep). Results show that our method largely outperforms the best competitor in all conditions by an average 14% accuracy and the most recent state-of-the-art method by 3%. Main contributions: 1) we propose a novel algorithm based on relational reasoning for the selfsupervised learning of visual representations, 2) we show its effectiveness on standard benchmarks with an in-depth experimental analysis, outperforming concurrent state-of-the-art methods (code released with an open-source license1), and 3) we highlight how the maximization of a Bernoulli log-likelihood in concert with a relation module, results in more effective and efficient objective functions with respect to the commonly used contrastive losses. 1.1 Overview Following the terminology used in the self-supervised literature (Jing and Tian, 2020) we consider relational reasoning as a pretext task for learning useful representations in the underlying neural network backbone. Once the joint system (backbone + relation head) has been trained, the relation head is discarded, and the backbone used in downstream tasks (e.g. classification, image retrieval). To achieve this goal we provide a new formulation of relational reasoning. The canonical formulation defines it as the process of learning the ways in which entities are connected, using this knowledge to accomplish higher-order goals (Santoro et al., 2017, 2018). The proposed formulation defines it as the process of learning the ways entities relate to themselves (intra-reasoning) and to other entities (inter-reasoning), using this knowledge to accomplish downstream goals. Consider a set of objects O = {o1, . . . , o N}, the canonical approach is within-scene, meaning that all the elements in O belong to the same scene (e.g. fruits from a basket). The within-scene approach is not very useful in our case. Ideally, we would like our learner to be able to differentiate between objects taken from every possible scene. Therefore first we define between-scenes reasoning: the task of relating objects from different scenes (e.g. fruits from different baskets). Starting from the between-scenes setting, consider the case where the learner is tasked with discriminating if two objects {oi, oj} O belong to the same category {oi, oj} same, or to a different one {oi, oj} different. Often a single attribute is informative enough to solve the task. For instance, in the pair {applei, orangej} the color alone is a strong predictor of the class, it follows that the learner does not need to pay attention to other features, this results in poor representations. To solve the issue we alter the object oi via random augmentations A(oi) (e.g. geometric transformation, color distortion) making between-scenes reasoning more complicated. The color of an orange can be randomly changed, or the shape resized, such that it is much more difficult to discriminate it from an apple. In this challenging setting, the learner is forced to take account of the correlation between a wider set of features (e.g. color, size, texture, etc.). However, it is not possible to create pairs of similar and dissimilar objects when labels are not given. To overcome the problem we bootstrap a supervisory signal directly from the (unlabeled) data, and we do so by introducing intra-reasoning and inter-reasoning. Intra-reasoning consists of sampling two random augmentations of the same object {A(oi), A(oi)} same (positive pair), whereas inter-reasoning consists of coupling two random objects {A(oi), A(o\i)} different (negative pair). This is like coupling different views of the same apple to build the positive pair, and coupling an apple with a random fruit to build the negative pair. In this work we show that it is possible to train a relation module via intra-reasoning and inter-reasoning, with the aim of learning useful representations. 1https://github.com/mpatacchiola/self-supervised-relational-reasoning 2 Previous work Relational reasoning. In the last decades there have been entire sub-fields interested in relational learning: e.g. reinforcement learning (Džeroski et al., 2001) and statistics (Koller et al., 2007). However, only recently the relational paradigm has gained traction in the deep learning community with applications in question answering (Santoro et al., 2017; Raposo et al., 2017), graphs (Battaglia et al., 2018), sequential streams (Santoro et al., 2018), deep reinforcement learning (Zambaldi et al., 2019), few-shot learning (Sung et al., 2018), and object detection (Hu et al., 2018). Our work differentiate from previous one in several ways: (i) previous work is based on labeled data, while we use relational reasoning on unlabeled data; (ii) previous work has focused on within-scene relations, here we focus on relations between different views of the same object (intra-reasoning) and between different objects in different scenes (inter-reasoning); (iii) in previous work training the relation head was the main goal, here is a pretext task for learning useful representations in the underlying backbone. Solving pretext tasks. There has been a substantial effort in defining self-supervised pretext tasks which can be solved only if generalizable representations have been learned. Examples are: predicting the augmentation applied to a patch (Dosovitskiy et al., 2014), predicting the relative location of patches (Doersch et al., 2015), solving Jigsaw puzzles (Noroozi and Favaro, 2016), learning to count (Noroozi et al., 2017), spotting artifacts (Jenni and Favaro, 2018), predicting image rotations (Gidaris et al., 2018), or image channels (Zhang et al., 2017), generating color version of grayscale images (Zhang et al., 2016; Larsson et al., 2016), and generating missing patches (Pathak et al., 2016). Metric learning. The aim of metric learning (Bromley et al., 1994) is to use a distance metric to bring closer representations of similar inputs (positives), while moving away representations of dissimilar inputs (negatives). Commonly used losses are the contrastive loss (Hadsell et al., 2006), the triplet loss (Weinberger et al., 2006), the Noise-Constrative Estimation (NCE, Gutmann and Hyvärinen 2010), the margin (Schroff et al., 2015) and magnet (Rippel et al., 2016) losses. At a first glance relational reasoning and metric learning may seem related, however they are fundamentally different: (i) metric learning explicitly aims at organizing representations by similarity, self-supervised relational reasoning aims at learning a relation measure and, as a byproduct, learning useful representations; (ii) metric learning directly applies a distance metric over the representations, relational reasoning collects representations into a set, aggregates them, then estimates relations; (iii) the relational score is not a distance metric (see Section 3.3) but rather a learnable (probabilistic) similarity measure. Contrastive learning. Metric learning methods based on contrastive loss and NCE are often referred to as contrastive learning methods. Contrastive learning via NCE has recently obtained the state of the art in self-supervised learning. However, one limiting factor is that NCE relies on a large quantity of negatives, which are difficult to obtain in mini-batch stochastic optimization. Recent work has used a memory bank to dynamically store negatives during training (Wu et al., 2018), followed by a plethora of other methods (He et al., 2019; Tian et al., 2019; Misra and van der Maaten, 2019; Zhuang et al., 2019). However, a memory bank has several issues, it introduces additional overhead and a considerable memory footprint. Sim CLR (Chen et al., 2020) tries to circumvent the problem by mining negatives in-batch, but this requires specialized optimizers to stabilize the training at scale. We compare relational reasoning and constrastive learning in Section 3.1 and Section 5. Pseudo-labeling. Self-supervision can be achieved providing pseudo-labels to the learner, which are then used for standard supervised learning. A way to obtain pseudo-labels is to use the model itself, picking up the class which has the maximum predicted probability (Lee, 2013; Sohn et al., 2020). A neural network ensemble can also be used to provide the labels (Gupta et al., 2020). In Deep Cluster (Caron et al., 2018), pseudo-labels are produced by running a k-means clustering algorithm, which can be forced to induce equipartition (Asano et al., 2020). Recent studies have shown that pseudo-labeling is not competitive against other methods (Oliver et al., 2018), since they are often prone to degenerate solutions with points assigned to the same label (or cluster). Info Max. A recent line of work has investigated the use of mutual information for unsupervised and self-supervised representation learning, following the Info Max principle (Linsker, 1988). Mutual information is often maximized at different scales (global and local) on single views (Deep Info Max, Hjelm et al. 2019), multi-views (Bachman et al., 2019; Ji et al., 2019), or sequentially (Oord et al., 2018). Those methods are often strongly dependent on the choice of feature extractor architecture (Tschannen et al., 2020). Figure 1: Overview of the proposed method. The mini-batch B is augmented K times (e.g. via random flip and crop-resize) and passed through a neural network backbone fθ to produce the representations Z(1), . . . , Z(K). An aggregation function a joins positives (representations of the same images) and negatives (randomly paired representations) through a commutative operator. The relation module rφ estimates the relational score y, which must be 1 for positives and 0 for negatives. The model is optimized minimizing the binary cross-entropy (BCE) between prediction and target t. 3 Description of the method Consider an unlabeled dataset D = {xn}N n=1 and a non-linear function fθ( ) parameterized by a vector of learnable weights θ, modeled as a neural network (backbone). A forward pass generates a vector fθ(xn) = zn (representation), which can be collected in a set Z = {zn}N n=1. The notation A(xn) is used to express the probability distribution of instances generated by applying stochastic data augmentation to xn, while x(i) n A(xn) is the i-th sample from this distribution (a particular augmented version of the input instance), and D(i) = {x(i) n }N n=1 the i-th set of random augmentations over all instances. Likewise z(i) n = fθ(x(i) n ) is grouped in Z(i) = {z(i) n }N n=1. Let K indicate the total number of augmentations D(1), . . . , D(K) and their representations Z(1), . . . , Z(K). Now, let us define a relation module rφ( ), as a non-linear function approximator parameterized by φ, which takes as input a pair of aggregated representations and returns a relation score y. Indicating with a( , ) an aggregation function and with L(y, t) the loss between the score and a target value t, the complete learning objective can be specified as j=1 L rφ a(z(i) n , z(j) n ) , t = 1 | {z } intra-reasoning + L rφ a(z(i) n , z(j) \n) , t = 0 | {z } inter-reasoning , with zn = fθ(xn), (1) where \n is an index randomly sampled from {1, . . . , N} \ {n}. In practice (1) can be framed as a standard binary classification problem (see Section 3.4), and minimized by stochastic gradient descent sampling a mini-batch B D with pairs built by repeatedly applying K augmentations to B. Positives can be obtained pairing two encodings of the same input (intra-reasoning term), and negatives by randomly coupling representations of different inputs (inter-reasoning term), relying on the assumption that in common settings this yields a very low probability of false negatives. An overview of the model is given in Figure 1 and the pseudo-code in Appendix C (supp. material). Mutual information. Following the recent work of Boudiaf et al. (2020) we can interpret (1) in terms of mutual information. Let us define the random variables Z|X and T|Z, representing embeddings and targets. Now consider the generative view of mutual information I(Z; T) = H(Z) H(Z|T). (2) Intra-reasoning is a tightening factor which can be expressed as a bound over the conditional entropy H(Z|T). Inter-reasoning is a scattering factor which can be linked to the entropy of the representations H(Z). In other words, each representation is pushed towards a positive neighborhood (intra-reasoning) and repelled from a complementary set of negatives (inter-reasoning). Under this interpretation (1) can be considered as a proxy for maximizing Equation (2). We refer the reader to Boudiaf et al. (2020) for a more detailed analysis. 3.1 Inputs augmentation Given a random mini-batch of M input instances B D, recursively apply data augmentation K times B(1), . . . , B(K) then propagate through fθ with a forward pass, to generate the corresponding representations Z(1), . . . , Z(K). Representations are coupled across augmentations to generate positive and negative tuples i, j {1, . . . , K} Z(i), Z(j) | {z } positives and Z(i), Z(j) | {z } negatives where Z indicates random assignment of each representation z(i) n to a different element z(j) \n. In practice, we discard identical pairs (identity mapping is learned across augmentations) and take just one of the symmetrical tuples (z(i), z(j)) and (z(j), z(i)) (the aggregation function ensures commutation, see Section 3.2). If a certain amount of data in D is labeled (semi-supervised setting), then positive pairs include representations of different augmented inputs belonging to the same category. Computational cost. Having defined M as the number of inputs in the mini-batch B, and K as the number of augmentations, the total number of pairs P (positive and negative) is given by P = M(K2 K). (4) The number of comparisons P scales quadratically with the number of augmentations K, and linearly with the size of the mini-batch M; whereas in recent constrastive learning methods (Chen et al., 2020), they scale as P = (MK)2, which is quadratic in both augmentations and mini-batch size. Augmentation strategy. Here, we consider the particular case where the input instances are color images. Following previous work (Chen et al., 2020) we focus on two augmentations: random crop-resize and color distortion. Crop-resize enforces comparisons between views: global-to-global, global-to-local, and local-to-local. Since augmentations are sampled from the same color distribution, the color alone may suffice to distinguish positives and negatives. Color distortion enforces colorinvariant encodings and neutralizes learning shortcuts. Additional details about the augmentations used in this work are reported in Section 4 and Appendix A.3 (supp. material). 3.2 Aggregation function Relation networks operate over sets. To avoid a combinatorial explosion due to an increasing cardinality, a commutative aggregation function is applied. Given fθ(xi) = zi and fθ(xj) = zj, there are different possible choices for the aggregation function asum(zi, zj) = zi + zj, amax(zi, zj) = max(zi, zj), acat(zi, zj) = zi, zj , (5) where sum and max are applied elementwise. Concatenation acat is not commutative, but it has been previously used when the cardinality is small (Hu et al., 2018; Sung et al., 2018), like in our case. 3.3 Relation module The relation module is a function rφ( ) parameterized by a vector of learnable weights φ, modeled as a multi-layer perceptron (MLP). Given a pair of representations zi and zj, the module takes as input the aggregated pair and produce a scalar y (relation score) rφ a(zi, zj) = y. (6) The relational score respects two properties: (i) r(a(zi, zj)) [0, 1]; (ii) r(a(zi, zj)) = r(a(zj, zi)). It is crucial to not misinterpret the relational score for a pairwise distance metric. Given a set of input vectors {vi, vj, vk} the distance metric d( , ) respects four properties: (i) d(vi, vj) 0; (ii) d(vi, vj) = 0 vi = vj; (iii) d(vi, vj) = d(vj, vi); (iv) d(vi, vk) d(vi, vj) + d(vj, vk). Note that the relational score does not satisfies all the conditions of a distance metric and therefore the relational score is not a distance metric, but rather a probabilistic estimate (see Section 3.4). 3.4 Definition of the loss The learning objective (1) can be framed as a binary classification problem over the P representation pairs. Under this interpretation, the relation score y represents a probabilistic estimate of representation membership, which can be induced through a sigmoid activation function. It follows that the objective reduces to the maximization of a Bernoulli log-likelihood, or similarly, the minimization of a binary cross-entropy loss L(y, t, γ) = 1 i=1 wi h ti log yi + (1 ti) log(1 yi) i , (7) with target ti = 1 for positives and ti = 0 for negatives. The optional weight wi is a scaling factor h (1 ti) yi + ti (1 yi) iγ , (8) where γ 0 defines how sharp the weight should be. This factor gives more importance to uncertain estimations and it is also known as the focal loss (Lin et al., 2017). 4 Experiments Evaluating self-supervised methods is problematic because of substantial inconsistency in the way methods have been compared (Kolesnikov et al., 2019; Musgrave et al., 2020). We provide a standardized environment implemented in Pytorch (see code in supp. material) using standard datasets (CIFAR-10, CIFAR-100, CIFAR-100-20, STL-10, tiny-Image Net, Slimage Net), different backbones (shallow and deep), same learning schedule (epochs), and well know evaluation protocols (Kolesnikov et al., 2019). In most conditions our method show superior performance. Implementation. Hyperparameters (relation learner): mini-batch of 64 images (K = 16 for Res Net32 on tiny-Image Net, K = 25 for Res Net-34 on STL-10, K = 32 for the rest), Adam optimizer with learning rate 10 3, binary cross-entropy loss with focal factor (γ = 2). Relation module: MLP with 256 hidden units (batch-norm + leaky-Re LU) and a single output unit (sigmoid). Aggregation: we used concatenation as it showed to be more effective (see Appenidx B.8, Table 13 supp. material). Augmentations: horizontal flip (50% chance), random crop-resize, conversion to grayscale (20% chance), and color jitter (80% chance). Backbones: Conv-4, Res Net-8/32/56 and Res Net-34 (He et al., 2016). Baselines: Deep Cluster (Caron et al., 2018), Rotation Net (Gidaris et al., 2018), Deep Info Max (Hjelm et al., 2019), and Sim CLR (Chen et al., 2020). Those are recent (hard) baselines, with Sim CLR being the current state-of-the-art in self-supervised learning. As upper bound we include the performance of a fully supervised learner (it has access to the labels), and as lower bound a network initialized with random weights, evaluated training only the linear classifier. All results are the average over three random seeds. Additional details in supp. material (Appendix A). Linear evaluation. We follow the linear evaluation protocol defined by Kolesnikov et al. (2019) training the backbone for 200 epochs using the unlabeled training set, and then training for 100 epochs a linear classifier on top of the backbone features (without backpropagation in the backbone weights). The accuracy of this classifier on the test set is considered as the final metric to asses the quality of the representations. Our method largely outperforms other baselines with an accuracy of 46.2% (CIFAR-100) and 30.5% (tiny-Imagenet), which is an improvement of +4.0% and +4.7% over the best competitor (Sim CLR), see Table 1. Best results are also obtained with the Conv-4 backbone on all datasets. Only in CIFAR-10/Res Net-32 Sim CLR is doing better, with a score of 77% against 75% of our method, see Appendix B.1 (supp. material). In the appendix we report the results on the challenging Slimage Net dataset used in few-shot learning (Antoniou et al., 2020): 160 low-resolution images for each one of the 1000 classes in Image Net. On Slimage Net our method has the highest accuracy (15.8%, K = 16), being better than Rotation Net (7.2%) and Sim CLR (14.3%). Domain transfer. We evaluate the performance of all methods in transfer learning by training on the unlabeled CIFAR-10 with linear evaluation on the labeled CIFAR-100 (and viceversa). Our method outperforms once again all the others in every condition. In particular, it is very effective in generalizing from a simple dataset (CIFAR-10) to a complex one (CIFAR-100), obtaining an accuracy of 41.5%, which is a gain of +5.3% over Sim CLR and +7.5% over the supervised baseline (with linear transfer). For results see Table 1 and Appendix B.2 (supp. material). Table 1: Comparison on various benchmarks. Mean accuracy (percentage) and standard deviation over three runs (Res Net-32). Best results in bold. Linear Evaluation: training on unlabeled data and linear evaluation on labeled data. Domain Transfer: training on unlabeled CIFAR-10 and linear evaluation on labeled CIFAR-100 (10 100), and viceversa (100 10). Grain: training on unlabeled CIFAR-100, linear evaluation on coarse-grained CIFAR-100-20 (20 super-classes). Finetune: training on the unlabeled set of STL-10, finetuning on the labeled set (Res Net-34). Linear Evaluation Domain Transfer Grain Finetune Method CIFAR-100 tiny-Img Net 10 100 100 10 CIFAR-100-20 STL-10 Supervised (upper bound) 65.32 0.22 50.09 0.32 33.98 0.71 71.01 0.44 76.35 0.57 69.82 3.36 Random Weights (lower bound) 7.65 0.44 3.24 0.43 7.65 0.44 27.47 0.83 16.56 0.48 n/a Deep Cluster (Caron et al., 2018) 20.44 0.80 11.64 0.21 18.37 0.41 43.39 1.84 29.49 1.36 73.37 0.55 Rotation Net (Gidaris et al., 2018) 29.02 0.18 14.73 0.48 27.02 0.20 52.22 0.70 40.45 0.39 83.29 0.44 Deep Info Max (Hjelm et al., 2019) 24.07 0.05 17.51 0.15 23.73 0.04 45.05 0.24 33.92 0.34 76.03 0.37 Sim CLR (Chen et al., 2020) 42.13 0.35 25.79 0.35 36.20 0.16 65.59 0.76 51.88 0.48 89.31 0.14 Relational Reasoning (ours) 46.17 0.17 30.54 0.42 41.50 0.35 67.81 0.42 52.44 0.47 89.67 0.33 CIFAR-10 CIFAR-100 tiny-Img Net 5 accuracy w.r.t Conv-4 Relation Sim CLR Rot Net 2 4 8 16 32 number of augmentations (K) accuracy w.r.t K = 2 Relation Rot Net Superv. 1 10 25 50 100 percentage of labels accuracy (linear eval) Supervised acc. (CIFAR-10) Supervised acc. (CIFAR-100) Relation (CIFAR-10) Relation (CIFAR-100) Figure 2: (a) Difference in accuracy using the deeper backbone (Conv4 Res Net-32, linear evaluation). As the complexity of the dataset raises our method performs increasingly better than the others. (b) Correlation between validation accuracy (3 seeds, Conv-4, CIFAR-10) and number of mini-batch augmentations. Only in our method the accuracy is positively correlated with the number of augmentations. (c) Semi-supervised accuracy with an increasing percentage of labels (Res Net-32). Grain. Different methods produce different representations, some may be better on datasets with a small amount of labels (coarse-grained), others may be better on datasets with a large amount of labels (fine-grained). To investigate the granularity of the representations we train on unlabeled CIFAR-100, then perform linear evaluation using the 100 labels (fine grained; e.g. apple, fox, bee, etc) and the 20 super-labels (coarse grained; e.g. fruits, mammals, insects, etc). Also in this case our method is superior in all conditions with an accuracy of 52.4% on CIFAR-100-20, see Table 1 and Appendix B.3 (supp. material). In comparison, the method does better in the fine-grained case, indicating that it is well suited for datasets with a large amount of classes. Finetuning. We used the STL-10 dataset (Coates et al., 2011) which provides a set of unlabeled data coming from a similar but different distribution from the labeled data. Methods have been trained for 300 epochs on the unlabeled set (100K images), finetuned for 20 epochs on the labeled set (5K images), and finally evaluated on the test set (8K images). We used a mini-batch of 64 with K = 25 and a Res Net-34. Implementation details are reported in Appendix A.6 (supp. material). Results in Table 1 show that our method obtains the highest accuracy: 89.67% (best seed 90.04%). Moreover a wider comparison reported in Appendix B.4 (supp. material) shows that the method outperforms strong supervised baselines and the previous self-supervised state-of-the-art (88.80%, Ji et al., 2019). Depth of the backbone. In Appendix B.5 we report an extensive comparison on four backbones of increasing depth: Conv-4, Res Net-8, Res Net-32, and Res Net-56. We tested the three best methods (Rotation Net, Sim CLR, and Relational Reasoning) on CIFAR-10/100 linear evaluation, grain, and domain transfer for a total of 24 conditions. Results show that our method has the highest accuracy on 21 of those conditions, with Sim CLR performing better on CIFAR-10 linear evaluation with Res Net backbones. A distilled version of those results is reported in Figure 2a. The figure shows the gain in accuracy from using a Res Net-32 instead of a Conv-4 backbone for datasets of increasing complexity (10, 100, and 200 classes). As the complexity of the dataset raises our method performs increasingly better than the others. The relative gain against Sim CLR gets larger: 2.6% (CIFAR-10), +1.1% (CIFAR-100), +3.3% (tiny-Image Net). The relative gain against Rotation Net is even more evident: +8.7%, +11.2%, +11.9%. Additional experiments. Figure 2b and Appendix B.6 (supp. material) show the difference in accuracy between K = 2 and K {4, 8, 16, 32} mini-batch augmentations for a fixed mini-batch size. There is a clear positive correlation between the number of augmentations and the performance of our model, while the same does not hold for a self-supervised algorithm (Rotation Net) and the supervised baseline. Figure 2c and Appendix B.7 (supp. material) show the accuracy obtained via linear evaluation when the number of available labels is gradually increased (0%, 1%, 10%, 25%, 50%, 100%), in both CIFAR-10 and CIFAR-100 (Res Net-32). The accuracy is positively correlated with the proportion of labels available, approaching the supervised upper bound when 100% of labels are available. Ablations. In Appendix B.8 we report the results of ablation studies on the aggregation function and relation head. We compare four aggregation functions: sum, mean, maximum, and concatenation. Results show that concatenation and maximum are respectively the most and less effective functions. Concatenation may favor backpropagation improving the quality of the representations, as supported by similar results in previous work (Sung et al., 2018). Ablations of the relation head have followed two directions: (i) removing the head, and (ii) replacing the relation module with an encoder. In the first condition we removed the head and replace it with a simple dot product between representation pairs (BCE-focal loss). In the second condition we followed an approach similar to Sim CLR (Chen et al., 2020), replacing the relation head with an encoder and applying the dot product to representations at the higher level (BCE-focal loss). The second condition differs from Sim CLR for the loss type (BCE vs Contrastive) and total number of mini-batch augmentations (K = 32 vs K = 2). In both conditions we observe a severe degradation of the performance with respect to the complete model (from a minimum of 3% to a maximum of 23%), confirming that the relation module is a fundamental component in the pipeline (see discussion in Section 5). Qualitative analysis. In Appendix B.9 (supp. material) is presented a qualitative comparison between the proposed method and Rotation Net, on an image retrieval downstream task. Given a random query image (not cherry-picked) the top-10 most similar images in representation space are retrieved. Our method shows better distinction between categories which are hard to separate (e.g. ships vs planes, trucks vs cars). The lower sample variance and the higher similarity with the query, confirm the fine-grained organization of the representations, which account for color, texture, and geometry. An analysis of retrieval errors in Appendix B.10 (supp. material) shows that the proposed method is superior in accuracy across all categories while being more robust against misclassification, with a top-10 retrieval accuracy of 67.8% against 47.7% of Rotation Net. In Appendix B.11 (supp. material) we report a qualitative analysis of the representations (Res Net-32, CIFAR-10) using t-SNE (Maaten and Hinton, 2008). Relational reasoning is able to aggregate the data in a more effective way, and to better capture high level relations with lower scattering (e.g. vehicles vs animals super-categories). 5 Discussion and conclusions Self-supervised relational reasoning is effective on a wide range of tasks in both a quantitative and qualitative manner, and with backbones of different size (Res Net-32, Res Net-56 and Res Net-34, with 0.5 106, 0.9 106 and 21.3 106 parameters). Representations learned through comparison can be easily transferred across domains, they are fine-grained and compact, which may be due to the direct correlation between accuracy and number of augmentations. An instance is pushed towards a positive neighborhood (intra-reasoning) and repelled from a complementary set of negatives (inter-reasoning). The number of augmentations may have a primary role in this process affecting the quality of the clusters. The possibility to exploit an high number of augmentations, by generating them on the fly, could be decisive in the low-data regime (e.g. unsupervised few-shot/online learning) where self-supervised relational reasoning has the potential to thrive. Those are factors that require further consideration and investigation. From self-supervised to supervised. Recent work has showed that contrastive learning can be used in a supervised setting with competitive results (Khosla et al., 2020). In our experiments we have observed a similar trend, with relational reasoning approaching the supervised performance when all the labels are available. However, we have obtained those results using the same hyperparameters and augmentations used in the self-supervised case, while there may be alternatives that are more effective. Learning by comparison could help in disentangling fine-grained differences in a fully supervised setting with high number of classes, and be decisive to build complex taxonomic representations, as pointed out in cognitive studies (Gentner and Namy, 1999; Namy and Gentner, 2002) Comparison with contrastive methods. We have compared relational reasoning to a state-of-the-art contrastive learning method (Sim CLR) using the same backbone, head, augmentation strategy, and learning schedule. Relational reasoning outperforms Sim CLR (+3% on average) using a lower number of pairs, being more efficient. Given a mini-batch of size 64, relational reasoning uses 6.35 104 (K = 32) and 1.5 104 (K = 16) pairs, against 6.55 104 of Sim CLR with mini-batch 128. Contrastive losses needs a large number of negatives, which can be gathered by increasing M the size of the mini-batch, or increasing K the number of augmentations (both solutions incur a quadratic cost, see Section 3.1). High quality negatives can only be gathered following the first solution, since the second provides lower sample variance. A typical mini-batch in Sim CLR encloses 98% negatives and 2% positives, in our method 50% negatives and 50% positives. The larger set of positives could be one of the reasons why relational reasoning is more effective in disentangling fine-grained representations. In addition to the difference in loss type, there is an important structural difference between the two approaches: in Sim CLR pairs are allocated in the loss space and then compared via dot product, while in relational reasoning they are aggregated in the space of transferable representations and compared through a relation head. Ablation studies in Section 4 have shown that this structural difference is fundamental for obtaining higher performances, but the way it influences the learning dynamics and the optimization process is not clear and requires further investigation. Why does cross-entropy work so well? We argue that in the context of recent state-of-the-art methods, cross-entropy has been overlooked in favor of contrastive losses. Our experiments show that cross-entropy is a more efficient and effective objective function with respect to the commonly used contrastive losses. Based on the results of the ablation studies, we hypothesize that the difference in performance is mainly due to the use of a relation module in conjunction with the binary cross-entropy loss. When the BCE is split from the relation head and applied directly to the representations there is a drastic drop in performance; applying the BCE to surrogate representations in a second encoding stage (like in Sim CLR) is equally ineffective. Therefore, the use of BCE on its own does not provide any advantage but in concert with the relation head it becomes effective. A more thorough analysis is necessary to substantiate these findings, which is left for future work. Broader Impact The motivation behind this work is to build systems able to exploit a large amount of unlabeled data. Applications that could benefit from the proposed method span from standard supervised classifiers to medical diagnostic systems. Therefore, there is a large number of individuals who may benefit or be harmed from this research. This requires putting some effort into selecting the data source, especially when the system is scaled. In most cases a large body of unlabeled images can be easily gathered from the internet; to avoid biases those images should be representative of different categories. Our method does not guarantee unbiased predictions, therefore it should be used with caution in critical applications. Individuals who may want to use it should consider the particular source of data at hand and evaluate how it could impact the system performance after the final deployment. Acknowledgments and Disclosure of Funding This work was supported by a Huawei DDMPLab Innovation Research Grant. MP and AS would like to thank anonymous reviewers for useful comments and suggestions; the Bayes Watch team for feedback and discussion, in particular Elliot J. Crowley, Luke Darlow, and Joseph Mellor. MP would like to thank the Becchi team for revising the preliminary version of the manuscript, in particular Valerio Biscione, Riccardo Polvara, and Luca Surace. Antoniou, A., Patacchiola, M., Ochal, M., and Storkey, A. (2020). Defining benchmarks for continual few-shot learning. ar Xiv preprint ar Xiv:2004.11967. Asano, Y. M., Rupprecht, C., and Vedaldi, A. (2020). Self-labelling via simultaneous clustering and representation learning. In International Conference on Learning Representations. Bachman, P., Hjelm, R. D., and Buchwalter, W. (2019). Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems. Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018). Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261. Boudiaf, M., Rony, J., Ziko, I. M., Granger, E., Pedersoli, M., Piantanida, P., and Ayed, I. B. (2020). Metric learning: cross-entropy vs. pairwise losses. ar Xiv preprint ar Xiv:2003.08983. Bromley, J., Guyon, I., Le Cun, Y., Säckinger, E., and Shah, R. (1994). Signature verification using a siamese time delay neural network. In Advances in Neural Information Processing Systems. Caron, M., Bojanowski, P., Joulin, A., and Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. ar Xiv preprint ar Xiv:2002.05709. Coates, A., Ng, A., and Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics. De Vries, T. and Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552. Doersch, C., Gupta, A., and Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision. Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems. Džeroski, S., De Raedt, L., and Driessens, K. (2001). Relational reinforcement learning. Machine learning, 43(1-2):7 52. Gentner, D. and Kurtz, K. (2005). Relational categories. WK Ahn, RL Goldstone, BC Love, AB Markman, & PW Wolff (Eds.), pages 151 175. Gentner, D. and Namy, L. L. (1999). Comparison in the development of categories. Cognitive development, 14(4):487 513. Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations. Goldwater, M. B., Don, H. J., Krusche, M. J., and Livesey, E. J. (2018). Relational discovery in category learning. Journal of Experimental Psychology: General, 147(1):1. Gupta, D., Ramjee, R., Kwatra, N., and Sivathanu, M. (2020). Unsupervised clustering using pseudo-semisupervised learning. In International Conference on Learning Representations. Gutmann, M. and Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Conference on Artificial Intelligence and Statistics. Hadsell, R., Chopra, S., and Le Cun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Computer Vision and Pattern Recognition. Haeusser, P., Plapp, J., Golkov, V., Aljalbout, E., and Cremers, D. (2018). Associative deep clustering: Training a classification network with no labels. In German Conference on Pattern Recognition. Springer. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2019). Momentum contrast for unsupervised visual representation learning. ar Xiv preprint ar Xiv:1911.05722. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Computer Vision and Pattern Recognition. Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. (2019). Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations. Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. (2018). Relation networks for object detection. In Computer Vision and Pattern Recognition. Jenni, S. and Favaro, P. (2018). Self-supervised feature learning by learning to spot artifacts. In Computer Vision and Pattern Recognition. Ji, X., Henriques, J. F., and Vedaldi, A. (2019). Invariant information clustering for unsupervised image classification and segmentation. In International Conference on Computer Vision. Jing, L. and Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. (2020). Supervised contrastive learning. ar Xiv preprint ar Xiv:2004.11362. Kolesnikov, A., Zhai, X., and Beyer, L. (2019). Revisiting self-supervised visual representation learning. In Computer Vision and Pattern Recognition. Koller, D., Friedman, N., Džeroski, S., Sutton, C., Mc Callum, A., Pfeffer, A., Abbeel, P., Wong, M.-F., Heckerman, D., Meek, C., et al. (2007). Introduction to statistical relational learning. MIT press. Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and brain sciences, 40. Larsson, G., Maire, M., and Shakhnarovich, G. (2016). Learning representations for automatic colorization. In European Conference on Computer Vision. Springer. Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML. Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In International Conference on Computer Vision. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3):105 117. Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579 2605. Misra, I. and van der Maaten, L. (2019). Self-supervised learning of pretext-invariant representations. ar Xiv preprint ar Xiv:1912.01991. Musgrave, K., Belongie, S., and Lim, S.-N. (2020). A metric learning reality check. ar Xiv preprint ar Xiv:2003.08505. Namy, L. L. and Gentner, D. (2002). Making a silk purse out of two sow s ears: Young children s use of comparison in category learning. Journal of Experimental Psychology: General, 131(1):5. Noroozi, M. and Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision. Springer. Noroozi, M., Pirsiavash, H., and Favaro, P. (2017). Representation learning by learning to count. In International Conference on Computer Vision. Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and Goodfellow, I. (2018). Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems. Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748. Oyallon, E., Belilovsky, E., and Zagoruyko, S. (2017). Scaling the scattering transform: Deep hybrid networks. In International Conference on Computer Vision. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition. Raposo, D., Santoro, A., Barrett, D., Pascanu, R., Lillicrap, T., and Battaglia, P. (2017). Discovering objects and their relations from entangled scene representations. ar Xiv preprint ar Xiv:1702.05068. Rippel, O., Paluri, M., Dollar, P., and Bourdev, L. (2016). Metric learning with adaptive density discrimination. In International Conference on Learning Representations. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252. Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski, M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., and Lillicrap, T. (2018). Relational recurrent neural networks. In Advances in Neural Information Processing Systems. Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. (2017). A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems. Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Diplomarbeit, Technische Universität München. Schmidhuber, J. (1990). Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Computer Vision and Pattern Recognition. Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. ar Xiv preprint ar Xiv:2001.07685. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In Computer Vision and Pattern Recognition. Tian, Y., Krishnan, D., and Isola, P. (2019). Contrastive multiview coding. ar Xiv preprint ar Xiv:1906.05849. Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. (2020). On mutual information maximization for representation learning. In International Conference on Learning Representations. Weinberger, K. Q., Blitzer, J., and Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems. Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Computer Vision and Pattern Recognition. Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T., Lockhart, E., et al. (2019). Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations. Zhang, R., Isola, P., and Efros, A. A. (2016). Colorful image colorization. In European Conference on Computer Vision. Springer. Zhang, R., Isola, P., and Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Computer Vision and Pattern Recognition. Zhuang, C., Zhai, A. L., and Yamins, D. (2019). Local aggregation for unsupervised learning of visual embeddings. In International Conference on Computer Vision.