# contrastive_learning_with_hard_negative_samples__2e6d7069.pdf Published as a conference paper at ICLR 2021 CONTRASTIVE LEARNING WITH HARD NEGATIVE SAMPLES Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, Stefanie Jegelka Massachusetts Institute of Technology Cambridge, MA, USA {joshrob,cychuang,suvrit,stefje}@mit.edu How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead. 1 INTRODUCTION Owing to their empirical success, contrastive learning methods (Chopra et al., 2005; Hadsell et al., 2006) have become one of the most popular self-supervised approaches for learning representations (Oord et al., 2018; Tian et al., 2019; Chen et al., 2020a). In computer vision, unsupervised contrastive learning methods have even outperformed supervised pre-training for object detection and segmentation tasks (Misra & Maaten, 2020; He et al., 2020). Contrastive learning relies on two key ingredients: notions of similar (positive) (x, x+) and dissimilar (negative) (x, x ) pairs of data points. The training objective, typically noise-contrastive estimation (Gutmann & Hyvärinen, 2010), guides the learned representation f to map positive pairs to nearby locations, and negative pairs farther apart; other objectives have also been considered (Chen et al., 2020a). The success of the associated methods depends on the design of informative of the positive and negative pairs, which cannot exploit true similarity information since there is no supervision. Much research effort has addressed sampling strategies for positive pairs, and has been a key driver of recent progress in multi-view and contrastive learning (Blum & Mitchell, 1998; Xu et al., 2013; Bachman et al., 2019; Chen et al., 2020a; Tian et al., 2020). For image data, positive sampling strategies often apply transformations that preserve semantic content, e.g., jittering, random cropping, separating color channels, etc. (Chen et al., 2020a;c; Tian et al., 2019). Such transformations have also been effective in learning control policies from raw pixel data (Srinivas et al., 2020). Positive sampling techniques have also been proposed for sentence, audio, and video data (Logeswaran & Lee, 2018; Oord et al., 2018; Purushwalkam & Gupta, 2020; Sermanet et al., 2018). Surprisingly, the choice of negative pairs has drawn much less attention in contrastive learning. Often, given an anchor point x, a negative x is simply sampled uniformly from the training data, independent of how informative it may be for the learned representation. In supervised and metric learning settings, hard (true negative) examples can help guide a learning method to correct its mistakes more quickly (Schroff et al., 2015; Song et al., 2016). For representation learning, informative negative examples are intuitively those pairs that are mapped nearby but should be far apart. This idea is successfully applied in metric learning, where true pairs of dissimilar points are available, as opposed to unsupervised contrastive learning. Code available at: https://github.com/joshr17/HCL Published as a conference paper at ICLR 2021 With this motivation, we address the challenge of selecting informative negatives for contrastive representation learning. In response, we propose a solution that builds a tunable sampling distribution that prefers negative pairs whose representations are currently very similar. This solution faces two challenges: (1) we do not have access to any true similarity or dissimilarity information; (2) we need an efficient sampling strategy for this tunable distribution. We overcome (1) by building on ideas from positive-unlabeled learning (Elkan & Noto, 2008; Du Plessis et al., 2014), and (2) by designing an efficient, easy to implement importance sampling technique that incurs no computational overhead. Our theoretical analysis shows that, as a function of the tuning parameter, the optimal representations for our new method place similar inputs in tight clusters, whilst spacing the clusters as far apart as possible. Empirically, our hard negative sampling strategy improves the downstream task performance for image, graph and text data, supporting that indeed, our negative examples are more informative. Contributions. In summary, we make the following contributions: 1. We propose a simple distribution over hard negative pairs for contrastive representation learning, and derive a practical importance sampling strategy with zero computational overhead that takes into account the lack of true dissimilarity information; 2. We theoretically analyze the hard negatives objective and optimal representations, showing that they capture desirable generalization properties; 3. We empirically observe that the proposed sampling method improves the downstream task performance on image, graph and text data. Before moving onto the problem formulation and our results, we summarize related work below. 1.1 RELATED WORK Contrastive Representation Learning. Various frameworks for contrastive learning of visual representations have been proposed, including Sim CLR (Chen et al., 2020a;b), which uses augmented views of other items in a minibatch as negative samples, and Mo Co (He et al., 2020; Chen et al., 2020c), which uses a momentum updated memory bank of old negative representations to enable the use of very large batches of negative samples. Most contrastive methods are unsupervised, however there exist some that use label information (Sylvain et al., 2020; Khosla et al., 2020). Many works study the role of positive pairs, and, e.g., propose to apply large perturbations for images Chen et al. (2020a;c), or argue to minimize the mutual information within positive pairs, apart from relevant information for the ultimate prediction task (Tian et al., 2020). Beyond visual data, contrastive methods have been developed for sentence embeddings (Logeswaran & Lee, 2018), sequential data (Oord et al., 2018; Hénaff et al., 2020), graph (Sun et al., 2020; Hassani & Khasahmadi, 2020; Li et al., 2019) and node representation learning (Velickovic et al., 2019), and learning representations from raw images for off-policy control (Srinivas et al., 2020). The role of negative pairs hase been much less studied. Chuang et al. (2020) propose a method for debiasing , i.e., correcting for the fact that not all negative pairs may be true negatives. It does so by taking the viewpoint of Positive-Unlabeled learning, and exploits a decomposition of the true negative distribution. Kalantidis et al. (2020) consider applying Mixup (Zhang et al., 2018) to generate hard negatives in latent space, and Jin et al. (2018) exploit the specific temporal structure of video to generate negatives for object detection. Negative Mining in Deep Metric Learning. As opposed to the contrastive representation learning literature, selection strategies for negative samples have been thoroughly studied in (deep) metric learning (Schroff et al., 2015; Song et al., 2016; Harwood et al., 2017; Wu et al., 2017; Ge, 2018; Suh et al., 2019). Most of these works observe that it is helpful to use negative samples that are difficult for the current embedding to discriminate. Schroff et al. (2015) qualify this, observing that some examples are simply too hard, and propose selecting semi-hard negative samples. The well known importance of negative samples in metric learning, where (partial) true dissimilarity information is available, raises the question of negative samples in contrastive learning, the subject of this paper. 2 CONTRASTIVE LEARNING SETUP We begin with the setup and the idea of contrastive representation learning. We wish to learn an embedding f : X Sd 1/t that maps an observation x to a point on a hypersphere Sd 1/t in Rd of radius 1/t, where t is the temperature scaling hyperparameter. Following the setup of Arora et al. (2019), we assume an underlying set of discrete latent classes C that represent semantic content, so that similar pairs (x, x+) have the same latent class. Denoting Published as a conference paper at ICLR 2021 Sample Negatives Uniformly from Dataset (typical method) Sample Hard Negatives (our method) Anchor: Negative Batch: Willow Sycamore Maple Embedding Space: Figure 1: Schematic illustration of negative sampling methods for the example of classifying species of tree. Top row: uniformly samples negative examples (red rings); mostly focuses on very different data points from the anchor (yellow triangle), and may even sample examples from the same class (triangles, vs. circles). Bottom row: Hard negative sampling prefers examples that are (incorrectly) close to the anchor. the distribution over latent classes by ρ(c) for c C, we define the joint distribution px,c(x, c) = p(x|c)ρ(c) whose marginal p(x) we refer to simply as p, and assume supp(p) = X. For simplicity, we assume ρ(c) = τ + is uniform, and let τ = 1 τ + be the probability of another class. Since the class-prior τ + is unknown in practice, it must either be treated as a hyperparameter, or estimated (Christoffel et al., 2016; Jain et al., 2016). Let h : X C be the true underlying hypothesis that assigns class labels to inputs. We write x x to denote the label equivalence relation h(x) = h(x ). We denote by p+ x (x ) = p(x |h(x ) = h(x)), the distribution over points with same label as x, and by p x (x ) = p(x |h(x ) = h(x)), the distribution over points with labels different from x. We drop the subscript x when the context is clear. Following the usual convention, we overload and also write x p to denote a point sampled from p. For each data point x p, the noise-contrastive estimation (NCE) objective (Gutmann & Hyvärinen, 2010) for learning the representation f uses a positive example x+ with the same label as x, and negative examples {x i }N i=1 with (supposedly) different labels, h(x i ) = h(x), sampled from q: Ex p, x+ p+ x {x i }N i=1 q log ef(x)T f(x+) ef(x)T f(x+) + Q N PN i=1 ef(x)T f(x i ) The weighting parameter Q is introduced for the purpose of analysis. When N is finite we take Q = N, yielding the usual form of the contrastive objective. The negative sample distribution q is frequently chosen to be the marginal distribution p, or, in practice, an empirical approximation of it (Tian et al., 2019; Chen et al., 2020a;c; He et al., 2020; Chen et al., 2020c; Oord et al., 2018; Hénaff et al., 2020). In this paper we ask: is there a better way to choose q? 3 HARD NEGATIVE SAMPLING In this section we describe our approach for hard negative sampling. We begin by asking what makes a good negative sample? To answer this question we adopt the following two guiding principles: Principle 1. q should only sample true negatives x i whose labels differ from that of the anchor x. Principle 2. The most useful negative samples are ones that the embedding currently believes to be similar to the anchor. In short, negative samples that have different label from the anchor, but that are embedded nearby are likely to be most useful and provide significant gradient information during training. In metric learning there is access to true negative pairs, automatically fulfilling the first principle. In unsupervised contrastive learning there is no supervision, so upholding Principle 1 is impossible to do exactly. In this paper we propose a method that upholds Principle 1 approximately, and simultaneously combines this idea with the key additional conceptual ingredient of hardness (encapsulated in Principle 2). The level of hardness in our method can be smoothly adjusted, Published as a conference paper at ICLR 2021 allowing the user to select the hardness that best trades-off between an improved learning signal from hard negatives, and the harm due to the correction of false negatives being only approximate. This important since the hardest points are those closest to the anchor, and are expected to have a high propensity to have the same label. Therefore the damage from the approximation not removing all false negatives becomes larger for harder samples, creating the trade-off. As a special case our our method, when the hardness level is tuned fully down, we obtain the method proposed in (Chuang et al., 2020) that only upholds Principle 1 (approximately) but not Principle 2. Finally, beyond Principles 1 and 2, we wish to design an efficient sampling method that does not add additional computational overhead during training. 3.1 PROPOSED HARD SAMPLING METHOD Our first goal is to design a distribution q on X that is allowed to depend on the embedding f and the anchor x. From q we sample a batch of negatives {x i }N i=1 according to the principles noted above. We propose sampling negatives from the distribution q β defined as q β (x ) := qβ(x |h(x) = h(x )), where qβ(x ) eβf(x) f(x ) p(x ), for β 0. Note that q β and qβ both depend on x, but we suppress the dependance from the notation. The exponential term in qβ is an unnormalized von Mises Fisher distribution with mean direction f(x) and concentration parameter β (Mardia & Jupp, 2000). There are two key components to q β , corresponding to each principle: 1) conditioning on the event {h(x) = h(x )} which guarantees that (x, x ) correspond to different latent classes (Principle 1); 2) the concentration parameter β term controls the degree by which qβ up-weights points x that have large inner product (similarity) to the anchor x (Principle 2). Since f lies on the surface of a hypersphere of radius 1/t, we have f(x) f(x ) 2 = 2/t2 2f(x) f(x ) so preferring points with large inner product is equivalent to preferring points with small squared Euclidean distance. Although we have designed q β to have all of the desired components, it is not clear how to sample efficiently from it. To work towards a practical method, note that we can rewrite this distribution by adopting a PU-learning viewpoint (Elkan & Noto, 2008; Du Plessis et al., 2014; Chuang et al., 2020). That is, by conditioning on the event {h(x) = h(x )} we can split qβ(x ) as qβ(x ) = τ q β (x ) + τ +q+ β (x ), (2) where q+ β (x ) = qβ(x |h(x) = h(x )) eβf(x) f(x ) p+(x ). Rearranging equation 2 yields a formula q β (x ) = qβ(x ) τ +q+ β (x ) /τ for the negative sampling distribution q β in terms of two distributions that are tractable since we have samples from p and can approximate samples from p+ using a set of semantics-preserving transformations, as is typical in contrastive learning methods (see Appendix E for extra discussion of practical implications of this approximation). It is possible to generate samples from qβ and approximately from q+ β using rejection sampling and data augmentations to generate positives. However, rejection sampling involves an algorithmic complication since the procedure for sampling batches must be modified. To avoid this, we instead take an importance sampling approach. To obtain this, first note that fixing the number Q and taking the limit N in the objective (1) yields, L(f, q) = E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + QEx q[ef(x)T f(x )] The original objective (1) can be viewed as a finite negative sample approximation to L(f, q) (note implicitly L(f, q) depends on Q) . Inserting q = q β and using the rearrangement of equation (2) we obtain the following hardness-biased objective: E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + Q τ (Ex qβ[ef(x)T f(x )] τ +Ev q+ β [ef(x)T f(v)]) This objective suggests that we need only to approximate expectations Ex qβ[ef(x)T f(x )] and Ev q+ β [ef(x)T f(v)] over qβ and q+ β (rather than explicily sampling). This can be achieved using Published as a conference paper at ICLR 2021 classical Monte-Carlo importance sampling techniques using samples from p and p+ as follows: Ex qβ[ef(x)T f(x )] = Ex p[ef(x)T f(x )qβ/p] = Ex p[e(β+1)f(x)T f(x )/Zβ], Ev q+ β [ef(x)T f(v)] = Ev p+[ef(x)T f(v)q+ β /p+] = Ev p+[e(β+1)f(x)T f(v)/Z+ β ], where Zβ, Z+ β are the partition functions of qβ and q+ β respectively. The right hand terms readily admit empirical approximations by replacing p and p+ with ˆp(x) = 1 N PN i=1 δx i (x) and ˆp+(x) = 1 M PM i=1 δx+ i (x) respectively (δw denotes the Dirac delta function centered at w). The only unknowns left are the partition functions, Zβ = Ex p[eβf(x)T f(x )] and Z+ β = Ex+ p+[eβf(x)T f(x+)] which themselves are expectations over p and p+ and therefore admit empirical estimates, i=1 eβf(x) f(x i ), b Z+ β = 1 i=1 eβf(x) f(x+ i ). It is important to emphasize the simplicity of the implementation of our proposed approach. Since we propose to reweight the objective instead of modifying the sampling procedure, only two extra lines of code are needed to implement our approach, with no additional computational overhead. Py Torch-style pseudocode for the objective is given in Fig. 13 in Appendix D. 4 ANALYSIS OF HARD NEGATIVE SAMPLING 4.1 HARD SAMPLING INTERPOLATES BETWEEN MARGINAL AND WORST-CASE NEGATIVES Intuitively, the concentration parameter β in our proposed negative sample distribution q β controls the level of hardness of the negative samples. As discussed earlier, the debiasing method of Chuang et al. (2020) can be recovered as a special case: taking β = 0 to obtain the distribution q 0 . This case amounts to correcting for the fact that some samples in a negative batch sampled from p will have the same label as the anchor. But what interpretation does large β admit? Specifically, what does the distribution q β converge to in the limit β , if anything? We show that in the limit q β approximates an inner solution to the following zero-sum two player game. inf f sup q Π L(f, q) = E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + QEx q[ef(x)T f(x )] where Π = {q = q( ; x, f) : supp q( ; x, f) {x X : x x}, x X} is the set of distributions with support that is disjoint from points with the same class as x (without loss of generality we assume {x X : x x} is non-empty). Since q = q( ; x, f) depends on x and f it can be thought of as a family of distributions. The formal statement is as follows. Proposition 3. Let L (f) = supq Π L(f, q). Then for any t > 0 and f : X Sd 1/t we observe the convergence L(f, q β ) L (f) as β . Proof. See Appendix A.1. To develop a better intuitive understanding of the worst case negative distribution objective L (f) = supq Π L(f, q), we note that the supremum can be characterized analytically. Indeed, sup q Π L(f, q) = E x p x+ p+ x f(x)T f(x+) + sup q Π E x p x+ p+ x log n ef(x)T f(x+) + QEx q[ef(x)T f(x )] o = E x p x+ p+ x f(x)T f(x+) + E x p x+ p+ x log n ef(x)T f(x+) + Q sup q Π Ex q[ef(x)T f(x )] o . The supremum over q can be pushed inside the expectation since q is a family of distribution indexed by x, reducing the problem to maximizing Ex q[ef(x)T f(x )], which is solved by any q whose support is a subset of arg supx :x x ef(x)T f(x ) if the supremum is attained. However, computing such points involves maximizing a neural network. Instead of taking this challenging route, using q β defines a lower bound by placing higher probability on x for which f(x)T f(x ) is large. This lower bound becomes tight as β (Proposition 3). Published as a conference paper at ICLR 2021 4.2 OPTIMAL EMBEDDINGS ON THE HYPERSPHERE FOR WORST-CASE NEGATIVE SAMPLES What desirable properties does an optimal contrastive embedding (global minimizer of L) possess that make the representation generalizable? To study this question, we first analyze the distribution of an optimal embedding f on the hypersphere when negatives are sampled from the adversarial worst-case distribution. We consider a different limiting viewpoint of objective (1) as the number of negative samples N . Following the formulation of Wang & Isola (2020) we take Q = N in (1), and subtract log N. This changes neither the set of minimizers, nor the geometry of the loss surface. Taking the number of negative samples N yields the limiting objective, L (f, q) = E x p x+ p+ x log ef(x)T f(x+) Ex q[ef(x)T f(x )] Theorem 4. Suppose the downstream task is classification (i.e. C is finite), and let L (f) = supq Π L (f, q) . The infimum inff: measurable L (f) is attained, and any f achieving the global minimum is such that f (x) = f (x+) almost surely. Furthermore, letting vc = f (x) for any x such that h(x) = c (so vc is well defined up to a set of x of measure zero), f is characterized as being any solution to the following ball-packing problem, max {vc Sd 1/t}c C c C ρ(c) min c =c vc vc 2. (7) Proof. See Appendix A.2. Interpretation: The first component of the result is that f (x) = f (x+) almost surely for an optimal f . That is, an optimal embedding f must be invariant across pairs of similar inputs x, x+. The second component is characterizing solutions via the classical geometrical Ball-Packing Problem of Tammes (1930) (Eq. 7) that has only been solved exactly for uniform ρ, for specific of|C| and typically for S2 (Schütte & Van der Waerden, 1951; Musin & Tarasov, 2015; Tammes, 1930). When the distribution ρ over classes is uniform this problem is solved by a set of|C| points on the hypersphere such that the average squared-ℓ2 distance from a point to the nearest other point is as large as possible. In other words, suppose we wish to place|C| number of balls1 on Sd 1 so that they do not intersect. Then solutions to Tammes Problem (7) expresses (twice) the largest possible average squared radius that the balls can have. So, we have a ball-packing problem where instead of trying to pack as many balls as possible of a fixed size, we aim to pack a fixed number of balls (one for each class) to have as big radii as possible. Non-uniform ρ adds importance weights to each fixed ball. In summary, solutions of the problem minf L (f) are a maximum margin clustering. This understanding of global minimizers of L (f) = supq Π L (f, q) can further developed into a better understanding of generalization on downstream tasks. The next result shows that representations that achieve small excess risk on the objective L still separate clusters well in the sense that a simple 1-nearest neighbor classifier achieves low classification error. Theorem 5. Suppose ρ is uniform on C and f is such that L (f) inf f measurable L ( f) ε with ε 1. Let {v c Sd 1/t}c C be a solution to Problem 7, and define the constant ξ = minc,c :c =c v c v c > 0. Then there exists a set of vectors {vc Sd 1/t}c C such that the 1-nearest neighbor classifier ˆh(x) = arg min c C f(x) v c (ties broken arbitrarily) achieves misclassification risk, Px,c(ˆh(x) = c) 8ε (ξ2 2|C| (1 + 1/t)ε1/2)2 Proof. See Appendix A.3. In particular, P(ˆh(x) = c) = O(ε) as ε 0, and in the limit ε 0 we recover the invariance claim of Theorem 4 as a special case. The result can be generalized to arbitrary ρ by replacing|C| in the bound by 1/ minc ρ(c). The result also implies that it is possible to build simple classifiers for tasks that involve only a subset of classes from C, or classes that are a union of classes from C. The constant ξ = minc,c :c =c v c v c > 0 is a purely geometrical property of spheres, and describes the minimum separation distance between a set of points that solves the Tammes ball-packing problem. 1For a manifold M Rd, we say C M is a ball if it is connected, and there exists a Euclidean ball B = {x Rd : x 2 R} for which C = M B. Published as a conference paper at ICLR 2021 5 EMPIRICAL RESULTS Next, we evaluate our hard negative sampling method empirically, and apply it as a modification to state-of-the-art contrastive methods on image, graph, and text data. For all experiments β is treated as a hyper-parameter (see ablations in Fig. 2 for more understanding of how to pick β). Values for M and τ + must also be determined. We fix M = 1 for all experiments, since taking M > 1 would increase the number of inputs for the forward-backward pass. Lemma 11 in the appendix gives a theoretical justification for the choice of M = 1. Choosing the class-prior τ + can be done in two ways: estimating it from data (Christoffel et al., 2016; Jain et al., 2016), or treating it as a hyper-parameter. The first option requires the possession of labeled data before contrastive training. 5.1 IMAGE REPRESENTATIONS We begin by testing the hard sampling method on vision tasks using the STL10, CIFAR100 and CIFAR10 data. We use Sim CLR (Chen et al., 2020a) as the baseline method, and all models are trained for 400 epochs. The results in Fig. 2 show consistent improvement over Sim CLR (q = p) and the particular case of our method with β = 0 proposed in (Chuang et al., 2020) (called debiasing) on STL10 and CIFAR100. For N = 510 negative examples per data point we observe absolute improvements of 3% and 7.3% over Sim CLR on CIFAR100 and STL10 respectively, and absolute improvements over the best debiased baseline of 1.9% and 3.2%. On tiny Image Net (Tab. 1) we observe an absolute improvement of 3.6% over Sim CLR, while on CIFAR10 there is a slight improvement for smaller N, which disappears at larger N. See Appendix C.1 results using Mo Co-v2 for large negative batch size, and Appendix D.1 for full setup details. 30 62 126 254 510 Negative Sample Size N (STL10) Top-1 Accuracy Sim CLR (β = 0, τ + = 0) Debiased (τ + = 0.1) Debiased (τ + = 0.12) Hard (β = 0.5, τ + = 0.1) Hard (β = 1, τ + = 0.1) 30 62 126 254 510 Negative Sample Size N (CIFAR100) Top-1 Accuracy Sim CLR (β = 0, τ + = 0) Debiased (τ + = 0.05) Debiased (τ + = 0.01) Hard (β = 1, τ + = 0.05) Hard (β = 0.5, τ + = 0.05) 3062 126 254 510 Negative Sample Size N (CIFAR10) Top-1 Accuracy Sim CLR (β = 0, τ + = 0) Debiased (τ + = 0.1) Debiased (τ + = 0.12) Hard (β = 0.5, τ + = 0.1) Hard (β = 1, τ + = 0.1) Figure 2: Classification accuracy on downstream tasks. Embeddings trained using hard, debiased, and standard (β = 0, τ + = 0) versions of Sim CLR, and evaluated using linear readout accuracy. 5.2 GRAPH REPRESENTATIONS Sim CLR Debiased Hard (β = 1) 53.4% 53.7% 57.0% Table 1: Top-1 linear readout on tiny Image Net. Class prior is set to τ + = 0.01. Second, we consider hard negative sampling in the context of learning graph representations. We use the state-of-theart Info Graph method introduced by Sun et al. (2020) as the baseline, which is suitable for downstream graph-level classification. The objective is of a slightly different form from the NCE loss. Because of this we use a generalization of the formulation presented in Section 3 (See Appendix B for details). In doing so, we illustrate that it is easy to adapt our hard sampling method to other contrastive frameworks. Fig. 3 shows the results of fine-tuning an SVM (Boser et al., 1992; Cortes & Vapnik, 1995) on the fixed, learned embedding for a range of different values of β. Hard sampling does as well as Info Graph in all cases, and better in 6 out of 8 cases. For ENZYMES and REDDIT, hard negative samples improve the accuracy by 3.2% and 2.4%, respectively, for DD and PTC by 1 2%, and for IMDB-B and MUTAG by at least 0.5%. Usually, multiple different choices of β > 0 were competitive with the Info Graph baseline: 17 out of the 24 values of β > 0 tried (across all 8 datasets) achieve accuracy as high or better than Info Graph (β = 0). 5.3 SENTENCE REPRESENTATIONS Third, we test hard negative sampling on learning representations of sentences using the quickthoughts (QT) vectors framework introduced by Logeswaran & Lee (2018), which uses adjacent sentences (before/after) as positive samples. Embeddings are trained using the unlabeled Book Corpus dataset (Kiros et al., 2015), and evaluated following the protocol of Logeswaran & Lee (2018) on six downstream tasks. The results are reported in Table 2. Hard sampling outperforms or equals the QT Published as a conference paper at ICLR 2021 72.5% 72.8% 72.2% 73.8% 55.3% 57.1% 56.2% 56.9% 81.0% 83.4% 82.8% 80.0% 74.5% 74.2% 74.1% 74.5% 0.77 PROTEINS 50.4% 51.2% 51.4% 53.6% 0.60 ENZYMES 86.8% 86.4% 87.3% 87.3% 0.925 MUTAG 72.2% 72.6% 72.6% 72.8% 49.6% 49.2% 49.4% 49.6% Average Accuracy Info Graph ( = 0) Hard ( = 1) Hard ( = 2) Hard ( = 10) Figure 3: Classification accuracy on downstream tasks. We compare graph representations on four classification tasks. Accuracies are obtained by fine-tuning an SVM readout function, and are the average of 10 runs, each using 10-fold cross validation. Results in bold indicate best performer. baseline in 5 out of 6 cases, the debiased baseline (Chuang et al., 2020) in 4 out of 6, and both in 3 out of 6 cases. Setting τ + > 0 led to numerical issues in optimization for hard sampling. Objective MR CR SUBJ MPQA TREC MSRP (Acc) (F1) QT (β = 0, τ + = 0) 76.8 81.3 86.6 93.4 89.8 73.6 81.8 Debiased (τ + = 0.01) 76.2 82.9 86.9 93.7 89.1 74.7 82.7 Hard (β = 1, τ + = 0) 77.1 82.5 87.0 92.9 89.2 73.9 82.2 Hard (β = 2, τ + = 0) 77.4 83.6 86.8 93.4 88.7 73.5 82.0 Table 2: Classification accuracy on downstream tasks. Sentence representations are learned using quick-thoughts (QT) vectors on the Book Corpus dataset and evaluated on six classification tasks. Evaluation of binary classification tasks (MR, CR, SUBJ, MPQA) uses 10-fold cross validation. 6 A CLOSER LOOK AT HARD SAMPLING 6.1 ARE HARDER SAMPLES NECESSARILY BETTER? By setting β to large values, one can focus on only the hardest samples in a training batch. But is this desirable? Fig. 4 (left, middle) shows that for vision problems, taking larger β does not necessarily lead to better representations. In contrast, when one uses true positive pairs during training (green curve, uses label information for positive but not negative pairs), the downstream performance monotonically increases with β until convergence (Fig. 4 , middle). Interestingly, this is achieved without using label information for the negative pairs. This observation suggests an explanation for why bigger β hurts performance in practice. Debiasing (conditioning on the event {h(x) = h(x )}) using the true p+ corrects for sampling x with the same label as x. However, since in practice we approximate p+ using a set of data transformations, we can only partially correct. This is harmful for large β since this regime strongly prefers x for which f(x ) is close to f(x), many of whom will have the same label as x if not corrected for. We note also that by annealing β (gradually decreasing β to 0 throughout training; see Appendix D.1 for details) it is possible to be more robust to the choice of initial β, with marginal impact on downstream accuracy compared to the best fixed value of β. 6.2 DOES AVOIDING FALSE NEGATIVES IMPROVE HARD SAMPLING? Our proposed hard negative sampling method conditions on the event {h(x) = h(x )} in order to avoid false negatives (termed debiasing (Chuang et al., 2020)). But does this help? To test this, we train four embeddings: hard sampling with and without debiasing, and uniform sampling (β = 0) with and without debiasing. The results in Fig. 4 (right) show that hard sampling with debiasing obtains the highest linear readout accuracy on STL10, only using hard sampling or only debiasing yields (in this case) similar accuracy. All improve over the Sim CLR baseline. Published as a conference paper at ICLR 2021 0.0 0.5 1.0 2.0 6.0 (STL10) Top-1 Accuracy Sim CLR ( = 0, + = 0) + = 0.1 (annealed) 0.0 0.5 1.0 2.0 6.0 0.0 0.5 1.0 2.0 6.0 (CIFAR100) Top-1 Accuracy Sim CLR ( = 0, + = 0) + = 0.05 ( annealed) True positives Debiased Not Debiased Sampling Not Hard 87.44 % 84.41 % 84.26 % 80.15 % Linear evaluation accuracy Figure 4: Left: the effect of varying concentration parameter β on linear readout accuracy. Middle: linear readout accuracy as concentration parameter β varyies, in the case of contrastive learning (fully unsupervised), using true positive samples (uses label information), and an annealing method that improves robustness to the choice of β (see Appendix D.1 for details). Right: STL10 linear readout accuracy for hard sampling with and without debiasing, and non-hard sampling (β = 0) with and without debiasing. Best results come from using both simultaneously. 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity Positive Pairs Negative Pairs H & D (Ours) H & not D not H & D not H & not D (Sim CLR) Figure 5: Histograms of cosine similarity of pairs of points with the same label (top) and different labels (bottom) for embeddings trained on STL10 with four different objectives. H=Hard Sampling, D=Debiasing. Histograms overlaid pairwise to allow for convenient comparison. Fig. 5 compares the histograms of cosine similarities of positive and negative pairs for the four learned representations. The representation trained with hard negatives and debiasing assigns much lower similarity score to a pair of negative samples than other methods. On the other hand, the Sim CLR baseline assigns higher cosine similarity scores to pairs of positive samples. However, to discriminate positive and negative pairs, a key property is the amount of overlap of positive and negative histograms. Our hard sampling method achieves less overlap than Sim CLR, by better trading off higher dissimilarity of negative pairs with less similarity of positive pairs. Similar tradeoffs are observed for the debiased objective, and hard sampling without debiasing. 6.3 HOW DO HARD NEGATIVES AFFECT OPTIMIZATION? Fig. 11 (in Appendix C due to space constraints) shows the performance on STL10 and CIFAR100 of Sim CLR versus using hard negatives throughout training. We use weighted k-nearest neighbors with k = 200 as the classifier and evaluate each model once every five epochs. Hard sampling with β = 1 leads to much faster training: on STL10 hard sampling takes only 60 epochs to reach the same performance as Sim CLR does in 400 epochs. On CIFAR100 hard sampling takes only 125 epochs to reach the same performance as Sim CLR does in 400 epochs. We speculate that the speedup is, in part, due to hard negatives providing non-negligible gradient information during training. 7 CONCLUSION We argue for the value of hard negatives in unsupervised contrastive representation learning, and introduce a simple hard negative sampling method. Our work connects two major lines of work: contrastive learning, and negative mining in metric learning. Doing so requires overcoming an apparent roadblock: negative mining in metric learning uses pairwise similarity information as a core component, while contrastive learning is unsupervised. Our method enjoys several nice aspects: having desirable theoretical properties, a very simple implementation that requires modifying only a couple of lines of code, not changing anything about the data sampling pipeline, introducing zero extra computational overhead, and handling false negatives in a principled way. Published as a conference paper at ICLR 2021 Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In Int. Conference on Machine Learning (ICML), pp. 5628 5637, 2019. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems (Neur IPS), pp. 15535 15545, 2019. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92 100, 1998. Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 144 152, 1992. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Int. Conference on Machine Learning (ICML), pp. 10709 10719, 2020a. Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. In Advances in Neural Information Processing Systems (Neur IPS), 2020b. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv:2003.04297, 2020c. Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pp. 539 546, 2005. Marthinus Christoffel, Gang Niu, and Masashi Sugiyama. Class-prior estimation for learning from positive and unlabeled data. In Asian Conference on Machine Learning, pp. 221 236, 2016. Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba, and Stefanie Jegelka. Debiased Contrastive Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine learning, 20(3):273 297, 1995. Bill Dolan, Chris Quirk, and Chris Brockett. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, pp. 350. Association for Computational Linguistics, 2004. Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems (Neur IPS), pp. 703 711, 2014. Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213 220, 2008. Weifeng Ge. Deep metric learning with hierarchical triplet loss. In Europ. Conference on Computer Vision (ECCV), pp. 269 285, 2018. Michael Gutmann and Aapo Hyvärinen. Noise-Contrastive Estimation: A new estimation principle for unnormalized statistical models. In Proc. Int. Conference on Artificial Intelligence and Statistics (AISTATS), pp. 297 304, 2010. Raia Hadsell, Sumit Chopra, and Yann Le Cun. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1735 1742, 2006. Published as a conference paper at ICLR 2021 Ben Harwood, Vijay Kumar BG, Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In Int. Conference on Computer Vision (ICCV), pp. 2821 2829, 2017. Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In Int. Conference on Machine Learning (ICML), pp. 3451 3461, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729 9738, 2020. Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. In Int. Conference on Machine Learning (ICML), pp. 6661 6671, 2020. Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168 177, 2004. Shantanu Jain, Martha White, and Predrag Radivojac. Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2693 2701, 2016. Sou Young Jin, Aruni Roy Chowdhury, Huaizu Jiang, Ashish Singh, Aditya Prasad, Deep Chakraborty, and Erik Learned-Miller. Unsupervised hard example mining from videos for improved object detection. In Europ. Conference on Computer Vision (ECCV), pp. 307 324, 2018. Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. ar Xiv:2004.11362, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Int. Conf. on Learning Representations (ICLR), 2015. Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-Thought Vectors. In Advances in Neural Information Processing Systems (Neur IPS), pp. 3294 3302, 2015. Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. Graph matching networks for learning the similarity of graph structured objects. In Int. Conference on Machine Learning (ICML), pp. 3835 3845, 2019. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. In Int. Conf. on Learning Representations (ICLR), 2018. K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., second edition, 2000. Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6707 6717, 2020. Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. In Graph Representation Learning and Beyond, ICML Workshop, 2020. Oleg R Musin and Alexey S Tarasov. The tammes problem for n= 14. Experimental Mathematics, 24 (4):460 468, 2015. Published as a conference paper at ICLR 2021 Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems (Neur IPS), pp. 271 279, 2016. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv:1807.03748, 2018. Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271. Association for Computational Linguistics, 2004. Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 115 124. Association for Computational Linguistics, 2005. Senthil Purushwalkam and Abhinav Gupta. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. ar Xiv:2007.13916, 2020. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815 823, 2015. K Schütte and BL Van der Waerden. Auf welcher kugel haben 5, 6, 7, 8 oder 9 punkte mit mindestabstand eins platz? Mathematische Annalen, 123(1):96 124, 1951. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134 1141, 2018. Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4004 4012, 2016. Aravind Srinivas, Michael Laskin, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. In Int. Conference on Machine Learning (ICML), pp. 10360 10371, 2020. Yumin Suh, Bohyung Han, Wonsik Kim, and Kyoung Mu Lee. Stochastic class-based hard example mining for deep metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7251 7259, 2019. Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. Info Graph: Unsupervised and semisupervised graph-level representation learning via mutual information maximization. In Int. Conf. on Learning Representations (ICLR), 2020. Tristan Sylvain, Linda Petrini, and Devon Hjelm. Locality and compositionality in zero-shot learning. In Int. Conf. on Learning Representations (ICLR), 2020. Pieter Merkus Lambertus Tammes. On the origin of number and arrangement of the places of exit on the surface of pollen-grains. Recueil des travaux botaniques néerlandais, 27(1):1 84, 1930. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Multiview Coding. In Europ. Conference on Computer Vision (ECCV), pp. 770 786, 2019. Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? ar Xiv:2005.10243, 2020. Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep Graph Infomax. In Int. Conf. on Learning Representations (ICLR), 2019. Ellen M Voorhees and Donna Harman. Overview of trec 2002. 2002. Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Int. Conference on Machine Learning (ICML), pp. 9574 9584, 2020. Published as a conference paper at ICLR 2021 Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165 210, 2005. Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Int. Conference on Computer Vision (ICCV), pp. 2840 2848, 2017. Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. ar Xiv:1304.5634, 2013. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In Int. Conf. on Learning Representations (ICLR), 2019. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Int. Conf. on Learning Representations (ICLR), 2018. Published as a conference paper at ICLR 2021 A ANALYSIS OF HARD SAMPLING A.1 HARD SAMPLING INTERPOLATES BETWEEN MARGINAL AND WORST-CASE NEGATIVES We begin by proving Proposition 3. Recall that the proposition stated the following. Proposition 6. Let L (f) = supq Π L(f, q). Then for any t > 0 and measurable f : X Sd 1/t we observe the convergence L(f, q β ) L (f) as β . Proof. Consider the following essential supremum, M(x) = ess sup x X:x x f(x)T f(x ) = sup{m > 0 : m f(x)T f(x ) a.s. for x p }. The second inequality holds since supp(p) = X. We may rewrite L (f) = E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + Qe M(x) L(f, q β ) = E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + QEx q β [ef(x)T f(x )] The difference between these two terms can be bounded as follows, L (f) L(f, q β ) E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + Qe M(x) + log ef(x)T f(x+) ef(x)T f(x+) + QEx q β [ef(x)T f(x )] = E x p x+ p+ x log ef(x)T f(x+) + QEx q β [ef(x)T f(x )] log ef(x)T f(x+) + Qe M(x) Q + 1 E x p x+ p+ x ef(x)T f(x+) + QEx q β [ef(x)T f(x )] ef(x)T f(x+) Qe M(x) Ex q β [ef(x)T f(x )] e M(x) e1/t Ex p Ex q β e M(x) ef(x)T f(x ) where for the second inequality we have used the fact that f lies on the hypersphere of radius 1/t to restrict the domain of the logarithm to values greater than (Q+1)e 1/t. Because of this the logarithm is Lipschitz with parameter e1/t/(Q + 1). Using again the fact that f lies on the hypersphere we know that f(x)T f(x ) 1/t2 and hence have the following inequality, e M(x) ef(x)T f(x ) e1/t2Ex p Eq β M(x) f(x)T f(x ) Let us consider the inner expectation Eβ(x) = Eq β M(x) f(x)T f(x ) . Note that since f is bounded, Eβ(x) is uniformly bounded in x. Therefore, in order to show the convergence L(f, q β ) L (f) as β , it suffices by the dominated convergence theorem to show that Eβ(x) 0 pointwise as β for arbitrary fixed x X. From now on we denote M = M(x) for brevity, and consider a fixed x X. From the definition of q β it is clear that q β p . That is, since q β = c p for some (non-constant) c, it is absolutely continuous with respect to p . So M(x) f(x)T f(x ) almost surely for x q β , and we Published as a conference paper at ICLR 2021 may therefore drop the absolute value signs from our expectation. Define the following event Gε = {x : f(x) f(x ) M ε} where G is refers to a good event. Define its complement Bε = Gc ε where B is for bad . For a fixed x X and ε > 0 consider, Eβ(x) = Ex q β M(x) f(x)T f(x ) = Px q β (Gε) Ex q β M(x) f(x)T f(x ) |Gε + Px q β (Bε) Ex q β M(x) f(x)T f(x ) |Bε Px q β (Gε) ε + 2Px q β (Bε) ε + 2Px q β (Bε). We need to control Px q β (Bε). Expanding, Px q β (Bε) = Z X 1 n f(x)T f(x ) < M(x) ε o eβf(x)T f(x ) p (x ) where Zβ = R X eβf(x)T f(x )p (x )dx is the partition function of q β . We may bound this expression by, X 1 n f(x)T f(x ) < M ε o eβ(M ε) p (x ) Zβ dx eβ(M ε) X 1 n f(x)T f(x ) < M ε o p (x )dx Zβ Px p (Bε) X eβf(x)T f(x )p (x )dx eβ(M ε/2)Px p (f(x)T f(x ) M ε/2). By the definition of M = M(x) the probability ρε = Px p (f(x)T f(x ) M ε/2) > 0, and we may therefore bound, Px q β (Bε) = eβ(M ε) eβ(M ε/2)ρε = e βε/2/ρε 0 as β . We may therefore take β to be sufficiently big so as to make Px q β (Bε) ε and therefore Eβ(x) 3ε. In other words, Eβ(x) 0 as β . A.2 OPTIMAL EMBEDDINGS ON THE HYPERSPHERE FOR WORST-CASE NEGATIVE SAMPLES In order to study properties of global optima of the contrastive objective using the adversarial worst case hard sampling distribution recall that we have the following limiting objective, L (f, q) = E x p x+ p+ x log ef(x)T f(x+) Ex qβ[ef(x)T f(x )] We may separate the logarithm of a quotient into the sum of two terms plus a constant, L (f, q) = Lalign(f) + Lunif(f, q) 1/t2 Published as a conference paper at ICLR 2021 where Lalign(f) = Ex,x+ f(x) f(x+) 2/2 and Lunif(f, q) = Ex p log Ex qef(x) f(x ). Here we have used the fact that f lies on the boundary of the hypersphere of radius 1/t, which gives us the following equivalence between inner products and squared Euclidean norm, 2/t2 2f(x) f(x+) = f(x) 2 + f(x+) 2 2f(x) f(x+) = f(x) f(x+) 2. (9) Taking supremum to obtain L (f) = supq Π L (f, q) we find that the second expression simplifies to, L unif(f) = sup q Π Lunif(f, q) = Ex p log sup x x ef(x) f(x ) = Ex p sup x x f(x) f(x ). Using Eqn. (9), this can be re-expressed as, Ex p sup x x f(x) f(x ) = Ex p inf x x f(x) f(x ) 2/2 + 1/t2. (10) The forthcoming theorem exactly characterizes the global optima of minf L (f) Theorem 7. Suppose the downstream task is classification (i.e. C is finite), and let L (f) = supq Π L (f, q) . The infimum inff: measurable L (f) is attained, and any f achieving the global minimum is such that f (x) = f (x+) almost surely. Furthermore, letting vc = f (x) for any x such that h(x) = c (so vc is well defined up to a set of x of measure zero), f is characterized as being any solution to the following ball-packing problem, max {vc Sd 1/t}c C c C ρ(c) min c =c vc vc 2. (11) Proof. Any minimizer of Lalign(f) has the property that f(x) = f(x+) almost surely. So, in order to prove the first claim, it suffices to show that there exist functions f arg inff L unif(f) for which f(x) = f(x+) almost surely. This is because, at that point, we have shown that arg minf Lalign(f) and arg minf L unif(f) intersect, and therefore any solution of L (f) = Lalign(f) + L unif(f) must lie in this intersection. To this end, suppose that f arg minf L unif(f) but that f(x) = f(x+) with non-zero probability. We shall show that we can construct a new embedding ˆf such that f(x) = f(x+) almost surely, and L unif( ˆf) L unif(f). Due to Eqn. (10) this last condition is equivalent to showing, Ex p inf x x ˆf(x) ˆf(x ) 2 Ex p inf x x f(x) f(x ) 2. (12) Fix a c C, and let xc arg maxx:h(x)=c infx x f(x) f(x ) 2. The maximum is guaranteed to be attained, as we explain now. Indeed we know the maximum is attained at some point in the closure {x : h(x) = c} {x : h(x) = c}. Since X is compact and connected, any point x {x : h(x) = c} \ {x : h(x) = c} is such that infx x f( x) f(x ) 2 = 0 since x must belong to {x : h(x) = c } for some other c . Such an x cannot be a solution unless all points in {x : h(x) = c} also achieve 0, in which case we can simply take xc to be a point in the interior of {x : h(x) = c}. Now, define ˆf(x) = f(xc) for any x such that h(x) = c and ˆf(x) = f(x) otherwise. Let us first aim to show that Eqn. (12) holds for this ˆf. Let us begin to expand the left hand side of Eqn. (12), Published as a conference paper at ICLR 2021 Ex p inf x x ˆf(x) ˆf(x ) 2 = Eˆc ρEx p( |ˆc) inf x x ˆf(x) ˆf(x ) 2 = ρ(c)Ex p( |c) inf x x ˆf(x) ˆf(x ) 2 + (1 ρ(c))Eˆc ρ( |ˆc =c)Ex p( |ˆc) inf x x ˆf(x) ˆf(x ) 2 = ρ(c)Ex p( |c) inf x x f(xc) f(x ) 2 + (1 ρ(c))Eˆc ρ( |ˆc =c)Ex p( |ˆc) inf x x ˆf(x) ˆf(x ) 2 = ρ(c) inf x xc f(xc) f(x ) 2 + (1 ρ(c))Eˆc ρ( |ˆc =c)Ex p( |ˆc) inf h(x ) =ˆc ˆf(x) ˆf(x ) 2 (13) By construction, the first term can be lower bounded by infx xc f(xc) f(x ) 2 Ex p( |c) infh(x ) =c f(x) f(x ) 2 for any x such that h(x) = c. To lower bound the second term, consider any fixed ˆc = c and x p( |ˆc) (so h(x) = ˆc). Define the following two subsets of the input space X A = {f(x ) : f(x ) = ˆc for x X} b A = {f(x ) X : ˆf(x ) = ˆc for x X}. Since by construction the range of ˆf is a subset of the range of f, we know that b A A. Combining this with the fact that ˆf(x) = f(x) whenever h(x) = ˆc = c we see, inf h(x ) =ˆc ˆf(x) ˆf(x ) 2 = inf h(x ) =ˆc f(x) ˆf(x ) 2 = inf u b A f(x) u 2 inf u A f(x) u 2 = inf h(x ) =ˆc f(x) f(x ) 2 Using these two lower bounds we may conclude that Eqn. (13) can be lower bounded by, ρ(c)Ex p( |c) inf h(x ) =c f(x) f(x ) 2 + (1 ρ(c))Eˆc ρ( |ˆc =c)Ex p( |ˆc) inf h(x ) =ˆc f(x) f(x ) 2 which equals Ex p infx x f(x) f(x ) 2. We have therefore proved Eqn. (12). To summarize the current progress; given an embedding f we have constructed a new embedding ˆf that attains lower Lunif loss and which is constant on x such that ˆf is constant on {x : h(x) = c}. Enumerating C = {c1, c2 . . . , c|C|}, we may repeatedly apply the same argument to construct a sequence of embeddings f1, f2, . . . , f|C| such that fi is constant on each of the following sets {x : h(x) = cj} for j i . The final embedding in the sequence f = f|C| is such that L unif(f ) L unif(f) and therefore f is a minimizer. This embedding is constant on each of {x : h(x) = cj} for j = 1, 2, . . . ,|C|. In other words, f (x) = f (x+) almost surely. We have proved the first claim. Obtaining the second claim is a matter of manipulating L (f ). Indeed, we know that L (f ) = L unif(f ) 1/t2 and defining vc = f (x) = f(xc) for each c C, this expression is minimized if and only if f attains, Published as a conference paper at ICLR 2021 max f Ex p inf x x f(x) f(x ) 2 = max f Ec ρEx p( |c) inf h(x ) =c f(x) f(x ) 2 c C ρ(c) inf h(x ) =c f(x) f(x ) 2 = max {vc Sd 1/t}c C c C ρ(c) min c =c vc vc 2 where the final equality inserts f as an optimal f and reparameterizes the maximum to be over the set of vectors {vc Sd 1/t}c C. A.3 DOWNSTREAM GENERALIZATION Theorem 8. Suppose ρ is uniform on C and f is such that L (f) inf f measurable L ( f) ε with ε 1. Let {v c Sd 1/t}c C be a solution to Problem 7, and define ξ = minc,c :c =c v c v c > 0. Then there exists a set of vectors {vc Sd 1/t}c C such that the following 1-nearest neighbor classifier, ˆh(x) = ˆc, where ˆc = arg min c C f(x) v c (ties broken arbitrarily) achieves misclassification risk, P(ˆh(x) = c) 8ε (ξ2 2|C| (1 + 1/t)ε1/2)2 Proof. To begin, using the definition of ˆh we know that for any 0 < δ < ξ, Px,c(ˆh(x) = c) = Px,c f(x) vc min c :c =c f(x) vc δ, and δ min c :c =c 1 Px,c f(x) vc > δ Px,c min c :c =c f(x) vc < δ So to prove the result, our goal is now to bound these two probabilities. To do so, we use the bound on the excess risk. Indeed, combining the fact L (f) inf f measurable L ( f) ε with the notational rearrangements before Theorem 7 we observe that Ex,x+ f(x) f(x+) 2 2ε. 2ε Ex,x+ f(x) f(x+) 2 = Ec ρEx+ p( |c)Ex p( |c) f(x) f(x+) 2 . For fixed c, x+, let xc arg min{x+:h(x+)=c} Ex p( |c) f(x) f(x+) 2 where we extend the minimum to be over the closure, a compact set, to guarantee it is attained. Then we have 2ε Ec ρEx+ p( |c)Ex p( |c) f(x) f(x+) 2 Ec ρEx p( |c) f(x) vc 2 where we have now defined vc = f(xc) for each c C. Note in particular that vc lies on the surface of the hypersphere Sd 1/t. This enables us to obtain the follow bound using Markov s inequality, Published as a conference paper at ICLR 2021 Px,c f(x) vc > δ = Px,c f(x) vc 2 > δ2 Ex,c f(x) vc 2 so it remains still to bound Px,c minc :c =c f(x) vc < δ . Defining ξ = minc,c :c =c vc vc , we have the following fact (proven later). Fact (see lemma 9): ξ p ξ2 2|C| (1 + 1/t) ε. Using this fact we are able to get control over the tail probability as follows, min c :c =c f(x) vc < δ Px,c f(x) vc > ξ δ f(x) vc > ξ q ξ2 2|C| (1 + 1/t)ε1/2 δ f(x) vc 2 > ( q ξ2 2|C| (1 + 1/t)ε1/2 δ)2 ξ2 2|C| (1 + 1/t)ε1/2 δ)2 . where this inequality holds for for any 0 δ p ξ2 2|C| (1 + 1/t)ε1/2. Gathering together our tail probability bounds we find that Px,c(ˆh(x) = c) 1 2ε ξ2 2|C|(1+1/t)ε1/2 δ)2 for any 0 δ p ξ2 2|C| (1 + 1/t)ε1/2. That is, Px,c(ˆh(x) = c) 2ε ξ2 2|C| (1 + 1/t)ε1/2 δ)2 Since this holds for any 0 δ p ξ2 2|C| (1 + 1/t)ε1/2, Px,c(ˆh(x) = c) min 0 δ ξ2 2|C| (1 + 1/t)ε1/2 δ)2 Elementary calculus shows that the minimum is attained at δ = ξ2 2|C|(1+1/t)ε1/2 2 . Plugging this in yields the final bound, P(ˆh(x) = c) 8ε (ξ2 2|C| (1 + 1/t)ε1/2)2 . Lemma 9. Consider the same setting as introduced in Theorem 5. In particular define ξ = min c,c :c =c vc vc , ξ = min c,c :c =c v c v c . where {v c Sd 1/t}c C is a solution to Problem 7, and {vc Sd 1/t}c C is defined via vc = f(xc) with xc arg min{x+:h(x+)=c} Ex p( |c) f(x) f(x+) 2 for each c C. Then we have, ξ2 2|C| (1 + 1/t)ε1/2. Published as a conference paper at ICLR 2021 Proof. Define, X = min c :c =c vc vc 2 , X = min c :c =c v c v c 2 . X and X are random due to the randomness of c ρ. We can split up the following expectation by conditioning on the event {X X } and its complement, E|X X | = P(X X )E[X X ] + P(X X )E[X X]. (14) Using L (f) inf f measurable L ( f) ε and the notational re-writing of the objective L introduced before Theorem 7, we observe the following fact, whose proof we give in a separate lemma after the conclusion of this proof. Fact (see lemma 10): EX 2(1 + 1/t) ε EX EX . This fact implies in particular E[X X ] 0 and E[X X] 2(1 + 1/t) ε. Inserting both inequalities into Eqn. 14 we find that E|X X | 2(1 + 1/t) ε. In other words, since ρ is uniform, min c :c =c vc vc 2 min c :c =c v c v c 2 2(1 + 1/t) ε. From which we can say that for any c C , min c :c =c vc vc 2 min c :c =c v c v c 2 2|C| (1 + 1/t) ε. So minc :c =c vc vc q minc :c =c v c v c 2 2|C| (1 + 1/t)ε1/2 p ξ2 2|C| (1 + 1/t)ε1/2. Since this holds for any c C , we conclude that ξ p ξ2 2|C| (1 + 1/t)ε1/2. Lemma 10. Consider the same setting as introduced in Theorem 5. Define also, X = min c :c =c vc vc 2 , X = min c :c =c v c v c 2 , where vc = f(xc) with xc arg min{x+:h(x+)=c} Ex p( |c) f(x) f(x+) 2 for each c C. We have, EX 2(1 + 1/t) ε EX EX . Proof. By Theorem 7 we know there is an f attaining the minimum inf f measurable L ( f) and that this f attains L align(f ) = 0, and also minimizes the uniformity term L unif(f), taking the value L unif(f ) = Ec ρ maxc :c =c v c v c . Because of this we find, L unif(f) L (f) L (f ) + L align(f ) L align(f) + L unif(f ) L (f) L (f ) + L unif(f ) ε + L unif(f ) = ε + Ec ρ max c :c =c v c v c . Published as a conference paper at ICLR 2021 Since we would like to bound Ec ρ maxc :c =c vc vc in terms of Ec ρ maxc :c =c v c v c , this observation means that is suffices to bound Ec ρ maxc :c =c vc vc in terms of L unif(f). To this end, note that for a fixed c, and x such that h(x) = c we have, sup x x f(x) f(x ) = sup x x vc f(x ) + (f(x) vc) f(x ) = sup x x vc f(x ) f(x) vc /t max x {xc}c C vc f(x ) f(x) vc /t = max c =c vc vc f(x) vc /t where the inequality follows since {xc}c C is a subset of the closure of {x : x x}. Taking expectations over c, x, L unif(f) = Ex,c sup x x f(x) f(x ) Ec ρ max c =c vc vc Ex,c f(x) vc /t Ec ρ max c =c vc vc q Ex,c f(x) vc 2/t Ec ρ max c =c vc vc ε/t. So since ε ε, we have found that Ec ρ max c =c vc vc ε/t + ε + Ec ρ max c :c =c v c v c (1 + 1/t) ε + Ec ρ max c :c =c v c v c . Of course we also have, Ec ρ max c :c =c v c v c = L unif(f ) Ec ρ max c :c =c vc vc since the embedding f(x) = vc whenever h(x) = c is also a feasible solution. Combining these two inequalities with the simple identity x y = 1/t2 x y 2 /2 for all length 1/t vectors x, y, we find, 1/t2 Ec ρ max c :c =c v c v c 2 /2 1/t2 Ec ρ max c :c =c vc vc 2 /2 1/t2 Ec ρ max c :c =c v c v c 2 /2 + (1 + 1/t) ε. Subtracting 1/t2 and multiplying by 2 yields the result. B GRAPH REPRESENTATION LEARNING We describe in detail the hard sampling method for graphs whose results are reported in Section 5.2. Before getting that point, in the interests of completeness we cover some required background details on the Info Graph method of Sun et al. (2020). For further information see the original paper (Sun et al., 2020). Published as a conference paper at ICLR 2021 B.1 BACKGROUND ON GRAPH REPRESENTATIONS We observe a set of graphs G = {Gj G}n j=1 sampled according to a distribution p over an ambient graph space G. Each node u in a graph G is assumed to have features h(0) u living in some Euclidean space. We consider a K-layer graph neural network, whose k-th layer iteratively computes updated embeddings for each node v G in the following way, h(k) v = COMBINE(k) h(k 1) v , AGGREGATE(k) h(k 1) v , h(k 1) u , euv : u N(v) ! where COMBINE(k) and AGGREGATE(k) are parameterized learnable functions and N(v) denotes the set of neighboring nodes of v. The K embeddings for a node u are collected together to obtain a single final summary embedding for u. As recommended by Xu et al. (2019) we use concatenation, hu = hu(G) = CONCAT {h(k) u }K k=1 to obtain an embedding in Rd. Finally, the node representations are combined together into a length d graph level embedding using a readout function, H(G) = READOUT {hu}u G which is typically taken to be a simple permutation invariant function such as the sum or mean. The Info Graph method aims to maximize the mutual information between the graph level embedding H(G) and patch-level embeddings hu(G) using the following objective, max h EG p 1 |G| u G I hu(G); H(G) In practice the population distribution p is replaced by its empirical counterpart, and the mutual information I is replaced by a variational approximation IT . In line with Sun et al. (2020) we use the Jensen-Shannon mutual information estimator as formulated by Nowozin et al. (2016). It is defined using a neural network discriminator T : R2d R as, IT hu(G); H(G) = EG p h sp( T hu(G), H(G) ) i E(G,G ) p p h sp(T hu(G), H(G ) ) i where sp(z) = log(1+ez) denotes the softplus function. The finial objective is the joint maximization over h and T, max θ,ψ EG p 1 |G| u G IT hu(G); H(G) B.2 HARD NEGATIVE SAMPLING FOR LEARNING GRAPH REPRESENTATIONS In order to derive a simple modification of the NCE hard sampling technique that is appropriate for use with Info Graph, we first provide a mildly generalized view of hard sampling. Recall that the NCE contrastive objective can be decomposed into two constituent pieces, L(f, q) = Lalign(f) + Lunif(f, q) where q is in fact a family of distributions q(x ; x) over x that is indexed by the possible values of the anchor x. Lalign performs the role of aligning positive pairs (embedding near to one-another), while Lunif repels negative pairs. The hard sampling framework aims to solve, Published as a conference paper at ICLR 2021 inf f sup q L(f, q). In the case of NCE loss we take, Lalign(f) = E x p x+ p+ x f(x)T f(x+), Lunif(f, q) = E x p x+ p+ x log n ef(x)T f(x+) + QEx q[ef(x)T f(x )] o . View this view, we can easily adapt to the Info Graph framework, taking Lalign(h, T) = EG p 1 |G| u G sp( T hu(G), H(G) ), Lunif(h, T, q) = EG p 1 |G| u G EG qsp(T hu(G), H(G ) ) Denote by ˆp the distribution over nodes u Rs defined by first sampling G p, then sampling u G uniformly over all nodes of G. Then these two terms can be simplified to Lalign(h, T) = Eu ˆpsp( T hu(G), H(G) ), Lunif(h, T, q) = E(u,G ) ˆp qsp(T hu(G), H(G ) ) At this point it becomes clear that, just as with NCE, a distribution q arg maxq L(f, q) in the Info Graph framework if it is supported on arg max G G sp(T hu(G), H(G ) ). Although this is still hard to compute exactly, it can be approximated by, qβ u(G ) exp βT(hu(G), H(G)) p(G ). C ADDITIONAL EXPERIMENTS C.1 HARD NEGATIVES WITH LARGE BATCH SIZES 0.0 0.1 0.2 0.5 (CIFAR10) Top-1 Accuracy Mo Co-v2 ( = 0, + = 0) Figure 6: Hard negative sampling using Mo Cov2 framework. Results show that hard negative samples can still be useful when the negative memory bank is very large (in this case N = 65536). The vision experiments in the main body of the paper are all based off the Sim CLR framework (Chen et al., 2020a). They use a relatively small batch size (up to 512). In order to test whether our hard negatives sampling method can help when the negative batch size is very large, we also run experiments using Mo Co-v2 with standard negative memory bank size N = 65536 (He et al., 2020; Chen et al., 2020c). We adopt the official Mo Cov2 code2. Embeddings are trained for 200 epochs, with batch size 128. Figure 6 summarizes the results. We find that hard negative sampling can still improve the generalization of embeddings trained on CIFAR10: Mo Co-v2 attains linear readout accuracy of 88.08%, and Mo Co-v2 with hard negatives (β = 0.2, τ + = 0) attains 88.47%. C.2 ABLATIONS 2https://github.com/facebookresearch/moco Published as a conference paper at ICLR 2021 0.0 0.5 1.0 2.0 6.0 (CIFAR10) Top-1 Accuracy Sim CLR ( = 0, + = 0) Figure 7: The effect of varying concentration parameter β on linear readout accuracy for CIFAR10. (Complements the left and middle plot from Figure 4.) To study the affect of varying the concentration parameter β on the learned embeddings Figure 9 plots cosine similarity histograms of pairs of similar and dissimilar points. The results show that for β moving from 0 through 0.5 to 2 causes both the positive and negative similarities to gradually skew left. In terms of downstream classification, an important property is the relative difference in similarity between positive and negative pairs. In this case β = 0.5 find the best balance (since it achieves the highest downstream accuracy). When β is taken very large (β = 6), we see a change in conditions. Both positive and negative pairs are assigned higher similarities in general. Visually it seems that the positive and negative histograms for β = 6 overlap a lot more than for smaller values, which helps explain why the linear readout accuracy is lower for β = 6 . Figure 12 gives real examples of hard vs. uniformly sampled negatives. Given an anchor x (a monkey) and trained embedding f (trained on STL10 using standard Sim CLR for 400 epochs), we sample a batch of 128 images. The top row shows the ten negatives x that have the largest inner product f(x) f(x ), while the bottom row is a random sample from from the same batch. Negatives with the largest inner product with the anchor correspond to the items in the batch are the most important terms in the objective since they are given the highest weighting by q β . Figure 12 shows that real hard negatives are conceptually similar to the idea as proposed in Figure 1: hard negatives are semantically similar to the anchor, possessing various similarities, including color (browns and greens), texture (fur), and objects (animals vs machinery). 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity Positive Pairs Negative Pairs = 0, acc=67.73% = 0.5, acc=69.61% = 2, acc=67.68% = 6, acc=65.54% Figure 8: Histograms of cosine similarity of pairs of points with different label (bottom) and same label (top) for embeddings trained on CIFAR100 with different values of β. Histograms overlaid pairwise to allow for easy comparison. 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity Positive Pairs Negative Pairs H & D H & not D not H & D not H & not D (Sim CLR) Figure 9: Histograms of cosine similarity of pairs of points with the same label (top) and different labels (bottom) for embeddings trained on CIFAR100 with four different objectives. H=Hard Sampling, D=Debiasing. Histograms overlaid pairwise to allow for convenient comparison. Published as a conference paper at ICLR 2021 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity 0.0 0.5 1.0 Cosine Similarity Positive Pairs Negative Pairs H & D H & not D not H & D not H & not D (Sim CLR) Figure 10: Histograms of cosine similarity of pairs of points with the same label (top) and different labels (bottom) for embeddings trained on CIFAR10 with four different objectives. H=Hard Sampling, D=Debiasing. Histograms overlaid pairwise to allow for convenient comparison. 0 50 100 150 200 250 300 350 400 Epochs (CIFAR100) Readout Accuracy Hard ( = 1.0, + = 0.05) Hard ( = 0.5, + = 0.05) 0 50 100 150 200 250 300 350 400 Epochs (STL10) Readout Accuracy Hard ( = 1.0, + = 0.1) Hard ( = 0.5, + = 0.1) Figure 11: Hard sampling takes much fewer epochs to reach the same accuracy as Sim CLR does in 400 epochs; for STL10 with β = 1 it takes only 60 epochs, and on CIFAR100 it takes 125 epochs (also with β = 1). D EXPERIMENTAL DETAILS Figure 13 shows Py Torch-style pseudocode for the standard objective, the debiased objective, and the hard sampling objective. The proposed hard-sample loss is very simple to implement, requiring only two extra lines of code compared to the standard objective. D.1 VISUAL REPRESENTATIONS We implement Sim CLR in Py Torch. We use a Res Net-50 (He et al., 2016) as the backbone with embedding dimension 2048 (the representation used for linear readout), and projection head into the lower 128-dimensional space (the embedding used in the contrastive objective). We use the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001 and weight decay 10 6. Code available at https://github.com/joshr17/HCL. Since we adopt the Sim CLR framework, the number of negative samples N = 2(batch size 1). Since we always take the batch size to be a power of 2 (16, 32, 64, 128, 256) the negative batch sizes are 30, 62, 126, 254, 510 respectively. Unless otherwise stated, all models are trained for 400 epochs. Annealing β Method: We detail the annealing method whose results are given in Figure 4. The idea is to reduce the concentration parameter down to zero as training progresses. Specifically, suppose we have e number of total training epochs. We also specify a number ℓof changes to the concentration parameter we shall make. We initialize the concentration parameter β1 = β (where this β is the number reported in Figure 4), then once every e/ℓepochs we reduce βi by β/ℓ. In other words, if we are currently on βi, then βi+1 = βi β/ℓ, and we switch from βi to βi+1 in epoch number i e/ℓ. The idea of this method is to select particularly difficult negative samples early on order to obtain useful gradient information early on, but later (once the embedding is already quite good) we reduce the hardness level so as to reduce the harmful effect of only approximately correcting for false negatives (negatives with the same labels as the anchor). Published as a conference paper at ICLR 2021 Hard negatives Uniform negatives Figure 12: Qualitative comparison of hard negatives and uniformly sampled negatives for embedding trained on STL10 for 400 epochs using Sim CLR. Top row: selecting the 10 images with highest inner product with anchor in latent space from a batch of 128 inputs. Bottom row: a set of random samples from the same batch. Hard negatives are semantically much more similar to the anchor than uniformly sampled negatives - hard negatives possess many similar characteristics to the anchor, including texture, colors, animals vs machinery. 1 # pos : exp of inner products for positive examples 2 # neg : exp of inner products for negative examples 3 # N : number of negative examples 4 # t : temperature scaling 5 # tau_plus: class probability 6 # beta : concentration parameter 7 8 #Original objective 9 standard_loss = -log(pos.sum() / (pos.sum() + neg.sum())) 10 11 #Debiased objective 12 Neg = max((-N*tau_plus*pos + neg).sum() / (1-tau_plus), e**(-1/t)) 13 debiased_loss = -log(pos.sum() / (pos.sum() + Neg)) 14 15 #Hard sampling objective (Ours) 16 reweight = (beta*neg) / neg.mean() 17 Neg = max((-N*tau_plus*pos + reweight*neg).sum() / (1-tau_plus), e**(-1/t)) 18 hard_loss = -log( pos.sum() / (pos.sum() + Neg)) Figure 13: Pseudocode for our proposed new hard sample objective, as well as the original NCE contrastive objective, and debiased contrastive objective. In each case we take the number of positive samples to be M = 1. The implementation of our hard sampling method only requires two additional lines of code compared to the standard objective. We also found the annealing in the opposite direction ( down ) achieved similar performance. Bias-variance of empirical estimates in hard-negative objective: Recall the final hard negative samples objective we derive is, E x p x+ p+ x log ef(x)T f(x+) ef(x)T f(x+) + Q τ (Ex qβ[ef(x)T f(x )] τ +Ev q+ β [ef(x)T f(v)]) This objective admits a practical counterpart by using empirical approximations to Ex qβ[ef(x)T f(x )] and Ev q+ β [ef(x)T f(v)]. In practice we use a fairly large number of samples (e.g. N = 510) to approximate the first expectation, and only M = 1 samples to approximate the second. Clearly in both cases the resulting estimator is unbiased. Further, since the first expectation is approximated using many samples, and the integrand is bounded, the resulting estimator is well concentrated (e.g. apply Hoeffding s inequality out-of-the-box). But what about the second expectation? This might seem uncontrolled since we use only one sample, however it turns out that the random variable X = ef(x)T f(v) where x p and v q+ β has variance that is bounded by Lalign(f). Lemma 11. Consider the random variable X = ef(x)T f(v) where x p and v q+ β . Then Var(X) O Lalign(f) . Published as a conference paper at ICLR 2021 Recall that Lalign(f) = Ex,x+ f(x) f(x+) 2/2 is termed alignment, and Wang & Isola (2020) show that the contrastive objective jointly optimize alignment and uniformity. Lemma 11 therefore shows that as training evolves, the variance of the X = ef(x)T f(v) where x p and v q+ β is bounded by a term that we expect to see becoming small, suggesting that using a single sample (M = 1) to approximate this expectation is not unreasonable. We cannot, however, say more than this since we have no guarantee that Lalign(f) goes to zero. Proof. Fix an x and recall that we are considering q+ β ( ) = q+ β ( ; x). First let X be an i.i.d. copy of X, and note that, conditioning on x, we have 2Var(X|x) = Var(X|x)+Var(X |x) = Var(X X |x) E (X X )2|x . Bounding this difference, E (X X )2|x = Ev,v q+ β ef(x) f(v) ef(x) f(v ) 2 e1/t2 f(x) f(v) f(x) f(v ) 2 e1/t4Ev,v q+ β f(x) f(v) f(v ) 2 t2 Ev,v q+ β f(v) f(v ) 2 O Ev,v p+ f(v) f(v ) 2 where the first inequality follows since f lies on the sphere of radius 1/t, the second inequality by Cauchy Schwarz, the third again since f lies on the sphere of radius 1/t, and the fourth since q+ β is absolutely continuous with respect to p+ with bounded ratio. Since p+(x+) = p(x+|h(x)) only depends on c = h(x), rather than x itself, taking expectations over x p is equivalent to taking expectations over c ρ. Further, ρ(c)p(v|c)p(v |c) = p(v)p(v |c) = p(v)p+ v (v ). So Ec ρEv,v p+ f(v) f(v ) 2 = Ex,x+ f(x) f(x+) 2 = 2Lalign(f), where x p and x+ p+ x . Thus we obtain the lemma. 1 # pos : exp of inner products for positive examples 2 # neg : exp of inner products for negative examples 3 # N : number of negative examples 4 # t : temperature scaling 5 # tau_plus: class probability 6 # beta : concentration parameter 7 8 #Clipping negatives trick before computing reweighting 9 reweight = 2*neg / max( neg.max().abs(), neg.min().abs() ) 10 reweight = (beta*reweight) / reweight.mean() 11 Neg = max((-N*tau_plus*pos + reweight*neg).sum() / (1-tau_plus), e**(-1/t)) 12 hard_loss = -log( pos.sum() / (pos.sum() + Neg)) Figure 14: In cases where the learned embedding is not normalized to lie on a hypersphere we found that clipping the negatives to live in a fixed range (in this case [ 2, 2]) stabilizes optimization. D.2 GRAPH REPRESENTATIONS All datasets we benchmark on can be downloaded at www.graphlearning.io from the TUDataset repository of graph classification problems (Morris et al., 2020). Information on basic statistics of the datasets is included in Tables 3 and 4. For fair comparison to the original Info Graph method, we adopt the official code, which can be found at https://github.com/ Published as a conference paper at ICLR 2021 fanyun-sun/Info Graph. We modify only the gan_losses.py script, adding in our proposed hard sampling via reweighting. For simplicity we trained all models using the same set of hyperparameters: we used the GIN architecture (Xu et al., 2019) with K = 3 layers and embedding dimension d = 32. Each model is trained for 200 epochs with batch size 128 using the Adam optimizer (Kingma & Ba, 2015). with learning rate 0.001, and weight decay of 10 6. Each embedding is evaluated using the average accuracy 10-fold cross-validation using an SVM as the classifier (in line with the approach taken by Morris et al. (2020)). Each experiment is repeated from scratch 10 times, and the distribution of results from these 10 runs is plotted in Figure 3. Since the graph embeddings are not constrained to lie on a hypersphere, for a batch we clip all the inner products to live in the interval [ 2, 2] while computing the reweighting, as illustrated in Figure 14. We found this to be important for stabilizing optimization. Dataset DD PTC REDDIT-B PROTEINS No. graphs 1178 344 2000 1113 No. classes 2 2 2 2 Avg. nodes 284.32 14.29 429.63 39.06 Avg. Edges 715.66 14.69 497.75 72.82 Table 3: Basic statistics for graph datasets. Dataset ENZYMES MUTAG IMDB-B IMDB-M No. graphs 600 188 1000 1500 No. classes 6 2 2 3 Avg. nodes 32.63 17.93 19.77 13.00 Avg. Edges 62.14 19.79 96.53 65.94 Table 4: Basic statistics for graph datasets. D.3 SENTENCE REPRESENTATIONS We adopt the official quick-thoughts vectors experimental settings, which can be found at https: //github.com/lajanugen/S2V. We keep all hyperparameters at the default values and change only the s2v-model.py script. Since the official Book Corpus dataset Kiros et al. (2015) is not available, we use an unofficial version obtained using the following repository: https://github. com/soskek/bookcorpus. Since the sentence embeddings are also not constrained to lie on a hypersphere, we use the same clipping trick as for the graph embeddings, illustrated in Figure 14. After training on the Book Corpus dataset, we evaluate the embeddings on six different classification tasks: paraphrase identification (MSRP) (Dolan et al., 2004), question type classification (TREC) (Voorhees & Harman, 2002), opinion polarity (MPQA) (Wiebe et al., 2005), subjectivity classification (SUBJ) (Pang & Lee, 2004), product reviews (CR) (Hu & Liu, 2004), and sentiment of movie reviews (MR) (Pang & Lee, 2005). E FURTHER DISCUSSION Comparison with Kalantidis et al. (2020): Kalantidis et al. (2020) also consider ways to sample negatives, and propose a mixing strategy for hard negatives, called Mo CHi. The main points of difference are: 1) Mo CHi considers the benefit of hard negatives, but does not consider the possibility of false negatives (Principle 1), which we found to be valuable. 2) Mo CHi introduces three extra hyperparameters, while our method introduces only two (β, τ +). If we discard Principle 1 (i.e. τ +) then only β requires tuning. 3) our method introduces zero computational overhead by utilizing within-batch reweighting, whereas Mo CHi involves a small amount of extra computation. Limitations of proposed method. There are two main sources of approximation in the hard negative sampling algorithm we propose to use in practice. First, our hard negatives objective requires computing an expectation over p+ x (x+), the distributions on points with the same x+ class label as x. Published as a conference paper at ICLR 2021 Due to lack of supervision, we are not able to sample exactly from p+ x . To circumnavigate this, we propose using samples generated using data augmentation, which we find works well in practice, but deviates from the original formulation and framework. We find this approximation has important implications for practitioners, since we observe that when using oracle access to p+ x the downstream performance improves monotonically as the concentration parameter β increases (see Fig. 4, middle), while when using the practical approximation, large values of β start to hurt performance, thereby requiring that β be tuned (values in the range (0.5, 2) tend to work well in general). Second, we estimate the expectation Ev q+ β [ef(x) f(v)] using just a single (importance weighted) sample. This is done in the interests of efficiency since using M > 1 samples significantly increases the effective batch size (this is the more costly multi-view setting), increasing memory and runtime costs. We anticipate that using M > 1 would improve performance further, in exchange for this extra price, but leave experimentation to future work. That said, our analysis finds that the variance of the variable ef(x) f(v) with v q+ β is controlled by the alignment objective, as shown in Lemma 11, which suggests that in later stage training the variance of the single-sample estimator may be reasonably small.