# largemargin_contrastive_learning_with_distance_polarization_regularizer__017427c7.pdf

Large-Margin Contrastive Learning with Distance Polarization Regularizer

Shuo Chen 1 Gang Niu 1 Chen Gong 2 Jun Li 2 Jian Yang 2 Masashi Sugiyama 1 3

Contrastive learning (CL) pretrains models in a pairwise manner, where given a data point, other data points are all regarded as dissimilar, including some that are semantically similar. The issue has been addressed by properly weighting similar and dissimilar pairs as in positive-unlabeled learning, so that the objective of CL is unbiased and CL is consistent. However, in this paper, we argue that this great solution is still not enough: its weighted objective hides the issue where the semantically similar pairs are still pushed away; as CL is pretraining, this phenomenon is not our desideratum and might affect downstream tasks. To this end, we propose large-margin contrastive learning (LMCL) with distance polarization regularizer, motivated by the distribution characteristic of pairwise distances in metric learning. In LMCL, we can distinguish between intra-cluster and inter-cluster pairs, and then only push away inter-cluster pairs, which solves the above issue explicitly. Theoretically, we prove a tighter error bound for LMCL; empirically, the superiority of LMCL is demonstrated across multiple domains, i.e., image classiﬁcation, sentence representation, and reinforcement learning.

1. Introduction

Machine learning without human annotation is a longstanding and important problem. Recently, the unsupervised learning approach has been greatly promoted by contrastive learning (CL), which shows encouraging performance compared to fully supervised learning methods (Wu et al., 2018; Saunshi et al., 2019). CL directly learns a generic feature embedding for original data, and the learned embedding can

1RIKEN Center for Advanced Intelligence Project, Japan; 2PCA-Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology, China; 3Graduate School of Frontier Sciences, The University of Tokyo, Japan. Correspondence to: Shuo Chen <shuo.chen.ya@riken.jp>, Jun Li <junli@njust.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Unregularized Distance Polarization Regularized

Intra/Inter-Cluster Distance

Polarization

Enlargement

Ambiguous Unambiguous

Inter Intra Inter Intra

Figure 1. Conceptual illustration of unregularized contrastive learning and contrastive learning regularized by distance polarization (DP). The unregularized model enlarges distances between all pairs of instances and potentially leading to some ambiguous intra/intercluster distances. We propose a DP regularized learning algorithm to encourage pairwise distances to be extremely large or small, and thus gaining the unambiguous distance determination with a large margin between intra-cluster and inter-cluster.

be widely employed in many downstream recognition tasks such as classiﬁcation (Chen et al., 2020a) and clustering (Zhong et al., 2020). Thereby, CL has become one of the most important unsupervised learning approaches.

As human annotation is not available in an unsupervised learning problem setting, CL algorithms usually consider building the pseudo supervision in their learning objectives (Saunshi et al., 2019; Jing & Tian, 2020). In general, most existing CL frameworks regard any two instances in the training data as a negative pair (including those falsenegative pairs consisted of semantically similar instances), and meanwhile construct the positive pair by combining each instance with its perturbation (Wu et al., 2018; Song & Ermon, 2020). Due to the continued success from positive pairs, many recent efforts have increasingly focused on various data augmentation techniques to further enrich training data (Oord et al., 2018; Tian et al., 2020a) and simultaneously preserve semantic contents (Logeswaran & Lee, 2018; Tian et al., 2020b).

While positive-pair sampling has drawn much attention, relatively fewer works consider the inﬂuence of negative pair in CL (Jing & Tian, 2020). Actually, as most existing CL

Large-Margin Contrastive Learning with Distance Polarization Regularizer

methods directly repel all pairs of instances in the training data, the semantically similar instances are undesirably pushed apart. Recent works propose weighting the positive and negative pairs as in positive-unlabeled learning (Chen et al., 2020b) to counteract the impact of false-negative pairs (Chuang et al., 2020; Robinson et al., 2020). Nevertheless, the weighted learning objectives still encourage repelling each pair of original instances in the training data (Huynh et al., 2020), so they are not able to faithfully reﬂect the similarity between two semantically similar instances.

Although existing CL algorithms have achieved promising results to some extent, most of their objectives do not explicitly discriminate the semantic similarity of each instance pair, and thus they cannot adequately capture intrinsic features in the training data. To address this issue, we provide theoretical results to reveal that when the conventional CL encourages repelling each pair of original instances, the ﬁnally learned pairwise distances nearly obey a unimodal distribution in the region (0, 1). It implies that the conventional CL fails to yield a notable margin to distinguish the similarities of data pairs (see the left panel of Fig. 1). Therefore, this inspires us to propose large-margin contrastive learning (LMCL) with distance polarization (DP) regularizer which clearly separates the similar pairs from dissimilar pairs with a large margin. Such a DP regularizer is motivated by the general goal of metric learning (Xing et al., 2002), which casts penalty onto all pairwise distances within the margin region, and thereby encouraging polarized distances for similarity determinations (see the bimodal distribution in the right panel of Fig. 1). Theoretically, we prove that the proposed DP regularizer effectively tightens the error bound of conventional CL algorithm. Experimentally, our approach consistently improves the state-of-the-art methods on vision, language, and reinforcement learning benchmarks. Our proposed DP regularizer is simple yet generic, which can be easily deployed in many existing CL methods. Our main contributions are summarized below:

We propose a novel distance polarization regularizer to enhance the generalization ability of the conventional CL algorithm by explicitly discriminating the pairwise similarity between two original instances.

We establish the complete theoretical guarantee for our method to analyze the error bounds of similarity measure and downstream classiﬁcation, respectively.

We conduct extensive experiments on synthesis and real-world datasets to validate the superiority of our method over the state-of-the-art CL approaches.

2. Background & Related Work

In this section, we ﬁrst introduce some necessary notations. Then, we brieﬂy review the background of contrastive learn-

ing. We also introduce the main concepts of metric learning and regularization technique, which are related to this paper.

Notations. We write matrices and vectors as bold uppercase characters and bold lowercase characters, respectively. We denote the training dataset X ={xi Rm|i=1, 2, . . . , N} where m is the data dimensionality and N is the total number of instances. Operator denotes the element-wise product of two vectors/matrices. Operators 0 and 1 denote the vector/matrix ℓ0norm and ℓ1-norm, respectively.

2.1. Contrastive Learning

As an unsupervised / self-supervised learning approach, the basic goal of contrastive learning (CL) algorithm is to learn a generic feature embedding ϕ : Rm 7 Rd, which transforms the data point from m-dimensional sample space to d-dimensional embedding space for extracting intrinsic features. The primitive CL method called instance discrimination learns such an embedding by directly repelling each pair of two instances in the training data (Wu et al., 2018). Subsequent works such as momentum contrastive (Mo Co) encourage using larger negative pair batch size for better learning results (He et al., 2020). Recently, the Sim CLR framework further introduces data augmentation to generate positive pairs which incorporate more semantic information into the learning objective (Chen et al., 2020a). In general, the effectiveness of existing CL algorithms relies on two key components: the negative pairs (x, x ) sampling from every two original instances in the training data, and the positive pairs (x, x+) built by each single instance x and its perturbation x+. When the noise contrastive estimation (NCE) loss (Gutmann & Hyv arinen, 2010) is employed to learn a feature embedding ϕ from positive and negative pairs, the general learning objective can be formulated as

= E x, x j X

log eϕ(x) ϕ(x+)

eϕ(x) ϕ(x+)+Pn j=1eϕ(x) ϕ(x j )

where x and {x j }n j=1 are uniformly sampled from the training data X. Here n is the batch size of negative pairs.

It is worth noting that the conventional NCE loss for contrastive learning is misleading, as the semantically similar (i.e., false-negative) data pairs might be pushed apart during the repelling of all negative pairs. To alleviate this issue, the clustering approach (Li et al., 2020) is applied on the learned embedding to gather similar instances, though the reliability of clustering results can be easily inﬂuenced by the learned embedding itself. Recent works adopted popular practices in positive-unlabeled (PU) learning (Chen et al., 2020b) to reweight the NCE loss by increasing the importance of positive pairs (Chuang et al., 2020) or allocating different importance for negative pairs (Robinson et al., 2020).

Large-Margin Contrastive Learning with Distance Polarization Regularizer

Although few works have been proposed to alleviate the undesirable repelling of semantically similar instances, their learning objectives still cannot clearly discriminate the pairwise similarity between two original instances. In this paper, we address this issue from a different viewpoint, which employs the basic property of metric learning (Chu et al., 2020) to constrain the similarity of negative pairs.

2.2. Metric Learning

As a supervised learning problem, metric learning aims to learn a distance metric to faithfully measure the pairwise similarity between two instances in the sample space (Davis et al., 2007; Chu et al., 2020). For the training data X = {xi Rm|i = 1, 2, . . . , N}, the class labels {yi {1, 2, . . . , C}|i = 1, 2, . . . , N} are provided for supervision, where C is the number of classes. As the supervisory information is available, the positive pairs and negative pairs in metric learning can be directly built by the semantics labels {yi}n i=1, and thus formulating the well-known (n + 1)-tuplet loss (Sohn, 2016)

=E yi=yk =ybj

log eϕ(xi) ϕ(xk)

eϕ(xi) ϕ(xk)+Pn j=1eϕ(xi) ϕ(xbj )

which encourages to reduce the intra-class distance ϕ(xi) ϕ(xk) 2 2 and enlarge inter-class distance ϕ(xi) ϕ(xbj) 2 2 for i, k, bj =1, 2, . . . , N, in which j =1, 2, . . . , n and {bj}n j=1 is the index set of batch negative points. Similar to Eq. (1), here n is the batch size of negative pairs. Minimizing such a supervised learning objective will lead to a margin between the intra-class and the inter-class distances, and thereby discriminating the pairwise similarity between each two original instances (Yu & Tao, 2019).

Although the above Eq. (2) has a very similar form to Eq. (1), we can ﬁnd that here Eq. (2) is fully supervised, so its negative pairs are unbiased. In this paper, we convert the basic property of the above metric learning model to a regularizer for constraining the learning objective of the CL algorithm.

2.3. Regularization Technique

Regularization is a generic and effective technique that has been well studied and widely applied in statistics and machine learning (Dong et al., 2014; Scholkopf & Smola, 2018). Generally speaking, a regularization term (i.e., regularizer) usually considers introducing a speciﬁc inductive bias into the empirical loss, and thus reducing the hypothesis space complexity and improving the model generalizability (Guo et al., 2017). For example, the well-known ℓ2-norm regularizer (i.e., weight decay (Krogh & Hertz, 1992) in some deep learning models) restricts the scale of learning parameter so that the learned embedding can successfully capture scale-invariant features (Yang et al., 2011). The ℓ1-

norm regularizer (i.e., sparse regularization) assumes that only a few learning parameters should be activated in practical recognition tasks, and thereby alleviating the impact from over-ﬁtting results (Arpit et al., 2016).

Our proposed method in this paper can also be regarded as a type of regularization technique. Similar to most existing regularizers, our method effectively reduces the hypothesis space complexity by introducing critical priori knowledge, which is acquired from the metric learning algorithm.

3. Methodology

In this section, we ﬁrst investigate the distribution of pairwise distances learned by the conventional CL algorithm. After that, we propose a new large-margin contrastive learning algorithm by building a distance polarization regularizer. The learning objective and the corresponding optimization algorithm are ﬁnally designed with convergence guarantee.

3.1. Motivation

As we mentioned before, the key element of CL is the similarity relation between pairwise instances. For a learnable mapping ϕ : Rm 7 Rd, the (squared) Euclidean distance ϕ(xi) ϕ(xj) 2 2 measures the (dis)similarity between two original instances xi and xj from training data X. Since ϕ(x) is usually normalized to reduce overﬁtting, the pairwise distance in embedding space satisﬁes ϕ(xi) ϕ(xj) 2 2 =2 2ϕ(xi) ϕ(xj). For simplicity, we further denote the following normalized Euclidean distance

Dϕ ij = (1 ϕ(xi) ϕ(xj))/2, (3)

which measures the similarity between instances xi, xj X with a real value Dϕ ij [0, 1]. Then, from both empirical and theoretical aspects, we investigate the distribution of the distance Dϕ ij for all 1 i < j N.

As we know, CL aims to repel each pair of instances away, i.e., enlarging the distance Dϕ ij to the maximal value 1 for all 1 i<j N. Now, we conduct simple experiments to investigate the distribution of Dϕ ij in the range [0, 1] where ϕ is learned by a conventional CL algorithm. Speciﬁcally, here we choose the popular method Sim CLR (Chen et al., 2020a) as our framework to learn the embedding ϕ on CIFAR-10 (Krizhevsky et al., 2009) dataset using the Adam optimizer (Reddi et al., 2018). Then we gather all pairwise distances and plot the histogram in Fig. 2(a)).

From Fig. 2(a), we can clearly observe that a signiﬁcant portion of the distances lie in the range of [0, 1/2]. It means that the ﬁnally learned embedding cannot equivalently enlarge all pairwise distances to the maximal value 1, although the learning objective of CL algorithm enforces to repel each pair of original instances in the training data.

We further provide theoretical analyses to support the above

Large-Margin Contrastive Learning with Distance Polarization Regularizer

(a) Self-Supervised

0 0.5 1 Distance

(b) Fully Supervised

0 0.5 1 Distance

Intra-Class Inter-Class

(c) DP Regularized

0 0.5 1 Distance

Figure 2. Distance histograms obtained by different methods on CIFAR-10 dataset, including the conventional self-supervised CL, the fully supervised metric learning, and our proposed DP regularized CL (which is also self-supervised).

empirical observation. To be speciﬁc, we assume that the feature embedding ϕ(x)=(ϕ1(x), ϕ2(x), . . . , ϕd(x)) is learned from a general hypothesis set

H={ϕ | ϕ(x) 2 =1 and ϕi(x) is differentiable

for any i=1, 2, . . . , d}, (4)

where ϕ(x) 2 = 1 denotes that the embedding result is ﬁnally normalized for any data point x Rm. Then, we investigate the maximal value of E1 i<j N[Dϕ ij] where ϕ is learned from the above hypothesis set H. For sufﬁciently large sample size, we have that 1

lim N max ϕ H E1 i<j N Dϕ ij lim N N/(2N 2)=1/2, (5)

which implies that the mean value of pairwise distances can be maximally enlarged to 1/2 rather than the ideal value 1. To further investigate the overall distribution of pairwise distances, we provide the following Theorem 1 to reveal the continuity of distance distribution in the range [0, 1], even though the intrinsic data distribution is unknown to us.

Theorem 1. Assume that the optimal feature embedding bϕ arg minϕ H LNCE(ϕ) and the corresponding distance value Dbϕ ij = (1 bϕ(xi) bϕ(xj))/2. Then for any given µ [0, 1] and ϵ > 0, there exists sufﬁciently large N such that min1 i<j N {|Dbϕ ij µ|} < ϵ.

The above Theorem 1 reveals that although conventional CL algorithms repel each pair of original instances, the optimal solution of their learning objectives will still contain many small distance values in [0, 1/2] (e.g., the result in Fig. 2(a)), and all pairwise distances will gradually cover the whole range [0, 1] with the increasing of sample size.

According to the above empirical and theoretical analyses, now the good news is that the conventional CL algorithms could adaptively capture the similarity and dissimilarity between pairwise instances during the repelling pairwise instances. The CL algorithms will discard some negative pairs

1For detailed calculations, the mean value E1 i<j N[Dϕ ij]= (P

1 i<j N(1 ϕ(xi) ϕ(xj)))/(N(N 1)) = ( N 2 +N/2 PN i=1ϕ(xi) 2 2/2)/(N(N 1)) ( N 2 +N/2))/(N(N 1))= (N(N 1)/2+N/2)/(N(N 1)) = N/(2N 2).

and regard them as semantically similar pairs, even though their learning objective treat each pair of original instances as dissimilar. This can be seen as a new interpretation to understanding the effectiveness of existing CL algorithms from the viewpoint of similarity metrics.

However, the bad news is that conventional CL algorithms are still not good enough since they fail to maintain a large margin in the distance space for reliable instance discrimination. As revealed in Theorem 1, the pairwise distances will gradually cover the whole region of [0, 1], which makes it difﬁcult to put the decision plane. To overcome this issue, we propose a distance polarization regularizer to constrain the learning objective of conventional CL algorithm.

3.2. Formulation

As we revealed, the distance space [0, 1] can be gradually covered by pairwise distances and thus losing a margin region to clearly discriminate the distances of similar and dissimilar pairs. However, when the supervisory information is available, the intra-class and inter-class distances obtained by metric learning algorithms should be clearly discriminated with an explicit margin region (see Fig. 2(b)), so that the metric learning algorithms can adequately capture the intrinsic features.

Most metric learning methods aim to enlarge the inter-class distances and reduce the intra-class distances simultaneously, so they usually yield a margin region between the intra-class and inter-class. It means that the ﬁnal distances obtained by metric learning methods should be reasonably polarized outside of an intermediate margin region, whatever the class labels are. Therefore, we employ such critical a priori knowledge to build a new regularizer which constrains the pairwise distances learned by the CL algorithms.

Distance Polarization (DP) Regularizer. We suppose that the matrix Dϕ = [Dϕ ij] RN N consisting of pairwise distance Dϕ ij measures the similarity between instances xi and xj for i, j =1, 2, . . . , N. We further assume that there exists 0<δ+ <δ <1 such that the underlying intra-class distances are smaller than δ+ while the inter-class distances are larger than δ . Then we construct the following distance polarization (DP) regularizer

R0(ϕ) = min((Dϕ +) (Dϕ ), 0) 0, (6)

where + = δ+ 1N N and = δ 1N N are thresholding parameters. Here the region (δ+, δ ) [0, 1] can be regarded as the large margin to discriminate the similarity of data pairs. The above ℓ0-norm (Liu et al., 2010) based regularizer will encourage the sparse distance distribution in the margin region (δ+, δ ), because any distance Dϕ ij fallen into the margin region (δ+, δ ) will increase the value of

Large-Margin Contrastive Learning with Distance Polarization Regularizer

R0(ϕ). 2 Thereby, minimizing such a regularizer will encourage all pairwise distances {Dϕ ij}N i,j=1 to distribute in the regions [0, δ+] or [δ , 1], and thus adaptively separating each data pair into similar or dissimilar result (see Fig. 2(c)).

Determination of + and . The above DP regularizer in Eq. (6) involves two critical parameters + and . Here we demonstrate how to determine these two parameters. We let τ = δ δ+ (0, 1), and then we can regard τ as the margin width which can be easily tuned. Thereby, we just need to determine the threshold δ . Intuitively, we expect to employ a large δ to repel the dissimilar pairs of instances as far as possible, but the pairwise distances cannot be really enlarged to an ideal maximal value 1 as we discussed in Section 3.1. Here we provide the following Theorem 2 to reveal that δ = 1/2 is a good choice to yield a margin width τ (0, 1/2).

Theorem 2. For training data {xi}N i=1 with underling class labels {yi}N i=1 and any given τ (0, 1/2), there exists a feature embedding ϕ H such that

max (i, j) I+Dϕ ij 1/2 τ < 1/2 min (k, l) I Dϕ kl, (7)

where yi = 1, 2, . . . , C for i = 1, 2, . . . , N and C < d. Here the bivariate index sets I+ = {(i, j)|yi = yj, i, j = 1, 2, . . . , N} and I ={(i, j)|yi =yj, i, j =1, 2, . . . , N}.

With the above Theorem 2, we can easily implement the proposed DP regularizer and deploy it in the learning objective of conventional CL algorithms. Without loss of generality, for most existing CL models equipped with NCE loss LNCE(ϕ) in Eq. (1), we build the following large-margin contrastive learning (LMCL) model

min ϕ H LNCE(ϕ) + λR0(ϕ), (8)

where the regularization parameter λ>0 is tuned by users. As a regularized learning objective, LMCL is simple and generic because here the loss term LNCE(ϕ) can be implemented by many existing CL algorithms. In the next subsection, we show that Eq. (8) can be easily solved by existing stochastic optimization methods.

3.3. Optimization

Minimizing the objective function in Eq. (8) is a classical ℓ0norm optimization problem which is usually non-continuous and non-convex. Fortunately, for the original ℓ0-norm based regularizer Eq. (6), here we can easily ﬁnd that Dϕ ij δ+ (0, 1) and δ Dϕ ij (0, 1) for any i, j =1, 2, . . . , N, so we have that min((Dϕ +) (Dϕ ), 0) [0, 1]N N. As the ℓ1-norm is a convex envelope of ℓ0-norm in the

2Any distance Dϕ ij fallen into the margin region (δ+, δ ) will incur the negative product (Dϕ ij δ+)(Dϕ ij δ ), and thereby leading to that min((Dϕ ij δ+)(Dϕ ij δ ), 0) =0 which increases the value of ℓ0-norm as well as the value of regularizer R0(ϕ).

Algorithm 1 Solving Eq. (9) via Adam. Input: Training Data X = {xi}N i=1; Step Size η > 0; Regularization Parameter λ > 0; Batch Size n N+. Initialize: Momentum Vectors m(0) = v(0) = 0; Decay Rates α1, α2 (0, 1); Iteration Number t = 0. For t from 1 to T:

1). Uniformly pick (n + 1) data points {xbj}n+1 j=1 from X;

2). Compute the stochastic gradient via Eq. (10):

g(t) ϕ(ℓ(ϕ;{xbj}n+1 j=1 )+λr(ϕ;{xbj}n+1 j=1 )); (11)

3). Compute moment vectors: m(t+1) α1mt + (1 α1)g(t), and v(t+1) α2vt + (1 α2)g(t) g(t);

4). Update the learning parameter:

ϕ(t+1) ϕ(t) η m(t+1)/(1 αt+1 1 ) q

v(t+1)/(1 αt+1 2 ) + ϵ ; (12)

End. Output: The converged eϕ.

unit hypercube [0, 1]N N, we can simply convert the ℓ0norm based regularizer in Eq. (6) to the ℓ1-norm based form R1(ϕ) 3 which is a good approximation to ℓ0-norm in the unit hypercube. By integrating such a differentiable almost everywhere (a.e.) function, we ﬁnally have the following learning objective F(ϕ)

min ϕ H {F(ϕ) = LNCE(ϕ) + λR1(ϕ)} . (9)

For the above objective function, we show that it can be solved by existing stochastic optimization methods. For n + 1 (i.e., the batch size) randomly selected data point {xbj|xbj X, bj B}n+1 j=1 , the NCE loss deﬁned by Eq. (1) already has a stochastic form 4, so here we only need to demonstrate the stochastic regularizer in a mini-batch, i.e.,

j=1 | min((Dϕ bibj δ+) (Dϕ bibj δ ), 0)|

= 1 N n+1 X

b Br(ϕ; {xbj}n+1 j=1 ), (10)

and thus F(ϕ) in Eq. (9) has the stochastic form f(ϕ; {xbj}n+1 j=1 ) = ℓ(ϕ; {xbj}n+1 j=1 ) + λr(ϕ; {xbj}n+1 j=1 ). Based on such a stochastic loss, we further provide the Adam iteration steps to solve Eq. (9) in Algorithm 1.

In summary, introducing the DP regularizer merely incurs

3Here R1(ϕ) = min((Dϕ +) (Dϕ ), 0) 1. 4Here the NCE loss LNCE(ϕ) = E[ℓ(ϕ; {xbj}n+1 j=1 )], and the corresponding stochastic loss ℓ(ϕ; {xbj}n+1 j=1 ) = log(exp(ϕ(xbn+1) ϕ(x+ bn+1))/(exp(ϕ(xbn+1)) ϕ(x+ bn+1))+ Pn j=1exp(ϕ(xbj)) ϕ(x bj)))). The index vector set B = {b =

(b1, . . . , bn+1) |bi, bj =1, . . . , N, bi =bj, i, j =1, . . . , n + 1}.

Large-Margin Contrastive Learning with Distance Polarization Regularizer

an additional stochastic gradient in Eq. (11). It means that our method can be easily implemented in most existing CL methods and only introduces very little computational overheads. Furthermore, the convergence of Adam has been well studied in previous works (Zaheer et al., 2018). It can be veriﬁed that ℓ(ϕ; {xbj}n+1 j=1 ) and r(ϕ; {xbj}n+1 j=1 ) are both Lipschitz-smooth and gradient-bounded, as long as the embedding ϕ is Lipschitz-smooth and gradient-bounded. In this case, the iteration sequence ϕ(1), . . . , ϕ(T ) in Algorithm 1 converges to a stationary point of the learning objective F with a convergence rate O(1/

T), where T is the number of iterations (Huang et al., 2019; 2020).

4. Theoretical Analyses

In this section, we further provide in-depth theoretical analyses for our proposed method. We ﬁrst investigate the reliability of our method for similarity measure. After that, we demonstrate the generalizability of our method on the downstream classiﬁcation task.

4.1. Error Bound for Similarity Measure

In general, CL usually considers the similarity between pairwise instances, so the reliability of CL algorithms depends on whether the pairwise similarity can be faithfully measured. Here we follow the common practice in learning theory (Xie et al., 2017) to study the error bound determined by the minimizer of our learning objective in Eq. (9). Specifically, we investigate the correctness of pairwise distances Dϕ

ij by building the expectations Eyi =yj[max(δ µ Dϕ

and Eyk=yl[max(Dϕ

kl δ+ µ, 0)] to evaluate the false negatives and false positives, respectively. The corresponding error bound is provided in Theorem 3.

Theorem 3. Assume that ϕ arg minϕ H LNCE(ϕ) + λR1(ϕ), and the underling class labels of training data {xi}N i=1 are {yi}N i=1. Then we have that

Eyi =yj[max(δ µ Dϕ

ij , 0)] + Eyk=yl[max(Dϕ

kl δ+ µ, 0)]

(δ δ+)R1(ϕ ) + (Kmax/Kmin)/C

4(δ δ+)/λ + (Kmax/Kmin)/C, (13)

where the constants δ µ = δ µ, δ+ µ = δ+ + µ, µ (0, δ δ+), Kmin = min1 k C y k 1N 1 0, and Kmax =max1 k C y k 1N 1 0.

The above Eq. (13) clearly reveals that the error bound of the similarities measured by our method will gradually converge to 0 with the increasing of class number C and the decreasing of the regularizer value R1(ϕ ). Firstly, it implies that the diversity of data (i.e., a large C) will beneﬁt the reliability of the similarity measured by CL algorithms. This conclusion is consistent with existing theoretical ﬁndings that the larger C leads to the better generalizability (Saunshi

et al., 2019). Secondly, such an error bound also relies on a small regularizer value R1(ϕ ). This demonstrates the necessity and usefulness of our proposed DP regularizer, because increasing the regularization parameter λ would assist the error bound in converging to zero.

4.2. Error Bound for Downstream Classiﬁcation

The experimental performance of most CL algorithms is usually evaluated by a downstream classiﬁcation task. Therefore, here we provide the generalization error bound (GEB) of our method for the classiﬁcation task which trains a softmax classiﬁer by minimizing the traditional cross entropy loss (Zhang & Sabuncu, 2018), i.e., LSM(ϕ; X) = inf W RC d LCEP(W ϕ; X). For a feature embedding ϕ, the generalization error is deﬁned by LT SM(ϕ) = EX T [LSM(ϕ; X)], where T is the underlying distribution of the training data X. Then we investigate how such a generalization error LT SM(ϕ) is far from the learning objective LNCE(ϕ) of contrastive learning.

Theorem 4. Let ϕ arg minϕ H LNCE(ϕ) + λR1(ϕ). Then with probability at least 1 δ, we have that

LT SM(ϕ ) LNCE(ϕ ) O

where Q1 = p

1+1/n, Q2 = log(1/δ) log2(n), and 5

RH(λ) is monotonically decreasing w.r.t. λ.

We can observe that the error bound in Eq. (14) gradually decreases with the increase of the training sample size N, and this is consistent with the traditional supervised learning method (Niu et al., 2016). Then, we ﬁnd that the negative pair size n in the error term p

Q2/N is negligible for the large sample size N. In this case, the relative large negative pair size n will effectively reduce the ﬁrst error term Q1(RH(λ)/N), and thereby tightening the error bound. This conclusion is also in line with the empirical observations in existing works (He et al., 2020; Kim et al., 2020). Finally, when we enlarge the regularization parameter λ, the rademacher complexity RH(λ) will also be decreased, and thus further reducing the error bound and improving the generalizability of contrastive learning algorithm.

5. Experimental Results

In this section, we show experimental results on both synthetic and real-world datasets to validate the effectiveness of our proposed method. In detail, we ﬁrst give visualization results on synthetic data to demonstrate the efﬁcacy of DP regularizer. Then, we compare our proposed learning algorithm with existing state-of-the-art models on vision and

5To be speciﬁc, here the Rademacher Complexity RH(λ) = Eσ { 1}3d N [supϕ H(λ) σ, f ], in which the restricted hypothesis space H(λ) = {ϕ|ϕ H, and R1(ϕ) 4/λ}.

Large-Margin Contrastive Learning with Distance Polarization Regularizer

(a). Three-Bars Dataset

Class-1 Class-2 Class-3

(b). Nested-Moons Dataset

(c). Projection by Conventional CL (d). Projection by Conventional CL

(e). Projection by Our LMCL (f). Projection by Our LMCL

Figure 3. Visualization results of the conventional CL method and our proposed LMCL method on the two toy datasets.

Table 1. K-means clustering accuracy rates (mean std) of baseline methods and our proposed method on the toy datasets.

METHOD Three-Bars Nested-Moons t-test

Euclidean Space 75.2 1.2 77.3 2.3 Conventional CT 78.3 2.2 77.5 1.2 LMCT (Ours) 84.2 0.2 85.2 2.3

language tasks. Finally, we test our method on the CL based reinforcement learning task. The regularization parameter λ of our method is ﬁxed to 0.1. The thresholds δ+ and δ

are ﬁxed to 0.1 and 0.5, respectively. The hyper-parameters of compared methods are set to the recommended values according to their original papers.

5.1. Experiments on Synthetic Data

We ﬁrst consider learning a linear embedding ϕ(x)=P x on two-dimensional synthetic data, where the matrix P R2 2 is the learning parameter. Here we employ the Three Bars and Nested-Moons datasets (Chen et al., 2018) to evaluate the performance of the conventional CL algorithm and our proposed LMCL algorithm. For each data point in the two datasets (see Fig. 3(a) and (b)), we build its data augmentation by adding Gaussian noise on the original data point. Then, we simply regard each data point and its augmentation as a positive pair, and sampling every two data points as a negative pair. For these positive pairs and negative pairs, we use the Adam optimizer (learning rate = 0.001) for both the conventional CL (i.e., Eq. (1)) and our proposed LMCL (i.e., Eq. (9) with λ=0.1). Both the projection matrices of conventional method and our method (i.e., P CL, P LM R2 2) are initialized by 0. After obtaining the learned matrices P CL and P LM, we record the projected points P CLx and P LMx to visualize the distribution of data points in embedding space.

32 64 128 256 512 (a). Classification accuracy of all compared methods on STL-10 dataset.

Accuracy Rate (%)

Sim CLR DLC Hard-CL LMCL(Sim CLR+DP) LMCL(DCL+DP) LMCL(HCL+DP)

32 64 128 256 512 (b). Classification accuracy of all compared methods on CIFAR-10 dataset.

Accuracy Rate (%)

Figure 4. Classiﬁcation accuracy of all methods on STL-10 and CIFAR-10 datasets. The negative sample size is from 32 to 512.

We can clearly obverse that although the conventional CL algorithm ﬁnds out the projection matrix P CL to roughly distinguish each class of data points (as shown in Fig. 3(c) and (d)), it still yields many ambiguous points between each two classes in the embedding (projection) space. In comparison, when the DP regularizer is employed, our method LMCL could further improve the separability of data points and successfully obtain unambiguous projected points between each of the two classes (Fig. 3(e) and (f)). Furthermore, the K-means (Bradley & Fayyad, 1998) clustering accuracy (mean std, 20 random trials) of conventional CL and our LMCL are reported in Tab. 1, and we can obverse that our LMCL consistently outperforms the conventional CL algorithm. We also perform the t-test at signiﬁcance level 0.05 in the last column, and indicates that our method is signiﬁcantly better than the baseline method.

5.2. Experiments on Image Classiﬁcation

In this subsection, we validate the effectiveness of our method on the image classiﬁcation task. Here we select Sim CLR (Chen et al., 2020a) and contrastive multiview coding (CMC) (Tian et al., 2020a) as baseline methods, and implement our method LMCL under such two classical frameworks. We also compare our method with three additional state-of-the-art methods including debiased contrastive learning (DCL) (Chuang et al., 2020), hard negative based contrastive learning (HCL) (Robinson et al., 2020), and the clustering based method (Sw AV) (Caron et al., 2020) on STL-10 (Coates et al., 2011), CIFAR-10 (Krizhevsky et al., 2009), and Image Net-100 (Russakovsky et al., 2015) datasets. All methods are fairly implemented by the Res Net50 with the same training epoch 100.

For STL-10 and CIFAR-10 datasets, we record the classiﬁcation accuracy of all compared methods with varying numbers of negative sample. From Fig. 4, we can clearly observe that our method LMCL (DP+Sim CLR) successfully improves the baseline for at least 1% and 2% on CIFAR-10 dataset and STL-10 dataset, respectively. Similar experi-

Large-Margin Contrastive Learning with Distance Polarization Regularizer

Table 2. Classiﬁcation accuracy (%) of all methods on Image Net100 dataset with negative sample size 1024 and 4096.

METHOD 1024 4096

Top1 Top5 Top1 Top5

CMC 60.23 79.23 73.58 92.06 Sw AV 60.93 79.43 75.78 92.86 DCL 61.01 78.99 74.60 92.08 HCL 60.89 79.33 74.66 92.32 LMLC(CMC+DP) 61.23 79.44 75.67 93.02 LMLC(DCL+DP) 61.12 79.20 75.89 92.89 LMLC(HCL+DP) 60.92 79.43 74.94 92.39

Table 3. Parametric sensitivities of λ and τ. Here λ and τ are changed in [0.01, 5] and [0.1, 0.4], respectively.

λ τ 0.1 0.2 0.25 0.3 0.4

0.01 80.4 81.3 81.2 81.2 80.8 0.1 81.5 81.9 81.7 81.8 81.9 0.5 81.6 81.6 80.7 81.7 81.9 5 80.9 81.9 80.9 80.6 80.5

ments are conducted on Image Net-100 dataset, and Tab. 2 shows that our method improves the baseline method CMC from 73.58% to 75.88%. For different negative sample sizes, the accuracy rates of our method are competitive or superior to the compared methods DCL and HCL, which clearly demonstrates the effectiveness of our method. Furthermore, our method can also be incorporated by the two existing methods (i.e., DP+DCL and DP+HCL) to achieve the improved recognition accuracy. Therefore, our method has good compatibility with existing CL algorithms on the image classiﬁcation task.

Parametric Sensitivity. Here we further investigate the parametric sensitivities on λ and τ. Speciﬁcally, we change λ and τ in [0.01, 5] and [0.1, 0.4] respectively, and record the classiﬁcation accuracy of our method on STL-10 dataset (Batch Size=256). Tab. 3 shows that the accuracy variation of our method is smaller than 1.5, so the hyper-parameters of our method can be easily tuned in practice use.

5.3. Experiments on Sentence Representation

In this subsection, we employ the Book Corpus dataset (Kiros et al., 2015) to evaluate the performance of all compared methods on six text classiﬁcation tasks, including movie review sentiment (MR), product reviews (CR), subjectivity classiﬁcation (SUBJ), opinion polarity (MPQA), question type classiﬁcation (TREC), and paraphrase identiﬁcation (MSRP). We follow the experimental settings in the

Table 4. Classiﬁcation accuracy (%) of all methods on Book Corpus dataset including six text classiﬁcation tasks.

METHOD MR CR SUBJ MPQA TREC MSRP

QT 76.8 81.3 86.6 93.4 89.8 73.6 DCL 76.2 82.9 86.9 93.7 89.1 74.7 HCL 77.4 83.6 86.8 93.4 88.7 73.5 LMCL(QT+DP) 77.3 82.3 86.9 93.7 90.2 74.1 LMCL(DCL+DP) 77.2 83.7 87.2 93.8 90.1 75.1 LMCL(HCL+DP) 78.1 83.5 87.2 94.0 89.1 74.2

Table 5. 100K Scores (mean std, 3 random trials) achieved by all methods on the six control tasks.

METHOD Spin Swingup Easy Run Walk Catch

CURL 413 53 680 32 908 86 298 38 621 121 826 42 DCL 422 23 672 52 878 96 248 98 626 98 836 12 HCL 420 61 678 82 869 116 268 42 623 26 819 62 LMCL(CURL+DP) 423 63 682 13 926 73 296 32 625 53 842 27 LMCL(DCL+DP) 423 33 683 93 909 87 287 67 625 93 843 37 LMCL(HCL+DP) 421 51 681 83 910 95 292 78 626 89 832 83

(c) LMCL (Ours)

Correct Incorrect

Figure 5. Distance histograms obtained by different methods (QT, DCL, and our proposed LMCL) on Book Corpus dataset.

baseline method quick-thought (QT) (Logeswaran & Lee, 2018), which chooses the neighboring sentences as positive pairs. Here the 10-fold cross validation is adopted, and the average classiﬁcation accuracy is listed in Tab. 4.

For the six classiﬁcation tasks, our method improves the classiﬁcation accuracy of baseline method QT for at least one percentage on most classiﬁcation benchmarks. The distance histograms of QT, DCL, and our LMCL are shown in Fig. 5. We clearly observe that our method obtains the more accurate distance determination than baseline methods, and this reveals that our method is effective for the text classiﬁcation task.

5.4. Experiments on Reinforcement Learning

This subsection further extends our experiments on reinforcement learning task, which is another application scenario of contrastive learning. Here the contrastive unsupervised representations for reinforcement learning (CURL) (Laskin et al., 2020) method is employed to perform imagebased policy control on representation learned by the CL algorithm. All methods are tested on the Deep Mind control suite (Tassa et al., 2018), which consists of six control tasks listed in Tab. 5. By following the experimental settings in CURL, the positive pair is built by simply cropping a single image, and the negative pair is composed of each two images in the control sequence. All methods are retrained for 3 times, and the corresponding means and standards of 100K scores are shown in Tab. 5.

For the six control tasks, our method consistently outperforms the baseline method CURL with higher means. When compared to DCL and HCL methods, our method achieves better results in most cases. Although our method LMCL (DP+CURL) has slightly lower scores than DCL or HCL on the last two control tasks, our method shows smaller variance. Moreover, when we incorporate our DP regularizer

Large-Margin Contrastive Learning with Distance Polarization Regularizer

to DCL and HCL, our method could further improve the overall scores of compared methods on the six tasks. This also reveals that our method is compatible with existing CL algorithms on the reinforcement learning task.

6. Conclusion

In this paper, we ﬁrst revealed that existing CL algorithms fail to maintain a margin region in the distance space to discriminate the semantically similar and dissimilar data pairs. To overcome such an issue, we proposed a distance polarization (DP) regularizer, which encourages the polarized distances and thus obtaining a large margin in the distance space in an unsupervised way. To the best of our knowledge, this is the ﬁrst work in CL that considers introducing a margin region in the distance space. We conducted intensive theoretical analyses to guarantee the effectiveness of our method. Visualization experiments on toy data and comparison experiments on real-world datasets across multiple domains indicate that our learning algorithm acquires more reliable feature embedding than state-of-the-art methods. Considering the uncertainty of similarity determination in the distance polarization would be interesting future work.

Acknowledgments

SC, GN, and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. MS was also supported by the Institute for AI and Beyond, UTokyo. CG, JL, and JY were supported by NSFC 62072242, 61973162, 61836014, U19B2034, and U1713208, Program for Changjiang Scholars, China Postdoctoral Science Foundation (No: 2020M681606), the Fundamental Research Funds for the Central Universities (No: 30920032202), and CCF-Tencent Open Fund (No: RAGR20200101).

Arpit, D., Zhou, Y., Ngo, H., and Govindaraju, V. Why regularized auto-encoders learn sparse representation? In International Conference on Machine Learning (ICML), pp. 136 144, 2016. 2.3

Bradley, P. S. and Fayyad, U. M. Reﬁning initial points for k-means clustering. In International Conference on Machine Learning (ICML), volume 98, pp. 91 99, 1998. 5.1

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in neural information processing systems (Neur IPS), pp. 1401 1413, 2020. 5.2

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud,

D. K. Neural ordinary differential equations. In Advances in neural information processing systems (Neur IPS), pp. 6571 6583, 2018. 5.1

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pp. 1597 1607, 2020a. 1, 2.1, 3.1, 5.2

Chen, X., Chen, W., Chen, T., Yuan, Y., Gong, C., Chen, K., and Wang, Z. Self-pu: Self boosted and calibrated positive-unlabeled training. In International Conference on Machine Learning (ICML), pp. 1510 1519, 2020b. 1, 2.1

Chu, X., Lin, Y., Wang, Y., Wang, X., Yu, H., Gao, X., and Tong, Q. Distance metric learning with joint representation diversiﬁcation. In International Conference on Machine Learning (ICML), pp. 1962 1973, 2020. 2.1, 2.2

Chuang, C.-Y., Robinson, J., Yen-Chen, L., Torralba, A., and Jegelka, S. Debiased contrastive learning. Advances in Neural Information Processing Systems (Neur IPS), 33, 2020. 1, 2.1, 5.2

Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In International conference on artiﬁcial intelligence and statistics (AISTATS), pp. 215 223, 2011. 5.2

Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. Information-theoretic metric learning. In International Conference on Machine learning (ICML), pp. 209 216, 2007. 2.2

Dong, W., Shi, G., Li, X., Ma, Y., and Huang, F. Compressive sensing via nonlocal low-rank regularization. IEEE Transactions on Image Processing, 23(8):3618 3632, 2014. 2.3

Guo, Z.-C., Shi, L., and Wu, Q. Learning theory of distributed regression with bias corrected regularization kernel network. The Journal of Machine Learning Research, 18(1):4237 4261, 2017. 2.3

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 297 304, 2010. 2.1

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), pp. 9729 9738, 2020. 2.1, 4.2

Large-Margin Contrastive Learning with Distance Polarization Regularizer

Huang, F., Chen, S., and Huang, H. Faster stochastic alternating direction method of multipliers for nonconvex optimization. In International Conference on Machine Learning (ICML), pp. 2839 2848, 2019. 3.3

Huang, F., Gao, S., Pei, J., and Huang, H. Accelerated zeroth-order momentum methods from mini to minimax optimization. ar Xiv preprint ar Xiv:2008.08170, 2020. 3.3

Huynh, T., Kornblith, S., Walter, M. R., Maire, M., and Khademi, M. Boosting contrastive self-supervised learning with false negative cancellation. ar Xiv preprint ar Xiv:2011.11765, 2020. 1

Jing, L. and Tian, Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 1

Kim, M., Tack, J., and Hwang, S. J. Adversarial selfsupervised contrastive learning. Advances in Neural Information Processing Systems (Neur IPS), 33, 2020. 4.2

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip thought vectors. Advances in neural information processing systems (Neur IPS), 28:3294 3302, 2015. 5.3

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. 3.1, 5.2

Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Advances in neural information processing systems (Neur IPS), pp. 950 957, 1992. 2.3

Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning (ICML), pp. 5639 5650, 2020. 5.4

Li, J., Zhou, P., Xiong, C., Socher, R., and Hoi, S. C. Prototypical contrastive learning of unsupervised representations. ar Xiv preprint ar Xiv:2005.04966, 2020. 2.1

Liu, G., Lin, Z., and Yu, Y. Robust subspace segmentation by low-rank representation. In International conference on machine learning (ICML), pp. 663 670, 2010. 3.2

Logeswaran, L. and Lee, H. An efﬁcient framework for learning sentence representations. In International Conference on Learning Representations (ICLR), 2018. 1, 5.3

Niu, G., du Plessis, M. C., Sakai, T., Ma, Y., and Sugiyama, M. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Advances in neural information processing systems (Neur IPS), pp. 1199 1207, 2016. 4.2

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. 1

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations (ICLR), 2018. 3.1

Robinson, J., Chuang, C.-Y., Sra, S., and Jegelka, S. Contrastive learning with hard negative samples. ar Xiv preprint ar Xiv:2010.04592, 2020. 1, 2.1, 5.2

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211 252, 2015. 5.2

Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and Khandeparkar, H. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning (ICML), pp. 5628 5637, 2019. 1, 1, 4.1

Scholkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning series, 2018. 2.3

Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems (Neur IPS), 29:1857 1865, 2016. 2.2

Song, J. and Ermon, S. Multi-label contrastive predictive coding. Advances in Neural Information Processing Systems (Neur IPS), 33, 2020. 1

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018. 5.4

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. In European Conference on Computer Vision (ECCV), pp. 1 18, 2020a. 1, 5.2

Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for contrastive learning. Advances in Neural Information Processing Systems (Neur IPS), 33, 2020b. 1

Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3733 3742, 2018. 1, 1, 2.1

Xie, P., Deng, Y., Zhou, Y., Kumar, A., Yu, Y., Zou, J., and Xing, E. P. Learning latent space models with angular

Large-Margin Contrastive Learning with Distance Polarization Regularizer

constraints. In International Conference on Machine Learning (ICML), pp. 3799 3810, 2017. 4.1

Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems (Neur IPS), volume 15, pp. 12, 2002. 1

Yang, Y., Shen, H. T., Ma, Z., Huang, Z., and Zhou, X. L2, 1-norm regularized discriminative feature selection for unsupervised learning. In International joint conference on artiﬁcial intelligence (IJCAI), 2011. 2.3

Yu, B. and Tao, D. Deep metric learning with tuplet margin loss. In IEEE International Conference on Computer Vision (ICCV), pp. 6490 6499, 2019. 2.2

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Advances in neural information processing systems (Neur IPS), pp. 9793 9803, 2018. 3.3

Zhang, Z. and Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems (Neur IPS), 31:8778 8788, 2018. 4.2

Zhong, H., Chen, C., Jin, Z., and Hua, X.-S. Deep robust clustering by contrastive learning. ar Xiv preprint ar Xiv:2008.03030, 2020. 1