# unbiased_supervised_contrastive_learning__38268957.pdf Published as a conference paper at ICLR 2023 UNBIASED SUPERVISED CONTRASTIVE LEARNING Carlo Alberto Barbano University of Turin LTCI, Tel ecom Paris, IP Paris Benoit Dufumier LTCI, T el ecom Paris, IP Paris Enzo Tartaglione LTCI, T el ecom Paris, IP Paris Marco Grangetto University of Turin Pietro Gori LTCI, T el ecom Paris, IP Paris Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (Info NCE, Sup Con, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (ϵ-Sup Info NCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose Fair KL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and Image Net, and we assess the debiasing capability of Fair KL with ϵ-Sup Info NCE, reaching stateof-the-art performance on a number of biased datasets, including real instances of biases in the wild . 1 INTRODUCTION Deep learning models have become the predominant tool for learning representations suited for a variety of tasks. Arguably, the most common setup for training deep neural networks in supervised classification tasks consists in minimizing the cross-entropy loss. Cross-entropy drives the model towards learning the correct label distribution for a given sample. However, it has been shown in many works that this loss can be affected by biases in the data (Alvi et al., 2018; Kim et al., 2019; Nam et al., 2020; Sagawa et al., 2019; Tartaglione et al., 2021; Torralba et al., 2011) or suffer by noise and corruption in the labels Elsayed et al. (2018); Graf et al. (2021). In fact, in the latest years, it has become increasingly evident how neural networks tend to rely on simple patterns in the data (Geirhos et al., 2019; Li et al., 2021). As deep neural networks grow in size and complexity, guaranteeing that they do not learn spurious elements in the training set is becoming a pressuring issue to tackle. It is indeed a known fact that most of the commonly-used datasets are biased (Torralba et al., 2011) and that this affects the learned models (Tommasi et al., 2017). In particular, when the biases correlate very well with the target task, it is hard to obtain predictions that are independent of the biases. This can happen, e.g., in presence of selection biases in the data. Furthermore, if the bias is easy to learn (e.g. a simple pattern or color), we will most likely obtain a biased model, whose predictions majorly rely on these spurious attributes and not on the true, generalizable, and discriminative features. Learning fair and robust representations of the underlying samples, especially when dealing with highly-biased data, is the main objective of this work. Contrastive learning has recently gained attention for this purpose, showing superior robustness to cross-entropy Graf et al. (2021). For this reason, in this work, we adopt a metric learning approach for supervised representation learning. Based on that, we provide a unified framework to analyze and compare existing formulations of contrastive losses1 such as the Info NCE loss (Chen et al., 2020; Oord et al., Corresponding author: carlo.barbano@unito.it 1We refer to any contrastive loss and not necessarily to losses based on pairs of samples as in (Sohn, 2016). Published as a conference paper at ICLR 2023 Figure 1: With ϵ-Sup Info NCE (a) we aim at increasing the minimal margin ϵ, between the distance d+ of a positive sample x+ (+ symbol inside) from an anchor x and the distance d of the closest negative sample x ( symbol inside). By increasing the margin, we can achieve a better separation between positive and negative samples. We show two different scenarios without margin (b) and with margin (c). Filling colors of datapoints represent different biases. We observe that, without imposing a margin, biased clusters might appear containing both positive and negative samples (b). This issue can be mitigated by increasing the ϵ margin (c). 2019), the Info L1O loss (Poole et al., 2019) and the Sup Con loss (Khosla et al., 2020). Furthermore, we also propose a new supervised contrastive loss that can be seen as the simplest extension of the Info NCE loss (Chen et al., 2020; Oord et al., 2019) to a supervised setting with multiple positives. Using the proposed metric learning approach, we can reformulate each loss as a set of contrastive, and surprisingly sometimes even non-contrastive, conditions. We show that the widely used Sup Con loss is not a straightforward extension of the Info NCE loss since it actually contains a set of latent non-contrastive constraints. Our analysis results in an in-depth understanding of the different loss functions, fully explaining their behavior from a metric point of view. Furthermore, by leveraging the proposed metric learning approach, we explore the issue of biased learning. We outline the limitations of the studied contrastive loss functions when dealing with biased data, even if the loss on the training set is apparently minimized. By analyzing such cases, we provide a more formal characterization of bias. This eventually allows us to derive a new set of regularization constraints for debiasing that is general and can be added to any contrastive or non-contrastive loss. Our contributions are summarized below: 1. We introduce a simple but powerful theoretical framework for supervised representation learning, from which we derive different contrastive loss functions. We show how existing contrastive losses can be expressed within our framework, providing a uniform understanding of the different formulations. We derive a generalized form of the Sup Con loss (ϵ-Sup Con), propose a novel loss ϵ-Sup Info NCE, and demonstrate empirically its effectiveness; 2. We provide a more formal definition of bias, thanks to the proposed metric learning approach, which is based on the distances among representations, This allows us to derive a new set of effective debiasing regularization constraints, which we call Fair KL. We also analyze, theoretically and empirically, the debiasing power of the different contrastive losses, comparing ϵ-Sup Info NCE and Sup Con. 2 RELATED WORKS Our work is related to the literature in contrastive learning, metric learning, fairness, and debiasing. Contrastive Learning Many different contrastive losses and frameworks have been proposed (Chen et al., 2020; Khosla et al., 2020; Oord et al., 2019; Poole et al., 2019). Supervised contrastive learning approaches aim at pulling representations of the same class close together while repelling representations of different classes apart from each other. It has been shown that, in a supervised setting, this kind of optimization can yield better results than standard cross-entropy, and can be more robust against label corruption (Khosla et al., 2020; Graf et al., 2021). Related to contrastive learning, we Published as a conference paper at ICLR 2023 can also find methods such as triplet losses (Chopra et al., 2005; Hermans et al., 2017) and based on distance metrics Schroff et al. (2015); Weinberger et al. (2006). The latter are the most relevant for this work, as we propose a metric learning approach for supervised representation learning. Debiasing Addressing the issue of biased data and how it affects generalization in neural networks has been the subject of numerous works. Some approaches in this direction include the use of different data sources in order to mitigate biases (Gupta et al., 2018) and data clean-up thanks to the use of a GAN (Sattigeri et al., 2018; Xu et al., 2018). However, they share some major limitations due to the complexity of working directly on the data. In the debiasing related literature, we can most often find approaches based on ensembling or adversarial setups, and regularization terms that aim at obtaining an unbiased model using biased data. The typical adversarial approach is represented by Blind Eye (Alvi et al., 2018). They employ an explicit bias classifier, trained on the same representation space as the target classifier, in an adversarial way, forcing the encoder to extract unbiased representations. This is also similar to Xie et al. (2017). Kim et al. (2019) use adversarial learning and gradient inversion to reach the same goal. Wang et al. (2019b) adopt an adversarial approach to remove unwanted features from intermediate representations of a neural network. All of these works share the limitations of adversarial training, which is well known for its potential training instability. Other ensembling approaches can be found in Clark et al. (2019); Wang et al. (2020), where feature independence among the different models is promoted. Bahng et al. (2020) propose Re Bias, aiming at promoting independence between biased and unbiased representations with a min-max optimization. In (Lee et al., 2021) disentanglement between bias features and target features is maximized to perform the augmentation in the latent space. Nam et al. (2020) propose Lf F, where a bias-capturing model is trained with a focus on easier samples (bias-aligned), while a debiased network is trained by giving more importance to the samples that the bias-capturing model struggles to discriminate. Another approach is proposed in Wang et al. (2019a) with HEX, where a neural-network-based gray-level co-occurrence matrix (Haralick et al., 1973; Lam, 1996), is employed for learning invariant representations to some bias. However, all these methods require training additional models, which in practice can be resource and time-consuming. Obtaining representations that are robust and/or invariant to some secondary attribute can also be achieved by applying constraints and regularization to the model. Using regularization terms for debiasing has gained traction also due to the typically lower complexity when compared to methods such as exembling. For example, recent works attempt to discourage the learning of certain features with the aim of data privacy (Barbano et al., 2021; Song et al., 2017) and fairness (Beutel et al., 2019). For example, Sagawa et al. (2019) propose Group-DRO, which aims at improving the model performance on the worst-group, defined based on prior knowledge of the bias distribution. In RUBi (Cadene et al., 2019), logits re-weighting is used to promote the independence of the predictions on the bias features. Tartaglione et al. (2021) propose En D, which is a regularization term that aims at bringing representations of positive samples closer together in case of different biases, and pulling apart representations of negative samples sharing the same bias attributes. A similar method is presented in (Hong & Yang, 2021), where a contrastive formulation is employed to reach a similar goal. Our method belongs to this latter class of approaches, as it consists of a regularization term which can be optimized during training. 3 CONTRASTIVE LEARNING: AN ϵ-MARGIN POINT OF VIEW Let x X be an original sample (i.e., anchor), x+ i a similar (positive) sample, x j a dissimilar (negative) sample and P and N the number of positive and negative samples respectively. Contrastive learning methods look for a parametric mapping function f : X Sd 1 that maps semantically similar samples close together in the representation space (a (d-1)-sphere) and dissimilar samples far away from each other. Once pre-trained, f is fixed and its representation is evaluated on a downstream task, such as classification, through linear evaluation on a test set. In general, positive samples x+ i can be defined in different ways depending on the problem: using transformations of x (unsupervised setting), samples belonging to the same class as x (supervised) or with similar image attributes of x (weakly-supervised). The definition of negative samples x j varies accordingly. Here, we focus on the supervised case, thus samples belonging to the same/different class, but the proposed framework could be easily applied to the other cases. We define s(f(a), f(b)) as a similarity measure (e.g., cosine similarity) between the representation of two samples a and b. Please note that since ||f(a)||2 = ||f(b)||2 = 1, using a cosine similarity is equivalent to using a L2-distance (d(f(a), f(b)) = ||f(a) f(b)||2 2). Similarly to Chopra et al. (2005); Hadsell et al. (2006); Schroff et al. (2015); Sohn (2016); Wang et al. (2014; 2019c); Weinberger et al. (2006); Yu & Tao (2019), Published as a conference paper at ICLR 2023 we propose to use a metric learning approach which allows us to better formalize recent contrastive losses, such as Info NCE (Chen et al., 2020; Oord et al., 2019), Info L1O (Poole et al., 2019) and Sup Con (Khosla et al., 2020), and derive new losses that better approximate the mutual information and can take into account data biases. Using an ϵ-margin metric learning point of view, probably the simplest contrastive learning formulation is looking for a mapping function f such that the following ϵ-condition is always satisfied: d(f(x), f(x+)) | {z } d+ d(f(x), f(x j )) | {z } < ϵ s(f(x), f(x j )) | {z } s(f(x), f(x+) | {z } s+ ϵ j (1) where ϵ 0 is a margin between positive and negative samples and we consider, for now, a single positive sample. Derivation of Info NCE The constraint of Eq. 1 can be transformed in an optimization problem using, as it is common in contrastive learning, the max operator and its smooth approximation Log Sum Exp (full derivation in the Appendix A.1.1): arg min f max( ϵ, {s j s+}j=1,...,N) arg min f log exp(s+) exp(s+ ϵ) + P j exp(s j ) | {z } ϵ Info NCE Here, we can notice that when ϵ = 0, we retrieve the Info NCE loss, also known as N-Pair loss (Sohn, 2016), whereas when ϵ we obtain the Info L1O loss. It has been shown in Poole et al. (2019) that these two losses are lower and upper bound of the Mutual Information I(X+, X) respectively: j exp s j | {z } Info NCE I(X+, X) log exp s+ P j exp s j | {z } Info L1O By using a value of ϵ [0, ), one might find a tighter approximation of I(X+, X) since the exponential function at the denominator exp( ϵ) monotonically decreases as ϵ increases. Proposed supervised loss (ϵ-Sup Info NCE) The inclusion of multiple positive samples (s+ i ) can lead to different formulations. Some of them can be found in the Appendix A.1.2. Here, considering a supervised setting, we propose to use the following one, that we call ϵ-Sup Info NCE: s j s+ i ϵ i, j i max( ϵ, {s j s+ i }j=1,...,N) X exp(s+ i ) exp(s+ i ϵ) + P j exp(s j ) | {z } ϵ Sup Info NCE Please note that this loss could also be used in other settings, like in an unsupervised one, where positive samples could be defined as transformations of the anchor. Furthermore, even here, the ϵ value can be adjusted in the loss function, in order to increase the ϵ-margin. This time, contrarily to what happens with Eq. 2 and Info NCE, if we consider ϵ = 0, we do not obtain the Sup Con loss. Derivation of ϵ-Sup Con (generalized Sup Con) It s interesting to notice that Eq. 4 is similar to Lsup out, which is one of the two Sup Con losses proposed in Khosla et al. (2020), but they differ for a sum over the positive samples at the denominator. The Lsup out loss, presented as the most straightforward way to generalize the Info NCE loss, actually contains another non-contrastive constraint on the positive samples: s+ t s+ i 0 i, t. Fulfilling this condition alone would force all positive samples to collapse to a single point in the representation space. However, it does not take into account negative samples. That is why we define it as a non-contrastive condition. Considering both Published as a conference paper at ICLR 2023 contrastive and non-contrastive conditions, we obtain: s j s+ i ϵ i, j and s+ t s+ i 0 i, t = i i max(0, {s j s+ i + ϵ}j, {s+ t s+ i }t =i) ϵ 1 exp(s+ i ) P t exp(s+ t ϵ) + P j exp(s j ) | {z } ϵ Sup Con (5) when ϵ = 0 we retrieve exactly Lsup out. The second loss proposed in Khosla et al. (2020), called Lsup in , minimizes a different contrastive problem, which is a less strict condition and probably explains the fact that this loss did not work well in practice (Khosla et al., 2020): max(s j ) < max(s+ i ) log( X j exp(s j )) log( X i exp(s+ i )) < 0 (6) arg min f max(0, max(s j ) max(s+ i )) log exp(s+ i ) P t exp(s+ t ) + P j exp(s j ) | {z } Lsup in It s easy to see that, differently from Eq. 4 and Lsup out, this condition is fulfilled when just one positive sample is more similar to the anchor than all negative samples. Similarly, another contrastive condition that should be avoided is: P j s(f(x), f(x j )) P i s(f(x), f(x+ i )) < ϵ since one would need only one (or few) negative samples far away from the anchor in the representation space (i.e., orthogonal) to fulfil the condition. 3.1 FAILURE CASE OF INFONCE: THE ISSUE OF BIASES Satisfying the ϵ-condition (1) can generally guarantee good downstream performance, however, it does not take into account the presence of biases (e.g. selection biases). A model could therefore take its decision based on certain visual features, i.e. the bias, that are correlated with the target downstream task but don t actually characterize it. This means that the same bias features would probably have a worse performance if transferred to a different dataset (e.g. different acquisition settings or image quality). Specifically, in contrastive learning, this can lead to settings where we are still able to minimize any Info NCE-based loss (e.g. Sup Con or ϵ-Sup Info NCE), but with degraded classification performance (Fig. 1b). To tackle this issue, in this work, we propose the Fair KL regularization technique, a set of debiasing constraints that prevent the use of the bias features within the proposed metric learning approach. In order to give a more in-depth explanation of the ϵ-Info NCE failure case, we employ the notion of bias-aligned and bias-conflicting samples as in Nam et al. (2020). In our context, a bias-aligned sample shares the same bias attribute of the anchor, while a bias-conflicting sample does not. In this work, we assume that the bias attributes are either known a priori or that they can be estimated using a bias-capturing model, such as in Hong & Yang (2021). Characterization of bias We denote bias-aligned samples with x ,b and bias-conflicting samples with x ,b . Given an anchor x, if the bias is strong and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive biasconflicting sample (of course, the same reasoning can be applied for the negative samples). This is why even in the case in which the ϵ-condition is satisfied and the ϵ-Sup Info NCE is minimized, we could still be able to distinguish between bias-aligned and bias-conflicting samples. Hence, we say that there is a bias if we can identify an ordering on the learned representations, such as: d(f(x), f(x+,b i )) | {z } < d(f(x), f(x+,b d(f(x), f(x ,b t )) | {z } ϵ < d(f(x), f(x ,b j )) | {z } ϵ i, k, t, j This represents the worst-case scenario, where the ordering is total (i.e., i, k, t, j). Of course, there can also be cases in which the bias is not as strong, and the ordering may be partial. Fair KL regularization for debiasing Ideally, we would enforce the conditions d+,b k d+,b i = 0 i, k and d ,b t d ,b j = 0 t, j, meaning that every positive (resp. negative) bias-conflicting Published as a conference paper at ICLR 2023 sample should have the same distance from the anchor as any other positive (resp. negative) biasaligned sample. However, in practice, this condition is very strict, as it would enforce uniform distance among all positive (resp. negative) samples. A more relaxed condition would instead force the distributions of distances, {d ,b k } and {d ,b i }, to be similar. Here, we propose two new debiasing constraints for both positive and negative samples using either the first moment (mean) of the distributions or the first two moments (mean and variance). Using only the average of the distributions, we obtain: 1 Pa i |s+,b i | = 0 (9) where Pa and Pc are the number of positive bias-aligned and bias-conflicting samples, respectively2. Denoting the first moments with µ+,b = 1 Pa P i d+,b i , µ+,b = 1 Pc P k , and the second mo- ments of the distance distributions with σ2 +,b = 1 Pa P i(d+,b i µ+,b)2, σ2 +,b = 1 Pc P k(d+,b k µ+,b )2, and making the hypothesis that the distance distributions follow a normal distribution, we can define a new set of debiasing constraints using, for example, the Kullback Leibler divergence: DKL({d+,b i }||{d+,b " σ2 +,b + (µ+,b µ+,b )2 σ2 +,b log σ2 +,b σ2 +,b 1 In practice, one could also use another distribution such as the log-normal, the Jeffreys divergence (DKL(p||q) + DKL(q||p)), or a simplified version, such as the difference of the two statistics (e.g., (µ+,b µ+,b )2 + (σ+,b σ+,b )2). The proposed debiasing constrains can be easily added to any contrastive loss using the method of the Lagrange multipliers, as a regularization term RF air KL = DKL({d+,b i }||{d+,b k }). Thus, the final loss function that we propose to minimize is: exp(s+ i ) exp(s+ i ϵ) + P j exp(s j ) | {z } ϵ Sup Info NCE +λRF air KL (11) where α and λ are positive hyperparameters. 3.1.1 COMPARISON WITH OTHER DEBIASING METHODS Sup Con It is interesting to notice that the non-contrastive conditions in Eq. 5: s+ t s+ i 0 i, t = i are actually all fulfilled only when s+ i = s+ t i, t = i. This means that one tries to align all positive samples, regardless of their bias b, to a single point in the representation space. In other terms, at the optimal solution, one would also fulfill the following conditions: s+,b i = s+,b t , s+,b i = s+,b t , s+,b i = s+,b t , s+,b i = s+,b t i, t = i (12) Realistically, this could lead to suboptimal solutions: we argue that the optimization process would mainly focus on the easier task, namely aligning bias-aligned samples, and neglecting the biasconflicting ones. In highly biased settings, this could lead to worse performance than ϵ-Sup Info NCE. More empirical results supporting this hypothesis are presented in Appendix C.2. End The constraint in Eq. 9 is very similar to what was recently proposed in Tartaglione et al. (2021) with En D. However, En D lacks the additional constraint on the standard deviation of the distances, which is given by Eq. 10. An intuitive difference can be found in Fig. 3 and 4 of the Appendix. The constraints imposed by En D (only first moments) can be fulfilled even if there is an effect of the bias features on the ordering of the positive samples. The use of constraints on the second moments, as in the proposed method, can remove the effect of the bias. An analytical comparison can be found in Appendix A.3. Bias Con In Hong & Yang (2021), authors propose the Bias Con loss, which is similar to Sup Con but only aligns positive bias-conflicting samples. It looks for an encoder f that fulfills: i ϵ i, j and s+,b p s+,b i 0 i, p and s+,b t s+,b i 0 i, t = i (13) The problem here is that we try to separate the negative samples from only the positive biasconflicting samples, ignoring the positive bias-aligned samples. This is probably why the authors proposed to combine this loss with a standard Cross Entropy. 2The same reasoning can be applied to negative samples (omitted for brevity.) Published as a conference paper at ICLR 2023 4 EXPERIMENTS In this section, we describe the experiments we perform to validate our proposed losses. We perform two sets of experiments. First, we benchmark our framework, presented in Sec. 3, on standard vision datasets such as: CIFAR-10 (Krizhevsky et al., a), CIFAR-100 (Krizhevsky et al., b) and Image Net100 (Deng et al., 2009). Then, we analyze biased settings, employing Biased MNIST (Bahng et al., 2020), Corrupted-CIFAR10 (Hendrycks & Dietterich, 2019), b FFHQ (Lee et al., 2021), 9-Class Image Net (Ilyas et al., 2019) and Image Net-A (Hendrycks et al., 2021). The code can be found at https://github.com/EIDOSLAB/unbiased-contrastive-learning. 4.1 EXPERIMENTS ON GENERIC VISION DATASETS We conduct an empirical analysis of the ϵ-Sup Con and ϵ-Sup Info NCE losses on standard vision datasets to evaluate the different formulations and to assess the impact of the ϵ parameter. We compare our results with baseline implementations including Cross Entropy (CE) and Sup Con. Experimental details We use the original setup from Sup Con (Khosla et al., 2020), employing a Res Net-50, a large batch size (1024), a learning rate of 0.5, a temperature of 0.1, and multiview augmentation, for CIFAR-10 and CIFAR-100. Additional experimental details (including Image Net1003) and the different hyperparameters configurations are provided in Sec. B of the Appendix. Table 1: Comparison of ϵ-Sup Info NCE and ϵ-Sup Con on Image Net-100. ϵ-Sup Info NCE 83.3 0.06 ϵ-Sup Con 82.83 0.11 Results First, we compare our proposed ϵ-Sup Info NCE loss with the ϵ-Sup Con loss derived in Sec. 3. As reported in Tab. 1, ϵ-Sup Info NCE performs better than ϵSup Con: we conjecture that the lack of the non-contrastive term of Eq. 5 leads to increased robustness, as it will also be shown in Sec. 4.2. For this reason, we focus on ϵSup Info NCE. Further comparison with different values of ϵ can be found in Sec. C.1, showing that Sup Con ϵSup Con ϵ-Sup Info NCE in terms of accuracy. Results on general computer vision datasets are presented in Tab. 2, in terms of top-1 accuracy. We report the performance for the best value of ϵ; the complete results can be found in Sec. C.1. The results are averaged across 3 trials for every configuration, and we also report the standard deviation. We obtain significant improvement with respect to all baselines and, most importantly, Sup Con, on all benchmarks: on CIFAR-10 (+0.5%), on CIFAR-100 (+0.63%), and on Image Net-100 (+1.31%). Table 2: Accuracy on vision datasets. Sim CLR and Max-Margin results from Khosla et al. (2020). Results denoted with * are (re)implemented with mixed precision due to memory constraints. Dataset Network Sim CLR Max-Margin Sim CLR* CE* Sup Con* ϵ-Sup Info NCE* CIFAR-10 Res Net-50 93.6 92.4 91.74 0.05 94.73 0.18 95.64 0.02 96.14 0.01 CIFAR-100 Res Net-50 70.7 70.5 68.94 0.12 73.43 0.08 75.41 0.19 76.04 0.01 Image Net-100 Res Net-50 - - 66.14 0.08 82.1 0.59 81.99 0.08 83.3 0.06 4.2 EXPERIMENTS ON BIASED DATASETS Next, we move on to analyzing how our proposed loss performs on biased learning settings. We employ five datasets, ranging from synthetic data to real facial images: Biased-MNIST, Corrupted CIFAR10, b FFHQ, and 9-Class Image Net along with Image Net-A. The detailed setup and experimental details are provided in the appendix B. Biased-MNIST is a biased version of MNIST (Deng, 2012), proposed in Bahng et al. (2020). A color bias is injected into the dataset, by colorizing the image background with ten predefined colors associated with the ten different digits. Given an image, the background is colored with the predefined color for that class with a probability ρ, and with any one of the other colors with a probability 1 ρ. Higher values of ρ will lead to more biased data. In this work, we explore the datasets in 3Due to computing resources constraints, we were not able to evaluate Image Net-1k. Published as a conference paper at ICLR 2023 0.0 0.1 0.25 0.5 1.0 2.5 ϵ Accuracy (%) 0.0 0.1 0.25 0.5 1.0 2.5 ϵ 0.0 0.1 0.25 0.5 1.0 2.5 ϵ 0.0 0.1 0.25 0.5 1.0 2.5 ϵ Sup Con CE ϵ-Sup Con ϵ-Sup Info NCE Figure 2: Comparison of ϵ-Sup Con and ϵ-Sup Info NCE on Biased-MNIST. It is noticeable that for ρ 0.997, ϵ-Sup Info NCE and ϵ-Sup Con are comparable, while for ρ = 0.999 the gap is significantly larger: this could be due to the additional non-contrastive condition of Sup Con. different values of ρ: 0.999, 0.997, 0.995 and 0.99. An unbiased test set is built with ρ = 0.1. We compare with cross entropy baseline and with other debiasing techniques, namely En D (Tartaglione et al., 2021), LNL (Nam et al., 2020) and Bias Con and Bias Bal (Hong & Yang, 2021). Analysis of ϵ-Sup Info NCE and ϵ-Sup Con: First, we perform an evaluation of the ϵ-Sup Con and ϵ-Sup Info NCE losses alone, without our debiasing regularization term. Fig. 2 shows the accuracy on the unbiased test set, with the different values of ρ. Baseline results of a cross-entropy model (CE) are reported in Tab. 3. Both losses result in higher accuracy compared to the cross entropy. The generally higher robustness of contrastive-based formulations is also confirmed by the related literature (Khosla et al., 2020). Interestingly, in the most biased setting (ρ = 0.999), we observe that ϵ-Sup Info NCE obtains higher accuracy than ϵ-Sup Con. Our conjecture is that the non-contrastive term of Sup Con in Eq. 5 (s+ t s+ i 0 i, t) can lead, in highly biased settings, to more biased representations as the bias-aligned samples will be especially predominant among the positives. For this reason, we focus on ϵ-Sup Info NCE in the remaining of this work. Debiasing with Fair KL: Next, we apply our regularization technique Fair KL jointly with ϵSup Info NCE, and compare it with the other debiasing methods. The results are shown in Tab. 3. Our technique achieves the best results in all experiments, with high gaps in accuracy, especially in the most difficult settings (lower ρ). For completeness, we also evaluate the debiasing power of Fair KL with different losses, i.e. CE and ϵ-Sup Con. With Fair KL we obtain better results than most of the other baselines with either CE, ϵ-Sup Con or ϵ-Sup Info NCE; the latter achieves the best performance, confirming the results observed in Sec 4.1. For this reason, in the rest of the work, we focus on ϵ-Sup Info NCE. Table 3: Top-1 accuracy (%) on Biased-MNIST. Reference results from Hong & Yang (2021). Results denoted with * are re-implemented without color-jittering and bias-conflicting oversampling. Method 0.999 0.997 0.995 0.99 CE Hong & Yang (2021) 11.8 0.7 62.5 2.9 79.5 0.1 90.8 0.3 LNL Kim et al. (2019) 18.2 1.2 57.2 2.2 72.5 0.9 86.0 0.2 ϵ-Sup Con 24.36 3.23 74.35 0.09 84.13 1.31 91.12 0.35 ϵ-Sup Info NCE 33.16 3.57 73.86 0.81 83.65 0.36 91.18 0.49 En D Tartaglione et al. (2021) 59.5 2.3 82.70 0.3 94.0 0.6 94.8 0.3 Bias Con+Bias Bal* Hong & Yang (2021) 30.26 11.08 82.83 4.17 88.20 2.27 95.04 0.86 Bias Bal Hong & Yang (2021) 76.8 1.6 91.2 0.2 93.9 0.1 96.3 0.2 Bias Con+CE* Hong & Yang (2021) 15.06 2.22 90.48 5.26 95.95 0.11 97.67 0.09 CE + Fair KL 79.9 4.29 93.86 1.13 94.85 0.55 95.92 0.17 ϵ-Sup Con + Fair KL 89.45 1.82 95.75 0.16 96.31 0.81 96.72 0.2 ϵ-Sup Info NCE + Fair KL 90.51 1.55 96.19 0.23 97.00 0.06 97.86 0.02 Corrupted CIFAR-10 is built from the CIFAR-10 dataset, by correlating each class with a certain texture (brightness, frost, etc.) following the protocol proposed in Hendrycks & Dietterich (2019). Similarly to Biased-MNIST, the dataset is provided with five different levels of ratio between bias- Published as a conference paper at ICLR 2023 conflicting and bias-aligned samples. The results are shown in Tab. 4. Notably, we obtain the best results in the most difficult scenario, when the amount of bias-conflicting samples is the lowest. Again, for the other settings, we obtain comparable results with the state of the art. b FFHQ is proposed by Lee et al. (2021), and contains facial images. They construct the dataset in such a way that most of the females are young (age range 10-29), while most of the males are older (age range 40-59). The ratio between bias-conflicting and bias-aligned provided for this dataset is 0.5. The results are shown in Tab. 4, where our technique outperforms all other methods. 9-Class Image Net and Image Net-A We also test our method on the more complex and realistic 9-Class Image Net (Ilyas et al., 2019) dataset. This dataset is a subset of Image Net, which is known to contain textural biases (Geirhos et al., 2019). It aggregates 42 of the original classes into 9 macro categories. Following Hong & Yang (2021), we train a Bag Net18 (Brendel & Bethge, 2019) as the bias-capturing model, which we then use to compute a bias score for the training samples, to apply within our regularization term. More details and the experimental setup can be found in the Sec. B.2.4. We evaluate the accuracy on the test set (biased) along with the unbiased accuracy (UNB), computed with the texture labels assigned in Brendel & Bethge (2019). We also report accuracy results on Image Net-A (IN-A) dataset, which contains bias-conflicting samples (Hendrycks et al., 2021). Results are shown in Tab. 5. On the biased test set, the results are comparable with Soft Con, while on the harder sets unbiased and Image Net-A we achieve SOTA results. Table 4: Top-1 accuracy (%) on Corrupted CIFAR-10 with different corruption ratio (%) and on b FFHQ. Reference results are taken from Lee et al. (2021). Corrupted CIFAR-10 b FFHQ Ratio Ratio Method 0.5 1.0 2.0 5.0 0.5 Vanilla Lee et al. (2021) 23.08 1.25 25.82 0.33 30.06 0.71 39.42 0.64 56.87 2.69 En D Tartaglione et al. (2021) 19.38 1.36 23.12 1.07 34.07 4.81 36.57 3.98 56.87 1.42 HEX Wang et al. (2019a) 13.87 0.06 14.81 0.42 15.20 0.54 16.04 0.63 52.83 0.90 Re Bias Bahng et al. (2020) 22.27 0.41 25.72 0.20 31.66 0.43 43.43 0.41 59.46 0.64 Lf F Nam et al. (2020) 28.57 1.30 33.07 0.77 39.91 0.30 50.27 1.56 62.2 1.0 DFA Lee et al. (2021) 29.95 0.71 36.49 1.79 41.78 2.29 51.13 1.28 63.87 0.31 ϵ-Sup Info NCE + Fair KL 33.33 0.38 36.53 0.38 41.45 0.42 50.73 0.90 64.8 0.43 Table 5: Top-1 accuracy (%) on 9-Class Image Net biased and unbiased (UNB) sets, and Image Net A (IN-A). Reference results from Hong & Yang (2021). Vanilla SIN LM RUBi Re Bias Lf F Soft Con ϵ-Sup Info NCE + Fair KL Biased 94.0 0.1 88.4 0.9 79.2 1.1 93.9 0.2 94.0 0.2 91.2 0.1 95.3 0.2 95.1 0.1 UNB 92.7 0.2 86.6 1.0 76.6 1.2 92.5 0.2 92.7 0.2 89.6 0.3 94.1 0.3 94.8 0.3 IN-A 30.5 0.5 24.6 2.4 19.0 1.2 31.0 0.2 30.5 0.2 29.4 0.8 34.1 0.6 35.7 0.5 5 CONCLUSIONS In this work, we propose a metric-learning-based framework for supervised representation learning. We propose a new loss, called ϵ-Sup Info NCE, that is based on the definition of the ϵ-margin, which is the minimal margin between positive and negative samples. By adjusting this value, we are able to find a tighter approximation of the mutual information and achieve better results compared to standard Cross-Entropy and to the Sup Con loss. Then, we tackle the problem of learning unbiased representations when the training data contains strong biases. This represents a failure case for Info NCE-like losses. We propose Fair KL, a debiasing regularization term derived from our framework. With it, we enforce equality between the distribution of distances of bias-conflicting samples and bias-aligned samples. This, together with the increase of the ϵ margin, allows us to reach stateof-the-art performances in the most extreme cases of biases in different datasets, comprising both synthetic data and real-world images. Published as a conference paper at ICLR 2023 Mohsan Alvi, Andrew Zisserman, and Christoffer Nell aker. Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0 0, 2018. Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In International Conference on Machine Learning (ICML), 2020. Carlo Alberto Barbano, Enzo Tartaglione, and Marco Grangetto. Bridging the gap between debiasing and privacy for deep learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 3806 3815, October 2021. Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Allison Woodruff, Christine Luu, Pierre Kreitmann, Jonathan Bischof, and Ed H Chi. Putting fairness principles into practice: Challenges, metrics, and improvements. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 453 459, 2019. Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. 7th International Conference on Learning Representations, ICLR 2019, 3 2019. doi: 10.48550/arxiv.1904.00760. URL https://arxiv.org/ abs/1904.00760v1. Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. In Advances in neural information processing systems, pp. 841 852, 2019. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning, pp. 1597 1607. PMLR, November 2020. URL http://proceedings.mlr.press/ v119/chen20j.html. ISSN: 2640-3498. S. Chopra, R. Hadsell, and Y. Le Cun. Learning a Similarity Metric Discriminatively, with Application to Face Verification. In CVPR, volume 1, pp. 539 546. IEEE, 2005. ISBN 978-0-7695-23729. doi: 10.1109/CVPR.2005.202. URL http://ieeexplore.ieee.org/document/ 1467314/. Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 4067 4080. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1418. URL https://doi.org/10.18653/v1/D19-1418. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. Gamaleldin F. Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. Advances in Neural Information Processing Systems, 2018-December:842 852, 3 2018. ISSN 10495258. doi: 10.48550/arxiv.1803.05598. URL https://arxiv.org/abs/1803.05598v2. Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygh9j09KX. Published as a conference paper at ICLR 2023 Florian Graf, Christoph D Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised constrastive learning. 2021. Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems, pp. 9094 9104, 2018. R. Hadsell, S. Chopra, and Y. Le Cun. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR, volume 2, pp. 1735 1742. IEEE, 2006. Robert M. Haralick, K. Shanmugam, and Its Hak Dinstein. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(6):610 621, 1973. doi: 10.1109/ TSMC.1973.4309314. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December:770 778, 12 2015. ISSN 10636919. doi: 10.48550/arxiv.1512.03385. URL https://arxiv.org/abs/1512.03385v1. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. 7th International Conference on Learning Representations, ICLR 2019, 3 2019. doi: 10.48550/arxiv.1903.12261. URL https://arxiv.org/abs/1903. 12261v1. Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. CVPR, 2021. Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person reidentification. ar Xiv preprint ar Xiv:1703.07737, 2017. Youngkyu Hong and Eunho Yang. Unbiased classification through bias-contrastive and biasbalanced learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=2Oq ZZAqxnn. Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. Advances in Neural Information Processing Systems, 32, 5 2019. ISSN 10495258. doi: 10.48550/arxiv.1905.02175. URL https://arxiv.org/abs/1905.02175v4. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. volume 33, pp. 18661 18673. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf. Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). a. URL http://www.cs.toronto.edu/ kriz/cifar.html. Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research). b. URL http://www.cs.toronto.edu/ kriz/cifar.html. S.W.-C. Lam. Texture feature extraction using gray level gradient based co-occurence matrices. In 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No.96CH35929), volume 1, pp. 267 271 vol.1, 1996. doi: 10.1109/ICSMC. 1996.569778. Jungsoo Lee, Eungyeup Kim, Juyoung Lee, Jihyeon Lee, and Jaegul Choo. Learning debiased representation via disentangled feature augmentation. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= -o Uh JJILWHb. Published as a conference paper at ICLR 2023 Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, and cihang xie. Shape-texture debiased neural network training. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Db4yer ZTYkz. Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classifier from biased classifier. In Advances in Neural Information Processing Systems, 2020. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. ar Xiv:1807.03748 [cs, stat], January 2019. URL http://arxiv.org/abs/ 1807.03748. ar Xiv: 1807.03748. Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, and George Tucker. On Variational Bounds of Mutual Information. In ICML, 2019. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2019. Prasanna Sattigeri, Samuel C Hoffman, Vijil Chenthamarakshan, and Kush R Varshney. Fairness gan. ar Xiv preprint ar Xiv:1805.09910, 2018. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Face Net: A Unified Embedding for Face Recognition and Clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815 823, June 2015. doi: 10.1109/CVPR.2015.7298682. URL http://arxiv. org/abs/1503.03832. ar Xiv: 1503.03832. Kihyuk Sohn. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://papers.nips.cc/paper/2016/hash/ 6b180037abbebea991d8b1232f8a8ca9-Abstract.html. Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 587 601. ACM, 2017. Enzo Tartaglione, Carlo Alberto Barbano, and Marco Grangetto. End: Entangling and disentangling deep representations for bias correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13508 13517, June 2021. Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A deeper look at dataset bias. In Domain adaptation in computer vision applications, pp. 37 55. Springer, 2017. Antonio Torralba, Alexei A Efros, et al. Unbiased look at dataset bias. In CVPR, pp. 7. Citeseer, 2011. Haohan Wang, Zexue He, Zachary L. Lipton, and Eric P. Xing. Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=r JEjjo R9K7. Jiang Wang, Yang song, Thomas Leung, Chuck Rosenberg, Jinbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning Fine-grained Image Similarity with Deep Ranking. In CVPR, 2014. Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In International Conference on Computer Vision (ICCV), October 2019b. Xinshao Wang, Yang Hua, Elyor Kodirov, and Neil M. Robertson. Ranked List Loss for Deep Metric Learning. In CVPR, 2019c. Zeyu Wang, Klint Qinami, Ioannis Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Russakovsky. Towards fairness in visual recognition: Effective strategies for bias mitigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Published as a conference paper at ICLR 2023 Kilian Q Weinberger, John Blitzer, and Lawrence Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. In Advances in Neural Information Processing Systems, volume 18. MIT Press, 2006. URL https://proceedings.neurips.cc/paper/2005/ hash/a7f592cef8b130a6967a90617db5681b-Abstract.html. Qizhe Xie, Zihang Dai, Yulun Du, E. Hovy, and Graham Neubig. Controllable invariance through adversarial feature learning. In NIPS, 2017. Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data), pp. 570 575. IEEE, 2018. Baosheng Yu and Dacheng Tao. Deep Metric Learning With Tuplet Margin Loss. In IEEE ICCV, pp. 6489 6498, 2019. Published as a conference paper at ICLR 2023 A THEORETICAL RESULTS A.1 COMPLETE DERIVATIONS FOR SECTION 3 In this section, we present the complete analytical derivation for the equations found in Sec. 3. All of the presented derivations are based on the smooth max approximation with the Log Sum Exp (LSE) operator: max(x1, x2, ..., x3) log( X i exp(xi)) (14) A.1.1 COMPUTATIONS OF ϵ-INFONCE (2) We consider Eq. 2 and we obtain: arg min f max( ϵ, {s j s+}j=1,...,N) arg min f exp(s+) exp(s+ ϵ) + P j exp(s j ) | {z } ϵ Info NCE Starting from the left-hand side, we have: max( ϵ, {s j s+}j=1,...,N) log exp( ϵ) + X j exp(s j s+) exp( ϵ) + exp( s+) X j exp(s j ) exp(s+ ϵ) + X j exp(s j ) = log exp( s+) + log exp(s+ ϵ) + X j exp(s j ) exp(s+) exp(s+ ϵ) + P j exp(s j ) | {z } ϵ Info NCE A.1.2 MULTIPLE POSITIVE EXTENSION Extending Eq. 2 to multiple positives can be done in different ways. Here, we list four possible choices. Empirically, we found that solution c) gave the best results and is the most convenient to implement for efficiency reasons. Published as a conference paper at ICLR 2023 a) max( ϵ, {s j s+ i } i=1,..,P j=1,...,N ) = log i s+ i ) exp(P i s+ i ϵ) + (P j exp(s j ))(P t =i s+ t )) j max( ϵ, {s j s+ i }i=1,...,P ) = X i s+ i ) exp(P i s+ i ϵ) + exp(s j )(P t =i s+ t )) i max( ϵ, {s j s+ i }j=1,...,N) = X exp(s+) exp(s+ ϵ) + P j exp(s j ) j max( ϵ, s j s+ i ) = X exp(s+) exp(s+ ϵ) + exp(s j ) A.1.3 COMPUTATIONS OF ϵ-SUPINFONCE (4) The computations are very similar to Eq. 16. We obtain: i max( ϵ, {s j s+ i }j=1,...,N) arg min f exp( ϵ) + X j exp(s j s+ i ) Starting from the left-hand side, we have: i max( ϵ, {s j s+ i }j=1,...,N) X exp( ϵ) + X j exp(s j s+ i ) j exp(s j ) exp(s+ i ϵ) P j exp(s j ) exp(s+ i ) exp(s+ i ϵ) P j exp(s j ) | {z } ϵ Sup Info NCE A.1.4 COMPUTATIONS OF ϵ-SUPCON (5) We extend Eq. 4 by adding the non contrastive conditions: s j s+ i ϵ i, j and s+ t s+ i 0 i, t = i (20) and we show i max(0, {s j s+ i +ϵ}j, {s+ t s+ i }t =i) ϵ 1 exp(s+ i ) P t exp(s+ i ϵ) + P j exp(s j ) | {z } ϵ Sup Con (21) Published as a conference paper at ICLR 2023 Starting from the left-hand side, we have: i max(0, {s j s+ i + ϵ}j=1,...,N,{s+ t s+ i }t =i) 1 j exp(s j s+ i + ϵ) + X t =i exp(s+ t s+ i ) j exp(s j ) exp(s+ i ϵ) + t =i exp(s+ t ) exp(s+ i ϵ) + P j exp(s j ) + P t =i exp(s+ t ϵ) exp(s+ i ϵ) exp(s+ i ϵ) P t exp(s+ t ϵ) + P j exp(s j ) exp(s+ i ) P t exp(s+ t ϵ) + P j exp(s j ) | {z } ϵ Sup Con (22) A.1.5 COMPUTATIONS OF Lsup in (7) Here we show that: max(s j ) < max(s+ i ) log exp(s+ i ) P t exp(s+ t ) + P j exp(s j )) | {z } Lsup in Starting from the left-hand side, and given that: max(s j ) < max(s+ i ) log( X j exp(s j )) log( X i exp(s+ i )) < 0 Published as a conference paper at ICLR 2023 j exp(s j )) log( X i exp(s+ i )) j exp(s j )) log( X i exp(s+ i )) P j exp(s j ) P i exp(s+ i ) j exp(s j )) P i exp(s+ i ) i exp(s+ i ) + P j exp(s j )) P i exp(s+ i ) exp(s+ i ) P t exp(s+ t ) + P j exp(s j )) | {z } Lsup in (24) A.1.6 COMPUTATIONS OF EQ.17-A arg min f max( ϵ, {s j s+ i } i=1,..,P j=1,...,N ) arg min f log exp( ϵ) + X j exp(s j s+ i ) j exp(s j ) exp(s+ i )) j exp(s j ) exp(s+ i )) exp(s+ i ϵ) + P(P j exp(s j )) P(exp(s+ i ))) P exp(s+ i ϵ)) + P j exp(s j ) i h 1 P exp(s+ i ϵ)(Q t =i exp(s+ t )) + (P j exp(s j ))(Q t =i exp(s+ t )) i i exp(s+ i ) i exp(s+ i ) + (P j exp(s j ))(P t =i exp(s+ t )) Q i exp(s+ i ) i s+ i ) exp(P i s+ i ϵ) + (P j exp(s j ))(P t =i s+ t )) A.2 BOUNDNESS OF THE ϵ-MARGIN In this section, we give insights on how an optimal value of ϵ can be estimated. First of all, it is easy to show that ϵ is bounded and cannot grow to infinity. Given that ||f(x)||2 = 1, then we Published as a conference paper at ICLR 2023 Figure 3: When considering only Eq. 9 a-left (average of distances) or Eq. 9a-right (average of similarities), that is when using En D Tartaglione et al. (2021), we may obtain a suboptimal configuration such as (a), where we can still (partially) order the distances of positive samples from the anchor based on the bias features. We can see that the conditions in Eq. 9 are fulfilled, namely the average of the distances of bias-aligned and bias-conflicting samples from the anchor are the same (µ+,b = µ+,b = 2). This is only partially mitigated when using a margin ϵ > 0 (b). However, the standard deviations of the distances of bias-aligned and bias-conflicting samples in (a) and (b) are different (σ+,b = 0, while σ+,b = 1). This can be computed using the distances d reported in the figure. If we also consider the conditions on the standard deviations of the distances, as proposed in Fair KL (Eq. 10), the ordering is removed and thus also the effect of the bias (c). In (c), we show the case in which both mean and standard deviation of the distributions match (in a simplified case with σ=0). A simulated example is shown in Fig. 4. max(d+ d ) = 2. This is the case in which the two samples are aligned at opposite poles of the hypersphere. We can conclude that, in general, ϵ will be less than 2. If we also take into account the temperature τ, when ϵ 2/τ. This is always true, however, a stricter upper bound can be found if we consider the geometric properties of the latent space. For example, Graf et al. (2021) show that when the Sup Con loss converges to its minimum value, then the representations of the different classes are aligned on a regular simplex. This property could be used to compute a precise upper bound of the ϵ margin, depending on the number of classes in the dataset. We leave further analysis on this matter as future work. A.3 THEORETICAL COMPARISON WITH END Here, we present a more detailed theoretical analysis of En D (Tartaglione et al., 2021), and we show that the En D regularization term can be equivalent to the condition in Eq. 9: i |s+,b i | = 0 b) 1 t |s ,b t | 1 j |s ,b j | = 0 (27) which can be turned into a minimization term R, using the method of Lagrange multipliers: i |s+,b i | t |s ,b t | 1 j |s ,b j | Now, if we assume λ1 = λ2 = 1, we can re-arrange the terms, obtaining: i |s+,b i | + 1 j |s ,b j | t |s ,b t | Published as a conference paper at ICLR 2023 aligned conflicting Figure 4: Toy example with simulated data to better explain the suboptimal solution of Fig. 3. We make the hypothesis that the distributions of the distances do follow a Gaussian distribution. In blue and in orange are shown the bias-aligned and the bias-conflicting samples respectively. The green sample represents the anchor. On the left, data points are sampled from two normal distributions with the same mean but different std. We can see that the two distributions do not match. This shows that, even if the first order constraints of En D are fulfilled, there might still be an effect of the bias. On the contrary, on the right, the two distributions have almost the same statistics (both average and std) and the KL divergence is almost 0. In that case, the bias effect is basically removed. The two terms we obtain are equivalent to the disentangling term R and to the entangling term R of En D: R tries to decorrelate all of the samples which share the same bias attribute, while the R tries to maximize the correlation of samples which belong to the same class but have different bias attributes. However, some practical differences between the two formulation remain in how the different terms are weighted: for Eq. 9 we would have: R = λ1(µ+,b µ+,b) + λ2(µ ,b µ ,b) (30) while for En D we would have: R = λ1(µ+,b µ ,b) + λ2(µ+,b µ ,b ) (31) B EXPERIMENTAL SETUP All of our experiments were run using Py Torch 1.10.0. We used a cluster with 4 NVIDIA V100 GPUs and a cluster of 8 NVIDIA A40 GPUs. For consistency, when training with constrastive losses we use a temperature value τ = 0.1 across all of our experiments. B.1 GENERIC VISION DATASETS B.1.1 CIFAR-10 AND CIFAR-100 We use the original setup from Sup Con (Khosla et al., 2020), employing a Res Net-50, large batch size (1024), learning rate of 0.5, temperature of 0.1 and multiview augmentation, for CIFAR-10 and CIFAR-100. We use SGD as optimizer with a momentum of 0.9, and train for 1000 epochs. Learning rate is decayed with a cosine policy with warmup from 0.01, with 10 warmup epochs. B.1.2 IMAGENET-100 For Image Net-100 we employ the Res Net50 architectures (He et al., 2015). We use SGD as optimizer with a weight decay of 10 4 and momentum of 0.9, with an initial learning rate of 0.1. We Published as a conference paper at ICLR 2023 train for 100 epochs with a batch size of 512, and we decay the learning rate by a factor of 0.1 every 30 epochs. B.2 BIASED DATASETS When employing our debiasing term, we find that scaling the ϵ-Sup Info NCE loss by a small factor α ( 1) and using λ closer to 1, is stabler then using values of λ >> 1 (as done in Tartaglione et al. (2021)) and tends to produce better results. For biased datasets, we do not make use of the projection head used in Khosla et al. (2020); Chen et al. (2020). For this reason, we also avoid the aggressive augmentation usually employed by contrastive methods (more on this in Sec. C.3). Furthermore, as also done by Hong & Yang (2021), we also experimented with a small contribution of the cross entropy loss for training the model end-to-end; however, we did not find any benefit in doing so, compared to training a linear classifier separately. B.2.1 BIASED-MNIST We emply the network architecture Simple Conv Net proposed by Bahng et al. (2020), consisting of four convolutional layers with 7 7 kernels. We use the Adam optimizer with a learning rate of 0.001, a weight decay of 10 5 and a batch size of 256. We decay the learning rate by a factor of 0.1 at 1/3 and 2/3 of the epochs (26 and 53). We train for 80 epochs. Hyperparameters configuration The hyperparameters for Tab. 3 are reported in Tab. 6. Table 6: Biased-MNIST hyperparameters of Tab. 3 Corr (ρ) 0.999 0.997 0.995 0.990 ϵ-Sup Con α 0.03 0.03 0.03 0.03 λ 0.75 0.5 0.5 0.5 ϵ 0.25 0 0.5 0 ϵ-Sup Info NCE α 0.03 0.03 0.03 0.03 λ 0.75 0.75 0.75 0.5 ϵ 0.5 0.5 0.5 0.5 B.2.2 CORRUPTED CIFAR-10 For this dataset we employ the Res Net-18 architecture. We use the Adam optimizer, with an initial learning rate of 0.001, a weight decay of 0.0001. The other Adam parameters are the pytorch default ones (β1 = 0.9, β2 = 0.999, ϵ = 10 8). We train for 200 epochs with a batch size of 256. We decay the learning rate using a cosine annealing policy. Hyperparameters configuration: Table 7 shows the hyperparameters for the results reported in Tab. 4 of the main paper. Table 7: Corrupted CIFAR-10 hyperparameters Ratio (%) 0.5 1.0 2.0 5.0 α 0.1 0.1 0.1 0.1 λ 1.0 1.0 1.0 1.0 ϵ 0.1 0.25 0.5 0.25 B.2.3 BFFHQ Following Lee et al. (2021),we use the Res Net-18 architecture. We use the Adam optimizer, with an initial learning rate of 0.0001, and train for 100 epochs. For this experiment, we set α = 0.1, Published as a conference paper at ICLR 2023 ϵ = 0.25 and λ = 1.5. Differently from Lee et al. (2021) we use a batch size of 256 (vs 64) as contrastive losses benefit more from larger batch sizes (Chen et al., 2020; Khosla et al., 2020). Additionally, we also use a weight decay of 10 4, rather than 0. These changes do not provide advantages to the debiasing task: we obtain an accuracy of 54.8% without Fair KL, which is in line with the 56.87% reported for the vanilla model. B.2.4 9-CLASS IMAGENET AND IMAGENET-A Our proposed method can also be applied in cases in which no prior label about the bias attributes is given, with only a slight change in formulation. For example, for Image Net. Similarly to other works (Nam et al., 2020; Hong & Yang, 2021) we exploit a bias-capturing model for this purpose. Fair KL with bias-capturing model To use a continuous score, rather than a discrete bias label, we compute the similarity of the bias features b+ i = s(g(x), g(x+ i )), where g( ) is the bias-capturing Bag Net18 model. The bias similarity bi is used to obtain a weighted sample similarity: s+,b i = s+ i b+ i for bias-aligned samples, and ˆs+,b i = s+ i (1 b+ i ) for bias-conflicting. By doing so, for example, the terms µ+,b = 1 Pa P i d+,b i and µ+,b = 1 Pc P k become ˆµ+,b = 1 N P i d+ i ˆb+ i and ˆµ+,b = i d+ i (1 ˆb+ i ), where N is the batch size. Setup We pretrain the bias-capturing model Bag Net18 (Brendel & Bethge, 2019) for 120 epochs. For the main model Res Net18, we use the Adam optimizer, with learning rate 0.001, β1 = 0.9 and β2 = 0.999, weight decay of 0.0001 and a cosine decay of the learning rate. We use a batch size of 256 and train for 200 epochs. We employ as augmentation: random resized crop, random flip, and, as done in Hong & Yang (2021) random color jitter and random gray scale (p = 0.2). We use ϵ = 0.5 and λ = 1. Given the higher complexity of this dataset, we employ α = 0.5. C ADDITIONAL EXPERIMENTS In this section, we present some additional experiments we conducted, for a more in depth analysis of our proposed framework. C.1 COMPLETE RESULTS FOR COMMON VISION DATASETS In Table 8 we report the results on CIFAR-10, CIFAR-100 and Image Net-100 for different values of ϵ. In Table 9 the full comparison between ϵ-Sup Con and ϵ-Sup Info NCE on Image Net-100 is presented. Our proposed ϵ-Sup Info NCE outperforms Sup Con in all datasets for all the ϵ values, reaching the best results. Furthermore, on Image Net-100, we observe that the lowest accuracy obtained by ϵ-Sup Info NCE (83.02%) is still higher than the best accuracy obtained by ϵ-Sup Con (82.83%) on the same dataset, even though ϵ-Sup Con is always higher than Sup Con. In terms of accuracy, the results in Tab. 9 show that Sup Con ϵ Sup Con ϵ Sup Info NCE. Table 8: Complete results for common vision datasets, for different values of ϵ, in terms of top-1 accuracy (%). With every value of ϵ we obtain better results than Sup Con (and CE) on the same dataset. Dataset CE Sup Con ϵ-Sup Info NCE ϵ = 0.1 ϵ = 0.25 ϵ = 0.5 CIFAR-10 94.73 0.18 95.64 0.02 95.93 0.02 96.14 0.01 95.95 0.12 CIFAR-100 73.43 0.08 75.41 0.19 75.85 0.07 76.04 0.01 75.99 0.06 Image Net-100 82.1 0.59 81.99 0.08 83.25 0.39 83.02 0.41 83.3 0.06 C.2 ANALYSIS OF ϵ-SUPCON FOR DEBIASING We perform a more in-depth analysis of the debiasing capabilities of ϵ-Sup Info NCE and ϵ-Sup Con. In Sec. 4.2 of the main text, we hypothesize that the non-constrastive condition of Eq. 20 s+ i s+ j 0 i, t = i Published as a conference paper at ICLR 2023 Table 9: Complete comparison of ϵ-Sup Info NCE and ϵ-Sup Con on Image Net-100 in terms of top-1 accuracy (%). The results of ϵ-Sup Info NCE are higher than any results of ϵ-Sup Con. Loss ϵ = 0.1 ϵ = 0.25 ϵ = 0.5 ϵ-Sup Info NCE 83.25 0.39 83.02 0.41 83.3 0.06 ϵ-Sup Con 82.83 0.11 82.54 0.09 82.77 0.14 might actually be the reason for the loss of accuracy in ϵ-Sup Con when compared to ϵ-Info NCE, as shown on the analysis on Biased-MNIST in Fig. 2 of the main text. In this section, we provide more empirical insights supporting this hypothesis. We plot the similarity of bias-aligned samples (s+,b) and bias-conflicting samples (s+,b ) during training, to understand how they are affected. Fig. 5 shows the bivariate histogram of the similarities obtained with the two loss functions, at different training epochs and values of ϵ, on Biased-MNIST, with a training ρ of 0.999. Focusing on the bias-aligned samples (first two columns), we observe that, in both cases, most values are close to 1. However, while this is true for most of the shown histograms, the presence of the non-contrastive condition of Eq. 20 produces a much tighter distribution for ϵ-Sup Con, when compared to ϵ-Sup Info NCE. In fact, with ϵ-Sup Info NCE we obtain significantly more bias-aligned samples with a similarity smaller than 1. This is especially evident if we focus on the last training epochs. More interestingly, if we focus on the bias-conflicting similarities (last two columns), we can also notice how, on average, the distribution of similarities of bias-conflicting samples for ϵ-Sup Con tends to be more concentrated around the value of 0. This means that bias-conflicting samples have dissimilar representations even if they are both positives and should be mapped to the same point in the representation space. The effect of the bias is thus still quite important and it has not been discarded. On the other hand, with ϵ-Sup Info NCE, we obtain a much more spread distribution, especially as the number of training epochs increases. This means that a higher number of biasconflicting samples have a greater similarity (in the representation space), leading to more robust representations. Clearly, ϵ-Sup Con focuses more on bias-aligned samples as most of them have a similarity close to 1, whereas most of the bias-conflicting samples have a similarity close to 0. With our proposed loss ϵ-Sup Info NCE, this behavior is less pronounced, as the lack of the non-contrastive condition leads the model to be less focused on bias-aligned samples. This could explain why ϵ-Sup Info NCE can perform better than ϵ-Sup Con in highly biased settings. C.3 TRAINING WITH A PROJECTION HEAD In Tab. 10 we show the results on Corrupted CIFAR-10 with and without using a projection head. When employing a projection head, the loss term and the regularization are applied on the projected and original space respectively, and the final classification is performed in the original latent space before the projection head. We conjecture that the loss in accuracy is likely due i.) to the absence of the aggressive augmentation typically used for generating multiviews in contrastive setups, which are probably attenuated by the projection head ii.) minimizing ϵ-Sup Info NCE and the Fair KL term on the same latent space rather than two different ones, could be more beneficial for the optimization process. Table 10: Accuracy on Corrupted CIFAR-10 with and without projection head Ratio (%) 0.5 1.0 2.0 5.0 With Head 30.85 0.19 32.75 0.57 37.95 0.14 45.67 0.66 Without Head 33.33 0.38 36.53 0.38 41.45 0.42 50.73 0.90 Published as a conference paper at ICLR 2023 0 25 50 75 Epoch 0 25 50 75 Epoch ϵ-Sup Info NCE 0 25 50 75 Epoch 0 25 50 75 Epoch ϵ-Sup Info NCE 0 25 50 75 Epoch 0 25 50 75 Epoch ϵ-Sup Info NCE 0 25 50 75 Epoch 0 25 50 75 Epoch ϵ-Sup Info NCE 0 25 50 75 Epoch 0 25 50 75 Epoch ϵ-Sup Info NCE | {z } bias-aligned 0 25 50 75 Epoch 0 25 50 75 Epoch ϵ-Sup Info NCE | {z } bias-conflicting Figure 5: (first and second columns) Distribution of positive bias-aligned similarities s+,b. Here ϵ-Sup Con tends to produce a much tighter distribution, with similarities close to 1; (third and fourth columns) Distribution of positive bias-conflicting similarities s+,b . Here ϵ-Sup Info NCE, even if marginally, is able to increase the number of similar bias-conflicting samples. ϵ-Sup Con focuses more on bias-aligned samples, resulting in more biased representations. With ϵ-Sup Info NCE, this behavior is less pronounced, as the lack of the non-contrastive condition leads to be less focused on bias-aligned samples and more focused on the bias-conflicting ones. We hypothesize that this is the reason ϵ-Sup Info NCE appears to obtain better results than ϵ-Sup Con in more biased datasets. Published as a conference paper at ICLR 2023 C.4 ABLATION STUDY OF DEBIASING REGULARIZATION We perform an ablation study of our debiasing regularization on Corrupted CIFAR-10 and on b FFHQ. We test two variants of the regularization term: 1. Only with the conditions on the mean of the representations µ+ and µ (Eq. equation 9-a and equation 9-b), similarly to (Tartaglione et al., 2021), but with the differences in formulations of Sec. A.3; 2. Full Fair KL debiasing term of Eq. 9-c and Eq. 9-d. The results are shown in Tab. 11. As it can be easily observed, employing the full regularization constraint consistently results in better accuracy. Table 11: Ablation study of RF air KL on Corrupted CIFAR-10 and b FFHQ Corrupted CIFAR-10 b FFHQ Ratio (%) Ratio (%) 0.5 1.0 2.0 5.0 0.5 Fair KL (mean) 32.37 1.72 35.65 0.75 39.94 0.50 50.25 0.16 60.55 1.05 Fair KL (full) 33.33 0.38 36.53 0.38 41.45 0.42 50.73 0.90 63.70 0.90 C.5 IMPORTANCE OF THE REGULARIZATION WEIGHT We conduct an analysis on the importance and stability of the weights α and λ of Eq. 11. We perform multiple experiments selecting α {0.01, 0.1, 1.0}. For simplicity, we fix ϵ = 0, and we report the accuracy scored on the Biased-MNIST test. The results are show in Tab. 12. There seems to be a correlation between the value of α and the strength of the bias: for stronger biases it is better to give more importance to the regularization term rather than the target loss function. Additionaly, we also find that α depends on the complexity of the dataset: for example on Corrupted-CIFAR10 and b FFHQ we use α = 0.1, for 9-Class Image Net we use α = 0.5. Table 12: Importance of the weights α and λ. λ = 0.5 λ = 1 Corr.\ α 0.01 0.1 1.0 0.01 0.1 1.0 0.999 89.55 1.43 31.63 2.30 38.38 1.26 84.98 2.29 43.21 10.08 31.84 4.31 0.997 94.08 0.10 82.11 2.48 78.91 2.48 91.08 0.82 84.98 10.23 79.03 3.15 0.995 92.42 3.76 90.60 3.35 86.63 2.26 88.39 5.00 93.97 1.83 88.27 1.72 0.99 95.00 0.21 96.60 0.17 93.75 0.25 90.72 0.51 97.13 0.38 94.74 0.40