# learning_debiased_classifier_with_biased_committee__9470d636.pdf

Learning Debiased Classifier with Biased Committee

Nayeong Kim Sehyun Hwang Sungsoo Ahn Jaesik Park Suha Kwak

Pohang University of Science and Technology (POSTECH), South Korea {nayeong.kim, sehyun03, sungsoo.ahn, jaesik.park, suha.kwak}@postech.ac.kr

Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. We propose a new method for training debiased classifiers with no spurious attribute label. The key idea is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting data, i.e., data without spurious correlation, and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of bias-conflicting data accordingly. The consensus within the committee on prediction difficulty thus provides a reliable cue for identifying and weighting bias-conflicting data. Moreover, the committee is also trained with knowledge transferred from the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult data as training progresses. On five real-world datasets, our method outperforms prior arts using no spurious attribute label like ours and even surpasses those relying on bias labels occasionally. Our code is available at https://github.com/nayeong-v-kim/LWBC.

1 Introduction

Most supervised learning algorithms for classification rely on the empirical risk minimization (ERM) principle [40]. However, ERM has been known to cause a learned classifier to be biased toward spurious correlations between predefined classes and latent attributes that appear in a majority of training data [12]. In the case of hair color classification, for example, when most people with blond-hair (i.e., target class) are female (i.e., latent attribute) in a dataset, a classifier learned by ERM exploits female as a shortcut for the classification due to its spurious correlation with blond-hair, and often mis-classifies non-blonde-haired women as blond-hair in consequence. We call data with such spurious correlations and holding a majority of training data bias-guiding samples, and the other bias-conflicting samples, respectively. The issue of model bias has often been addressed by exploiting explicit spurious attribute labels [22, 28, 36, 2, 39, 38, 45] or knowledge about bias types given a priori [3]. However, these methods are impractical because such supervision and prior knowledge are costly, and the methods demand extensive post hoc analysis.

Hence, a body of research has been conducted for learning debiased classifiers with no additional label for spurious attributes [42, 27, 29, 32, 23, 26]. A common approach in this line of work is to employ an intentionally biased classifier as an auxiliary module [29, 32, 23, 26]. In this approach, samples that the biased classifier has trouble handling are regarded as bias-conflicting ones and assigned large weights when used for training the main classifier to reduce the effect of bias-guiding counterparts. Although it has driven remarkable success, this approach has drawbacks due to the use of a single biased classifier. First, the quality of the biased classifier could vary by hyper-parameters [29] and its initial parameter values [11]. Further, data that the biased classifier fails to handle could include not only bias-conflicting samples but also bias-guiding ones, which differs by the quality of the classifier.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Figure 1: Analysis on the instability of a single biased classifier in mining and weighting biasconflicting samples. The experiments are conducted on the Celeb A dataset, in which samples with blond and male attributes are bias-conflicting. (a) The ratio of bias-conflicting samples to all incorrectly predicted by a single biased classifier. The ratio highly fluctuates by the learning rate of the classifier and varies up to 4%p due to different initialization even with a fixed learning rate, meaning that a single biased classifier is sensitive to hyper-parameters. (b) Disagreement on predictions of biased classifiers. For pairs of biased classifiers initialized differently, we measure the number of bias-conflicting samples for which the classifiers predict differently. The results suggest that individual biased classifiers are sensitive to initialization. (c) Comparisons between a single biased classifier and a committee of biased classifiers in terms of enrichment [29]. Higher enrichment implies more precise mining and weighting of bias-conflicting samples; the formal definition of enrichment is given in Appendix ??. The committee clearly outperforms the single biased classifier in terms of the enrichment.

These drawbacks limit the reliability and performance of debiasing methods depending on a single biased classifier, as demonstrated in Figure 1.

To overcome these limitations, we propose a new method using a committee of biased classifiers as the auxiliary module, coined learning with biased committee (LWBC). LWBC identifies bias-conflicting samples and determines their weights through consensus on their prediction difficulty within the committee. To this end, the committee is built as a bootstrapped ensemble, i.e., each of its classifiers is trained from a randomly sampled subset of the entire training dataset. This strategy not only guarantees the diversity among the classifiers, but also lets a majority of the classifiers be biased since random subsets of training data are highly likely to be dominated by bias-guiding samples. Accordingly, a majority of the committee tends to classify bias-guiding samples correctly and fail to deal with bias-conflicting ones. The consensus on prediction difficulty within the committee thus gives a strong cue for identifying and weighting bias-conflicting samples. Also, using the consensus of multiple classifiers enables LWBC to be robust to the varying quality of individual classifiers and consequently to focus more precisely on bias-conflicting samples, as shown in Figure 1.

Moreover, unlike the biased classifier trained independently of the main classifier in the previous work, the committee in LWBC is trained with knowledge of the main classifier as well as the random subsets of training data to serve the main classifier better. Specifically, the knowledge is distilled in the form of classification logits of the main classifier [18], and each classifier of the committee utilizes the knowledge as pseudo labels of training data other than its own training set. We expect that this strategy allows the committee to become debiased gradually so that it does not give large weights to easy bias-conflicting samples, i.e., those already well handled by the main classifier, and focuses more on difficult ones. Note that, even with this strategy, the classifiers of the committee are still biased differently due to their different training sets with ground-truth labels.

Finally, we further improve the proposed method by adopting a self-supervised representation as the frozen backbone of the committee and the main classifier. Since self-supervised learning is not dependent on class labels, it is less affected by the spurious correlations between classes and latent attributes, leading to a robust and less-biased representation. Also, by installing the committee and the main classifier on top of the representation, the classifiers can be implemented efficiently in both space and time while enjoying the rich and bias-free features given by the backbone.

LWBC is validated extensively on five real-world datasets. It substantially outperforms existing methods using no bias label and even occasionally surpasses previous arts demanding bias labels. We also demonstrate that all of the main components, i.e., the use of the committee, its training with

knowledge transfer, and the self-supervised learning, contribute to the outstanding performance. The main contribution of this paper is four-fold:

We present LWBC, a new approach to learning a debiased classifier with no spurious attribute label. The use of consensus within the committee allows LWBC to address limitations of previous work relying on a single biased classifier.

We propose to learn the committee using knowledge of the main classifier, unlike the previous work whose auxiliary modules do not consider the main classifier.

We investigate the potential of self-supervised learning for debiasing, and find that it is a solid yet unexplored baseline for the task.

LWBC demonstrates superior performance on five real-world datasets. It outperforms existing methods using no additional supervision like ours and even surpasses those relying on spurious attribute labels occasionally.

2 Related work

2.1 Debiasing with prior knowledge on bias

A majority of previous work on learning debiased classifiers mitigates the bias by leveraging explicit labels on bias types [22, 28, 36, 2, 39, 38, 45]. For instance, Kim et al. [22] build a debiased model to classify digits by leveraging RGB color labels explicitly in the Colored MNIST dataset. Other methods are designed to handle predefined domain-specific bias types (e.g., texture bias in Image Net) [41, 5, 3].

Recent methods try to reduce annotation costs for bias supervision by only utilizing a small set of bias-labeled data. For instance, Nam et al. [33] and Jung et al. [21] train auxiliary bias predictor with a small set of bias-labeled data and assign pseudo bias labels to the entire training set using its prediction. These methods have two inherent limitations. First, they cannot handle biases whose types are not predefined in training. Second, manually annotating all bias types is often expensive and laborious. Even obtaining a small set of bias labels could be expensive since bias labels existing in a real-world dataset is often long-tailed [30] or multi-labeled [16].

2.2 Debiasing without spurious attribute labels

Recent debiasing methods without any bias supervision [19, 8, 32, 26, 23, 29, 7, 1, 37, 25, 34] are based on the assumption that attributes of malignant bias are easier to learn than those of target classes. Through this assumption, previous methods identify the amount of biases within the samples and train a robust models by emphasizing bias-conflicting samples. To identify biases of samples, previous work leverage a high gradient of the latent vector [19, 8] or employ an intentionally biased classifier as an auxiliary module [32, 23, 26] by using generalized cross-entropy (GCE) loss [44]. Liu et al. [29] consider misclassified samples as bias-conflicting samples, and Creager et al. [7] derive bias-guiding and bias-conflicting partition that maximally violates the invariance principle. On the other hand, Bahng et al. [3] propose a method tailored to the texture bias, building a debiased model to learn independent features from a biased model in which receptive field size is limited to capture the texture bias in image classification.

While LWBC is also based on the aforementioned assumption, we firstly propose to identify biasconflicting samples and up-weight them through the consensus of a committee of the biased classifiers. By exploiting the consensus of multiple classifiers, LWBC is robust against the varying quality of individual classifiers and allows to learn debiased classifiers more reliably and effectively.

3 Proposed method

We propose a new method that learns a debiased classifier with a committee of biased classifiers, dubbed LWBC. It first learns a feature representation with self supervision, which is used as the frozen backbone providing rich and bias-free features to downstream modules (Section 3.1). Next, it trains a committee of m auxiliary classifiers f1, f2, . . . , fm and the main classifier g on top of

: Bias-guiding sample : Bias-conflicting sample

Sample weight 𝑤

# correct 𝑓( )

Sample weighting

Knowledge distillation

Input images

Figure 2: The overall pipeline of LWBC. It adopts a frozen backbone trained by self-supervised learning. A committee of auxiliary classifiers fi is trained on top of the backbone; a random subset of training data is assigned to each classifier of the committee as labeled data for supervised learning (Eq. (1)). The committee determines the weight of each training sample based on consensus of its members, i.e., the number of members that correctly predict the class of the sample (Eq. (2)). The main classifier g is trained using the weights by minimizing the weighted cross entropy loss (Eq. (3)). In turn, knowledge of the main classifier is transferred to the committee through knowledge distillation (Eq. (4)).

the self-supervised representation (Section 3.2); thanks to the self-supervised representation, the classifiers are designed concisely, using only two fully-connected layers for each.

In a nutshell, the committee identifies bias-conflicting samples and assigns them large weights to reduce the effect of bias-guiding samples during the training of the main classifier. To this end, the committee is trained as a bootstrapped ensemble of classifiers so that a majority of its classifiers are biased as well as diverse, and intentionally fail to predict classes of bias-conflicting samples accordingly. Hence, the consensus within the committee on prediction difficulty of a sample (e.g., the number of classifiers that fail to predict its class label) indicates how much likely the sample is bias-conflicting, and is used to compute weights for training samples. Moreover, the committee is trained also with knowledge of the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult samples as training progresses. Note that the committee is an auxiliary module used only in training and thus does not impose additional computation or memory footprint in testing.

The overall process of LWBC is illustrated conceptually in Figure 2 and given formally in Algorithm 1. The following sections elaborate on each step of LWBC.

3.1 Self-supervised representation learning

As the feature extractor, we train a backbone network by self-supervised learning with BYOL [14] on the target dataset. During the self-supervised learning, a random patch of input image is cropped, resized to 224 224 pixels, flipped horizontally at random, and distorted by a random sequence of brightness, contrast, saturation, hue adjustments, and grayscale conversion.

A self-supervised model can capture diverse patterns shared by data without being biased towards a particular class even the training set is biased. We empirically demonstrate that adopting a selfsupervised representation leads to a model less biased compared with a supervised representation.

Although the representation offers rich and less-biased features, the main classifier can be still biased when it is trained by ERM. In other words, the self-supervised representation alone is not enough to learn a debiased model, and the need to explore a debiasing method for a classifier arises. Hence, we propose LWBC, a new debiasing method illustrated in the next section.

3.2 Learning a debiased classifier with a biased committee

First, we randomly sample m subsets of the same size, denoted by S1, ..., Sm, from the entire training dataset D with replacement. Then m auxiliary classifiers f1, ..., fm of the committee are initialized randomly, and each of the subsets Sl is assigned to each auxiliary classifier fl as its training data.

Algorithm 1 Learning a debiased classifier with a biased committee

input training set D, batch size b, size of the committee m, learning rate η, scale hyper-parameter α, balancing hyper-parameter λ, total number of iterations t, number of warm-up iterations tw 1: Draw m random subsets S1, ..., Sm of D. 2: Initialize auxiliary classifiers of the committee {fl(x; θl)}m l=1. 3: Initialize the main classifier g(x; ϕ). 4: for j = 1, ..., tw do 5: Draw a mini-batch B = {(xi, yi)}b i=1 from D. 6: θl θl η θl LCE l = 1, ..., m Eq. (1) 7: end for 8: for j = tw + 1, ..., t do 9: Draw a mini-batch B = {(xi, yi))}b i=1 from D. 10: w(xi) = 1/(Pm l=1 1(fl(xi) = yi)/m + α) xi B Eq. (2) 11: ϕ ϕ η ϕLWCE Eq. (3) 12: θl θl (1 λ)η θl LCE λη θl LKD l = 1, ..., m Eq. (1), Eq. (4) 13: end for

The first step of LWBC is warm-up training of the committee; this is required to ensure that the committee is capable of identifying and weighting bias-conflicting samples at the beginning of the main training process. Given a mini-batch B at each warm-up iteration, the committee is trained by minimizing the cross-entropy loss below:

(x,y) Sl B CE(fl(x), y). (1)

Since each subset is sampled from the training set dominated by bias-guiding samples, a majority of auxiliary classifiers are also biased. At the same time, the classifiers are diverse due to their difference in initialization and training data.

After the warm-up stage, the main classifier and the committee are trained while interacting with each other. First, the main classifier is trained by the weighted cross entropy loss with the entire training set, where the sample weights are computed by considering consensus within the committee on prediction difficulty of the samples. Since a majority of auxiliary classifiers have trouble to handle bias-conflicting samples, we identify and weight bias-conflicting samples based on the number of auxiliary classifiers whose predictions are correct for the samples. The weight function w is given by

w(x) = 1 Pm l=1 1(fl(x) = y)/m + α, (2)

where m is the size of the committee, fl means the l-th classifier of the committee, and α is a scale hyper-parameter. The weight reflects how much the sample x is likely to be bias-conflicting and decreases rapidly when the number of correctly predicting classifiers increases. Then we train the main classifier g with emphasis on the bias-conflicting samples through the weight function w. The weighted cross entropy loss LWCE is given by

(x,y) B w(x) CE(g(x), y), (3)

where B is the mini-batch.

During training, as the main classifier is gradually debiased, samples useful for debiasing the main classifier change accordingly. To focus more on bias-conflicting samples difficult for the main classifier, we inform the quality of the main classifier to the committee by distilling the knowledge of the main classifier in the form of its classification logits [18] and transferring the knowledge by minimizing the following KD loss:

(x,y) B\Sl KL softmax g(x)

, softmax fl(x)

where τ is a temperature parameter. Note that we apply LKD to the complement set of Sl to avoid auxiliary classifiers in the committee being identical to each other. By interacting with the main

Table 1: GUIDING, UNBIASED, and CONFLICTING metrics (%) for the Celeb A dataset. For methods without spurious attribute labels, we mark the best and the second-best performance in bold and underline, respectively.

Method Spurious attribute label Celeb A Hair Color Celeb A Heavy Makeup

GUIDING UNBIASED CONFLICTING GUIDING UNBIASED CONFLICTING

Group DRO [36] 87.46 85.43 0.53 83.40 0.67 79.52 64.88 0.42 50.24 0.68 En D [38] 94.97 91.21 0.22 87.45 1.06 98.16 75.93 1.31 53.70 5.24 CSAD [45] 91.19 89.36 87.53 82.32 67.88 53.44

ERM 87.98 70.25 0.35 52.52 0.19 90.25 62.00 0.02 33.75 0.28 Lf F [32] 87.24 84.24 0.37 81.24 1.38 86.92 66.20 1.21 45.48 4.43 SSL+ERM 94.15 0.57 80.48 0.91 66.79 2.20 93.00 0.49 66.30 1.15 39.50 2.47 LWBC 90.57 2.15 88.90 1.55 87.22 1.14 86.42 0.82 70.29 1.14 51.28 5.74

Table 2: UNBIASED and WORST-GROUP metrics (%) for the Celeb A dataset. We also report the difference between UNBIASED and WORST-GROUP as GAP. For methods without spurious attribute labels, we mark the best and the second-best performance in bold and underline, respectively.

Method Backbone network Spurious attribute label Celeb A Hair Color

UNBIASED WORST-GROUP GAP

Group DRO [36] Resnet50 93.1 0.21 88.5 1.16 4.6 SSA [33] Resnet50 92.8 0.11 89.8 1.28 3.0

ERM Resnet50 95.6 47.2 48.4 CVa R DRO [27] Resnet50 82.4 64.4 18.0 Lf F [32] Resnet50 86.0 70.6 15.4 EIIL [7] Resnet50 91.9 83.3 8.6 JTT [29] Resnet50 88.0 81.1 6.9 SSL+ERM Resnet18 80.5 0.9 38.5 4.1 42.0 LWBC Resnet18 88.9 1.6 85.5 1.4 3.4

classifier, the committee gradually becomes debiased along with the main classifier. Hence, samples correctly predicted by the main classifier are less weighted and those with incorrect predictions are more weighted by the committee.

After the warm-up training, the main classifier and the auxiliary classifiers are alternately updated with a given mini-batch at each iteration. First, we forward every sample in a mini-batch to each auxiliary classifier and then compute weights of the samples using predictions of the auxiliary classifiers (Eq. (2)). With the weights, the main classifier is updated by Eq. (3) and the knowledge of the updated main classifier is transferred to the auxiliary classifiers by Eq. (4). Thus, the auxiliary classifiers are updated by minimizing both losses of Eq. (1) and Eq. (4):

Lcommittee = (1 λ)LCE + λLKD, (5)

where λ is a balancing hyper-parameter.

4 Experiments

Implementation details. We adopt Res Net-18 [15] as the self-supervised model and train it following the strategy of BYOL [14] on the target dataset. Since the color information is key feature for Hair Color classification on the Celeb A dataset, we vary only brightness and contrast for data augmentation of color distortion when training the model on Celeb A. Since NICO and BAR are very small datasets, we initialize the self-supervised model with Image Net [9] pretrained parameters; except for the experiments using BAR and NICO, the self-supervised model is trained from scratch. We use the self-supervised Res Net-18 as a backbone network except for the last fully connected layer. We set the batch size to {64, 64, 128, 256}, learning rate to {1e-3, 1e-3, 1e-4, 6e-3}, the size of the committee m to {30, 30, 30, 40}, the size of subset Sl to {10, 10, 80, 300}, λ to {0.9, 0.6, 0.6, 0.6},

Table 3: VALIDATION, UNBIASED, and TEST metrics (%) evaluated on the Image Net-9 and Image Net A datasets. For methods without spurious attribute labels, we mark the best and the second-best performance in bold and underline, respectively.

Method Spurious attribute label Image Net-9 Image Net-A

VALIDATION UNBIASED TEST

Stylised IN [13] 88.4 0.5 86.6 0.6 24.6 1.4 Learned Mixin [6] 64.1 4.0 62.7 3.1 15.0 1.6 RUBi [5] 90.5 0.3 88.6 0.4 27.7 2.1

ERM 90.8 0.6 88.8 0.6 24.9 1.1 Biased (Bag Net18) [4] 67.7 0.3 65.9 0.3 18.8 1.15 Re Bias [3] 91.9 1.7 90.5 1.7 29.6 1.6 Lf F [32] 86.0 85.0 24.6 Caa M [42] 95.7 95.2 32.8 SSL+ERM 94.18 0.07 93.18 0.04 34.21 0.49 LWBC 94.03 0.23 93.04 0.32 35.97 0.49

and τ to {1, 1, 1, 2.5}, respectively for {BAR, NICO, Imagenet-9, Celeb A}, and α to 0.02. Note that we run LWBC on 3 random seeds and report the average and the standard deviation.

Evaluation metrics. Six metrics are adopted for evaluation. VALIDATION / TEST: average accuracy on validation / test splits. GUIDING: average accuracy on bias guiding samples per class. CONFLICTING: average accuracy on bias conflicting samples per class. UNBIASED: average of GUIDING and CONFLICTING per class. WORST-GROUP: minimum average accuracy of group; we can group the validation and test samples by the target and the bias.

4.2 Datasets

Celeb A. Celeb A [31] is a dataset for face recognition where each sample is labeled with 40 attributes. Following the experiment configuration suggested by Nam et al. [32], we focus on Hair Color and Heavy Makeup attributes that are spuriously correlated with Gender attributes, i.e., most of the Celeb A images with blond-hair are women. As a result, the biased model suffers from performance degradation when predicting Hair Color attribute on males. Therefore, we use Hair Color as the target attribute and Gender as a spurious attribute, the same as Heavy Makeup.

Image Net-9. Image Net-9 [20] is a subset of Image Net [35] containing nine super-classes. Following the setting adopted by Bahng et al. [3], we conduct experiments with 54,600 training images and 2,100 validation images. Image Net-9 has been known to have a correlations between object class and image texture. We follow the evaluation scheme adopted by Bahng et al. [3], and we report the unbiased accuracy of the validation set, which is computed as average accuracy on every object-texture combination.

Image Net-A. Image Net-A [17] contains real-world images misclassified by an Image Net-trained Res Net 50 [15]. Since such failures are caused when the model too heavily relies on colors, textures, and backgrounds, Image Net-A could be regarded as a bias-conflicting set w.r.t. various Image Net biases. This dataset is used only for evaluating a model trained on Image Net-9.

BAR. The Biased Action Recognition (BAR) dataset [32] is a real-world image dataset intentionally designed to exhibit spurious correlations between human action and place on its images. Originally the training set of BAR consists of only bias-guiding samples, and its test set consists of only biasconflicting samples. In our setting, we use 10% of the original BAR training set as validation and set the bias-conflicting ratio of the training set to 1%.

NICO. NICO [16] is a real-world dataset for simulating out-of-distribution image classification scenarios. Following the setting used by Wang et al. [42], we use an animal subset of NICO, which is labeled with 10 object and 10 context classes for evaluating the debiasing methods. The training set consists of 7 context classes per object class and they are long-tailed distributed (e.g., dog images are more frequently coupled with the on grass context than any of the other 6 contexts). The validation and test sets consist of 7 seen context classes and 3 unseen context classes per object class. We verify the ability of debiasing a model from object-context correlations through evaluation on NICO.

Table 4: CONFLICTING, VALIDATION, and TEST metrics (%) evaluated on the BAR dataset (a) and the NICO dataset (b). For methods without spurious attribute labels, we mark the best and the second-best performance in bold and underline, respectively.

Method Spurious attribute label BAR

CONFLICTING

ERM 35.32 0.46 Re Bias [3] 37.02 0.26 Lf F [32] 48.15 0.93 LDD [26] 52.31 1.00 SSL+ERM 60.88 0.80 LWBC 62.03 0.74

(a) BAR dataset

Method Spurious attribute label NICO

VALIDATION TEST

Cutout [10] 43.69 43.77 RUBi [5] 43.86 44.37 IRM [2] 40.62 41.46 Unshuffle [39] 43.15 43.00 REx [24] 41.00 41.15

ERM 43.77 42.61 CBAM [43] 42.15 42.46 Re Bias [3] 44.92 45.23 Lf F [32] 41.83 40.18 Caa M [42] 46.38 46.62 SSL+ERM 55.63 0.54 52.24 0.27 LWBC 56.05 0.45 52.84 0.31

(b) NICO dataset

Table 5: Ablation studies using WORST-GROUP metric (%) on the Celeb A Hair Color dataset. We study the impact of learning from a single biased classifier (row 2), learning by committee (row 3), and transferring the knowledge of the main classifier (row 4). We mark the best performance in bold.

Method Celeb A Hair Color

SSL ERM Single Committee KD WORST-GROUP

38.5 4.1 64.1 2.4 81.3 2.3 85.5 1.4

(a) Image Net-9

(b) Celeb A Hair Color

Figure 3: Effect of knowledge distillation (Eq. (4)) and the size of the committee m. (a) Accuracy of the main classifier and the committee with or without KD loss in Image Net-9 train set (left) and unbiased validation set (right). (b) WORST-GROUP accuracy of LWBC versus the number of classifiers within the committee, where the green dot indicates the value used in the main paper.

4.3 Quantitative results

LWBC shows superior classification accuracy among the methods that do not use the spurious attribute label on the five real-world datasets. In Tables 1 & 2, we observe LWBC outperforms existing debiasing methods using no spurious attribute label and shows comparable CONFLICTING performance with methods that exploit spurious attribute labels on Celeb A, which reflects gender bias in the real world. Especially, the gap between the average accuracy of groups and worst-group accuracy of LWBC is much smaller than the other methods, suggesting that our model fairly predicts a sample that belongs to each group. Table 3 shows the results on Image Net-9, which is dominated by

The ratio of correctly predicted classifiers

Sample weight

The ratio of correctly predicted classifiers

(a) Celeb A

The ratio of correctly predicted classifiers

Sample weight

The ratio of correctly predicted classifiers

Test sample (bias-conflicting)

Figure 4: Qualitative examples of sample weighting by LWBC on Celeb A and BAR. The graph represents the weight function in Eq. (2) along with example images corresponding to specific weights. (a) The examples on the Celeb A dataset. Hair Color is the target and Gender is the bias. A majority of blond-hair images on the training set is female. (b) The examples of diving class on the BAR dataset. Action is the target and Place is the bias. A majority of diving images on the training set include a body of water or the surface of a body of water. The red bordered images are test samples which are bias-conflicting samples.

texture bias, and Image Net-A, which is regarded as a bias-conflicting set of Image Net. LWBC is 9.7% better than Caa M [43] on the Image Net-A dataset, i.e., LWBC is robust to texture bias and various Image Net biases. Table 4a shows results on the BAR dataset. LWBC is 18.6% better than LDD, the previous state of the art. Compared with Lf F [32] and LDD [26], which are debiasing methods with a single biased classifier, learning a debiased classifier with the biased committee is more effective. Table 4b shows the results on the NICO dataset. LWBC is 13.3% better than Caa M, which suggests that LWBC is better generalized to unseen spurious attributes. Note that the validation and test set of NICO have unseen context classes and are unbiased.

4.4 Ablation study

Self-supervised representation as a solid baseline. We empirically investigate the potential of self-supervised representation as a solid baseline for the debiasing task. We train a classifier on the top of the self-supervised representation by ERM. The results are denoted by SSL+ERM and compared with ERM , which is a fully supervised classification model in Table 1, 2, 3, 4a, 4b. SSL+ERM outperforms not only ERM but also the previous state-of-the-art on all the datasets except for Celeb A. SSL+ERM is less biased than the model trained by fully supervised learning.

Importance of each module in LWBC. Table 5 demonstrates through ablation studies [29, 32, 23, 26]: (1) learning from a single biased classifier, (2) learning with committee, and (3) transferring the knowledge of the main classifier. First, we train a classifier by ERM (row 1) then assign a weight value 50 to the wrongly predicted samples and 1 to other samples. Then re-training the classifier with the weights (row 2). Comparing these two results demonstrates that up-weighting scheme with a biased

classifier is effective to debiasing a classifier. Then we increase the number of biased classifiers and compute the sample weights using our weight function Eq. (2) (row 3). Learning with the committee shows a remarkable improvement in the worst-group accuracy. Moreover, knowledge distillation that enables the committee to interact with the main classifier further improves performance (row 4).

Effectiveness of transferring knowledge of the main classifier. Figure 3a shows the range of unbiased validation accuracy of classifiers in the committee and unbiased validation accuracy of the main classifier during training. The mean accuracy of classifiers in the committee gradually increases following the accuracy of the main classifier. Also, the accuracy of the main classifier gradually increases as training progresses. On the other hand, the accuracy of classifiers in the committee trained without KD loss does not increase or even decrease.

Number of classifiers. We compare the results of the main classifier trained with the different number of auxiliary classifiers. Figure 3b shows the worst-group accuracy versus the number of auxiliary classifiers m on Celeb A. With a single auxiliary classifier, the main classifier shows the lowest worst-group accuracy, but the accuracy increases as m increases. When larger than 40, the number of classifiers has little effect on learning the main classifier.

4.5 Qualitative results

Figure 4 shows the graph of the weight function in Eq. (2) along with example images corresponding to specific weights. LWBC not only up-weights the bias-conflicting samples and down-weights the bias-guiding samples but also imposes fine-grained weights according to the difficulty of samples. For example, a majority of training samples of diving class on BAR are images of scuba diving or those of falling into the water with a blue background. In contrast, test samples of diving class are sport images of falling into water from a platform or springboard. As illustrated in Figure 4b, LWBC imposes fine-grained weights based on the proportion of the background in the image as well as down-weighting scuba diving images and up-weighting falling from platform images. The results demonstrate that LWBC more precisely distinguishes samples according to bias attributes.

5 Limitations

Self-supervised learning increases overall training complexity as it usually takes a longer time for convergence than supervised learning. Also, since the committee is designed as a bootstrapped ensemble, it involves randomness in training and thus causes noticeable performance fluctuation on Celeb A, although it demonstrates stable performance on the other datasets.

6 Conclusion

We have proposed a new method for learning a debiased classifier with a committee of auxiliary classifiers. The committee is learned in a way that consensus on predictions of its classifiers offers a strong and reliable cue to identify and weight bias-conflicting data. The main debiased classifier is then trained with an emphasis on the bias-conflicting data to reduce the effect of bias-guiding counterparts. The committee is also trained to be debiased gradually along with the main classifier so that it highlights more challenging data as training progresses. Moreover, we demonstrated that selfsupervised learning is a solid yet unexplored baseline for debiasing. Coupled with a self-supervised feature extractor, our method achieved state-of-the-art by large margins on most existing real-world datasets.

Acknowledgments and Disclosure of Funding

This work was supported by the NRF grants and the IITP grants funded by the Ministry of Education and the Ministry of Science and ICT, Korea (NRF-2021R1A2C3012728, 40%; NRF2022R1A6A1A03052954, 20%; IITP-2022-0-00290, 30%; IITP-2019-0-01906, 10%).

[1] Faruk Ahmed, Yoshua Bengio, Harm van Seijen, and Aaron Courville. Systematic generalisation with group invariant predictions. In Proc. International Conference on Learning Representations (ICLR), 2021.

[2] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

[3] Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In Proc. International Conference on Machine Learning (ICML), pages 528 539. PMLR, 2020.

[4] Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. ar Xiv preprint ar Xiv:1904.00760, 2019.

[5] Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. In Proc. Neural Information Processing Systems (Neur IPS), volume 32, pages 841 852, 2019.

[6] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4069 4082, 2019.

[7] Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In International Conference on Machine Learning, pages 2189 2200. PMLR, 2021.

[8] Luke Darlow, Stanisław Jastrz ebski, and Amos Storkey. Latent adversarial debiasing: Mitigating collider bias in deep neural networks. ar Xiv preprint ar Xiv:2011.11486, 2020.

[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: a largescale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[10] Terrance De Vries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017.

[11] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019.

[12] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665 673, 2020.

[13] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. International Conference on Learning Representations (ICLR), 2019.

[14] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. In Proc. Neural Information Processing Systems (Neur IPS), 2020.

[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[16] Yue He, Zheyan Shen, and Peng Cui. Towards non-iid image classification: A dataset and baselines. In Pattern Recognition, volume 110, page 107383. Elsevier, 2021.

[17] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 15262 15271, 2021.

[18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[19] Zeyi Huang, Haohan Wang, Eric P Xing, and Dong Huang. Self-challenging improves crossdomain generalization. In Proc. European Conference on Computer Vision (ECCV), pages 124 140. Springer, 2020.

[20] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Proc. Neural Information Processing Systems (Neur IPS), volume 32, pages 125 136, 2019.

[21] Sangwon Jung, Sanghyuk Chun, and Taesup Moon. Learning fair classifiers with partially annotated group labels. ar Xiv preprint ar Xiv:2111.14581, 2021.

[22] Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9012 9020, 2019.

[23] Eungyeup Kim, Jihyeon Lee, and Jaegul Choo. Biaswap: Removing dataset bias with biastailored swapping augmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 14992 15001, 2021.

[24] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In Proc. International Conference on Machine Learning (ICML), pages 5815 5826. PMLR, 2021.

[25] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. In Proc. Neural Information Processing Systems (Neur IPS), volume 33, pages 728 740, 2020.

[26] Jungsoo Lee, Eungyeup Kim, Juyoung Lee, Jihyeon Lee, and Jaegul Choo. Learning debiased representation via disentangled feature augmentation. In Proc. Neural Information Processing Systems (Neur IPS), volume 34, 2021.

[27] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In Proc. Neural Information Processing Systems (Neur IPS), volume 33, pages 8847 8860, 2020.

[28] Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by dataset resampling. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9572 9581, 2019.

[29] Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In Proc. International Conference on Machine Learning (ICML), pages 6781 6792. PMLR, 2021.

[30] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.

[31] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 3730 3738, 2015.

[32] Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classifier from biased classifier. In Proc. Neural Information Processing Systems (Neur IPS), 2020.

[33] Junhyun Nam, Jaehyung Kim, Jaeho Lee, and Jinwoo Shin. Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation. In Proc. International Conference on Learning Representations (ICLR), 2022.

[34] Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. In Proc. Neural Information Processing Systems (Neur IPS), volume 34, pages 1256 1272, 2021.

[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pages 1 42, April 2015.

[36] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In Proc. International Conference on Learning Representations (ICLR), 2019.

[37] Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Albert Gu, and Christopher Ré. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In Proc. Neural Information Processing Systems (Neur IPS), volume 33, pages 19339 19352, 2020.

[38] Enzo Tartaglione, Carlo Alberto Barbano, and Marco Grangetto. End: Entangling and disentangling deep representations for bias correction. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13508 13517, 2021.

[39] Damien Teney, Ehsan Abbasnejad, and Anton van den Hengel. Unshuffling data for improved generalization in visual question answering. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 1417 1427, 2021.

[40] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.

[41] Haohan Wang, Zexue He, Zachary C Lipton, and Eric P Xing. Learning robust representations by projecting superficial statistics out. In Proc. International Conference on Learning Representations (ICLR), 2018.

[42] Tan Wang, Chang Zhou, Qianru Sun, and Hanwang Zhang. Causal attention for unbiased visual recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 3091 3100, 2021.

[43] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proc. European Conference on Computer Vision (ECCV), pages 3 19, 2018.

[44] Zhilu Zhang and Mert R Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proc. Neural Information Processing Systems (Neur IPS), 2018.

[45] Wei Zhu, Haitian Zheng, Haofu Liao, Weijian Li, and Jiebo Luo. Learning bias-invariant representation by cross-sample mutual information minimization. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 15002 15012, October 2021.