# angular_visual_hardness__df169d70.pdf

Angular Visual Hardness

Beidi Chen 1 Weiyang Liu 2 Zhiding Yu 3 Jan Kautz 3 Anshumali Shrivastava 1

Animesh Garg 3 4 5 Anima Anandkumar 3 6

Recent convolutional neural networks (CNNs) have led to impressive performance but often suffer from poor calibration. They tend to be overconﬁdent, with the model conﬁdence not always reﬂecting the underlying true ambiguity and hardness. In this paper, we propose angular visual hardness (AVH), a score given by the normalized angular distance between the sample feature embedding and the target classiﬁer to measure sample hardness. We validate this score with an indepth and extensive scientiﬁc study, and observe that CNN models with the highest accuracy also have the best AVH scores. This agrees with an earlier ﬁnding that state-of-art models improve on the classiﬁcation of harder examples. We observe that the training dynamics of AVH is vastly different compared to the training loss. Speciﬁcally, AVH quickly reaches a plateau for all samples even though the training loss keeps improving. This suggests the need for designing better loss functions that can target harder examples more effectively. We also ﬁnd that AVH has a statistically signiﬁcant correlation with human visual hardness. Finally, we demonstrate the beneﬁt of AVH to a variety of applications such as self-training for domain adaptation and domain generalization.

1. Introduction

Convolutional neural networks (CNNs) have achieved great progress on many computer vision tasks such as image classiﬁcation (He et al., 2016; Krizhevsky et al., 2012), face recognition (Sun et al., 2014; Liu et al., 2017b; 2018a), and scene understanding (Zhou et al., 2014; Long et al., 2015a). On certain large-scale benchmarks such as Image Net, CNNs

1Rice University 2Georgia Institute of Technology 3NVIDIA 4University of Toronto 5Vector Institute, Toronto 6Caltech. Correspondence to: Beidi Chen <beidi.chen@rice.edu>, Weiyang Liu <wyliu@gatech.edu>, Zhiding Yu <zhidingy@nvidia.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Plate Rack Sharpness Contrast Blur

Dishwasher Saltshaker Nail Oil Filter Figure 1. Example images that confuse humans. Top row: images with degradation. Bottom row: images with semantic ambiguity.

have even surpassed human-level performance (Deng et al., 2009). Despite the notable progress, CNNs are still far from matching human-level visual recognition in terms of robustness (Goodfellow et al., 2014; Wang et al., 2018c), adaptability (Finn et al., 2017) and few-shot generalizability (Hariharan & Girshick, 2017; Liu et al., 2019), and could suffer from various biases. For example, Image Net-trained CNNs are reported to be biased towards textures, and these biases may result in CNNs being overconﬁdent, or prone to domain gaps and adversarial attacks (Geirhos et al., 2019).

Softmax score has been widely used as a conﬁdence measure for CNNs but it tends to give over-conﬁdent output (Guo et al., 2017; Li & Hoiem, 2018). To ﬁx this issue, one line of work considers conﬁdence calibration from a Bayesian point of view (Springenberg et al., 2016; Lakshminarayanan et al., 2017). Most of these methods tend to focus on the calibration and rescaling of model conﬁdence by matching expected error or ensemble. But how much they are correlated with human conﬁdence is yet to be thoroughly studied. On the other hand, several recent works (Liu et al., 2016; 2017c; 2018b) conjecture that softmax feature embeddings tend to naturally decouple into norms and angular distances that are related to intra-class conﬁdence and inter-class semantic difference. Though inspiring, the conjecture lacks thorough investigation and we make surprising observations partially contradicting to the conjecture on intra-class conﬁdence. This motivates us to conduct rigorous studies for reliable and semantics-related conﬁdence measure.

Human vision is considered much more robust than current CNNs, but this does not mean humans cannot be confused.

Angular Visual Hardness

0 1 2 3 4 5 6 7 8 9 Norm ||x||

Angle θ(x,wy)

Classifier wy

Figure 2. Visualization of embeddings on MNIST by setting their

dimensions to 2 in a CNN.

Many images appear ambiguous or hard for humans due to various image degradation factors such as lighting conditions, occlusions, visual distortions, etc. or due to semantic ambiguity in not understanding the label category, as shown in Figure 1. It is therefore natural to consider such human ambiguity or visual hardness on images as the gold-standard for conﬁdence measures. However, explicitly encoding human visual hardness in a supervised manner is generally not feasible, since hardness scores can be highly subjective and difﬁcult to obtain. Fortunately, a surrogate for human visual hardness was recently made available on the Image Net validation set (Recht et al., 2019). This is based on Human Selection Frequency (HSF) - the average number of times an image gets picked by a crowd of annotators from a pool belonging to certain speciﬁed category. We adopt HSF as a surrogate for human visual hardness in this paper to validate our proposed angular hardness measure in CNNs.

Contribution: Angular Visual Hardness (AVH). Given a CNN, we propose a novel score function for measuring sample hardness. It is the normalized angular distance between the image feature embedding and the weights of the target category (See Figure 2 as a toy example). The normalization takes into account the angular distances to other categories.

We make observations on the dynamic evolution of AVH scores during Image Net training. We ﬁnd that AVH plateaus early in training even though the training (cross-entropy) loss keeps decreasing. This is due to the nature of parameterization in softmax loss, of which the minimization goes in two directions: either by aligning the angles between feature embeddings and classiﬁers or by increasing the norms of feature embeddings. We observe two phases popping up during training: (1) Phase 1, where the softmax improvement is primarily due to angular alignment, and later, (2) Phase 2, where the improvement is primarily due to signiﬁcant increase in feature-embedding norms.

The above ﬁndings suggest that the AVH can be a robust universal measure of hardness since angular scores are mostly frozen early in training. In addition, they suggest the need to design better loss functions over softmax loss that can improve performance on hard examples and focus on optimiz-

ing angles, e.g., (Liu et al., 2017b; Deng et al., 2019; Wang et al., 2018b;a). We verify that better models tend to have better average AVH scores, which validates the argument in (Recht et al., 2019) that improving on hard examples is the key to improved generalization. We show that AVH has a statistically signiﬁcant stronger correlation with human selection frequency than widely used conﬁdence measures such as softmax score and embedding norm across several CNN models. This makes AVH a potential proxy of human perceived hardness when such information is not available.

Finally, we empirically show the superiority of AVH with its application to self-training for unsupervised domain adaptation and domain generalization. With AVH being an improved conﬁdence measure, our proposed self-training framework renders considerably improved pseudo-label selection and category estimation, leading to state-of-the-art results with signiﬁcant performance gain over baselines. Our proposed new loss function based on AVH also shows drastic improvement for the task of domain generalization.

2. Related Work

Example hardness measures. An automatic detection of examples that are hard for human vision has numerous applications. (Recht et al., 2019) showed that state-of-the-art models perform better on hard examples. This implies that in order to improve generalization, the models need to improve accuracy on hard examples. This can be achieved through various learning algorithms such as curriculum learning (Bengio et al., 2009) and self-paced learning (Kumar et al., 2010) where being able to detect hard examples is crucial. Measuring sample conﬁdence is also important in partially-supervised problems such as semi-supervised learning (Zhu; Zhou et al., 2012), unsupervised domain adaptation (Chen et al., 2011) and weakly-supervised learning (Tang et al., 2017) due to their under-constrained nature. Sample hardness can also be used to identify implicit distribution imbalance in datasets to ensure fairness and remove societal biases (Buolamwini & Gebru, 2018).

Angular distance in neural networks. (Zhang et al., 2018) uses deep features to quantify the semantic difference between images, indicating that deep features contain the most crucial semantic information. It empirically shows that the angular distance between feature maps in deep neural networks is very consistent with the human in distinguishing the semantic difference. (Liu et al., 2017c) proposes a hyperspherical neural network that constrains the parameters of neurons on a unit hypersphere and uses angular similarity to replace the inner product similarity. (Liu et al., 2018b) proposes to decouple the inner product as norm and angle, arguing that norms correspond to intra-class variation, and angles corresponds to inter-class semantic difference. However, this work does not perform in-depth studies to prove this conjecture. Recent research (Liu et al., 2018a; Lin et al.,

Angular Visual Hardness

Figure 3. Toy example of two overlapping Gaussian distributions (classes) on a unit sphere. Left: samples from the distributions as input

to a multi layer perceptron (MLP). Middle: AVH heat map produced by MLP, where samples in lighter colors (higher hardness) are mostly overlapping hard examples. Right: 2-norm heat map, where certain non-overlapping samples also have higher values.

2020; Liu et al., 2020) comes up with an angle-based hyperspherical energy to characterize the neuron diversity and improve generalization by minimizing this energy.

Deep model calibration. Conﬁdence calibration aims to predict probability estimates representative of the true correctness likelihood (Guo et al., 2017). It is well-known that the deep neural networks tend to be mis-calibrated and there has been a rich set of literature trying to solve this problem (Kumar et al., 2018; Guo et al., 2017). While establishing correlation between model conﬁdence and prediction correctness, the connection to human conﬁdence has not been widely studied from a training dynamics perspective.

Uncertainty estimation. In uncertainty estimation, two types of uncertainties are often considered: (1) Aleatoric uncertainty which captures noise inherent in the observations; (2) Epistemic uncertainty which accounts for uncertainty in the model due to limited data (Der Kiureghian & Ditlevsen, 2009). The latter is widely modeled by Bayesian inference (Kendall & Gal, 2017) and its approximation with dropout (Gal & Ghahramani, 2016; Gal et al., 2017), but often at the cost of additional computation. The fact that AVH correlates well with Human Selection Frequency indicates its underlying connection to aleatoric uncertainty. This makes it suitable for tasks such as self-training. Yet unlike Bayesian inference, AVH can be naturally computed during regular softmax training, making it convenient to obtain with only one-time training and a drop-in uncertainty measure for most existing neural networks.

3. Discoveries in CNN Training Dynamics

Notation. Denote Sn as the unit n-sphere, i.e., Sn = {x 2 Rn+1|kxk2 = 1}. Below by A( , ), we denote the angular distance between two points on Sn, i.e., A(u, v) = arccos( hu,vi

kukkvk). Let x be the feature embedding input for the last layer of the classiﬁers in the pretrained CNNs (e.g., FC-1000 in VGG-19). Let C be the number of classes for a classiﬁcation task. Denote W = {wi|0 < i C} as the set of weights for all C classes in the ﬁnal layer of the classiﬁer.

Deﬁnition 1 (Model Conﬁdence). We deﬁne Model Conﬁdence on a single sample as the probability score of the true objective class output by the CNN models, ewyx PC

Deﬁnition 2 (Human Selection Frequency). We deﬁne one way to measure human visual hardness on pictures as Human Selection Frequency (HSF). Quantitatively, given m number of human workers in a labeling process described in (Recht et al., 2019), if b out of m label a picture as a particular class and that class is the target class of that picture in the ﬁnal dataset, then HSF is deﬁned as b

3.1. Proposal and Intuition

Deﬁnition 3 (Angular Visual Hardness). The AVH score, for any (x, y), is deﬁned as:

AVH(x) = A(x, wy) PC

i=1 A(x, wi)

which wy represents the weights of the target class.

Theoretical Foundations of AVH. There are theoretical supports of AVH from both machine learning and vision science perspectives. On the machine learning side, we have brieﬂy discussed above that AVH is directly related to the angle between feature embedding and the classiﬁer weight of ground truth class. (Soudry et al., 2018) theoretically shows that the logit of ground truth class must diverge to inﬁnity in order to minimize cross-entropy loss to zero under gradient descent. Assuming input feature embeddings have ﬁxed unit-norm, the norm of classiﬁer weight grows to inﬁnity. Similar result is also shown in (Wei & Ma, 2019) where generalization error of a linear classiﬁer is controlled by the output margins normalized with classiﬁer norm. Although the above analyses make certain assumptions, they indicate that norm is a less calibrated variable towards measuring properties of model/data compared to angle. This conclusion is comprehensively validated by our experiments in Section 3. On the vision science side, there have been wide studies showing that human vision is highly adapted for extracting structural information (Zhang et al., 2018; Wang et al., 2004), while the angular distance in AVH is precisely

Angular Visual Hardness

0 20 40 60 80 (SRFK

)UHT >0 0 0 2@ )UHT >0 2 0 4@ )UHT >0 4 0 6@ )UHT >0 6 0 8@ )UHT >0 8 0@

0 20 40 60 80 (SRFK

)Ue T >0.0 0.2@ )Ue T >0.2 0.4@ )Ue T >0.4 0.6@ )Ue T >0.6 0.8@ )Ue T >0.8 1.0@

0 20 40 60 80 (Soc K

Model Accuracy

)re T >0.0 0.2@ )re T >0.2 0.4@ )re T >0.4 0.6@ )re T >0.6 0.8@ )re T >0.8 1.0@

0 20 40 60 80 (SRFK

)UHT >0 0 0 2@ )UHT >0 2 0 4@ )UHT >0 4 0 6@ )UHT >0 6 0 8@ )UHT >0 8 0@

0 20 40 60 80 (SRFK

)Ue T >0.0 0.2@ )Ue T >0.2 0.4@ )Ue T >0.4 0.6@ )Ue T >0.6 0.8@ )Ue T >0.8 1.0@

0 20 40 60 80 (Soc K

Model Accuracy

)re T >0.0 0.2@ )re T >0.2 0.4@ )re T >0.4 0.6@ )re T >0.6 0.8@ )re T >0.8 1.0@

0 20 40 60 80 (SRFK

)UHT >0 0 0 2@ )UHT >0 2 0 4@ )UHT >0 4 0 6@ )UHT >0 6 0 8@ )UHT >0 8 1 0@

0 20 40 60 80 (SRFK

)Ue T >0.0 0.2@ )Ue T >0.2 0.4@ )Ue T >0.4 0.6@ )Ue T >0.6 0.8@ )Ue T >0.8 1.0@

0 20 40 60 80 (Soc K

Model Accuracy

)re T >0.0 0.2@ )re T >0.2 0.4@ )re T >0.4 0.6@ )re T >0.6 0.8@ )re T >0.8 1.0@

0 20 40 60 80 (SRFK

)UHT >0 0 0 2@ )UHT >0 2 0 4@ )UHT >0 4 0 6@ )UHT >0 6 0 8@ )UHT >0 8 1 0@

0 20 40 60 80 (SRFK

)Ue T >0.0 0.2@ )Ue T >0.2 0.4@ )Ue T >0.4 0.6@ )Ue T >0.6 0.8@ )Ue T >0.8 1.0@

0 20 40 60 80 (Soc K

Model Accuracy

)re T >0.0 0.2@ )re T >0.2 0.4@ )re T >0.4 0.6@ )re T >0.6 0.8@ )re T >0.8 1.0@

Figure 4. Averaged training dynamics across different Human Selection Frequency levels on Image Net validation set. Columns from left

to right: number of epochs vs. average 2 norm, number of epochs vs. average AVH score, and number of epochs vs. model accuracy. Rows from top to bottom: dynamics corresponding to Alex Net, VGG-19, Res Net-50, and Dense Net-121. Shadows in the ﬁgures of the ﬁrst two columns denote the corresponding standard deviations.

good at capturing such information (Liu et al., 2018b). This also justiﬁes our angular based design as an inductive bias towards measuring human visual hardness.

The AVH score is inspired by the observation from Figure 2 as well as (Liu et al., 2018b) that samples from each class concentrate in a convex cone in the embedding space along with some interesting theoretical results that are discussed above. Naturally, we conjecture AVH, a measure with angle or margin information, could be the useful component of

softmax score indicating the input sample hardness. We also perform a simulation providing visual intuition of how AVH instead of feature embedding norms corresponds to visually hard examples on two Gaussians in Figure 3 (simulation details and analyses in Appendix A).

3.2. Observations and Conjecture Setup. We aim to observe the complete training dynamics of models that are trained from scratch on Image Net instead of the pretrained models. Therefore, we follow the

Angular Visual Hardness

standard training process of Alex Net (Krizhevsky et al., 2012), VGG-19 (Simonyan & Zisserman, 2014), Res Net50 (He et al., 2016) and Dense Net-121 (Huang et al., 2017). For consistency, we train all models for 90 epochs and decay the initial learning rate by a factor of 10 every 30 epochs. The initial learning rate for Alex Net and VGG-19 is 0.01 and for Dense Net-121 and Res Net-50 is 0.1. We split all the validation images into 5 bins, [0.0, 0.2], [0.2, 0.4], [0.4, 0.6], [0.6, 0.8], [0.8, 1.0], based on their HSF respectively. In Appendix B, we further provide experimental results on different datasets, such as MNIST, CIFAR10/100, and degraded Image Net with different contrast or noise level, to better validate our proposal. For all the ﬁgures in this section, epoch starts from 1.

Optimization algorithms are used to update weights and biases, i.e., the internal parameters of a model to improve the training loss. Both the angles between the feature embedding and classiﬁers, and the L2 norm of the embedding can inﬂuence the loss. While it is well-known that the training loss or accuracy keeps improving but it is not obvious what would be the dynamics of the angles and norms separately during training. we design the experiments to observe the training dynamics of various network architectures.

Observation 1: The norm of feature embeddings keeps increasing during training.

The ﬁrst column of Figure 4 presents the dynamics of averaged kxk2 on validation samples with the same range of HSF over 90 epochs of training. Different ﬁgures also cover different network architectures. Note that we are using the validation data for dynamics observation and therefore have never fed them into the model. The average kxk2 increases with a small initial slope but it suddenly climbs after 30 epochs when the ﬁrst learning rate decay happens. The accuracy curve is very similar to that of the average kxk2. These observations are consistent in all models and compatible with (Soudry et al., 2018) although it is more about the norm of the classiﬁer weights. More interestingly, we ﬁnd that neural networks with shortcuts (e.g., Res Nets and Dense Nets) tend to make the norm of the images with different HSF the same, while neural networks without shortcuts (e.g., Alex Net and VGG) tend to keep the gap of norm among the images with different human visual hardness.

Observation 2: AVH hits a plateau very early even when the accuracy or loss is still improving.

Middle row of Figure 4 exhibits the change of average AVH for validation samples in 90 epochs of training on three models. The average AVH for Alex Net and VGG-19 decreases sharply at the beginning and then starts to bounce back a little bit before converging. However, the dynamics of the average AVH for Dense Net-121 and Res Net-50 are different. They both decrease slightly and then quickly hits

a plateau in all three learning rate decay stages. But the common observation is that they all stop improving even when kxk2 and model accuracy are increasing. AVH is more important than kxk2 in the sense that it is the key factor deciding which class the input sample is classiﬁed to. However, optimizing the norm under the current softmax cross-entropy loss would be easier, which cause the plateau of angles for easy examples. However, the plateau for the hard examples can be caused by the limitation of the model itself and we show a simple illustration in Appendix C. It shows the necessity of designing loss functions that focus on optimizing angles.

Observation 3: AVH s correlation with Human Selection Frequency consistently holds across models throughout the training process.

In Figure 4, we average over validation samples in ﬁve HSF bins or ﬁve degradation level bins separately , and then compute the average embedding norm, AVH and model accuracies. We can observe that for kxk2, the gaps between the samples with different human visual hardness are not obvious in Res Net and Dense Net, while they are quite obvious in Alex Net and VGG. However, for AVH, such AVH gaps are very signiﬁcant and consistent across every network architecture during the entire training process. Interestingly, even if the network is far from being converged, such AVH gaps are still consistent across different HSF. Also the norm gaps are also consistent. The intuition behind this could be that the angles for hard examples are much harder to decrease and probably never in the region for correct classiﬁcation. Therefore the corresponding norms would not increase otherwise hurting the loss. It validates that AVH is a consistent and robust measure for visual hardness (and even generalization).

Observation 4: AVH is an indicator of a model s generalization ability.

From Figure 4, we observe that better models (i.e., higher accuracy) have lower average AVH throughout the training process and also across samples under different human visual hardness. For instance, Alexnet is the weakest model, and its overall average AVH and average AVH on each of the ﬁve bins are higher than those of the other three models. In addition, we have found that when testing Hypothesis 3 for better models, their AVH correlations with HSF are signiﬁcantly stronger than correlations of Model Conﬁdence. The above observations are aligned with earlier observations from (Recht et al., 2019) that better models also tend to generalize better on samples across different levels of human visual hardness. In addition, AVH is potentially a better measure for the generalization of a pretrained model. As shown in (Liu et al., 2017b), the norms of feature embeddings are often related to training data priors such as data imbalance and class granularity (Krizhevsky et al., 2012).

Angular Visual Hardness

0.0 0.2 0.4 0.6 0.8 1.0 Human Selection Frequency

# of Samples

0.0 0.2 0.4 0.6 0.8 1.0 Human Selection Frequency

Model Confidence

# of Samples

0.0 0.2 0.4 0.6 0.8 1.0 Human Selection Frequency

# of Samples

Figure 5. The left one presents HSF v.s. AVH(x), which we can see strong correlation. The second plot presents the correlation between

HSF and Model Conﬁdence with Res Net-50. It is not surprising that the density is highest on the right corner. The third one presents HSF v.s. kxk2. There are no obvious correlation between them. Note that different color indicates the density of samples in that bin.

Table 1. Spearman s rank correlation coefﬁcients between HSF and AVH, Model Conﬁdence and L2 Norm of the Embedding in Res Net-50

for different visual hardness bin of samples. Note that here we show the absolute value of the coefﬁcient which represents the strength of the correlation. For example, [0, 0.2] denotes the samples that have HSF from 0 to 0.2.

z-score Total Coef [0, 0.2] [0.2, 0.4] [0.4, 0.6] [0.6, 0.8] [0.8, 1.0] Number of Samples - 29987 837 2732 6541 11066 8811 AVH 0.377 0.36 0.228 0.125 0.124 0.103 0.094 Model Conﬁdence 0.337 0.325 0.192 0.122 0.102 0.078 0.056 kxk2 - 0.0017 0.0013 0.0007 0.0005 0.0004 0.0003

However, when extracting features from unseen classes that do not exist in the training set, such training data prior is often undesired. Since AVH does not consider such feature embedding norms, it potentially presents a better measure towards the open set generalization of a deep network.

Conjecture on training dynamics of CNNs. From Figure 4 and observations above, we conjecture that the training of CNN has two phases. 1) At the beginning of the training, the softmax cross-entropy loss will ﬁrst optimize the angles among different classes while the norm will ﬂuctuate and increase very slowly. We argue that it is because changing the norm will not decrease the loss when the angles are not separated enough for correct classiﬁcation. As a result, the angles get optimized ﬁrstly. 2) As the training continues, the angles become more stable and change very slowly while the norm increases rapidly. On the one hand, for easy examples, it is because when the angles get decreased enough for correct classiﬁcation, the softmax cross-entropy loss can be well minimized by purely increasing the norm. On the other hand, for hard examples, the plateau is caused by that the CNN is unable to decrease the angle to correctly classify examples and thereby also unable to increase the norms (because it may otherwise increase the loss).

4. Connections to Human Visual Hardness

From Section 3.2, we conjecture that AVH has a stronger correlation with Human Selection Frequency - a reﬂection of human visual hardness that is related to aleatoric uncertainty. In order to validate this claim, we design statistical testings for the connections between Model Conﬁdence,

AVH, kxk2 and HSF. Studying the precise connection or gap between human visual hardness and model uncertainty is usually prohibitive because it is laborious to collect such highly subjective human annotations. In addition, these annotations are application or dataset speciﬁc, which signiﬁcantly reduces the scalability of uncertainty estimation models that are directly supervised by them. This makes yet another motivation to this work since AVH is naturally obtained for free without any conﬁdence supervision. In our case, we only leverage such human annotated visual hardness measure for correlation testing. In this section, We ﬁrst provide four hypothesis and test them accordingly. Hypothesis 1. AVH has a correlation with Human Selection Frequency.

Outcome: Null Hypothesis Rejected

We use the pre-trained network model to extract the feature embedding x from each validation sample and also provide the class weights w to compute AVH(x). Note that we linearly scale the range of AVH(x) to [0, 1]. Table 1 shows the overall consistent and stronger correlation between AVH(x) and HSF (p-value < 0.001 rejects the null hypothesis). From the coefﬁcients shown in different bins of sample hardness, we can see that the harder the sample, the weaker the correlation. Also Note that we validate the results across different CNN architectures and found that better models tend to have higher coefﬁcients.

The plot on the left in Figure 5 indicates the strong correlation between AVH(x) and HSF on validation images. One intuition behind this correlation is that the class weights W might correspond to human perceived semantics for each

Angular Visual Hardness

Table 2. This table presents the Spearman s rank, Pearson, and Kendall s Tau correlation coefﬁcients between Human Selection Frequency

and AVH, Model Conﬁdence on Res Net-50, along with signiﬁcance testings between coefﬁcient pairs. Note that having p-value< 0.05 indicates that the result is statistically signiﬁcant.

Type Coef with AVH Coef with Model Conﬁdence Zavh Zmc Z value p-value Spearman s rank 0.360 0.325 0.377 0.337 4.85 < .00001 Pearson 0.385 0.341 0.406 0.355 6.2 < .00001 Kendall s Tau 0.257 0.231 0.263 0.235 3.38 .0003

category and thereby AVH(x) corresponds to human s semantic categorization of an image. In order to test if the strong correlation holds for all models, we perform the same set of experiments on different backbones, including Alex Net, VGG-19 and Dense Net-121.

Hypothesis 2. Model Conﬁdence has a correlation with Human Selection Frequency.

Outcome: Null Hypothesis Rejected

An interesting observation in (Recht et al., 2019) shows that HSF has strong inﬂuence on the Model Conﬁdence. Speciﬁcally, examples with low HSF tends to have relatively low Model Conﬁdence. Naturally we examine if the correlation between Model Conﬁdence and HSF is strong. Speciﬁcally, all Image Net validation images are evaluated by the pretrained models. The corresponding output is simply the Model Conﬁdence on each image. From Table 1, we can ﬁrst see that it is clear that because p-value is < 0.001, Model Conﬁdence does have a strong correlation with HSF. However, the correlation coefﬁcient for Model Conﬁdence and HSF is consistently lower than that of AVH and HSF.

The middle plot in Figure 5 presents a two-dimensional histogram for the correlation visualization. The x-axis represents HSF, and the y-axis represents Model Conﬁdence. Each bin exhibits the number of images which lie in the corresponding range. We can observe the high density at the right corner, which means the majority of the images have both high human and model accuracy. However, there is a considerable amount of density on the range of medium human accuracy but either extremely low or high model accuracy. One may question that the difference of the correlation coefﬁcient is not large, thereby we also run statistical testing on the signiﬁcance of the gap, naturally our next step is to test if the difference is signiﬁcant.

Hypothesis 3. AVH has a stronger correlation to Human Selection Frequency than Model Conﬁdence.

Outcome: Null Hypothesis Rejected

There are three steps for testing if two correlation coefﬁcient are signiﬁcantly different. First step is applying Fisher Z-Transformation to both coefﬁcients. The Fisher Z-Transformation is a way to transform the sampling distribution of the correlation coefﬁcient so that it becomes normally distributed. Therefore, we apply ﬁsher transformation for each correlation coefﬁcient: Z score for coefﬁcient

of AVH becomes 0.377 and that of Model Conﬁdence becomes 0.337. The second step is to compute the Z value of two Z scores. Then we determined the Z value to be 4.85 from the two above-mentioned Z scores and sample sizes. The last step is that ﬁnd out the p-value according to the Z table. According to Z table, p-value is 0.00001. Therefore, we reject the null hypothesis and conclude that AVH has statistically signiﬁcant stronger correlation with HSF than Model Conﬁdence. In later Section 5.1, we also empirically show that such stronger correlation brings cumulative advantages in some applications.

In Table 2, besides the Spearman correlation coefﬁcient, we also show the coefﬁcients of Pearson and Kendall Tau. In addition, in Appendix D, we run the same tests on four different architectures to test whether the same conclusion holds for different models. Our conclusion is that for all the considered models, AVH correlates signiﬁcantly stronger than Model Conﬁdence, and the correlation is even stronger for better models. This indicates that besides what we have shown in Section 3, AVH is also better aligned with human visual hardness which is related to aleatoric uncertainty.

Hypothesis 4. kxk2 has a correlation with Human Selection Frequency.

Outcome: Failure to Reject Null Hypothesis

(Liu et al., 2018b) conjectures that kxk2 accounts for intraclass Human/Model Conﬁdence. Particularly, if the norm is larger, the prediction from the model is also more conﬁdent, to some extent. Therefore, we conduct similar experiments like previous section to demonstrate the correlation between kxk2 and HSF. Initially, we compute the kxk2 for every validation sample for all models. Then we normalize kxk2 within each class. Table 1 presents the results for the correlation test. We omit the results for p-value in the table and report here that they are all much higher than 0.05, indicating there is no correlation between kxk2 and HSF. The right plot in Figure 5 uses a two-dimensional histogram to show the correlation for all the validation images. Given that the norm has been normalized with each class, naturally, there is notable density when the norm is 0 or 1. Except for that, there is no obvious correlation between kxk2 and HSF.

We also provide a detailed discussion on the difference between AVH and Model Conﬁdence in Appendix E.

Angular Visual Hardness

5. Applications

5.1. AVH for Self-training and Domain Adaptation

Unsupervised domain adaptation (Ben-David et al., 2010) presents an important transfer learning problem and deep self-training (Lee, 2013) recently emerged as a powerful framework to this problem (Saito et al., 2017a; Shu et al., 2018; Zou et al., 2018; 2019). Here we show the application of AVH as an improved conﬁdence measure in self-training that could signiﬁcantly beneﬁt domain adaptation.

Dataset: We conduct expeirments on the Vis DA-17 (Peng et al., 2017) dataset which is a widely used major benchmark for domain adaptation in image classiﬁcation. The dataset contains a total number of 152, 409 2D synthetic images from 12 categories in the source training set, and 55, 400 real images from MS-COCO (Lin et al., 2014) with the same set of categories as the target domain validation set. We follow the protocol of previous works to train a source model with the synthetic training set, and report the model performance on target validation set upon adaptation.

Baseline: We use class-balanced self-training (CBST) (Zou et al., 2018) as a state-of-the-art self-training baseline. We also compare our model with conﬁdence regularized selftraining (CRST)1 (Zou et al., 2019), a more recent framework improved over CBST with network prediction/pseudolabel regularized with smoothness. Speciﬁcally, our work follows the exact implementation of CBST/CRST.

Speciﬁcally, given the labeled source domain training set xs 2 XS and the unlabeled target domain data xt 2 XT , with known source labels ys = (y(1)

s , ..., y(K)

s ) 2 YS and unknown target labels ˆyt = (ˆy(1)

t , ..., ˆy(K)

t ) 2 ˆYT from K classes, CBST performs joint network learning and pseudolabel estimation by treating pseudo-labels as discrete learnable latent variables with the following loss:

min w, ˆ YT

LCB(w, ˆY) =

s log p(k|xs; w)

t log p(k|xt; w)

λk s.t. ˆyt 2 EK [ {0}, 8t

where the feasible set of pseudo-labels is the union of {0} and the K dimensional one-hot vector space EK, and w and p(k|x; w) represent the network weights and the classiﬁer s softmax probability for class k, respectively. In addition, λk serves as a class-balancing parameter controlling the pseudolabel selection of class k, and is determined by the softmax conﬁdence ranked at portion p (in descending order) among samples predicted to class k. Therefore, only one parameter p is used to determine all λk s. The optimization problem in (2) can be solved via minimizing with respect to w and

1We consider MRKLD+LRENT which is reported to be the highest one in (Zou et al., 2019).

ˆY alternatively, and the solver of ˆY can be written as:

1, if k = arg max

c {p(c|xt; w)

λc } and p(k|xt; w) > λk

0, otherwise

The optimization with respect to w is simply network retraining with source labels and estimated pseudo-labels. The complete self-training process involves alternative repeat of network re-training and pseudo-label estimation.

CBST+AVH: We seek to improve the pseudo-label solver with better conﬁdence measure from AVH. We propose the following deﬁnition of angular visual conﬁdence (AVC) to represent the predicted probability of class c:

AVC(c|x; w) = A(x, wc) PK

k=1( A(x, wk))

and pseudo-label estimation in CBST+AVH is deﬁned as:

1, if k = arg max

c {p(c|xt; w)

and AVC(k|xt; w) > βk 0, otherwise

where p(k|xt; w) is the softmax output of xt. λk and βk are determined respectively by referring to p(k|xt; w) and AVC(k|xt; w) ranked at a particular portion among samples predicted to class k, following the same deﬁnition of λk in CBST. In addition, network re-training in CBST+AVH follows the softmax self-training loss in (2).

One could see that AVH changes the self-training behavior by having improved pseudo-label selection in (4) in terms of AVC(k|xt; w) > βk. Speciﬁcally, the condition determines which samples are not ignored during self-training based on AVC. With the improved conﬁdence measure that better resembles human visual hardness, this aspect is likely to inﬂuence the ﬁnal performance of self-training.

Experimental Results: We present the results of the proposed method in Table 3, and also show its performance with respect to different self-training epochs in Figure 6. One could see that CBST+AVH outperforms both CBST and CRST by a very signiﬁcant margin. We would like to emphasize that this is a very compelling result under apples to apples comparison with the same source model, implementation and hyper-parameters.

Analysis: A major challenge of self-training is the ampliﬁcation of error due to misclassiﬁed pseudo-labels. Therefore, traditional self-training methods such as CBST often use model conﬁdence as the measure to select conﬁdently labeled examples. The hope is that higher conﬁdence potentially implies lower error rate. While this generally proves useful, the model tends to focus on the less informative samples, whereas ignoring the more informative , harder ones near classier boundaries that could be essential for learning a better classiﬁer. More details are in Appendix F.

Angular Visual Hardness

Table 3. Class-wise and mean classiﬁcation accuracies on Vis DA-17.

Method Aero Bike Bus Car Horse Knife Motor Person Plant Skateboard Train Truck Mean Source (Saito et al., 2018) 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4 MMD (Long et al., 2015b) 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1 DANN (Ganin et al., 2016) 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4 ENT (Grandvalet & Bengio, 2005) 80.3 75.5 75.8 48.3 77.9 27.3 69.7 40.2 46.5 46.6 79.3 16.0 57.0 MCD (Saito et al., 2017b) 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9 ADR (Saito et al., 2018) 87.8 79.5 83.7 65.3 92.3 61.8 88.9 73.2 87.8 60.0 85.5 32.3 74.8 Source (Zou et al., 2019) 68.7 36.7 61.3 70.4 67.9 5.9 82.6 25.5 75.6 29.4 83.8 10.9 51.6 CBST (Zou et al., 2019) 87.2 78.8 56.5 55.4 85.1 79.2 83.8 77.7 82.8 88.8 69.0 72.0 76.4 CRST (Zou et al., 2019) 88.0 79.2 61.0 60.0 87.5 81.4 86.3 78.8 85.6 86.6 73.9 68.8 78.1 Proposed 93.3 80.2 78.9 60.9 88.4 89.7 88.9 79.6 89.5 86.8 81.5 60.0 81.5

5.0 7.5 10.0 12.5 15.0 17.5 20.0 Epoch

CBST + AVH CRST CBST

Figure 6. Adaptation accuracy vs. epoch for different comparing

methods on Vis DA-17.

Table 4. Statistics of the examples selected by CBST+AVH and

Method TP Rate AVH (avg) Model Conﬁdence Norm kxk CBST+AVH 0.844 0.118 0.961 20.84 CBST/CRST 0.848 0.117 0.976 21.28

An advantage we observe from AVH is that the improved calibration leads to more frequent sampling of harder samples, whereas the pseudo-label classiﬁcation on these hard samples generally outperforms softmax results. Table 4 shows the statistics of examples selected with AVH and model conﬁdence respectively at the beginning of the training process. The true positive rate (TP Rate) for CBST+AVH remains similar to CBST/CRST, indicating AVH overall is not introducing additional noise compare to model conﬁdence. On the other hand, it is observed that the average model conﬁdence of AVH selected samples is lower, indicating there are more selected hard samples that are closer to the decision boundary. It is also observed that the average sample norm by AVH is also lower, conﬁrming the inﬂuence of sample norms on ultimate model conﬁdence.

5.2. AVH-based Loss for Domain Generalization

The problem of domain generalization (DG) is to learn from multiple training domains, and extract a domain-agnostic model that can then be applied to an unseen domain. Since we have no assumption on how the unseen domain looks like, the generalization on the unseen domains will mostly depend on the generalizability of the neural network. We

Table 5. Domain generalization accuracy (%) on PACS dataset.

Method Painting Cartoon Photo Sketch Avg Alex Net (Li et al., 2017) 62.86 66.97 89.50 57.51 69.21 MLDG (Li et al., 2018) 66.23 66.88 88.00 58.96 70.01 Meta Reg (Balaji et al., 2018) 69.82 70.35 91.07 59.26 72.62 Feature-critic (Li et al., 2019) 64.89 71.72 89.94 61.85 72.10 Baseline CNN-10 66.46 67.88 89.70 51.72 68.94 CNN-10 + AVH 72.02 66.42 90.12 61.26 72.46

use the challenging PACS dataset (Li et al., 2017) which consists of Art painting, Cartoon, Photo and Sketch domains. For each domain, we leave it out as the test set and train our models on rest of the three domains.

Speciﬁcally, we train a 10-layer plain CNN with the following AVH-based loss (additional details in Appendix F):

s( A(xi, wyi))

s( A(xi, wk))

where s is hyperparameter that adjusts the scale of the output logits and implicitly controls the optimization difﬁculty. This hyperparameter is typically set by cross-validation. Experimental results are reported in Table 5. With the proposed new loss which directly has an AVH-based design, a simple CNN is outperforming baseline and recent methods that are based on more complex models. In fact, similar learning objectives have also been shown useful in image recognition (Liu et al., 2017c) and face recognition (Wang et al., 2017; Ranjan et al., 2017), indicating that AVH is generally effective to improve generalization in various tasks.

6. Concluding Remarks

We propose a novel measure for CNN models known as Angular Visual Hardness. Our comprehensive empirical studies show that AVH can serve as an indicator of generalization abilities of neural networks, and improving SOTA accuracy entails improving accuracy on hard examples. AVH also has a signiﬁcantly stronger correlation with Human Selection Frequency. We empirically show the advantage of AVH over Model Conﬁdence in self-training for domain adaptation task and loss function for domain generalization task. AVH can be useful in other applications such as deep metric learning, fairness, knowledge transfer, etc. and we plan to investigate them in the future (discussions in Appendix G).

Angular Visual Hardness

Acknowledgements

Work done during internship at NVIDIA. We would like to thank Shiyu Liang, Yue Zhu and Yang Zou for the valuable discussions that enlighten our research. We are also grateful to the anonymous reviewers for their constructive comments that signiﬁcantly helped to improve our paper. Weiyang Liu is partially supported by Baidu scholarship and NVIDIA GPU grant. This work was supported by NSF-1652131, NSF-BIGDATA 1838177, AFOSR-YIPFA9550-18-1-0152, Amazon Research Award, and ONR BRC grant for Randomized Numerical Linear Algebra.

Balaji, Y., Sankaranarayanan, S., and Chellappa, R. Metareg:

Towards domain generalization using meta-regularization. In Neur IPS, 2018.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A.,

Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J.

Curriculum learning. In ICML, 2009.

Berardino, A., Ball, J., Laparra, V., and Simoncelli, E. P.

Eigen-distortions of hierarchical representations, 2017.

Buolamwini, J. and Gebru, T. Gender shades: Intersectional

accuracy disparities in commercial gender classiﬁcation. In Conference on Fairness, Accountability and Transparency, pp. 77 91, 2018.

Chen, M., Weinberger, K. Q., and Blitzer, J. Co-training for

domain adaptation. In Advances in neural information processing systems, pp. 2456 2464, 2011.

Dekel, R. Human perception in computer vision, 2017.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Deng, J., Guo, J., Xue, N., and Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In CVPR, 2019.

Der Kiureghian, A. and Ditlevsen, O. Aleatory or epistemic?

does it matter? Structural safety, 31(2):105 112, 2009.

Dodge, S. and Karam, L. A study and comparison of human

and deep learning recognition performance under visual distortions. 2017 26th International Conference on Computer Communication and Networks (ICCCN), Jul 2017. doi: 10.1109/icccn.2017.8038465.

Fellbaum, C. Wordnet and wordnets. 2005.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-

learning for fast adaptation of deep networks. In ICML, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi-

mation: Representing model uncertainty in deep learning. In ICML, 2016.

Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In

Neur IPS, 2017.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle,

H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. JMLR, 2016.

Geirhos, R., Temme, C. R., Rauber, J., Sch utt, H. H., Bethge,

M., and Wichmann, F. A. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, pp. 7538 7550, 2018.

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-

mann, F. A., and Brendel, W. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. 2019.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-

ing and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Grandvalet, Y. and Bengio, Y. Semi-supervised learning by

entropy minimization. In Neur IPS, 2005.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On

calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 1321 1330. JMLR. org, 2017.

Hadsell, R., Chopra, S., and Le Cun, Y. Dimensionality

reduction by learning an invariant mapping. In CVPR, 2006.

Hariharan, B. and Girshick, R. Low-shot visual recognition

by shrinking and hallucinating features. In ICCV, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Angular Visual Hardness

Kendall, A. and Gal, Y. What uncertainties do we need in

bayesian deep learning for computer vision? In Neur IPS, 2017.

Kheradpisheh, S. R., Ghodrati, M., Ganjtabesh, M., and

Masquelier, T. Deep networks can resemble human feedforward vision in invariant object recognition. Scientiﬁc Reports, 6(1), Sep 2016. ISSN 2045-2322. doi: 10.1038/ srep32672.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet

classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Kumar, A., Sarawagi, S., and Jain, U. Trainable calibration

measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp. 2810 2819, 2018.

Kumar, M. P., Packer, B., and Koller, D. Self-paced learning

for latent variable models. In Neur IPS, 2010.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple

and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6402 6413. Curran Associates, Inc., 2017.

Lee, D.-H. Pseudo-label: The simple and efﬁcient semi-

supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013.

Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper,

broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017.

Li, Y., Yang, Y., Zhou, W., and Hospedales, T. M. Feature-

critic networks for heterogeneous domain generalization. ar Xiv preprint ar Xiv:1901.11448, 2019.

Li, Z. and Hoiem, D. Reducing over-conﬁdent errors outside

the known distribution. ar Xiv preprint ar Xiv:1804.03166, 2018.

Li, Z., Xu, Z., Ramamoorthi, R., Sunkavalli, K., and Chan-

draker, M. Learning to reconstruct shape and spatiallyvarying reﬂectance from a single image. In SIGGRAPH Asia 2018 Technical Papers, pp. 269. ACM, 2018.

Lin, R., Liu, W., Liu, Z., Feng, C., Yu, Z., Rehg, J. M.,

Xiong, L., and Song, L. Regularizing neural networks via minimizing hyperspherical energy. In CVPR, 2020.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740 755. Springer, 2014.

Lindsay, P. H. and Norman, D. A. Human information

processing: An introduction to psychology. Academic press, 2013.

Liu, W., Wen, Y., Yu, Z., and Yang, M. Large-margin

softmax loss for convolutional neural networks. In ICML, 2016.

Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B.,

Rehg, J. M., and Song, L. Iterative machine teaching. In ICML, 2017a.

Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L.

Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017b.

Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T.,

and Song, L. Deep hyperspherical learning. In Neur IPS, 2017c.

Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., and Song,

L. Learning towards minimum hyperspherical energy. In Neur IPS, 2018a.

Liu, W., Liu, Z., Yu, Z., Dai, B., Lin, R., Wang, Y., Rehg,

J. M., and Song, L. Decoupled networks. In CVPR, 2018b.

Liu, W., Liu, Z., Rehg, J. M., and Song, L. Neural similarity

learning. In Neur IPS, 2019.

Liu, W., Lin, R., Liu, Z., Rehg, J. M., Xiong, L., Weller,

A., and Song, L. Orthogonal over-parameterized training. ar Xiv preprint ar Xiv:2004.04690, 2020.

Long, J., Shelhamer, E., and Darrell, T. Fully convolutional

networks for semantic segmentation. In CVPR, 2015a.

Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning

transferable features with deep adaptation networks. In ICML, 2015b.

Martin Cichy, R., Khosla, A., Pantazis, D., and Oliva, A.

Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks. Neuro Image, 153:346358, Jun 2017. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2016.03.063.

Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. Deep

metric learning via lifted structured feature embedding. In CVPR, 2016.

Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D.,

and Saenko, K. Visda: The visual domain adaptation challenge. ar Xiv preprint ar Xiv:1710.06924, 2017.

Angular Visual Hardness

Pramod, R. T. and Arun, S. P. Do computational models dif-

fer systematically from human object perception? 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. doi: 10.1109/cvpr.2016.177.

Ranjan, R., Castillo, C. D., and Chellappa, R. L2constrained softmax loss for discriminative face veriﬁcation. ar Xiv preprint ar Xiv:1703.09507, 2017.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do im-

agenet classiﬁers generalize to imagenet? ar Xiv preprint ar Xiv:1902.10811, 2019.

Saito, K., Ushiku, Y., and Harada, T. Asymmetric tritraining for unsupervised domain adaptation. In ICML, 2017a.

Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Max-

imum classiﬁer discrepancy for unsupervised domain adaptation. 2017b.

Saito, K., Ushiku, Y., Harada, T., and Saenko, K. Adversar-

ial dropout regularization. In ICLR, 2018.

Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A

uniﬁed embedding for face recognition and clustering. In CVPR, 2015.

Shu, R., Bui, H. H., Narui, H., and Ermon, S. A dirt-t

approach to unsupervised domain adaptation. 2018.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Sohn, K. Improved deep metric learning with multi-class

n-pair loss objective. In Neur IPS, 2016.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and

Srebro, N. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878, 2018.

Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F.

Bayesian optimization with robust bayesian neural networks. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4134 4142. Curran Associates, Inc., 2016.

Sun, Y., Wang, X., and Tang, X. Deep learning face rep-

resentation from predicting 10,000 classes. In CVPR, 2014.

Tang, P., Wang, X., Bai, X., and Liu, W. Multiple instance

detection network with online instance classiﬁer reﬁnement. In CVPR, 2017.

Wang, F., Xiang, X., Cheng, J., and Yuille, A. L. Normface:

L2 hypersphere embedding for face veriﬁcation. In ACMMM, 2017.

Wang, F., Liu, W., Liu, H., and Cheng, J. Additive margin softmax for face veriﬁcation. ar Xiv preprint ar Xiv:1801.05599, 2018a.

Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J.,

Li, Z., and Liu, W. Cosface: Large margin cosine loss for deep face recognition. In CVPR, 2018b.

Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., and

Xia, S.-T. Iterative learning with open-set noisy labels. In CVPR, 2018c.

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P.

Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600 612, 2004.

Wei, C. and Ma, T. Improved sample complexities for deep

networks and robust classiﬁcation via an all-layer margin. ar Xiv preprint ar Xiv:1910.04284, 2019.

Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P.

Sampling matters in deep embedding learning. In ICCV, 2017.

Yi, D., Lei, Z., Liao, S., and Li, S. Z. Learning face repre-

sentation from scratch. ar Xiv preprint ar Xiv:1411.7923, 2014.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586 595, 2018.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A.

Learning deep features for scene recognition using places database. In Neur IPS, 2014.

Zhou, Y., Kantarcioglu, M., and Thuraisingham, B. Self-

training with selection-by-rejection. In 2012 IEEE 12th international conference on data mining, pp. 795 803. IEEE, 2012.

Zhu, X. Semi-supervised learning tutorial.

Zou, Y., Yu, Z., Kumar, B., and Wang, J. Domain adapta-

tion for semantic segmentation via class-balanced selftraining. ar Xiv preprint ar Xiv:1810.07911, 2018.

Zou, Y., Yu, Z., Liu, X., Kumar, B., and Wang, J.

Conﬁdence regularized self-training. ar Xiv preprint ar Xiv:1908.09822, 2019.