# from_label_smoothing_to_label_relaxation__c2780508.pdf

From Label Smoothing to Label Relaxation

Julian Lienen, Eyke H ullermeier Heinz Nixdorf Institute and Department of Computer Science Paderborn University 33098 Paderborn, Germany {julian.lienen,eyke}@upb.de

Regularization of (deep) learning models can be realized at the model, loss, or data level. As a technique somewhere in-between loss and data, label smoothing turns deterministic class labels into probability distributions, for example by uniformly distributing a certain part of the probability mass over all classes. A predictive model is then trained on these distributions as targets, using cross-entropy as loss function. While this method has shown improved performance compared to non-smoothed cross-entropy, we argue that the use of a smoothed though still precise probability distribution as a target can be questioned from a theoretical perspective. As an alternative, we propose a generalized technique called label relaxation, in which the target is a set of probabilities represented in terms of an upper probability distribution. This leads to a genuine relaxation of the target instead of a distortion, thereby reducing the risk of incorporating an undesirable bias in the learning process. Methodically, label relaxation leads to the minimization of a novel type of loss function, for which we propose a suitable closed-form expression for model optimization. The effectiveness of the approach is demonstrated in an empirical study on image data.

Introduction In standard settings of supervised learning, the result of a learning process is essentially determined by the interplay of the model class, the learning algorithm resp. the loss function this algorithm seeks to minimize in order to identify the presumably optimal model, and the training data. Of utmost practical importance, especially for ﬂexible models such as neural networks, is a regularization of the learner, so as to prevent it from overﬁtting the training data. In the case where models are non-deterministic and produce probabilistic predictions, for example class probabilities in the case of classiﬁcation, overﬁtting also manifests itself in overly conﬁdent predictors with a tendency to assign probabilities close to the extremes of 0 or 1. While the role of the model class and the loss function in helping to regularize the learner is quite obvious, this is arguably less true for the training data. The role of the data becomes especially important when supervision is only indirect in the sense that the target of the predictor is not directly

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

observed. Again, class probabilities constitute an important example: Even if probabilistic predictions are sought, the data will normally not provide such probabilities as training information. Instead, it typically consists of examples with a single class label attached. In such cases, the formalization of the learning problem also involves the modeling of the training data. This aspect, the modeling of data, is at the core of this paper. Compared to the model class and learning algorithm, it has received rather little attention in the literature so far, and is mostly done in an implicit way. For example, when training a model by optimizing losses such as log-loss or Brier score, an observed class label is implicitly treated as a degenerate (one-point) distribution, which assigns the entire probability mass to that label. Unsurprisingly, feeding the learner with extreme distributions of that kind aggravates the problems of overﬁtting and over-conﬁdence. So-called label smoothing (Szegedy et al. 2016) has recently been proposed to address these issues. The idea is to remove a certain amount of probability mass from the observed class and spread it across the other classes, thereby making the distribution less extreme. While probability mass can be spread in any way, the authors suggest a uniform distribution over all classes. This modiﬁcation of the data encourages the model to be less conﬁdent about the predictions, which effectively narrows the gap between the logits of the observed class and the others. This method has proved successful in various applications, such as classiﬁcation of image data using deep convolutional neural networks (Szegedy et al. 2016; M uller, Kornblith, and Hinton 2019). Although label smoothing undoubtedly provokes a regularization effect (Lukasik et al. 2020), it can be questioned from a data modeling point of view. In particular, since the smoothed probability distribution is still unlikely to match the true underlying conditional class probability, it is likely to introduce a bias that may harm the generalization performance (Li, Dasarathy, and Berisha 2020). Indeed, while label smoothing helps to calibrate the degree of conﬁdence of a model, and improves compared to the use of the conventional cross-entropy loss with one-point probabilities, explicit calibration methods such as temperature scaling (Guo et al. 2017) turn out to calibrate models even better (M uller, Kornblith, and Hinton 2019). In this paper, we propose label relaxation as an alternative

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

approach to data modeling. To avoid a possibly undesirable bias, the key idea is to replace a degenerate probability distribution associated with an observed class label, not by a single smoothed distribution, but by a larger set of candidate distributions. All distributions in this set still assign the highest probability to the observed class, but the concrete degree is not ﬁxed, and the remaining mass can be distributed freely over the other classes. This way, the learner itself can decide on the most appropriate distribution. In other words, instead of predetermining an alleged ground-truth distribution as a target, this distribution will be determined in a data-driven way as a result of the learning process itself. To put label relaxation into practice, we devise a suitable generalization of the Kullback-Leibler (KL) divergence loss, which is able to compare a predicted probability distribution with a class of candidate distributions. The effectiveness of learning by minimizing this generalized loss is demonstrated on commonly used image datasets. While being competitive to label smoothing and other related regularization techniques in terms of classiﬁcation performance, label relaxation does indeed improve in terms of calibration, i.e., accurate estimation of probabilities, often even compared to explicit calibration techniques that require extra data.

Related Work As already said, different actions could be taken to improve learning, including the manipulation or modiﬁcation of the original training data. In this regard, one can distinguish between methods acting on the instance, feature, and label level. While there are approaches to augment the instance set, e.g., by synthetically generating additional training examples (Cires an et al. 2010; Krizhevsky, Sutskever, and Hinton 2012), or to introduce noise in the input features (e.g., (Vincent et al. 2008; van der Maaten et al. 2013)), our focus is on the adjustment of the labels. One of the most prominent approaches of this kind, label smoothing (Szegedy et al. 2016), was already mentioned. It turns deterministic observations of class labels into probability distributions, assigning a predeﬁned amount of probability mass to the non-observed classes; by default, a uniform distribution is used for that purpose. The newly generated targets are then used together with a conventional crossentropy loss. As a result, the learner no longer tries to perfectly predict the original class (with probability 1), which may lead to drastic differences among the class logits and cause numerical instabilities. Label smoothing has been applied successfully not only to the domain of image classiﬁcation, but also to other domains such as machine translation. More recently, M uller, Kornblith, and Hinton (2019) observed a calibration effect produced through label smoothing. Moreover, the authors analyzed the activation patterns in penultimate layers in neural networks. Apparently, label smoothing supports a regular distribution of the classes in these layers, in which clusters of instances associated with a class are well separated and tend to be equi-distant. Additionally, label smoothing has also turned out to be effective against label noise (Lukasik et al. 2020). While label smoothing in its original form distributes probability mass to the non-observed classes uniformly, dis-

tributions other than uniform are of course conceivable. Although the work does not directly build upon label smoothing, (Hinton, Vinyals, and Dean 2015) follows a similar approach. Here, a teacher network predicts target probabilities which are then used for training a student network to distill the teacher s knowledge. In a different approach, a bootstrapping technique is proposed that makes use of the model s own distribution to adjust the training labels (Reed et al. 2015). Similarly, self-distillation approaches gathering the target labels from the model itself have shown regularizing effects to improve generalization performance (Zhang et al. 2019; Yun et al. 2020).

As an alternative approach to prevent the model becoming too overconﬁdent and closely related to label smoothing, Pereyra et al. (2017) propose the penalization of conﬁdent distributions by adding the negative entropy of the predicted distribution to the original loss. Following this principle, Dubey et al. (2018) transfer the method to ﬁne-grained classiﬁcation. Related to this, with similar effects, the socalled focal loss (Lin et al. 2020) aims to reduce the loss for well-classiﬁed instances, i.e., predictions close to the actual target, by dynamically scaling cross-entropy loss. With this, designed to cope with class imbalance in object detection problems, the danger of overconﬁdence is reduced by ﬂattening the loss near the true target and, thereby, shrinking the gradients for conﬁdent predictions.

In addition to the approaches outlined above, further ideas to adjust the given labels in order to achieve better generalization properties can be found in the literature. For instance, the approach by Xie et al. (2016) randomly ﬂips targets with a ﬁxed probability, resulting in training on a dataset ensemble with shared weights. As a result, an averaging effect lowers the risk of overﬁtting. Motivated by (Hinton, Vinyals, and Dean 2015), Li et al. (2017) propose a related distillery approach considering noisy side information. Bagherinezhad et al. (2018) describe a model that iteratively reﬁnes label probabilities from previous model predictions in a chain of multiple networks. By reﬁning the labels over all models, data is augmented by soft targets to prevent overﬁtting.

In neural network learning, the predicted probabilities should ideally match the true distribution. However, as shown empirically by Guo et al. (2017), modern neural networks tend to be calibrated very poorly. While there exists a wide range of calibration methods, including isotonic regression (Zadrozny and Elkan 2002), Bayesian binning techniques (Naeini, Cooper, and Hauskrecht 2015), or beta calibration (Kull, Filho, and Flach 2017), a simple technique called temperature scaling proved to provide strong performance compared to its competitors (Guo et al. 2017). However, most calibration methods require additional data to determine the calibration parameters, being left with less data for training the model. Typically, this comes with a loss of generalization performance. Although label smoothing reduces the calibration error compared to non-smoothed training, it is still slightly inferior to temperature scaling (M uller, Kornblith, and Hinton 2019).

Label Relaxation In the following, we detail our idea of label relaxation as an alternative to label smoothing, i.e., the idea of modeling deterministic data, namely observed class labels, in terms of a set of probability distributions instead of a single target distribution. We also propose a generalization of an underlying loss function, which compares probabilistic predictions with a set of candidate distributions, and derive a closed-form expression for the case of the KL divergence.

Motivation Consider a conventional setting of supervised learning, in which we are interested in learning a probabilistic classiﬁer ˆp : X P(Y), where X is an instance space, Y = {y1, . . . , y K} a set of class labels, and P(Y) the space of probability distributions over Y. To this end, we typically proceed from training data D = {(xi, yi)}N i=1 X Y, i.e., observations in the form of instances labeled by one of the classes. Thus, even if we assume a ground-truth (conditional) probability distribution p i = p ( | xi) to exist for each xi X, this distribution will normally not be provided as training information. Instead, the training will be based on the deterministic label yi, which is (explicitly or implicitly) treated as a degenerate (one-point) distribution pi P(Y) such that pi(yi | xi) = 1 and pi(y | xi) = 0 for y = yi. Needless to say, making the realistic assumption of a nondeterministic dependency between X and Y, the true distribution p i will normally be less extreme than the surrogate pi. Therefore, providing the former as training information may suggest a level of determinism that is actually not warranted. As a consequence, the learner will be encouraged to make extreme predictions, which suggests a high degree of conﬁdence, leading to biased probability estimates and a tendency to overﬁt the training data all the more when training ﬂexible models such as neural networks. In label smoothing, a surrogate distribution p is replaced by a less extreme surrogate ps = (1 α) p + α u as a target for the learner, where u P(Y) is a ﬁxed distribution and α (0, 1] a smoothing factor. As shown by Szegedy et al. (2016), the resulting cross-entropy H, which often serves as a loss function for the learner, for a prediction ˆp P(Y) is of the form

H(ps, ˆp) = (1 α)H(p, ˆp) + αH(u, ˆp) (1) = (1 α)H(p, ˆp) + α (DKL(u||ˆp) + H(u)) .

Since H(p) = 0 for a degenerate p, the ﬁrst term on the right-hand side simpliﬁes to H(p, ˆp) = DKL(p||ˆp) + H(p) = DKL(p||ˆp), with DKL the Kullback-Leibler divergence. Moreover, assuming u to be independent of ˆp, H(u) can be treated as a constant with no inﬂuence on loss minimization. Furthermore, as pointed out by Szegedy et al. (2016), the divergence DKL(u||ˆp) essentially corresponds to the negative entropy of ˆp for the case where u is the uniform distribution. Thus, we eventually end up with a loss function of the form

L(ps, ˆp) = (1 α) DKL(p||ˆp) + α H(ˆp) , (2)

i.e., a loss that augments the original cross-entropy loss by a penalty that enforces a higher entropy for the prediction ˆp,

and which hence serves the purpose of regularization, as empirical results have conﬁrmed (Pereyra et al. 2017). Training a learner with the loss (2) will obviously lead to less extreme predictions (for α = 1, the learner will always predict the uniform distribution on Y). Thus, there are different ways of looking at label smoothing. According to what we just explained, it can be seen as a regularization technique, which may explain its practical usefulness. On the other side, coming back to the discussion we started with, it can also be seen as an attempt at presenting the training information in a more faithful way: A smoothed target probability ps is arguably more realistic than a degenerate distribution p assigning the fully probability mass to a single class label. However, it is still unlikely that the adjusted distribution ps matches the ground-truth p . Therefore, using ps as a more or less arbitrary target, the learner will still be biased in a possibly undesirable way. Related to this, one may wonder whether a systematic penalization of the learner for overly correct predictions, i.e., predictions ˆp that are closer to the original p (the truly observed class label) than ps, is indeed appropriate. At least in some cases, such predictions could be justiﬁed, and indeed be closer to the ground-truth. As a presumably better but at least more faithful representation of our knowledge about the ground-truth p , we propose to replace the original target p by a set Q P(Y) of candidate probabilities that are sufﬁciently close to the original target p. While the replacement of p by a single distribution ps can be seen as a distortion of the original target, this can be considered as a relaxation of the target: As long as the learner predicts any distribution ˆp Q inside the candidate set, it should not be penalized at all, i.e., the loss should be 0. This is to some extent comparable to the use of loss functions like the ϵ-insensitive loss in support vector regression, where the loss is 0 in the ϵ-neighborhood of the original target; essentially, this means that the original target, which is a real number, is relaxed and replaced by a set in the form of an interval. Note that, by using set-valued targets Qi for the training instances xi, a regularization effect can also be expected: By accepting all predictions as correct that are sufﬁciently close to the original target distribution, the learner is still allowed to produce extreme predictions but no longer urged to do so. Instead, the learner is more ﬂexible and can freely choose a target pr i Q that appears most appropriate. Since the pr i are the result of a learning process and determined in a datadriven way, one expects them to be closer to the p i than the surrogates ps i, which are chosen arbitrarily. This is completely in line with the idea of data disambiguation in the context of learning from set-valued data (H ullermeier and Cheng 2015). See Fig. 1 for an illustration of the conceptual differences between label smoothing and label relaxation.

Loss Formulation To formalize the ideas sketched above, we leverage the theory of imprecise probabilities (Walley 1991). A convenient way to express a set of probability distributions is to provide upper probabilities, i.e., upper bounds on the probabilities of events. So-called possibility distributions π : Y [0, 1]

Figure 1: An illustration of label smoothing (left) and label relaxation (right), using a barycentric representation, in which points correspond to (3-class) distributions and probabilities are given by the lengths of the projections to the sides of the triangle. In the former, the original distribution p (in the lower right corner) is shifted toward the uniform distribution u, and the loss of a prediction ˆp depends on the (KL) distance to ps, hence on the distance to p as well as u. In label relaxation, p is replaced by a set Q of distributions, indicated by the shaded region. The learner is free to choose any of the distributions inside this set (non-ﬁlled circles inside the region), and the loss is determined by the minimal distance between ˆp and any of the distributions in Q (ﬁlled red circle).

are often interpreted in this way (Dubois and Prade 2004), i.e., in the sense that π(y) is an upper bound on p (y). More generally, since a possibility distribution π induces a measure Π on Y deﬁned by Π(Y ) = maxy Y π(y) for all Y Y, the set of probability distributions associated with a distribution π is given by

Qπ ..= n p P(Y) | Y Y : X

y Y p(y) max y Y π(y) o .

Note that a possibility distribution π is assumed to be normalized in the sense that π(y) = 1 for at least one y Y . In other words, there is at least one alternative that appears completely plausible. In our case, this alternative naturally corresponds to the class label that has actually been observed for a training instance: Potentially, this class may have a (conditional) probability of 1. However, by assigning a certain degree π(y) > 0 of possibility also to the other classes, we can express that these classes are not completely excluded either. More speciﬁcally, consider a distribution of the following kind:

πi(y) = 1 if y = yi α if y = yi ,

where α [0, 1] is a parameter. By deﬁnition, the associated set Qα πi is then given by the set of probability distributions p that assign a probability mass of at most 1 to the observed class yi and at most α to the other classes:

Qα i ..= n p P(Y) | X

yi =y Y p(y) α o (3)

Replacing the class labels yi observed as training information by sets Qα i as new targets for the learner, we need to deﬁne a suitably generalized loss function L . Since the learner

0.0 0.5 1.0 Predicted true class probability

CE w/o LS CE w LS (α = 0.1) CE w LS (α = 0.25) CE w LS (α = 0.5)

0.0 0.5 1.0 Predicted true class probability

KL Divergence L (α = 0.1) L (α = 0.25) L (α = 0.5)

Figure 2: A comparison of the different losses discussed in this paper for binary classiﬁcation. Left: Cross-entropy losses with and without label smoothing. Right: LR loss L based on the Kullback-Leibler divergence.

is still assumed to produce probabilistic predictions, the loss should be able to compare a predicted distribution ˆpi with a candidate set Qα i . According to what we said before, namely that a prediction inside Qα i should be considered as perfect, a natural deﬁnition is

L (Q, ˆp) ..= min p Q L(p, ˆp) , (4)

where L is a standard loss on probability distributions, i.e., a loss L : P(Y)2 R. Interestingly, (4) can be seen as a special case of what has been introduced under the notion of optimistic superset loss in the context of superset learning (H ullermeier and Cheng 2015), and more recently as inﬁmum loss by Cabannes, Rudi, and Bach (2020). In the following, we shall refer to (4) as label relaxation (LR) loss. As a theoretically convenient case, instantiating L with the Kullback-Leibler divergence, that is,

L(p, ˆp) ..= DKL(p||ˆp) = X

y Y p(y) log p(y)

(4) simpliﬁes as follows for sets Qα i of the form (3):

L (Qα i , ˆpi) = 0 if ˆpi Qα i DKL(pr i ||ˆpi) otherwise , (5)

( 1 α if y = yi α ˆpi(y) P

y =yi ˆpi(y ) otherwise . (6)

We refer to the technical appendix for a formal proof of this result. Fig. 2 shows a comparison of label smoothing (crossentropy losses with and without smoothing, left side) with our label relaxation loss (right side). As can be seen (and is proven in the appendix), L based on the Kullback-Leibler divergence is convex, which makes the optimization computationally feasible. Moreover, label smoothing is not monotone and again increases for predictions close to 1, while the LR loss vanishes for values 1 α. This cut reﬂects the relaxation of the problem. For multi-class problems, since the proposed label relaxation loss projects the predicted probabilities from ˆpi according to its own distribution to pr i , it is

invariant to the concretely predicted probabilities for classes not equal to the observed class. Interestingly, the resulting losses as shown in Fig. 2 for varying α parameters seem to be very related to the focal loss as introduced in (Lin et al. 2020). However, while the focal loss deemphasizes predictions in the well-classiﬁed region by almost ﬂat regions, our loss completely eliminates the loss in such a region for the genuine relaxation.

Evaluation To demonstrate the effectiveness of label relaxation, an empirical evaluation on image classiﬁcation datasets assessing the classiﬁcation performance and calibration is conducted.

Experimental Setting Within the empirical evaluation of our method proposal, we compare models trained by conventional cross-entropy (CE), label smoothing (LS), conﬁdence penalizing (CP) as described by Pereyra et al. (2017), and the focal loss (FL) of Lin et al. (2020) to our label relaxation (LR) approach. To this end, we study the performances on neural networks for the task of image classiﬁcation. Although the losses are completely general and not speciﬁcally tailored to any domain, this problem serves as a good representative and has been used in related studies in the past. In addition to assessing the generalization accuracy in terms of the classiﬁcation rate, we also measure the degree of calibration of the networks, i.e., the quality of the predicted class probabilities. To this end, we use the estimated expected calibration error (ECE) as done by Guo et al. (2017). This measure requires probabilities to be discretized through binning, and as suggested by M uller, Kornblith, and Hinton (2019), we ﬁx the number of bins to 15. To compare label smoothing and our approach with explicit calibration methods, non-calibration and temperature scaling (Guo et al. 2017) serve as baselines. Within our study, we consider MNIST (Le Cun et al. 1998), Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017), CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton 2009) as image datasets. While MNIST and Fashion-MNIST both have 60k training and 10k test examples, CIFAR-10 and CIFAR-100 each consist of 50k training and 10k test instances. For the ﬁrst two datasets, we train our models on a simple fully connected, Re LU activated neural network structure with two hidden layers consisting of 1024 neurons each. For the latter two datasets, we train the commonly used deep architectures VGG16 (Simonyan and Zisserman 2015), Res Net56 (V2) (He et al. 2016) and Dense Net-BC-100-12 (Huang et al. 2017). While we repeated every experiment for MNIST and Fashion-MNIST with 10 different seeds, we run each of the latter experiments 5 times. The runs were conducted on 20 Nvidia RTX 2080 Ti and 10 Nvidia GTX 1080 Ti GPUs. For a fair comparison, all hyperparameters are ﬁxed, except the parameter α in the case of label relaxation and smoothing loss, β as degree of conﬁdence penalization in CP, and γ as being used to adjust the focal loss. For every combination of model and dataset, we empirically determined hyperparameters (such as the learning rate schedule

and additional regularization) that work reasonably well for all losses. Since all losses are quite similar to each other, this was possible without favoring some of them while putting others at a disadvantage. To diminish regularization effects by additional means, we tried to exclude other techniques (such as extensive weight decay or Dropout (Srivastava et al. 2014)) as much as possible, thereby emphasizing the effect of the different loss functions while still achieving performances close to the originally published results.

To optimize the models, SGD with a Nesterov momentum of 0.9 has been used as optimizer. In all experiments, the batch size has been ﬁxed to 64. Depending on the model, we set the initial learning rates to 0.01 (VGG), 0.05 (simple dense), and 0.1 (Res Net and Dense Net). For each model, we optimized the learning rate schedule for generalization performance by dividing the learning rate by a constant factor (ranging from 0.1 to

0.1). We trained for either 25 (MNIST), 50 (Fashion-MNIST), 200 (CIFAR-10), or 300 (CIFAR-100) epochs. Furthermore, we used data augmentation by randomly horizontally ﬂipping and shifting the input images in width and height. We refer to the appendix for a more comprehensive overview of the ﬁxed hyperparameters.

Since the parameters α for LR and LS, β for CP, and γ for FL are of critical importance, they have been optimized separately on a separate hold-out validation set consisting of 1/6 of the original training data. In the ﬁrst experiments, we optimize this parameter for the highest classiﬁcation rate, whereas in the second evaluation, we focus on a low ECE. In both cases, we assessed values α {0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4}, β {0.1, 0.3, 0.5, 1, 2, 4, 8}, and γ {0.1, 0.2, 0.5, 1, 2, 3.5, 5} as suggested as reasonable parameters in the corresponding publications. The best model is then retrained on the original training data and evaluated on a separate test set. For each seed, the original training and test splits are merged and resampled to increase the variance of the experiments. This way, we achieve a better estimation of the generalization error. However, as a consequence, the presented results are not directly comparable to previously published results based on the original splits, although this special case is also covered in our experiments.

For temperature scaling, a separate hold-out validation set is used to optimize the parameter T among the values T {0.25, 0.5, 0.75, 1, 1.1, 1.2, . . . , 2, 2.5, 3}. This parameter highly depends on the actually trained model and does not generalize well, i.e., a value performing well on an inner optimization run does not necessarily imply good calibration on the model ﬁnally trained. Therefore, the evaluation scenario is slightly different compared to the optimization of α: For the latter, as opposed to the case of temperature scaling, the hold-out validation set is included in the ﬁnal training using the optimized parameters. This can be regarded as the price being paid for explicitly calibrating the model with temperature scaling, as opposed to the implicit calibration achieved by label smoothing and label relaxation. Here, we use 15% of the training data for calibration, which is almost comparable to the validation set used for optimizing α.

Loss MNIST Fashion-MNIST Avg. Rank Acc. ECE Acc. ECE Acc. ECE

CE (α = 0) 0.985 0.002 0.010 0.001 0.912 0.003 0.129 0.184 2 3.5 LS (α opt. for acc.) 0.988 0.001 0.106 0.144 0.915 0.002 0.155 0.128 1 5 CP (β opt. for acc.) 0.985 0.002 0.012 0.002 0.911 0.004 0.075 0.005 3 3.5 FL (γ opt. for acc.) 0.984 0.002 0.009 0.002 0.911 0.002 0.062 0.002 4.5 2 LR (α opt. for acc.) 0.985 0.001 0.007 0.002 0.912 0.003 0.059 0.008 2 1

CE (α = 0, T opt.) 0.983 0.001 0.003 0.001 0.908 0.004 0.030 0.003 4 2.5 LS (α opt. for ECE) 0.987 0.001 0.014 0.001 0.915 0.003 0.016 0.002 1 3.5 CP (β opt. for ECE) 0.984 0.001 0.011 0.001 0.911 0.003 0.072 0.003 2.5 4 FL (γ opt. for ECE) 0.982 0.001 0.004 0.001 0.907 0.003 0.011 0.002 5 1.5 LR (α opt. for ECE) 0.985 0.002 0.003 0.001 0.911 0.003 0.015 0.003 2 1.5

Table 1: Results on MNIST and Fashion-MNIST using a simple 2-layer dense architecture. Bold entries indicate the best combination with regard to the corresponding metric per dataset and optimization scheme. The resulting ranks are averaged over both datasets for the respective metric.

Table 1 shows the results of all assessed loss variants with regard to their classiﬁcation performance and calibration error on MNIST and Fashion-MNIST. As can be seen, with a single exception, our label relaxation approach provides the lowest calibration error on both datasets regardless of the optimization target (accuracy or ECE). Although models trained with the focal loss deliver competitive calibration results, they generalize worse than LR optimized models. Label smoothing delivers the highest classiﬁcation rate, while lacking calibration abilities. By still having a competitive classiﬁcation rate compared to LS, our method offers a reasonable compromise between strong generalization (in terms of classiﬁcation rate) and good calibration. Since the accuracies on MNIST and Fashion-MNIST are already quite high, a more insightful evaluation is given by the experiments on CIFAR-10 and CIFAR-100, using multiple popular deep convolutional network architectures. Table 2 summarizes the results for both datasets and the different topologies. With few exceptions, LR minimizes the calibration error in terms of ECE among the assessed losses. At the same time, in accordance with the results presented before, it provides competitive classiﬁcation rates. While FL-based models also yield relatively low calibration errors, they sometimes drop signiﬁcantly in terms of classiﬁcation performance (e.g., VGG16 and Dense Net-BC on CIFAR100 optimized for ECE). Also, although temperature scaling uses separate data to explicitly optimize the temperature for a low calibration error, the implicit calibration of LR outperforms temperature scaling in most of the cases. Thus, relying on losses that implicitly calibrate models seems to be a reasonable strategy for model calibration. To get a better overview of the presented results, Table 3 shows the resulting aggregated ranks for all datasets and models per metric and parameter optimization target (accuracy or ECE). For both optimization schemes, LR clearly dominates the other losses in terms of the calibration error. At the same time, the overall classiﬁcation performance is reasonably close to the best loss, especially when applying

accuracy-based hyperparameter optimization. As the results demonstrate, it balances both metrics and provides a compelling alternative to the other losses, particularly for applications in which the aim is to predict probabilities matching the underlying true probabilities of the classes.

Conclusion We proposed label relaxation as an alternative to label smoothing, an established technique for preventing overﬁtting and over-conﬁdence in classiﬁer learning: Instead of replacing the original (degenerate) distribution associated with an observed class label by another, smoother yet still precise distribution, we relax the problem by letting the learner choose from a larger set of such distributions. This kind of imprecisiation of training data relieves the learner from the need to reproduce unrealistically deﬁnite observations, very much like label smoothing, but also allows it to predict probabilities in a ﬂexible way. This ﬂexibility appears to be important, not only for accurate classiﬁcation, but even more so for producing less biased and better calibrated probability estimates. These reﬂections are conﬁrmed by an empirical study in image classiﬁcation. Here, the calibration of deep convolutional neural network models could be improved without a loss in classiﬁcation accuracy compared to label smoothing, penalizing conﬁdent output distributions and focal lossbased optimization. Label relaxation even outperforms explicit calibration methods like temperature scaling, which, due to requiring extra data for calibration, often pay with a drop in classiﬁcation performance. The idea of modeling targets in supervised learning in terms of imprecise probabilities, combined with the minimization of generalized losses penalizing deviations from the set of associated precise distributions, is very general and could be instantiated in various ways. Here, we considered the problem of classiﬁcation and generalized the KL divergence. However, motivated by the promising empirical results, we also plan to look at other problems and other combinations of data imprecisiation and loss functions. Even-

Model Loss CIFAR-10 CIFAR-100 Avg. Rank Acc. ECE Acc. ECE Acc. ECE

CE (α = 0) 0.930 0.002 0.041 0.001 0.708 0.003 0.196 0.003 1.5 3.5 LS (α opt. for acc.) 0.929 0.001 0.148 0.119 0.711 0.003 0.149 0.049 1.5 3.5 CP (β opt. for acc.) 0.927 0.001 0.059 0.002 0.703 0.003 0.228 0.013 3 4.5 FL (γ opt. for acc.) 0.921 0.001 0.038 0.004 0.700 0.005 0.190 0.009 5 2.5 LR (α opt. for acc.) 0.927 0.002 0.033 0.008 0.701 0.006 0.133 0.069 3.5 1

CE (α = 0, T opt.) 0.922 0.001 0.017 0.003 0.689 0.005 0.053 0.003 3.5 2 LS (α opt. for ECE) 0.932 0.002 0.028 0.010 0.711 0.003 0.085 0.005 1 4 CP (β opt. for ECE) 0.922 0.003 0.050 0.003 0.700 0.004 0.209 0.004 3 5 FL (γ opt. for ECE) 0.918 0.003 0.027 0.001 0.684 0.007 0.048 0.014 5 2.5 LR (α opt. for ECE) 0.926 0.001 0.022 0.001 0.703 0.006 0.046 0.005 2 1.5

Res Net56 (V2)

CE (α = 0) 0.940 0.002 0.041 0.002 0.737 0.003 0.126 0.003 3 3 LS (α opt. for acc.) 0.938 0.002 0.132 0.145 0.733 0.004 0.110 0.061 4.5 4 CP (β opt. for acc.) 0.939 0.003 0.046 0.004 0.738 0.004 0.151 0.007 2 4 FL (γ opt. for acc.) 0.941 0.002 0.036 0.006 0.738 0.005 0.107 0.017 1 1.5 LR (α opt. for acc.) 0.938 0.003 0.059 0.090 0.738 0.003 0.092 0.030 2.5 2.5

CE (α = 0, T opt.) 0.933 0.002 0.030 0.002 0.709 0.005 0.041 0.006 5 3.5 LS (α opt. for ECE) 0.940 0.002 0.017 0.002 0.730 0.004 0.053 0.003 2 3 CP (β opt. for ECE) 0.940 0.002 0.044 0.001 0.741 0.003 0.140 0.003 1 5 FL (γ opt. for ECE) 0.938 0.002 0.017 0.002 0.738 0.005 0.024 0.003 3 2 LR (α opt. for ECE) 0.939 0.002 0.016 0.002 0.729 0.003 0.017 0.003 3.5 1

Dense Net-BC (100-12)

CE (α = 0) 0.929 0.003 0.050 0.002 0.706 0.005 0.229 0.005 1 4 LS (α opt. for acc.) 0.927 0.004 0.046 0.011 0.704 0.008 0.182 0.063 4 1.5 CP (β opt. for acc.) 0.929 0.002 0.056 0.004 0.698 0.011 0.252 0.018 3 5 FL (γ opt. for acc.) 0.928 0.003 0.047 0.004 0.703 0.001 0.223 0.001 3.5 3 LR (α opt. for acc.) 0.928 0.002 0.039 0.014 0.706 0.003 0.203 0.023 2 1.5

CE (α = 0, T opt.) 0.921 0.003 0.009 0.004 0.687 0.006 0.096 0.006 4 2 LS (α opt. for ECE) 0.928 0.003 0.020 0.002 0.704 0.015 0.077 0.035 1 2.5 CP (β opt. for ECE) 0.928 0.002 0.054 0.002 0.704 0.003 0.237 0.003 1 5 FL (γ opt. for ECE) 0.915 0.003 0.016 0.001 0.681 0.004 0.133 0.004 5 3 LR (α opt. for ECE) 0.922 0.002 0.017 0.003 0.703 0.008 0.085 0.006 3 2.5

Table 2: Results on CIFAR-10 and CIFAR-100 for the assessed model architectures. Here, bold entries indicate the best performances among the loss variants per dataset, model and optimization scheme. The ranks are averaged over both datasets as done before.

Loss Acc. Opt. ECE Opt. Overall Acc. ECE Acc. ECE Acc. ECE

CE 1.88 3.5 4.13 2.5 3 3 LS 2.75 3.5 1.25 3.25 2 3.38 CP 2.75 4.25 1.88 4.75 2.31 4.5 FL 3.5 2.25 4.5 2.25 4 2.25 LR 2.5 1.5 2.63 1.63 2.56 1.56

Table 3: Average ranks of the losses with regard to the accuracy and ECE when a) optimizing the accuracy, b) optimizing the ECE and c) the overall ranking.

tually, a broader study of different instantiations should lead to a deeper understanding and general methodology of label relaxation.

Acknowledgments

This work was partially supported by the German Research Foundation (DFG) under Grant No. 3050231323. The authors gratefully acknowledge the funding of this project by computing time provided by the Paderborn Center for Parallel Computing (PC2).

Bagherinezhad, H.; Horton, M.; Rastegari, M.; and Farhadi, A. 2018. Label Reﬁnery: Improving Image Net Classiﬁcation through Label Progression. Co RR abs/1805.02641.

Cabannes, V.; Rudi, A.; and Bach, F. R. 2020. Structured Prediction with Partial Labelling through the Inﬁmum Loss. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, July 1318, 2020, volume 119, 1230 1239. PMLR.

Cires an, D. C.; Meier, U.; Gambardella, L. M.; and Schmidhuber, J. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neural Computation 22(12): 3207 3220.

Dubey, A.; Gupta, O.; Raskar, R.; and Naik, N. 2018. Maximum-Entropy Fine Grained Classiﬁcation. In Bengio, S.; Wallach, H. M.; Larochelle, H.; Grauman, K.; Cesa Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, 635 645.

Dubois, D.; and Prade, H. 2004. Possibility Theory, Probability Theory and Multiple-Valued Logics: A Clariﬁcation. Annals of Mathematics and Artiﬁcial Intelligence 32: 35 66.

Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, August 6-11, 2017, volume 70 of Proceedings of Machine Learning Research, 1321 1330. PMLR.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision, ECCV 2016, Amsterdam, The Netherlands, October 11-14, 2016, Part IV, 630 645. Springer.

Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. Co RR abs/1503.02531.

Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2261 2269. IEEE Computer Society.

H ullermeier, E.; and Cheng, W. 2015. Superset Learning Based on Generalized Loss Minimization. In Appice, A.; Rodrigues, P. P.; Costa, V. S.; Gama, J.; Jorge, A.; and Soares, C., eds., Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part II, volume 9285 of Lecture Notes in Computer Science, 260 275. Springer.

Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Canada.

Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Image Net Classiﬁcation with Deep Convolutional Neural Networks. In Bartlett, P. L.; Pereira, F. C. N.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Proceedings of the 26th Annual Conference on Neural Information Processing Systems, NIPS 2012, Lake Tahoe, NV, USA, December 3-6, 2012, 1106 1114.

Kull, M.; Filho, T. S.; and Flach, P. 2017. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classiﬁers. In Singh, A.; and Zhu, J., eds., Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2017, Fort Lauderdale, FL, USA, April 20-22, 2017, volume 54

of Proceedings of Machine Learning Research, 623 631. PMLR. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11): 2278 2324. Li, W.; Dasarathy, G.; and Berisha, V. 2020. Regularization via Structural Label Smoothing. In Chiappa, S.; and Calandra, R., eds., 23rd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2020, Online [Palermo, Sicily, Italy], August 26-28, 2020, volume 108 of Proceedings of Machine Learning Research, 1453 1463. PMLR. Li, Y.; Yang, J.; Song, Y.; Cao, L.; Luo, J.; and Li, L. 2017. Learning from Noisy Labels with Distillation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 1928 1936. IEEE Computer Society. Lin, T.; Goyal, P.; Girshick, R. B.; He, K.; and Doll ar, P. 2020. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2): 318 327. Lukasik, M.; Bhojanapalli, S.; Menon, A. K.; and Kumar, S. 2020. Does Label Smoothing Mitigate Label Noise? In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, July 13-18, 2020, volume 119, 6448 6458. PMLR. M uller, R.; Kornblith, S.; and Hinton, G. E. 2019. When Does Label Smoothing Help? In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, Neur IPS 2019, Vancouver, BC, Canada, December 8-14, 2019, 4696 4705. Naeini, M. P.; Cooper, G. F.; and Hauskrecht, M. 2015. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence, Austin, Texas, USA, January 25-30, 2015, 2901 2907. AAAI Press. Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, L.; and Hinton, G. E. 2017. Regularizing Neural Networks by Penalizing Conﬁdent Output Distributions. Co RR abs/1701.06548. Reed, S. E.; Lee, H.; Anguelov, D.; Szegedy, C.; Erhan, D.; and Rabinovich, A. 2015. Training Deep Neural Networks on Noisy Labels with Bootstrapping. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings. Simonyan, K.; and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015. Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overﬁtting. J. Mach. Learn. Res. 15(1): 1929 1958. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2818 2826. IEEE Computer Society. van der Maaten, L.; Chen, M.; Tyree, S.; and Weinberger, K. Q. 2013. Learning with marginalized corrupted features. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, 410 418. Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML 2008, Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, 1096 1103. ACM. Walley, P. 1991. Statistical reasoning with imprecise probabilities. Chapman & Hall. Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. Co RR abs/1708.07747. Xie, L.; Wang, J.; Wei, Z.; Wang, M.; and Tian, Q. 2016. Disturb Label: Regularizing CNN on the Loss Layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 4753 4762. IEEE Computer Society. Yun, S.; Park, J.; Lee, K.; and Shin, J. 2020. Regularizing Class-Wise Predictions via Self-Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 13873 13882. IEEE. Zadrozny, B.; and Elkan, C. 2002. Transforming classiﬁer scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada, 694 699. ACM. Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; and Ma, K. 2019. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 3712 3721. IEEE.