# criterion_collapse_and_loss_distribution_control__a6bcadb6.pdf Criterion Collapse and Loss Distribution Control Matthew J. Holland 1 In this work, we consider the notion of criterion collapse, in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVa R, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literature (Flooding, Soft AD). We show how collapse in the context of losses with a Bernoulli distribution goes far beyond existing results for CVa R and DRO, then expand our scope to include surrogate losses, showing conditions where monotonic criteria such as tilted ERM cannot avoid collapse, whereas non-monotonic alternatives can. 1. Introduction As machine learning systems become more widespread and integrated in our daily lives, the tension between performance metrics we ideally wish to optimize (at test time) and those we actually optimize in practice (at training time) becomes increasingly important. Broadly speaking, tasks are characterized by losses and what we will call learning criteria. Losses are diverse, depending on the underlying problem to be solved (e.g., classification, regression, ranking, clustering, etc.) and a wide range of application-specific needs. Losses depend on random data, and are themselves random. In principle, performance can be described in terms of random loss distributions, but working with complex and unwieldy distributions is not practical; a concise numerical summary is usually desired both for designing objectives to optimize, as well as for disseminating interpretable results. Providing such a summary is the role of learning criteria. In contrast with the diversity of losses, there is a single de facto standard learning criterion, namely the expected value of the loss distribution, a criterion which is central to the 1SANKEN, Osaka University, Japan. Correspondence to: Matthew J. Holland . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). general setting of the learning problem (Vapnik, 1999) that is pervasive in both theory and practice. In recent years, however, the need to capture a wider variety of performance metrics at training time has motivated the use of novel learning criteria going beyond the expected value. Arguably the best-known families of such criteria are those which control sensitivity to the right tail of the loss distribution, e.g., CVa R (Curi et al., 2020), positive-tilted ERM (Li et al., 2021), and Cressie-Read DRO (Duchi & Namkoong, 2021), though criteria designed to capture both tails at once have also begun to appear (Holland, 2022). While all of these criteria can be clearly shown to be distinct from the expected value, when such criteria are optimized at training time, can we expect the outcome to actually change? This question is somewhat delicate, since it depends on the nature of the loss distribution. For example, when we are working with the zero-one loss for classification, or any binary performance indicator for that matter, previous work has noted that vanilla empirical risk minimization (ERM) is always sufficient for minimizing DRO and CVa R (Hu et al., 2018; Zhai et al., 2021), making it meaningless to distinguish between optimizing such criteria. What about when we make use of surrogate loss functions for training? The traditional theory of transferring consistency from a surrogate to a target loss is tied tightly to the expected value criterion (and its linearity) (Bartlett et al., 2006; Reid & Williamson, 2011), meaning that a change in criterion can drastically change the relations between loss distributions. For example, given a model which is CVa Roptimal in terms of a surrogate loss, it may or may not be CVa R-optimal in terms of the zero-one loss; understanding when such properties hold is critical to ensure the learning algorithms we use in practice are aligned with our true objectives. For many classes of learning criteria, however, such questions remain completely unexplored. In this work, we consider the notion of criterion collapse, in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers (i.e., minimizers of the expected zero-one loss) under a wide variety of learning criteria, including learning scenarios that involve a surrogate loss. We start in 2 by showing how collapse in the context of losses with a Bernoulli distribution goes far beyond CVa R Criterion Collapse and Loss Distribution Control and DRO. We then expand our scope to include surrogate losses in 3. First, we highlight cases where optimality diverges across loss distributions under a common criterion ( 3.1), and then on the other hand show conditions where for monotonic criteria (e.g., positive-tilted ERM), collapse into error probability minimizers is unavoidable ( 3.2), though the use of non-monotonic alternatives can be used to avoid this when such collapse is not desired ( 3.3). The main limitation of our results in 3 is that we only cover margin-type losses for binary classification. Extensions to asymmetric, multi-class surrogate losses is left as future work. We complement our basic theory with a set of experiments in 4, training non-linear neural network models (e.g., Res Net-34) for image classification from scratch, comparing across a variety of learning criteria with a common base loss. We find that when criterion selection is optimized for validation accuracy, the non-monotonic criteria (with clear links to ascent-descent learning algorithms) provide an appealing balance across other metrics such as model norm and average (surrogate) loss at test time. 2. Criterion Collapse Underlying this section is the following simple question: is it meaningful to introduce a new criterion? Even if two criteria are distinct in a mathematical sense, i.e., they return different values for the same loss distribution, if the solution set of one criterion completely includes the other, then an obvious sufficiency property holds; no tradeoffs between criteria arise. When we have a sufficient condition (for optimality in both criteria) that is easy to satisfy, it essentially renders one criterion meaningless. Here we will look at how such a phenomenon occurs frequently under binary distributions. 2.1. Preliminaries Basic notation Let X Y denote our underlying data space. Random data points will be denoted by individual random pairs distributed on X Y. For a single data point, often to represent data at test time, we write (X, Y). When we are dealing with non-random quantities, X and Y will be replaced by x and y respectively. As general-purpose notation for decisions, we write h, assumed to be an element of a set of admissible decisions H (the hypothesis class ). Throughout this paper, when we use E and P to represent expectation and probability, it will always be with respect to the distribution of (X, Y), unless otherwise noted. For an arbitrary real-valued function f : H R, we denote the set of minimizers by arg minh H f(h) H, with arg minh H f(h) = in the case that no minimizers exist in H. We remark that unlike h (always h : X Y), the notation f is not reserved here. We use f and also g to denote various different helper functions throughout the paper. Loss function and criterion mapping Central to this paper is the notion of a numerical loss that can be used to provide feedback to a learning algorithm, or as a metric for evaluation after learning is complete. A loss function ℓ: H X Y R maps (decision, data point) pairs to real values ℓ(h; x, y), called losses. When the data is random, so is the loss. We denote random losses by L(h) ..= ℓ(h; X, Y), making the dependence on decision h H explicit in our notation. Running over the set H, we end up with a set of random losses, denoted by L ..= {L(h) : h H}. We will sometimes just refer to individual random losses L L without specifying which h H is associated with L. The other basic notion we require is that of a criterion mapping, denoted C : L R, which maps random losses L L to numerical values C(L), called criteria. As mentioned in 1, the traditional criterion is C(L) = E[L], often called the risk, but we will go well beyond this criterion in the following sub-sections. 2.2. Random error and criterion collapse The most basic and pervasive loss function used for evaluation is the zero-one loss, a simple classification error penalty that we denote as ℓ01(h; x, y) ..= ( 1, if h(x) = y 0, otherwise. (1) Denoting the random error by L01(h) ..= ℓ01(h; X, Y), it follows a Bernoulli distribution that is completely determined by its expected value, the error probability E[L01(h)] = P{h(X) = Y}. Since this quantity will appear frequently, we reserve the following symbols for the error probability and its solution set: E(h) ..= P{h(X) = Y}, H E ..= arg min h H E(h). (2) The expected value L 7 E[L] also happens to be the traditional choice of criterion mapping in machine learning, but as discussed earlier in 1, a variety of new choices for criterion mapping have arisen in the context of risk-sensitive learning. In general, by introducing new criteria (instead of the expected value), one gains an additional degree of freedom (beyond choice of loss function) in terms of what qualities we evaluate, and how we quantify such qualities. However, for loss functions such as ℓ01 in (1), the resulting random loss L01 has such a simple distribution that many criteria collapse into the expected value. Two important special cases of this phenomenon have been noted in the previous literature. We recall these two cases here in chronological order, adapted to our notation for readability. Theorem 2.1 (DRO criterion; Hu et al. (2018, Thm. 1)). For arbitrary random loss L L, denote the distributionally Criterion Collapse and Loss Distribution Control robust optimization (DRO) criterion by DRO(L) ..= sup µ P Eµ[L] where the uncertainty set P is taken to be a ball centered at some pre-defined data distribution on X Y, with finite radius measured by a valid f-divergence. Under zeroone loss L = L01, error probability minimizers are always optimal in terms of the DRO criterion, namely we have H E arg min h H DRO(L01(h)) with H E defined earlier in (2). Another very closely related insight has been presented in the literature, this time looking at conditional value-atrisk (CVa R), a well-studied criterion that is central to the quantification of financial risk. Theorem 2.2 (CVa R criterion; Zhai et al. (2021, Prop. 1)). Again taking any loss L L, we write the conditional valueat-risk (CVa R) criterion as CVa R(L) ..= inf θ R θ + 1 1 β E[(L θ)+] where 0 < β < 1 controls the degree of right-tail sensitivity, and ( )+ ..= max{0, }. Under the zero-one loss L = L01, error probability minimizers are optimal in terms of CVa R, i.e., H E arg min h H CVa R(L01(h)). The basic message of Theorems 2.1 and 2.2 is clear: under losses with a Bernoulli distribution, it is essentially meaningless to consider DRO and CVa R as distinct from the expected value, since minimizing the expected loss is always sufficient for optimality in terms of DRO and CVa R as well. In 2.3 to follow, we show that this collapse into error probability minimizers arises for criteria going well beyond DRO and CVa R. 2.3. Which classes lead to collapse? Here we take a look at several large classes of criteria, showing how most collapse in the sense illustrated in 2.2. 2.3.1. EXPECTATION OF FIXED FUNCTION Starting with the simplest class, we consider criteria that are computed as L 7 E[f(L)] (3) where f : R R is any function such that the expectation is finite. By setting L = L01 and reflecting dependence on h H, note that E[f(L01(h))] = f(0) + E(h) (f(1) f(0)) where E( ) is the error probability defined in (2). If f(0) = f(1) happens to hold, the criterion is constant and thus not very interesting. When f(0) = f(1), minimizers of E[f(L01( ))] are easy to characterize: they either minimize or maximize E( ), depending on whether f(1) > f(0) or not. As such, no matter how f is designed, the result of minimizing such a criterion will always be one of these two extremes. 2.3.2. QUANTILES As a natural alternative to the mean, let us consider a standard definition of quantiles, namely the β-level left quantile of the random loss, defined for L L by Qβ(L) ..= min{x R : P{L x} β} (4) for all 0 < β 1, and Qβ(L) ..= for β = 0. While in general the relation between quantiles and the mean can be very complicated, for the special case of Bernoulli losses, this relation is very simple for all choices of β, and we can readily show how the quantile criterion collapses to the mean. Proposition 2.3 (Collapse of left quantiles). Given Qβ as defined in (4) and L = L01, we have H E arg min h H Qβ(L01(h)) for all probability levels 0 β 1. Remark 2.4 (Related case: right quantiles). Another very similar discussion can be carried out with the right quantiles defined by Q β (L) ..= max{x R : P{L < x} β} (5) for 0 β < 1, and Q β (L) ..= for β = 1. 2.3.3. DISTRIBUTION DEPENDENT FUNCTIONS As a natural extension to the rudimentary fixed function expectation criteria seen in 2.3.1, one can consider a family of functions f( ; θ) parameterized by θ R, for which the value of θ is allowed to be determined based on the L L being evaluated. While countless possibilities exist, one class of criteria that captures many important special cases in the literature is to take f(u; θ) = θ + ρ(u θ), and to optimize with respect to θ after taking expectation. As a typical example, consider Cρ(L) ..= inf θ R [θ + E[ρ(L θ)]] (6) where ρ : R R used in (6) is assumed to be such that for each L L, the function θ 7 θ + E[ρ(L θ)] achieves its minimum on R, and that inf{Cρ(L) : L L} > . As we will discuss later, special cases of the learning criterion Cρ( ), as well as close variants, arise frequently in the Criterion Collapse and Loss Distribution Control literature, particularly in the case where ρ is monotonic (nondecreasing) and convex on R. As the following result shows, under the zero-one loss, monotonicity alone is sufficient to imply collapse into error probability minimizers. Proposition 2.5 (Collapse under monotonic dispersion). With Cρ as defined in (6), we have H E arg min h H Cρ(L01(h)) when ρ( ) is non-decreasing, and equality when ρ( ) is increasing. As we note in the following remarks, a wide range of wellknown criteria are captured by Proposition 2.5, or can be shown to collapse using an analogous argument. Remark 2.6 (Special case: OCE criteria). When ρ in (6) is assumed to be a closed, convex, sub-differentiable function, normalized such that ρ(0) = 0 and 1 ρ(0), which is monotonically non-decreasing on R, the resulting learning criterion is called an optimized certainty equivalent (OCE) (see also Figure 4). Clearly, setting ρ(u) = u recovers the expected value. Another well-known special case of OCE criteria is CVa R, which is recovered by setting ρ(u) = max{0, u}/(1 β) with 0 < β < 1. Clearly this ρ is nondecreasing, so the inclusion in Proposition 2.5 holds; note that Theorem 2.2 is thus implied. On the other hand, the ρ used for CVa R is not strictly increasing. Under the zero-one loss, are there CVa R-optimal solutions that are not optimal in terms of error probability? This is indeed possible, but it depends on both H and β; we discuss this in detail in B.2. Another well-studied special case from the OCE class is the tilted risk (also entropic risk), recovered by setting ρ(u) = (eγu 1)/γ with γ > 0, taking the form γ log E eγ L . (7) Note that since the exponential function is monotonically increasing, so is ρ, and thus by Proposition 2.5, tilted risk minimization under L = L01 is identical to E( ) minimization. To conclude this remark, we note that more generally, the monotonicity of all OCE risks immediately implies that all E( )-optimal rules are also Cρ-optimal, even without the other assumptions of convexity and normalization. Remark 2.7 (Related case: Cressie-Read DRO). An important class of distributionally robust optimization (DRO) criteria is defined in a very similar way to the OCE class given in (6). In particular, using the Cressie-Read family of f-divergences leads to criteria of the form DROc,ε(L) ..= inf θ R h θ + (E [ρc,ε(L01(h) θ)])1/c i where we have defined ρc,ε(x) ..= (1 + c(c 1)ε)c /c(x)c + and c ..= c/(c 1), and these criteria are parameterized by c > 1 and ε 0. Note that the function ρc,ε is clearly non-decreasing on R, just like in our discussion of CVa R in Remark 2.6. With c > 1, Cressie-Read DRO criteria are not strictly speaking OCE criteria, since the expected value is wrapped within ( )1/c . That said, an argument analogous to that in the proof of Proposition 2.5 using monotonicity can be used to show the same result, i.e., that H E is included in the solution set of DROc,ε(L01( )). Remark 2.8 (Related case: criteria based on Orlicz regret). Let f : R+ R+ be a proper, lower semi-continuous convex function, satisfying f(1) = 0, f(0) < , and super-coercivity in that f(u)/u as u . Letting f denote the usual convex conjugate of f, namely f (u) ..= supv R[uv f(v)], in recent work by Fr ohlich & Williamson (2023), a class of learning criteria are introduced with the form Cf,ε(L) ..= inf θ,σ σ ε + θ + E f L where the infimum is taken over θ R and σ > 0, and ε > 0 is a parameter of the criterion. Note that f is finite, convex, and monotonically increasing on R (Fr ohlich & Williamson, 2023, Prop. 3.2). Using this strong monotonicity property, just as we did for OCE criteria with increasing ρ, we can prove that under the zero-one loss, Cf,ε(L01( ))- optimal decisions coincide with the E( )-optimal decisions. Remark 2.9 (Non-monotonic alternative: variantile). The variance of a random variable, say L, can be naturally expressed as the minimum value of the function E(L θ)2, with the minimum taken with respect to θ R. The dispersion around θ here is measured in a symmetric fashion, since the same function is used both when L > θ and L θ. A natural asymmetric extension considers re-scaling each of these cases with 2(1 τ) and 2τ respectively, where τ is a free parameter such that 0 < τ < 1. Writing this explicitly as a learning criterion, we have Cτ(L) ..= min θ R 2E |1θ(L) τ|(L θ)2 (9) where 1θ(L) ..= I{L θ}. The value of θ that minimizes the function shown on the right-hand side of (9) is wellknown in the economics literature as the expectile, and while the residual Cτ(L) itself has received less attention, it has appeared in recent work under names such as variancile and variantile (Frongillo & Kash, 2021). Using the latter term here, the variantile in (9) generalizes the variance, with τ = 1/2 recovering the variance as a special case. While of a similar nature to the Cρ criteria give in (6), the critical difference here is that ( )2 is not monotonic on R. As a result, collapse in the sense of Proposition 3.1 and Remarks 2.6 2.8 is not guaranteed, but instead the solution set collapses into the set of either minimizers or maximizers of E( ), analogous to 2.3.1. See B.3 for a more detailed demonstration of this. Criterion Collapse and Loss Distribution Control 3. Relationship with Surrogate Losses Our analysis and discussion of criterion collapse in the preceding section was centered around losses with a Bernoulli distribution, which typically arise when using the zeroone loss ℓ01 defined in (1). While this loss function is ubiquitous as a metric for evaluation, during training we rarely use ℓ01 to design a criterion to be optimized directly. It is far more common to introduce a surrogate function, say ℓ(h; x, y), that is more congenial to numerical optimization. This results in there being two distinct loss distributions for each candidate h, namely that of the binary error L01(h) = ℓ01(h; X, Y) and the surrogate loss L(h) = ℓ(h; X, Y). Say we introduce some new criterion map C( ). We already know from 2 how most typical criteria collapse under L01(h), but these insights do not apply in general for L(h), which may have a much more complicated distribution than a simple Bernoulli. Placing our focus on the surrogate criterion C(L( )), there are two natural instances of collapse with respect to error probability minimizers that we can conceive of: H E arg min h H C(L(h)) (10) arg min h H C(L(h)) H E. (11) In some learning scenarios, the properties (10) and (11) may of course be desirable. In the case of (10), which is the surrogate version of the collapse notion seen in 2, we have that it is possible to be optimal in C(L( )) without having to make a sacrifice in terms of error probability E( ). As for (11), in the traditional setup where C( ) = E[ ], proving that (11) holds is the central goal of classification calibrated surrogate design (Bartlett et al., 2006). On the other hand, the inclusion in (11) means that error probability minimization is unavoidable (under C( ) and ℓ( )), which could have unintended repercussions in terms of other metrics such as fairness or privacy, for which tradeoffs with accuracy are well-known (Menon et al., 2021). In 3.1 we first give a simple example showing how both (10) and (11) can fail to hold under a limited model H. In contrast, when H is a highly expressive model, we show in 3.2 that for most important (binary) surrogate losses and a large class of criteria (in particular the tilted risk criteria), average error optimizers are in fact unavoidable (i.e., (11) holds). We complement this in 3.3 by showing how criteria underlying typical ascent-descent gradient-based algorithms can be used to avoid unintentional collapse into H E. 3.1. Disentangling the error and surrogate loss For the remainder of 3, we will focus on the binary classification case, where our random data takes the form Z = (X, Y), taking values in X Y with Y = { 1, 1}. Let S denote a set of scoring functions s : X R, with decision set H ..= {sign(s( )) : s S} based on S, noting that for any u R, sign(u) is 1 when u 0, and is 1 when u < 0. In this setting, we show how under a model H with limited expressive power, it is very easy to construct an example where both notions of collapse just mentioned cannot hold. For simplicity and ease of visualization, consider an example on the plane, i.e., where X = R2. Let X = (V1, V2), where V1 and V2 are real-valued random variables. Fixing an arbitrary a > 1, let V1 take values in { 1, 1, a} and let V2 take values in { a, 1, 1}. Assume that P{X = ( 1, 1)} = P{X = (1, 1)} > 0, and writing p ..= P{X = ( 1, 1)} + P{X = (1, 1)} let us say that P{X = (a, a)} = 1 p and p < 1 also hold. As for the labeling, let us assume 2 X 2) sign(V2 V1). (12) Set S ..= {s1, s2}, where s1(v1, v2) ..= v2 v1 and s2(v1, v2) ..= v1 v2 for all v1, v2 R. The resulting classifiers are h1(x) ..= sign(s1(x)) and h2(x) ..= sign(s2(x)) for x R2. See Figure 1 for a visualization of the three possible data points and the role of p. Error probabilities are P{h1(X) = Y} = 1 p and P{h2(X) = Y} = p. When p > 1/2, clearly H E = {h1}. As a typical surrogate loss, let us consider the usual binary logistic loss, namely the binary cross-entropy loss with scores converted to probabilities via the logistic function. Losses take the form ℓlog(s; x, y) ..= log(1 + exp( ys(x))) for each x R2 and y { 1, 1}. Writing Llog(s) ..= ℓlog(s; X, Y), the surrogate loss distributions that result are quite simple; some arithmetic shows that we have Llog(s1) {log(1 + exp( 2)), log(1 + exp(2a))}, Llog(s2) {log(1 + exp( 2a)), log(1 + exp(2))} with probability 1. For any fixed value of a > 1, it is always possible to take 0 < p < 1 close enough to 1 that in terms of the expected surrogate loss, we have arg min s S E[Llog(s)] = {s1} which is in agreement with the error probability minimizer. On the other hand, regardless of what value of 0 < p < 1 is assumed, both the maximum and minimum losses incurred by s2 are respectively smaller than those of s1. That is, max X,Y Llog(s2) < max X,Y Llog(s1), min X,Y Llog(s2) < min X,Y Llog(s1). This simple observation means that criteria which tend to emphasize optimizing either the worst case or best case Criterion Collapse and Loss Distribution Control { X 2 2 } { X 2 > 2 } 0 Figure 1. In the left plot, we show the three possible data points that can arise in the example described in 3.1. Points above the dashed silver line are assigned a label of 1 by h1 and 1 by h2; signs are reversed for all points below this line. For the outlying point in the bottom right, we have set a = 2 in this example. In the right plot, we illustrate setting p > 1/2 to ensure the optimality of h1 and h2 in distinct criteria diverges. in terms of surrogate loss values will end up disagreeing with the error probability E( ), and thus neither (10) nor (11) can hold. This is not limited to the extreme case where the criterion map is C( ) = max X,Y( ) or C( ) = min X,Y( ). For example, analogous results can easily be shown to hold under the γ-tilted risk L 7 (1/γ) log(E[eγ L]) (recalling Remark 2.6) for γ = 0 with |γ| sufficiently large, with γ < 0 emphasizing the best case, and γ > 0 emphasizing the worst case. 3.2. Criteria that cannot avoid collapse Let us assume that ℓis a surrogate loss function which applies penalties to classification margins, i.e., assume that ℓ(s; x, y) ..= φ(ys(x)), where margin penalizer φ : R R+ is assumed to be convex and non-negative over R, as well as differentiable at 0; one typical example is ℓ= ℓlog from 3.1. Let us also generalize beyond the expected loss criteria by considering the optimized certainty equivalent (OCE) criterion, defined for any L L by OCE(L) ..= inf θ R [θ + E[ρ(L θ)]] (13) and characterized by ρ : R R satisfying the properties mentioned in Remark 2.6. Our running assumption is that for each L L, there is always a θ R such that OCE(L) = θ +E[ρ(L θ )] holds. We show below that for a large sub-class of OCE criteria and loss functions, criteria collapse in the sense of (11) often cannot be avoided. Proposition 3.1. Assume that the margin penalizer satisfies φ (0) < 0, and that ρ in (13) is increasing (not just non-decreasing) and differentiable. With H ..= {sign(s( )) : s S}, let S be the set of all measurable functions on X, and let (s1, s2, . . .) be a sequence from S, with the resulting classifiers denoted by hn( ) ..= sign(sn( )). Write Lφ(h) ..= ℓ(sh; X, Y) = φ(Ysh(X)) for the losses induced by φ. Taking n , the following implication holds: OCE(Lφ(hn)) inf h H OCE(Lφ(h)) = E(hn) E , where E ..= infh H E(h). In addition, minimizers of the OCE surrogate satisfy arg min h H OCE(Lφ(h)) H E. Remark 3.2 (Classification calibrated surrogates). Our assumptions on the margin penalizer φ( ) in Proposition 3.1 amount to it being classification calibrated in the sense of Bartlett et al. (2006), a basic property common in all popular surrogates for binary classification. Many choices of φ( ) are allowed by Proposition 3.1: in addition to monotonic losses such as the logistic loss φ(u) = log(1 + exp( u)), exponential loss φ(u) = exp( u), and hinge loss φ(u) = max{0, 1 u}, it also captures non-monotonic choices such as the quadratic loss φ(u) = (1 u)2, the ARC-X4 loss φ(u) = |1 u|5 (Breiman, 1999), and shifted Huber-Catoni penalties (Holland, 2019). Remark 3.3 (Strict monotonicity of ρ). While the class of OCE criteria in its most general form just asks for ρ to be non-decreasing, in Proposition 3.1 we require that it be strictly monotonic (increasing). Should a flat (zero slope) region be allowed to exist, it allows for OCE(Llog( ))- optimal but E( )-sub-optimal solutions to exist. This is akin to the positive steepness condition seen in related work on DRO (Hu et al., 2018, Lem. 1). This means that the special case of CVa R, amounting to OCE( ) with ρ(u) = max{0, u}/(1 β) and 0 < β < 1, is excluded. On the other hand, the closely related γ-tilted risk, with ρ(u) = (eγu 1)/γ and γ > 0, is included. Under models with high expressive capacity, Proposition 3.1 suggests that methods such as tilted ERM may not always be a trustworthy choice for learning tasks where some evaluation metric of interest (e.g., fairness, interpretability) is at odds with error probability. Criterion Collapse and Loss Distribution Control 3.3. Criteria that can avoid collapse Here we consider alternatives to the class of criteria captured by the unavoidability result in Proposition 3.1. To do so, we introduce minor changes to the functional form of the OCE criterion of (13), and highlight links to existing algorithms in the literature (see Remark 3.5). Let eρ : R [0, ) be a continuous convex function which is coercive (i.e., |u| implies eρ(u) ), takes its minimum at 0 (assuming eρ(0) = 0), and is strictly convex on a neighborhood of 0. We use the symbol eρ (instead of ρ as in 3.2) because these assumptions imply that eρ cannot be monotonic on R. Let us denote the inner and outer loss-restraining criterion maps by Cinn( ) and Cout( ) respectively, defined for any L L by Cinn(L; θ) ..= eρ(E[L] θ), (14) Cout(L; θ) ..= E[eρ(L θ)]. (15) See Figure 4 for some examples of valid eρ compared with valid ρ used for OCE criteria. Intuitively, Cinn(L; θ) simply asks the learning algorithm to arrive at a loss distribution whose mean is close to some threshold θ; when θ = 0, Cinn(L; θ) and E[L] are equivalent (i.e., their solution sets coincide). In contrast, Cout(L; θ) asks that the loss distribution be well-concentrated near θ. In both cases, the notions of closeness and concentration are broad, and can be asymmetric (e.g., heavier right tails than left tails are allowed) when eρ is. The following result shows how even under a highly expressive model as in 3.2, constraining the loss distribution by means of criteria such as Cinn and Cout lets us circumvent the unavoidability of Proposition 3.1. Proposition 3.4. Under the assumptions of Proposition 3.1, assume that E = 0 and the margin penalizer φ( ) is monotonic (non-increasing) and unbounded above. Consider the non-trivial case where H E = H. Then, there exists θ > 0 such that (11) cannot hold under either of the loss-restraining criteria, i.e., we have H E arg min h H C(Lφ(h); θ) = for both C = Cinn and C = Cout. Remark 3.5 (Links to algorithms in the literature). While we have introduced the loss-restraining criteria Cinn and Cout in (14) and (15) in a rather general form, plugging in concrete examples of eρ can be shown to align with specific learning algorithms from the literature. Perhaps the bestknown is that of Flooding, as studied by Ishida et al. (2020). By setting eρ( ) = | | and applying sub-gradient descent with step-size α 0 to the resulting Cinn(L( ); θ) for any differentiable loss L, we end up with an update rule ht+1 = ht α sign(E[L(ht)] θ)E[ L(ht)]. (16) In practice, of course, the true expectation is replaced with the empirical mean over a finite sample. From (16) it is clear that as the sequence (h1, h2, . . .) progresses, as long as the expected loss is above threshold θ, vanilla gradient descent (on the expected loss) is run. It is only if the expected loss falls below this threshold that the sign flips, changing the update to gradient ascent. The original intuition behind this approach was simply to avoid over-optimizing a surrogate loss function, but since the original paper, links between this method and sharpness-aware minimization (Foret et al., 2021) have been established; see for example Karakida et al. (2023). The hard condition for switching between ascent and descent was softened in recent work by Holland & Nakatani (2023); their approach can be interpreted in our framework as using Cout rather than Cinn, and replacing the absolute value function with eρ( ) = p ( )2 + 1 1. The resulting gradient descent update looks like ht+1 = ht αE[eρ (L(ht) θ) L(ht)]. (17) This procedure is called softened ascent-descent (Soft AD) by the authors; note that the hard switch of sign( ) in (16) is replaced in (17) by a smooth switch eρ ( ), plus the modulation is applied in a per-point fashion before averaging, rather than after. Given that the Flooding and Soft AD methods just described are captured as special cases in Proposition 3.4, we shall pay particular attention to them in our empirical analysis to follow in 4. Remark 3.6 (Alternative approaches in the literature). In the context of avoiding collapse of DRO criteria (e.g., Theorem 2.1), Hu et al. (2018, 4) consider constraining the uncertainty set (our P) such that the underlying data distribution is effectively controlled by a simple discrete latent variable. This nice trick allows them to express their structural ARM criterion in terms of a weighted sum of expected zero-one loss and variance-like terms (e.g., their equation 18), which in principle makes it possible to avoid collapse. Another clever approach to avoiding such collapse (this time for CVa R; Theorem 2.2) is to consider the task of learning an ensemble of k candidates as done by Zhai et al. (2021, 3), i.e., choosing h ..= (h1, . . . , hk) with each hj H and a weight vector λ ..= (λ1, . . . , λk). The underlying loss function is then a weighted sum of zero-one losses, namely ℓ(k) 01 (h, λ; x, y) ..= Pk j=1 λjℓ01(hj; x, y), and thus in principle we can avoid collapse since the distribution is no longer Bernoulli. While both of these results describe settings in which collapse need not occur under L = L01, collapse is still possible, and there are not conditions for avoiding collapse with surrogates (as in our Proposition 3.4). 4. Empirical Analysis Within our theoretical analysis of 3, we contrasted learning criteria that can and cannot avoid collapse into error probability minimizers. In particular, we emphasized the role of monotonicity in measuring the dispersion of losses around Criterion Collapse and Loss Distribution Control metric = loss CVa R DRO Flood Soft AD Tilted metric = acc CVa R DRO Flood Soft AD Tilted 0 50 100 150 200 250 metric = norm CVa R DRO Flood Soft AD Tilted Trajectory of metrics over epochs (data: cifar100) metric = loss CVa R DRO Flood Soft AD Tilted metric = acc CVa R DRO Flood Soft AD Tilted 0 50 100 150 200 250 100 metric = norm CVa R DRO Flood Soft AD Tilted Trajectory of metrics over epochs (data: svhn) Figure 2. Key metrics of interest (vertical axis) over epochs (horizontal axis). Here loss refers to average surrogate loss, acc refers to accuracy, and norm refers to the model L2 norm. Loss and accuracy are given for both training (dotted lines) and test data (solid lines). Plots on the left are for CIFAR-100, and plots on the right are for SVHN. CIFAR-10 CIFAR-100 Fashion MNIST SVHN CVa R 0.14 (0.15) 0.28 (0.13) 0.46 (0.13) 0.20 (0.12) DRO 0.02 (0.04) 0.0 (0.0) 0.0 (0.0) 0.17 (0.20) Flood 0.05 (0.06) 0.03 (0.05) 0.01 (0.0) 0.01 (0.0) Soft AD 0.06 (0.05) 0.15 (0.13) 0.01 (0.0) 0.19 (0.19) Tilted 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) Table 1. Hyperparameter values selected for each method to maximize validation accuracy (averaged over trials). Standard deviation (again over trials) is given in parentheses. a threshold (ρ vs. eρ). DRO, CVa R, and tilted ERM are typical monotonic methods, whereas Flooding and Soft AD are representative non-monotonic methods. To complement the points established in the previous section, here we implement the aforementioned methods and apply them to the training of non-linear neural network models for classification. Our interest is in tradeoffs. Namely, having optimized for their respective learning criterion (under a base surrogate loss), we compare performance in terms of three alternative metrics, namely average surrogate loss, accuracy, and model complexity measured by the norm of model parameters. Setup Our experiment design essentially follows the experimental design described by Ishida et al. (2020) (who Criterion Collapse and Loss Distribution Control first proposed the Flooding technique). The basic learning task is image classification from scratch. We use four standard datasets: CIFAR-10, CIFAR-100, Fashion MNIST, and SVHN, accessed via the torchvision package. Each of these datasets has a default training-test split, which we leave fixed across all randomized trials. In each trial, however, we shuffle the training data and do an 80-20 split for training-validation. For CIFAR-10/100, we use Res Net-34; for Fashion MNIST we flatten the images and use a simple feed-forward network (one hidden layer, 1000 units, batch normalization, Re LU activation); finally, for SVHN, we use Res Net-18. As an optimizer, we use vanilla SGD with step size 0.1 and momentum 0.9, run for 250 epochs, with mini-batch size of 200. In all cases, no pre-training is done; initial weights are randomly determined anew for each trial, and we note that the initial weights for any given trial are common across all methods being tested. The base surrogate loss function used for training is the multi-class cross-entropy loss. Code A Git Hub repository with code and seed values to re-create all the results presented in this paper is available at this URL: https://github.com/feedbackward/ collapse. Methods We are comparing the impact of different learning criteria, so the choice of method here amounts to choosing from the following: CVa R, DRO, Flooding, Soft AD, and tilted ERM. For DRO, we use Cressie-Read divergences (Remark 2.7), with c = 2 fixed and ε as a hyperparameter. For CVa R (cf. Theorem 2.2), the quantile level β is the hyperparameter. For both CVa R and DRO, we optimize the threshold θ simultaneously with model parameters. Tilted ERM has a closed form (7) that we use, with hyperparameter γ. For the non-monotonic methods (Remark 3.5), Flooding is Cinn with eρ(u) = |u|, and Soft AD is Cout with eρ(u) = u2 + 1 1; both have threshold θ as a hyperparameter. For each method, we consider ten distinct hyperparameter candidates, as follows. CVa R: β ranges between 0 and 0.9. DRO: with ε = (1/(1 eε) 1)2/2, eε ranges between 0 and 0.9. Flooding: θ ranges between 0.01 and 1.0. Soft AD: θ ranges between 0.01 and 0.75. Tilted ERM: γ ranges between 0 and 2.0. We remark that vanilla ERM is captured as a special case of the three monotonic methods (β = 0, ε = 0, and γ = 0). Results For evaluation, before and after each epoch, we record three metrics: average surrogate loss, average accuracy (one minus average zero-one loss), and the L2 norm of model parameters. Here average refers to taking an average over each relevant data set (training, validation, test). Furthermore, we have run five independent trials, and the results to be presented are averaged over all trials. To choose a representative hyperparameter setting for each method, we select the value which maximizes the validation accuracy. In Figures 2 and 5 we show the trajectory of our metrics of interest over epochs (with epoch 0 being the initial state). In Table 1, we show the hyperparameter values that were selected for each method (here also, averaged over trials). Discussion There are several salient points that can be extracted from the results just described. First, it is clear that in terms of accuracy, all methods that we have tested include hyperparameter settings which can achieve nearperfect training accuracy, with test accuracy essentially the same across all methods. Note that for all trials and all datasets, tilted ERM is such that optimizing for validation accuracy leads to a setting of γ = 0, namely vanilla ERM. Second, while the best accuracy levels are essentially uniform across the methods tested, from the perspective of tradeoffs, we see that things are far less uniform. In terms of average surrogate loss and model norm metrics, there is substantial dispersion between methods. We see that the two non-monotonic methods tend to achieve a far better average test surrogate loss than the monotonic methods, with Soft AD being unique in terms of realizing a small model norm as well. Considering these trends (in Figures 2 and 5) together with the selection trends (in Table 1), the results seem to suggest that even weakly constraining the surrogate loss distribution via Cinn and Cout (corresponding to Flooding and Soft AD here) looks to offer an appealing balance between the metrics we have studied here. Whether or not this behavior can be extended in a straightforward manner to metrics from other domains (e.g., fairness, prediction calibration, etc.) remains to be seen. 5. Concluding Remarks In this work, we have formulated and studied the notion of criterion collapse in machine learning, with particular emphasis placed on settings in which unintended collapse may occur, and how this phenomenon changes when surrogate functions are introduced into the picture. By highlighting situations in which unintentional collapse is possible, we are hoping to slowly but surely build a methodological roadmap of sorts, highlighting potential risks with model/algorithm design decisions, something analogous to the AI model cards that have attracted interest in recent years (Liang et al., 2024). By complementing our theory with experiments, we try to highlight how learning algorithms derived from distinct criteria classes (here the key difference is monotonicity) can lead to rather different algorithm behavior and learned model properties, both during and after training. The idea is to take these empirical insights together with the aforementioned basic principles to guide smart AI system design when our goals are more diverse than just maximize the accuracy on a big training set. Criterion Collapse and Loss Distribution Control Acknowledgements This work was supported by JST PRESTO Grant Number JPMJPR21C6, and a grant from the SECOM Science and Technology Foundation. Impact Statement This paper presents work whose goal is to advance the field of machine learning. While there are many potential societal consequences of forwarding this field, these are already wellestablished and are not specific to this work. As such, there are no particular concerns we feel must be highlighted here. Ash, R. B. and Dol eans-Dade, C. A. Probability and Measure Theory. Academic Press, 2nd edition, 2000. Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Ben-Tal, A. and Teboulle, M. An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449 476, 2007. Bertsekas, D. P. Convex Optimization Algorithms. Athena Scientific, 2015. Breiman, L. Prediction games and arcing algorithms. Neural Computation, 11(7):1493 1517, 1999. Curi, S., Levy, K. Y., Jegelka, S., and Krause, A. Adaptive sampling for stochastic risk-averse learning. In Advances in Neural Information Processing Systems 33 (Neur IPS 2020), pp. 1036 1047, 2020. Duchi, J. C. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=6Tm1mposlr M. Fr ohlich, C. and Williamson, R. C. Tailoring to the tails: Risk measures for fine-grained tail sensitivity. Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum? id=Unt Uoe Lwwu. Frongillo, R. M. and Kash, I. A. Elicitation complexity of statistical properties. Biometrika, 108(4):857 879, 2021. Holland, M. J. Classification using margin pursuit. In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pp. 712 720, 2019. Holland, M. J. Learning with risks based on M-location. Machine Learning, 111:4679 4718, 2022. doi: 10.1007/ s10994-022-06217-5. Holland, M. J. and Nakatani, K. Implicit regularization via soft ascent-descent. ar Xiv preprint ar Xiv:2310.10006v1, 2023. Holland, M. J. and Tanabe, K. A survey of learning criteria going beyond the usual risk. Journal of Artificial Intelligence Research, 73:781 821, 2023. Hu, S., Wang, X., and Lyu, S. Rank-based decomposable losses in machine learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 2023. Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp. 2029 2037, 2018. Ishida, T., Yamane, I., Sakai, T., Niu, G., and Sugiyama, M. Do we need zero training loss after achieving zero training error? In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 4604 4614, 2020. Karakida, R., Takase, T., Hayase, T., and Osawa, K. Understanding gradient regularization in deep learning: Efficient finite-difference computation and implicit bias. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research, pp. 15809 15827, 2023. Kashima, H. Risk-sensitive learning via minimization of empirical conditional value-at-risk. IEICE Transactions on Information and Systems, 90(12):2043 2052, 2007. Lee, J., Park, S., and Shin, J. Learning bounds for risksensitive learning. In Advances in Neural Information Processing Systems 33 (Neur IPS 2020), pp. 13867 13879, 2020. Li, T., Beirami, A., Sanjabi, M., and Smith, V. Tilted empirical risk minimization. In The 9th International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum? id=K5Yas WXZT3O. Li, T., Beirami, A., Sanjabi, M., and Smith, V. On tilted losses in machine learning: Theory and applications. Journal of Machine Learning Research, 24:1 79, 2023. Criterion Collapse and Loss Distribution Control Liang, W., Rajani, N., Yang, X., Ozoani, E., Wu, E., Chen, Y., Smith, D. S., and Zou, J. What s documented in AI? systematic analysis of 32K AI model cards. ar Xiv preprint ar Xiv:2402.05160, 2024. Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., and Kumar, S. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=37nvvqk Co5. Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Advances in Neural Information Processing Systems 29 (NIPS 2016), volume 29, pp. 2208 2216, 2016. Reid, M. D. and Williamson, R. C. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010. Reid, M. D. and Williamson, R. C. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731 817, 2011. Rockafellar, R. T. and Uryasev, S. Optimization of conditional value-at-risk. Journal of Risk, 2:21 42, 2000. Rockafellar, R. T. and Uryasev, S. The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2):33 53, 2013. Royset, J. O. Risk-adaptive approaches to learning and decision making: A survey. ar Xiv preprint ar Xiv:2212.00856, 2022. Rudin, W. Principles of Mathematical Analysis. Mc Graw Hill, 3rd edition, 1976. Takeda, A. and Sugiyama, M. ν-support vector machine as conditional value-at-risk minimization. In Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 1056 1063, 2008. Vapnik, V. N. The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer, 2nd edition, 1999. Williamson, R. C., Vernet, E., and Reid, M. D. Composite multiclass losses. Journal of Machine Learning Research, 17:1 52, 2016. Zhai, R., Dan, C., Suggala, A., Kolter, Z., and Ravikumar, P. Boosted CVa R classification. In Advances in Neural Information Processing Systems 34 (Neur IPS 2021), pp. 21860 21871, 2021. Criterion Collapse and Loss Distribution Control A. Additional Notes Tradeoffs between metrics The navigation between measures and their proxies is a well-known challenge in the field of public policy design, and has recently been highlighted as an important challenge by researchers at Open AI.1 Surveys on learning criteria In recent years, several in-depth surveys looking at trends in criterion design for machine learning and related disciplines have appeared. See Holland & Tanabe (2023), Hu et al. (2023), and Royset (2022) for representative work. OCE and DRO references See Lee et al. (2020) and Holland & Tanabe (2023, 3.2) for background on OCE criteria in machine learning, and Ben-Tal & Teboulle (2007) for an authoritative initial reference. Tilted ERM with γ > 0 is a special case of OCE, but it should be noted that a more general notion of tilted ERM can be considered by using the right-hand side of (7) with general γ = 0 (Li et al., 2021). For γ < 0 the link to OCE criteria breaks down (reducing the role of right-side tails instead of emphasizing them), but leads to an array of different applications (Li et al., 2023). For the well-known special case of CVa R, see Rockafellar & Uryasev (2000) for the landmark paper that presents the form of CVa R used in our Theorem 2.2. See Kashima (2007) and Takeda & Sugiyama (2008) for early links to machine learning. CVa R can be derived as a limiting case of DRO under a specific f-divergence ((Duchi & Namkoong, 2021, Ex. 3)). For more general results on DRO with uncertainty sets constructed using on f-divergences, see Namkoong & Duchi (2016) and Duchi & Namkoong (2021). Tail-sensitive criteria The basic argument of 3.1 can be extended well beyond the tilted ERM criterion to a wide variety of other criteria including many OCE risks and their inverted variants; see Lee et al. (2020) or Holland & Tanabe (2023) for some representative examples of such criteria. References on surrogate design There is a very large literature on the topic of designing surrogate functions under the E[ ] criterion. For binary classification, see for example Bartlett et al. (2006) and Reid & Williamson (2010; 2011). For the multi-class setting, see Williamson et al. (2016), and the references within. Remark A.1 (Coherent risks). One of the most well-known and well-studied classes of risk functions is that of coherent risk criteria.2 Coherent risks are characterized by several properties, including a very weak notion of monotonicity, which simply requires that for any random losses L and L , a criterion map C( ) satisfies the following: L L (almost surely) = C(L) C(L ). (18) While a very natural property in a general context, when we restrict ourselves to Bernoulli losses, with say L = L01(h1) and L = L01(h2) based on candidates h1, h2 H, this monotonicity property becomes rather vacuous. Just consider when the condition in (18) actually can hold. It can only be valid in two cases: either we have L = L (almost surely) or L < L (almost surely). Note that for L < L to hold almost surely, we must have L = 0 and L = 1 almost surely. In any case, we are dealing with either constant random variables or identical distributions, so the notion of monotonicity only really becomes interesting for more complicated distributions. Gaps between theory and practice? As discussed in 3.2, our Proposition 3.1 could be interpreted as a negative result showing that criteria like tilted ERM (with positive tilt) cannot avoid error probability minimization. On the other hand, such criteria have been applied with effect to classification under class imbalance learning problems (e.g., Li et al. (2023, 7.2.4)). We remark that there is no conflict between our results and such empirical results, since in practice, we only have limited expressive power and more importantly limited capability to optimize a typically non-convex objective (viewed as a function on H). When effective model capacity is constrained, it is easy to find settings in which the zero-one loss mean is not controlled by relevant surrogate loss criterion (e.g., our example from 3.1). Notes on experiment results Looking at Table 1, note that for CIFAR-100 and Fahsion MNIST, in all five trials, the hyperparameter setting for both DRO and Tilted ERM was zero, i.e., vanilla ERM was selected. In the case of Fashion MNIST, the trajectories in Figure 5 (right plot) align perfectly, but for CIFAR-100 (Figure 2, left plot), they do not; this is due to randomness inherent in the Py Torch implementation of Res Net-34 (no pre-training) that we have left uncontrolled. 1https://openai.com/research/measuring-goodharts-law 2See Rockafellar & Uryasev (2013, 3) for background. Criterion Collapse and Loss Distribution Control B.1. Proofs of results given in the main text Proof of Proposition 2.3. Fixing h H and taking 0 < β 1 E(h), we know that for any choice of x 0 we have P{L01(h) x} β, but this fails for any x < 0. Thus we have Qβ(L01(h)) = 0 for any β in this range. For larger probability levels, namely where 1 E(h) < β 1, we know that P{L01(h) x} β holds for any x 1, but fails for any x < 1, giving us Qβ(L01(h)) = 1 for this range. Taking these facts together, we have Qβ(L01(h)) = ( 1, if 1 E(h) < β 1 0, if 0 < β 1 E(h). Now, consider the task of minimizing Qβ(L01(h)) as a function of h H given a fixed level 0 < β < 1. Note that in all situations, we can always say that H E arg min h H Qβ(L01(h)). (19) This should be intuitively clear since achieving a small quantile amounts to achieving a sufficiently small error probability, though the probability need not be minimal. For completeness, let us deal with the corner cases. If minh H E(h) > 1 β, then regardless of what h we choose, the value is Qβ(L01(h)) = 1, and so every h H is trivially optimal. The same trivial optimality arises if maxh H E(h) 1 β, since all h result in Qβ(L01(h)) = 0. When neither of these conditions hold, choosing h H such that the error probability is small enough gives us the desired minimum quantile value. In all cases, the inclusion (19) holds. Proof of Proposition 2.5. For convenience, let us write Cρ(L; θ) ..= θ + E[ρ(L θ)] for any L L and θ R. Since our running assumption is that the minimum (in θ) is achieved on R, we have that Cρ(L) = minθ R Cρ(L; θ). Let h1, h2 H be any candidates such that E(h1) E(h2). In addition, let θ1 and θ2 respectively minimize Cρ(L01(h1); ) and Cρ(L01(h2); ). With these assumptions in place, the following chain of inequalities holds: Cρ(L01(h1)) = Cρ(L01(h1); θ1) Cρ(L01(h1); θ2) = θ2 + ρ( θ2) + E(h1) (ρ(1 θ2) ρ( θ2)) θ2 + ρ( θ2) + E(h2) (ρ(1 θ2) ρ( θ2)) = Cρ(L01(h2)). Note that the second inequality uses the assumed monotonicity of ρ, which implies ρ(1 θ2) ρ( θ2). Since the choice of h1 and h2 was arbitrary, we can immediately conclude for any h1, h2 H that E(h1) E(h2) = Cρ(L01(h1)) Cρ(L01(h2)). Another immediate conclusion is that H E arg min h H Cρ(L01(h)). (20) If ρ is monotonically increasing (rather than just non-decreasing), then it follows that E(h1) < E(h2) = Cρ(L01(h1)) < Cρ(L01(h2)). This means that one cannot be sub-optimal in E( ) while still being optimal in Cρ(L01( )). As such, the two solution sets must coincide: H E = arg min h H Cρ(L01(h)). (21) Taking the conditions for (20) and (21), we have the desired result. Criterion Collapse and Loss Distribution Control Proof of Proposition 3.1. To get started, we would like to express the OCE criterion in terms of a modified penalty applied to binary classification margins. Let us denote the modified penalty based on φ by eφ(u; θ) ..= θ + ρ(φ(u) θ) (22) with the corresponding θ-dependent expectation denoted by Cφ(s; θ) ..= E[eφ(Ys(X); θ)]. (23) By our running assumptions, there always exists a minimizer of θ 7 Cφ(s; θ) on R, which for convenience we will denote by θ (s) arg min θ R Cφ(s; θ). (24) With this notation in hand, it follows that for each s S we have Cφ(s; θ (s)) = OCE(φ(Ys(X))). (25) Moving forward, we will want to study sequences (s1, s2, . . .) of functions in S that minimize the OCE criterion in terms of the surrogate loss. More formally, let us assume (s1, s2, . . .) is any sequence satisfying lim n OCE(φ(Ysn(X))) = inf s S OCE(φ(Ys(X))) (26) With such a sequence in hand, note that there exists a value θ > 0 such that for all integers n 1, we have 0 θ (sn) θ. (27) The lower bound in (27) holds because φ( ) 0, and thus the support of the surrogate losses φ(Ys(X)) is bounded below by zero.3 We cannot however guarantee that the support of φ(Ys(X)) is bounded above. That said, (s1, s2, . . .) is not a completely arbitrary sequence; from (26) we are assuming this sequence minimizes the OCE criterion. If the sequence θ (sn) were to grow without bound, it would contradict (26).4 As such, while the choice of θ < may depend on the sequence, (26) always implies (27). Next, we would like to link up surrogate losses (under the OCE criterion) to the error probability E( ). To do this, first note that the excess value in the OCE criterion can be bounded below by OCE(φ(Ys(X))) inf s S OCE(φ(Ys(X))) = Cφ(s; θ (s)) inf s S inf θ R Cφ(s ; θ) Cφ(s; θ (s)) inf s S Cφ(s ; θ (s)). (28) Next let us note that the normalization of ρ (slope 1, value 0 at the origin) combined with its convexity and monotonicity immediately implies ρ(u) u for all u R, and thus for any choice of θ R, we have eφ(u; θ) θ + (φ(u) θ) = φ(u). As such, it follows that φ( ) 0 = eφ( ; θ) 0. (29) This fact is useful because it allows us to leverage structural results of Bartlett et al. (2006). To do so, let us denote the so-called conditional eφ-risk by Cβ(u; θ) ..= β eφ(u; θ) + (1 β)eφ( u; θ) (30) for each u R, θ R, and 0 β 1. Note that we can connect this function to our modified penalty through the equality Cφ(s; θ) = EX Cβ(X)(s(X); θ) (31) 3That this implies 0 θ (s) for any s S is a basic property of OCE criteria; see for example Ben-Tal & Teboulle (2007, Prop. 2.1). 4To see this, just note that the θ term in the OCE criterion increases (with θ) no slower than the ρ( θ) term decreases (again, as θ grows), due to the slope, convexity, and monotonicity conditions on ρ. Criterion Collapse and Loss Distribution Control that holds for any choice of θ R and s S, where we have denoted β(x) ..= P{Y = 1 | X = x}. Optimal values of this function (with and without constraints) are denoted by H (β; θ) ..= inf {Cβ(u; θ) : u(2β 1) 0} , (32) H(β; θ) ..= inf u R Cβ(u; θ). (33) Here H(β; θ) corresponds to the optimal conditional risk computed in terms of eφ, and H (β; θ) is used to quantify the best possible risk when a score (i.e., with u as s(x)) disagrees with the Bayes optimal score (i.e., (2β 1) with β as P{Y = 1 | X = x}). The so-called ψ-transform is based on the gap between these two optimal values. Denoting eψ(u; θ) ..= H 1 + u 2 ; θ H 1 + u for all 1 u 1, define ψ( ; θ) ..= eψ ( ; θ), namely the Fenchel-Legendre biconjugate of eψ. The first key structural results we wish to leverage are as follows: ψ( ; θ) is continuous on [0, 1], ψ( ; θ) 0, ψ(0; θ) = 0. (35) The properties in (35) hold for all θ R.5 Furthermore, using (29) and applying the proof of Bartlett et al. (2006, Thm. 1(1)), for any choice of s S and θ R, we have ψ (E(hs) E ; θ) EX Cβ(X)(s(X); θ) H (β(X); θ) , (36) where hs( ) ..= sign(s( )), E ..= minh H E(h) and β(x) ..= P{Y = 1 | X = x} as before. By assuming that S is the set of all measurable functions on X, we have H (β(X); θ) = inf s S EX Cβ(X)(s (X); θ) (37) for any choice of θ R.6 Combining (36), (37), and (31) with the inequality (28) established earlier, we may conclude that OCE(φ(Ys(X))) inf s S OCE(φ(Ys(X))) ψ (E(hs) E ; θ (s)) (38) holds for any choice of s S. Applying this bound (38) to any sequence (s1, s2, . . .) satisfying (26), it follows from the basic properties of (35) that lim n ψ (E(hn) E ; θ (sn)) = 0, (39) where we have denoted hn( ) ..= sign(sn( )) for readability. All that remains is to analyze how the sequence (E(h1), E(h2), . . .) behaves. In order to establish E(hn) E , we need to utilize some additional properties of ρ. We have assumed that ρ is differentiable, and can easily confirm that the first two derivatives of eφ( ; θ) are as follows: eφ (u; θ) = ρ (φ(u) θ)φ (u), eφ (u; θ) = ρ (φ(u) θ)(φ (u))2 + ρ (φ(u) θ)φ (u). Since ρ and φ are both assumed to be convex, and strict monotonicity of ρ implies ρ ( ) > 0, we can immediately conclude that eφ ( ; θ) 0, and thus eφ( ; θ) is convex on R, for any choice of θ R. Furthermore, since we have assumed φ (0) < 0, it also follows that eφ (0; θ) < 0. This property along with convexity implies that we have ψ(u; θ) > 0, 0 < u 1 (40) ψ(u; θ) = eφ(0; θ) H 1 + u 5These properties follow from Lemma 2 (parts 6, 7, and 8) of Bartlett et al. (2006) after applying (29). 6This fact is also used in the proof of Bartlett et al. (2006, Thm. 1(1)). Criterion Collapse and Loss Distribution Control for any θ R.7 Using the definitions of eφ and H( ; θ) along with (41), we can write ψ(u; θ) = ρ(φ(0) θ) inf v R ρ(φ(v) θ) + 1 u ρ(φ( v) θ) . Introducing the helper function g(θ, u; v) ..= 1 + u ρ(φ(v) θ) + 1 u ρ(φ( v) θ), clearly we have ψ(u; θ) = ρ(φ(0) θ) infv R g(θ, u; v). Due to the monotonicity of ρ and the convexity of both φ and ρ, it is easy to directly verify that (θ, v) 7 g(θ, u; v) is convex on R2, for any choice of 0 u 1. In addition, partial minimization preserves convexity, so θ 7 infv R g(θ, u; v) is also convex and continuous.8 As such, for any convergent sequence (θ1, θ2, . . .) and any choice of 0 u 1, we have lim n ψ(u; θn) = ψ u; lim n θn . (42) With this fact established, let us say that under (26), the sequence E(hn) does not converge to E . In such a case, there exists some ε > 0 such that lim sup n [E(hn) E ] = ε. From the definition of lim sup, we can always find a convergent subsequence, denoted (s 1, s 2, . . .) such that E(h n) E ε, where h n( ) ..= sign(s n( )). This implies that for all large enough n, we have E(h n) E ε/2 > 0, and further that we have ψ(E(h n) E ; θ (s n)) ψ(ε/2; θ (s n)) > 0 (43) for all such n.9 Noting that from (27), all elements of sequence θ (s n) must be included in a bounded interval (i.e., a compact set), there exists a sub-sequence of (s n) that we denote as (s 1, s 2, . . .), such that θ (s n) converges.10 Utilizing the continuity property (42) and positivity (40), it follows that lim n ψ(ε/2; θ (s n)) = ψ ε/2; lim n θ (s n) > 0. (44) This however would contradict (39), which we have already shown to be implied by any sequence satisfying (26). As such, we can conclude that under (26), we have E(hn) E . Finally, for our statement regarding the inclusion of solution sets, for the trivial case that no minimizer of h 7 OCE(Lφ(h)) exists, we have H E; let us consider the non-trivial case where a minimizer exists. Write such a minimizer by s arg min s S OCE(φ(Ys(X))). (45) Writing h ( ) ..= sign(s ( )), if E(h ) > E were to hold, then via positivity (40), we would have a contradiction to the inequality (38). As such, h arg minh H E(h) must hold, concluding the proof. Proof of Proposition 3.4. To start, fix a threshold θ = θε ..= φ(0) + ε using some ε > 0 to be determined later. Next, note that for any h H, the expected surrogate loss can be written as E[Lφ(h)] = E[I{h(X) = Y} Lφ(h)] + E[I{h(X) = Y} Lφ(h)]. (46) 7The properties (40) and (41) follow by applying Theorem 2 and Lemma 2(9) of Bartlett et al. (2006). 8See for example Bertsekas (2015, B.3.3 and B.1.3). 9The first inequality holds using the fact that monotonicity of u 7 ψ(u; θ) is implied by the properties of being minimized at zero (see (35)) and convexity on [0, 1] for all θ R; see Lemma 2(2) of Bartlett et al. (2006). The second inequality follows from (41). 10See for example Rudin (1976, Thm. 2.37). Criterion Collapse and Loss Distribution Control Now, taking an arbitrary h H E and the assumption that E(h ) = 0, it follows that E[I{h (X) = Y} Lφ(h )] = 0. (47) Combining (46) and (47) with the monotonicity of φ, we have that the expected surrogate loss of h H E can be bounded above by E[Lφ(h )] = E[I{h (X) = Y} Lφ(h )] φ(0). (48) With this in place, consider any h / H E, namely any h H such that E(h) > 0. Recall that by definition of H, there is some s S such that h(x) = sign(s(x)) for all x X. Again by assumption, this s( ) is measurable.11 Consider a simple re-scaling of s, namely for some constant b > 0, let us define ( b, s(x) 0 b, s(x) < 0. Using the measurability of s( ) and the fact that es( ) is a simple function (in the usual measure theoretical sense), we have that es S and thus defining eh(x) ..= sign(es(x)), we have eh H as well. In fact, since signs are not changed, we have h(x) = eh(x) for all x X, and thus E(eh) = E(h) > 0. On the other hand, re-scaling impacts the margin penalties incurred via the surrogate loss, and the resulting expected value is E[Lφ(eh)] = E[I{eh(X) = Y} Lφ(eh)] + E[I{eh(X) = Y} Lφ(eh)] = E(h)φ( b) + (1 E(h))φ(b). (49) Using the running assumption that φ( ) is both unbounded and non-negative, from (49) it is clear that we can take b > 0 large enough that E[Lφ(eh)] > φ(0). Returning to the threshold θε, if we set ε = E[Lφ(eh)] φ(0), we have found a value of θε such that E[Lφ(h )] φ(0) < E[Lφ(eh)] = θε, noting that the first inequality comes from (48). Note that this implies E[Lφ(eh)] θε = 0 > E[Lφ(h )] θε and using the local strict convexity of eρ near the origin, we have eρ(E[Lφ(eh)] θε) = 0 < eρ(E[Lφ(h )] θε). In other words, we have that h H E is sub-optimal in terms of Cinn(Lφ( ); θε). Note that the preceding argument (in particular the choice of b) does not depend on the particular choice of h H E, and thus we can conclude that all elements of H E share this sub-optimality, which is the desired result for C = Cinn. As for the remaining case of C = Cout, the preceding argument can fail because a large b used in defining es means that the distribution of Lφ(eh) becomes arbitrarily widely spread out, and thus can be strongly penalized by Cout. As a simple alternative, we consider re-scaling and flipping the sign of any E( )-optimal classifier. Take any h H E, with h (x) = sign(s (x)) for some s S. Define a modified version of the scoring function as follows: |s (x)| , s (x) = 0 0, s (x) = 0. As mentioned earlier, measurability is preserved, and so es S and thus setting eh (x) ..= sign(es (x)), we also have eh H. Since this classifier always disagrees with h , we have 1 = E(eh ) > E(h ) = 0. With probability 1, we have 0 Lφ(h ) φ(0) < Lφ(eh ) = φ( 1). 11We have a typical notion of Borel measurability in mind. It is sufficient, for example, if for any a R, we have that the event {s(X) a} is measurable, i.e., an element of the sigma-field composing the underlying measure space. See for example Ash & Dol eans-Dade (2000, Ch. 1) for more background. Criterion Collapse and Loss Distribution Control As such, by taking θ = φ( 1), clearly we have that E[eρ(Lφ(eh ) θ)] = 0 < E[eρ(Lφ(h ) θ)], where the inequality holds because Lφ(h ) φ( 1) < 0 holds with probability 1, plus the strict convexity of eρ around the origin. This shows how while h is optimal in E( ), it is sub-optimal in terms of Cout(Lφ( ); θ). Again, this approach holds for any choice of h H E, and thus the desired result holds for C = Cinn as well. B.2. Divergence of CVa R from error probability As mentioned in Remark 2.6, CVa R satisfies the non-decreasing condition of Proposition 2.5, but ρ(u) = (u)+/(1 β) is not strictly monotonic on R (i.e., not increasing). With ρ that is non-decreasing, we know the inclusion (20) holds, telling us that all error probability minimizers are also optimal in terms of CVa R. Are there any CVa R-optimal solutions that are not optimal in terms of error probability? The answer is that it depends on both H and β. To consider this a bit more precisely, first note that quantiles (recall 2.3.2) of the zero-one loss are also binary valued, namely that we have Qβ(L01(h)) = ( 1, if 1 > β > P{L01(h) = 0} 0, if 0 < β P{L01(h) = 0}. It is straightforward to check that the following implications hold: min h H E(h) > 1 β = Qβ(L01(h)) = 1, for all h H (50) max h H E(h) 1 β = Qβ(L01(h)) = 0, for all h H (51) When the model H is poor enough (or β is large/strict enough) that the condition in (50) holds, the CVa R risk is constant, i.e., Cρ(L01(h)) = 1 for all h H, and thus trivially all h H are optimal in CVa R, but certainly need not be in E( ). When the model is good enough (or β is small/loose enough) that the condition in (51) holds, we have for all h H that ρ(1 Qβ(L01(h))) ρ( Qβ(L01(h))) = ρ(1) > 0, and thus the strong monotonicity argument used earlier applies, telling us that CVa R-optimal solutions and error probability minimizers are identical (i.e., (21) holds). B.3. Collapse under variantile With Remark 2.9 as context, note that for 0 θ 1, it is straightforward to confirm that under the zero-one loss L = L01, for any h H we have 2E |1θ(L01(h)) τ|(L01(h) θ)2 = 2 (1 τ)(1 E(h))θ2 + τE(h)(1 θ)2 . Checking first-order conditions, this function can be minimized (in θ) by setting θ = τE(h) (1 τ)(1 E(h)) + τE(h). Plugging this solution in to obtain an explicit form of Cτ(L01(h)), we have Cτ(L01(h)) = (1 τ)(1 E(h)) τE(h) (1 τ)(1 E(h)) + τE(h) 2 + τE(h) (1 τ)(1 E(h)) (1 τ)(1 E(h)) + τE(h) With this form in mind, simply replace E(h) with a variable p that can be taken over [0, 1], and introduce a helper function gτ(p) ..= 2 (1 τ)(1 p) τp (1 τ)(1 p) + τp 2 + τp (1 τ)(1 p) (1 τ)(1 p) + τp One can readily check that this function gτ( ) is strictly concave on the unit interval for any choice of 0 < τ < 1; see Figure 3 for a numerical example. As a result, we may conclude that depending on the nature of H, the minimizer set of Cτ(L01( )) is either the set of minimizers or maximizers of E( ), a result analogous to the point raised in 2.3.1. Criterion Collapse and Loss Distribution Control 0.0 0.2 0.4 0.6 0.8 1.0 Graph of g (p) = 0.15 = 0.50 = 0.85 Figure 3. Graphs of gτ( ) in (52) over the unit interval for varying choices of τ. Criterion Collapse and Loss Distribution Control C. Additional Figures As supplementary material for the main text, here we include two additional figures. Examples of valid ρ for OCE criteria and eρ for loss-restraining criteria are shown in Figure 4. In 4 we only gave metric trajectories for CIFAR-100 and SVHN; analogous results for CIFAR-10 and Fashion MNIST are given in Figure 5. 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Examples of valid ( ) for OCE Linear Tilted CVa R Quadratic 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Examples of valid ( ) for Cinn and Cout Quadratic Huber-like Absolute log-cosh Figure 4. Examples of valid choices of ρ (left) and eρ (right) for use in defining OCE criteria (13) and loss-restraining criteria (14) (15) respectively. Criterion Collapse and Loss Distribution Control metric = loss CVa R DRO Flood Soft AD Tilted metric = acc CVa R DRO Flood Soft AD Tilted 0 50 100 150 200 250 metric = norm CVa R DRO Flood Soft AD Tilted Trajectory of metrics over epochs (data: cifar10) metric = loss CVa R DRO Flood Soft AD Tilted metric = acc CVa R DRO Flood Soft AD Tilted 0 50 100 150 200 250 metric = norm CVa R DRO Flood Soft AD Tilted Trajectory of metrics over epochs (data: fashionmnist) Figure 5. Results for CIFAR-10 and Fashion MNIST; see the caption of Figure 2 for details.