# supervised_learning_with_general_risk_functionals__19b0dc19.pdf Supervised Learning with General Risk Functionals Liu Leqi 1 Audrey Huang 2 Zachary C. Lipton 1 Kamyar Azizzadenesheli 3 Abstract Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of Hölder risk functionals for which the closeness in the Cumulative Distribution Function (CDF) entails closeness in risk. We establish the first uniform convergence results for estimating the CDF of the loss distribution, which yield uniform convergence guarantees that hold simultaneously both over a class of Hölder risk functionals and over a hypothesis class. Thus licensed to perform empirical risk minimization, we develop practical gradient-based methods for minimizing distortion risks (widely studied subset of Hölder risks that subsumes the spectral risks, including the mean, conditional value at risk, cumulative prospect theory risks, and others) and provide convergence guarantees. In experiments, we demonstrate the efficacy of our learning procedure, both in settings where uniform convergence results hold and in high-dimensional settings with deep networks. 1. Introduction To date, the vast majority of supervised, unsupervised, and reinforcement learning research has focused on objectives expressible as expectations (over some dataset or distribution) of an underlying loss (or reward) function. This focus is understandable. The expected loss is mathematically convenient and a reasonable default, and a special case of nearly every proposed family of risks. To be sure, this focus has 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, University of Illinois Urbana Champaign 3Department of Computer Science, Purdue University. Correspondence to: Liu Leqi , Audrey Huang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). paid off: we now possess a rich body of theory and methods for evaluating, optimizing, and providing theoretical guarantees on the expected loss. However, real-world concerns such as risk aversion, equitable allocations of benefits and harms, or alignment with human preferences, often demand that we address other functionals of the loss distribution. For example, in finance, the expectation of returns must be weighed against their variance to determine an ideal portfolio allocation, as codified, e.g., in the mean-variance objective (Björk et al., 2014). Focusing on supervised learning, consider the common scenario in which a population contains a minority (constituting fraction α of the population) but where group membership was not recorded in the available data. If the pattern relating the features to the label were different for different demographics, a naively trained model might adversely harm members of a minority group. Absent further information, one sensible strategy could be to optimize the worst case performance over all subsets (of size up to α). This would translate to the familiar Conditional Value at Risk (CVa R) objective (Rockafellar et al., 2000). In addition, even in settings where a model is evaluated in terms of the expected loss at test time, the training objective may be chosen as other functionals due to reasons including distribution shifts (Duchi & Namkoong, 2018), noisy labels (Lee et al., 2020), imbalanced datset (Li et al., 2020), etc. Risk-sensitive learning research addresses the problem of learning models under many families of (risk) functionals, including (among others) distortion risks (Wirch & Hardy, 2001), coherent risks (Artzner et al., 1999), spectral risks (Acerbi, 2002), and cumulative prospect theory risks (Prashanth et al., 2016). Subsuming these risks under a common framework addressing bounded losses/rewards, Huang et al. (2021) recently introduced Lipschitz risk functionals, for which differences in the risk are bounded by (sup norm) differences in the Cumulative Distribution Function (CDF) of losses. Thus, because a single CDF estimate can be used to estimate all Lipschitz risks, sup norm concentration of the CDF estimate entails corresponding (simultaneous) concentration of all Lipschitz risks calculated on that CDF estimate. However, this concentration result applies only to a single hypothesis. In contrast, most uniform convergence results in learning theory have concentrated largely on the expected loss (Vapnik, 1999; 2013; Bartlett & Mendelson, 2002). Supervised Learning with General Risk Functionals While uniform convergence results are known for several specific risk classes, including the spectral/rank-weighted risks (Khim et al., 2020) and optimized certainty equivalent risks (Lee et al., 2020), no results to date provide uniform convergence guarantees that hold simultaneously over both a hypothesis class and a broad class of risks. Tackling this problem, we present, to our knowledge, the first uniform convergence guarantee on estimation of the loss CDF. Our bounds rely on appropriate complexity measures of the hypothesis class. In addition to the traditional Rademacher complexity and VC dimension, we propose a new notion of permutation complexity that is especially suited to CDF estimation. For general risk estimation, we adopt the broader class (subsuming the Lipschitz risks) of Hölder risk functionals, for which closeness in distribution entails closeness in risk. Combined with our uniform convergence guarantees for CDF estimation, this property allows us to establish uniform convergence guarantees for risk estimation of supervised learning models, which hold simultaneously over all hypotheses in the model class and over all Hölder risks. These results license us to optimize general risks, assuring that for appropriate model classes and given sufficient data, the empirical risk minimizer will indeed generalize and that whichever objective is optimized, all Hölder risk estimates will be close to their true values. Generalization aside, optimizing complex risks is non-trivial. To tackle this problem, we propose a new algorithm for optimizing distortion risks, a subset of Hölder risks that subsumes the spectral risks, including the expectation, CVa R, cumulative prospect theory risks, and others. Our approach extends traditional gradient-based empirical risk minimization methods to handle distortion risks (Denneberg, 1990; Wang, 1996). In particular, we calculate the empirical distortion risk by re-weighting losses based on CDF values and establish convergence guarantees for the proposed optimization method. Finally, we experimentally validate our algorithm, both in settings where uniform convergence results hold and in highdimensional settings with deep networks. In summary, we contribute the following: 1. The first uniform convergence result for CDF estimation together with corresponding new complexity measures suited to the task (Section 4). 2. Uniform convergence for risk estimation that holds simultaneously for all Hölder risks, yielding learning guarantees for empirical risk minimization for the broad class of distortion risks (Section 5). 3. A gradient-based method for minimizing distortion risks that re-weights examples dynamically based on the empirical CDF of losses, and corresponding con- vergence guarantees (Section 6). 4. Experiments confirming the practical usefulness of our learning algorithm (Section 7). 2. Related Literature Risk functionals have long been studied in diverse contexts (Sharpe, 1966; Artzner et al., 1999; Rockafellar et al., 2000; Krokhmal, 2007; Shapiro et al., 2014; Acerbi, 2002; Prashanth et al., 2016; Jie et al., 2018). CVa R, value-at-risk, and mean-variance (Cassel et al.; Sani et al., 2013; Vakili & Zhao, 2015; Zimin et al., 2014) rank among the most widely studied risks. Prashanth et al. (2016) introduces the cumulative prospect theory risks, which have been studied in bandit (Gopalan et al., 2017) and supervised learning settings (Leqi et al., 2019). Many previous works have tackled the evaluation (Huang et al., 2021; Chandak et al., 2021) and optimization (Torossian et al., 2019; Munos, 2014) of risk functionals. Recent work on risk-sensitive supervised learning has established the uniform convergence of a single risk functional when losses incurred by the models are bounded (Khim et al., 2020; Lee et al., 2020), or the excess risk of a particular learning procedure in cases where the loss could be unbounded (Holland & Haress, 2021). Collectively, these works have addressed the class of spectral risks (L-risks or rank-weighted risks) that includes the expected value, CVa R and cumulative prospect theory risks (Khim et al., 2020; Holland & Haress, 2021), as well as the class of optimized certainty equivalent risks that includes the expected value, CVa R and entropic risks (Ben-Tal & Teboulle, 1986; Lee et al., 2020) . To our knowledge, the aforementioned risk-sensitive learning results are considerably narrower: the analyses apply only to smaller families of risks and the guarantees hold only for a single risk functional (not simultaneously over the family). By contrast, we establish uniform convergence results that hold simultaneously over both a broader class of risks and over an entire model class (constrained by an appropriate complexity measure). The key to our approach is to estimate the CDF of losses and control its sup norm error uniformly over a hypothesis class. CDF estimation is a central topic in learning theory (Devroye et al., 2013). Strong approximation results provide concentration bounds on the Kolmogorov Smirnov distance (sup norm) between the true and estimated CDF (Massart, 1990). As our uniform convergence results are over a hypothesis class of possibly infinite number of hypotheses, we control the complexity of the hypothesis class using data-dependent complexity notions (e.g., Rademacher complexity) and data-independent complexity notions (e.g., VC dimension) (Alexander, 1984; Vapnik, 2006; Gänssler & Stute, 1979). Supervised Learning with General Risk Functionals 3. Preliminaries We use X to denote the space of covariates, Y the space of labels, and Z = X Y. Let ℓ: Y Y R denote a loss function and F a hypothesis class, where for any f F, f : X Y. The set Fℓ, with elements ℓf, denotes the class of functions that are compositions of the loss function ℓand a hypothesis f F, i.e., z Z, ℓf(z) = ℓ(f(x), y). Furthermore, we use ℓf (Z) to denote the random variable of the loss incurred by f F under data Z = (X, Y ). For any n N, [n] := {1, . . . , n}. We use U to denote the space of real-valued random variables that admit CDFs. For any U U, its CDF is denoted by FU. A risk functional ρ : U R is a mapping from a space of real-valued random variables to reals. A risk functional is called law-invariant (or version-independent) if for any pair of random variables U, U U with the same law (FU = FU ), we have ρ(U) = ρ(U ) (Kusuoka, 2001). We work with law-invariant risk functionals in this paper, and with some abuse of notation, we refer to ρ(FU) and ρ(U) interchangeably. 4. Uniform Convergence for CDF Estimation We begin with an important building block for risk estimation CDF estimation with uniform convergence guarantees. Given a loss function ℓand a data set of n labeled data points {Zi}n i=1 where Zi = (Xi, Yi), we are interested in estimating the CDF of ℓf(Z) for all f F. We use the unbiased empirical CDF estimator: b F(r; f) := 1 i=1 1{ℓf (Zi) r}, (1) where E[ b F(r; f)] = P(ℓf(Z) r) = F(r; f). To establish the uniform convergence of the estimator, our central goal is to analyze the following quantity: en(F, ℓ) = sup f F sup r R b F(r; f) F(r; f) . (2) In Section 4.1, we exploit the special structure of CDF estimation and propose a new notion of permutation complexity that captures the complexity of the hypothesis class used for CDF estimation. In Section 4.2, we apply the more classical approach for analyzing uniform convergence that does not exploit any special structure of CDF estimation. Each approach offers a unique perspective and contributes to our understanding of CDF estimation. We highlight that the uniform convergence we provide hold for any loss distribution regardless of whether the loss is binary or bounded. We first introduce notation key to our analysis. The Rademacher complexity in our setting (for a given loss function ℓ) is given as follows: R(n, F) = EP,R sup f F sup r R i=1 ξi1{ℓf (Zi)) r} sup f F sup g G(1) i=1 ξig(ℓf(Zi)) with R being a Rademacher measure on a set of Rademacher random variables {ξi}n i=1 and G(1) := {1{ r} : r R} is the set of indicator functions parameterized by a realvalued r. Using Mc Diarmid s inequality and symmetrization, we obtain the following classical result that bounds en(F, ℓ) in terms of the Rademacher complexity. All proofs in this section can be found in Appendix B. Theorem 4.1. Given a hypothesis class F, any loss function ℓ: Y Y R, and n samples {Zi}n i=1, we have that with probability at least 1 δ, en(F, ℓ) 2R(n, F) + In general, the Rademacher complexity R(n, F) is hard to obtain. Researchers have come up with different ways to control it for various hypothesis classes, e.g., hypothesis classes with finite VC dimension (Wainwright, 2019). In the following, we discuss how we work with R(n, F). 4.1. Permutation Complexity We first notice that R(n, F) depends jointly on both the hypothesis class F and G(1). A direct approach that follows from the classical statistical learning theory is to work with the function class that combines F and G(1), which we provide more details in Section 4.2. In this section, we propose a new way of thinking about R(n, F). By exploiting the special structure of CDF estimation (the structure of G(1)), we uncover that R(n, F) can be controlled by only the complexity of the hypothesis class F (or Fℓwith elements ℓf). In order to do so, we first introduce the notion of permutation complexity. This complexity measure is data-dependent and enables us to work with R(n, F) by disentangling the complexity of F (or Fℓ) from that of G(1). For a measurable space V, let ζ : V R denote a measurable function and {vi}n i=1 denote a set of n points in V. Satisfying the conditions of selection and maximum theorems (Guide, 2006, Chapter 17), a permutation function π : [n] [n] in the space Π(n) of all permutation of size n exists, and permutes the indices of {vi}n i=1 such that ζ(vπ(1)) ζ(vπ(2)) . . . ζ(vπ(n)). We note that the permutation function can depend on the specific data points {vi}n i=1 and the function ζ of interests. In the following definitions, we consider a function class J of functions Supervised Learning with General Risk Functionals ζ : V R and a probability measure µ on (V, σ(V)), where σ(V) denotes the σ-algebra generated by V. Definition 4.1 (Permutation Complexity). The instancedependent permutation complexity of J at n data points {vi}n i=1, denoted as NΠ(J, {vi}n i=1), is the minimum number of permutation functions π Π(n) needed to sort elements of {ζ(vi)}n i=1, ζ J. The permutation complexity of J at n (random) data points {Vi}n i=1 with measure µ is NΠ(n, J, µ) := Eµ [NΠ(J, {Vi}n i=1)] . To obtain a better understanding of permutation complexity, we provide the permutation complexity of monotone realvalued functions. Lemma 4.1. When V is the space of reals, J is a set of real-valued non-decreasing functions on V, the permutation complexity NΠ(n, J, µ) = 1. Proof. For a set of real-valued points {vi}n i=1, we can construct a permutation function π such that vπ(1) vπ(2) . . . vπ(n). Since all functions in J are non-decreasing, for any ζ J, we have {vi}n i=1 such that ζ(vπ(1)) ζ(vπ(2)) . . . ζ(vπ(n)). Thus, NΠ(J, {vi}n i=1) = 1 and NΠ(n, J, µ) = 1 for any measure µ. Remark 4.1. This result implies that the permutation complexity of threshold functions G(1) is one. Mapping the definition to our setting, the function class J of interest is Fℓwith elements ℓf. The data points Vi = Zi = (Xi, Yi) and the measure µ = P (the probability measure for Z). An immediate observation is that when |F| is finite, we only need at most |F| permutation functions to sort {ℓf(zi)}n i=1 (one permutation function for each f F), i.e., NΠ(Fℓ, {zi}n i=1) |F|. For the special case of binary classification, where Y = {0, 1} and the loss function ℓis the 0/1 loss, a coarse upper bound on the permutation complexity when F has finite VC dimension ν(F) is NΠ(n, Fℓ, P) (n + 1)ν(F). This is due to Sauer s Lemma: there are at most (n + 1)ν(F) ways of labeling the data, which suggests that we need at most (n + 1)ν(F) permutation functions. However, as one may have noticed, the number of permutation functions needed may be (much) smaller than this number. For example, consider a binary classification setting where we have 3 data points and F is large enough such that all 23 possible losses (000, 001, . . .) can be incurred. In such a case, we only need 4 permutation functions, since loss sequences that are non-decreasing (or non-increasing) can share the same permutation function, e.g., for loss sequences 111, 011, 001, 000, we can use the same permutation function π(i) = i, i [3]. We note that the permutation complexity is defined for not just binary-valued function classes. Precisely characterizing the permutation complexity for different combinations of function classes, data distributions and loss functions is of future interest. Theorem 4.2. For any hypothesis class F and loss function ℓ: Y Y R, we have that log(4NΠ(n, Fℓ, P)) Theorem 4.2 indicates that, despite R(n, F) depending on the supremum over both the function class F and G(1) where G(1) is an infinite set, R(n, F) can be controlled by just the complexity of Fℓ. When the permutation complexity NΠ(n, Fℓ, P) is polynomial in the number of samples n, we obtain that R(n, F) = O( p log(n)/n). The immediate consequence of Theorem 4.2 when the hypothesis class F is finite is provided below. Corollary 4.1. For a finite hypothesis class F, Corollary 4.1 suggests that the generalization bound of CDF estimation follows the same rate as classical generalization bound of the expected loss for finite function classes, i.e., O( p log(|F|)/n). 4.2. Classical Approach In this section, we present a more classical approach for analyzing the uniform convergence without exploiting the specific structure of CDF estimation. As we have noted before, R(n, F) depends on both F and G(1). In this approach, we directly work with the function class that combines F and G(1): for a given loss function ℓ, we define H := {h : Z {0, 1} : h(z; r) = 1{ℓf (z) r}, f F, r R}. We note that even when F is finite, H is an infinite set. The Rademacher complexity (3) can be re-written as R(n, F) = EP,R i=1 ξih(Zi) Lemma 4.2. (Wainwright, 2019, Lemma 4.14) Let |H({zi}n i=1)| denote the maximum cardinality of the set H({zi}n i=1) = {(h(z1), . . . , h(zn)) : h H} , where n N is fixed and {zi}n i=1 can be any data collection for zi Z. Then, we have log(|H({zi}n i=1)|) n . Supervised Learning with General Risk Functionals Since H consists of binary functions, for any data collection {zi}n i=1, the set H({zi}n i=1) is finite. When F is finite, we obtain that |H({zi}n i=1)| (n + 1)|F|. This is true since for a given data collection {zi}n i=1 and hypothesis f F, after sorting {ℓf(zi)}n i=1, we have at most (n + 1) ways of labeling them using the indicator functions 1{ r} for r R. Thus, if we directly apply Lemma 4.2, we obtain that the Rademacher complexity R(n, F) is on the order of O( p (log |F|+log n)/n), which has an extra log(n) term in the numerator compared to our bound in Corollary 4.1. In general, when H has finite VC dimension ν(H), we obtain that R(n, F) O( p ν(H) log(n+1)/n). However, this result is far from being sharp. Using more advanced techniques, e.g., chaining and Dudley s entropy integral (Wainwright, 2019), one can remove the extra log(n) factor on the numerator, and obtain that R(n, F) O( p ν(H)/n) (for more details, see Wainwright (2019, Example 5.24)). As a consequence, when |F| < , we have ν(H) log(|F|) and thus obtain that R(n, F) O( p log(|F|)/n) which is at the same rate as our bound in Corollary 4.1. 5. Uniform Convergence for Hölder Risk Estimation Using our results in Section 4, we show uniform convergence for a broad class of risk functionals Hölder risks. As illustrated in Section 6, uniform convergence for a single risk functional provides grounding for learning models through minimizing the empirical risk. In addition to uniform convergence for a single risk, we also provide uniform convergence results hold for a collection of risks simultaneously. The second result is important due to the following reasons: Although models are trained to optimize a single risk objective, evaluating their performance under multiple risks can give a holistic assessment of their behavior a task we call risk assessment. For example, models minimizing CVa R at different α levels may have different tradeoffs with their expected loss, and monitoring the progress of both objectives throughout the training process can inform choice of the best model. In addition, when given a set of models obtained under different learning mechanisms, one may want to compare them in terms of different risks. To this end, using our results on uniform convergence for CDF estimation (Section 4), we demonstrate how a collection of models may be assessed under many risks simultaneously, with estimation errors of the same order as the CDF estimation error. 5.1. Hölder Risk Functionals We begin by introducing a new class of risks the Hölder risk functionals that includes many popularly studied risks and generalizes the notion of Lipschitz risk functionals (Huang et al., 2021) and Hölder continuous functionals in Wasserstein distance (Bhat & LA, 2019). Definition 5.1. Let d denote a quasi-metric1 on the space of CDFs. A risk functional ρ is L(ρ, p, d) Hölder on a space of real-valued random variables U if there exist constants p > 0 and L(ρ, p, d) > 0 such that for all U, U U with CDF FU and FU respectively, the following holds: |ρ(FU) ρ(FU )| L(ρ, p, d)d(FU, FU )p. The class of Hölder risk functionals subsumes many other risk functional classes. In particular, Lipschitz risk functionals (Huang et al., 2021) are L( , 1, L ) Hölder on bounded random variables. As a direct result, distortion risk functionals with Lipschitz distortion functions (Denneberg, 1990; Wang, 1996; Wang et al., 1997; Balbás et al., 2009; Wirch & Hardy, 1999; 2001), cumulative prospect theory risks (Prashanth et al., 2016; Leqi et al., 2019), variance, and linear combinations of aforementioned functionals are all Hölder on bounded random variables. In addition, the optimized certainty equivalent risks (Lee et al., 2020) and spectral risks (Khim et al., 2020; Holland & Haress, 2021) recently studied in risk-sensitive supervised learning literatures, are Lipschitz (hence Hölder) on the space of bounded random variables. We provide proofs and further details in Appendix A. To be more specific, we present a subset of Hölder risk functionals distortion risks with Lipschitz distortion functions that consists of many well-studied risks including the expected value, CVa R and cumulative prospect theory risks. When the loss is non-negative, the distortion risk of ℓf(Z) is defined to be: ρ(F( ; f)) = Z 0 g(1 F(r; f))dr, (4) where the distortion function g : [0, 1] [0, 1] is nondecreasing with g(0) = 0 and g(1) = 1. In the case of expected value, the distortion function g is 1-Lipschitz. For CVa R at level α, the distortion function g is 1 α-Lipschitz. For more details, we refer the readers to Huang et al. (2021). In the following, we show uniform convergence for estimating Hölder risks using our proposed estimator (Section 5.2) and develop optimization procedures to minimize distortion risks (Section 6). 5.2. Uniform Convergence for Risk Estimation For a given hypothesis f F and loss function ℓ, we estimate the risk ρ(F( ; f)) using the CDF estimator b F( ; f) by plugging it in the functional of interest: ρ( b F( ; f)). Many existing risk estimators in the supervised learning literatures, including the traditional empirical risk and estimators used for estimating spectral risks (Khim et al., 2020) and optimized certainty equivalent risks (Lee et al., 2020) can be viewed as examples of the above estimator. 1Quasi-metrics are defined in Appendix A. Supervised Learning with General Risk Functionals Leveraging the uniform convergence results of the CDF estimator, we present uniform convergence result of the proposed risk estimator. The uniform convergence holds both over the hypothesis class F and over a set of Hölder risk functionals. As the Hölder class contains a large set of popularly studied risks, our result demonstrates that models can be assessed under many risks without loss of statistical power. This is formalized in Theorem 5.1 below. Theorem 5.1. For a hypothesis class F, a bounded loss function ℓ, and δ (0, 1], if P(en(F, ℓ) ϵ) 1 δ, then with probability 1 δ, for all ρ T, we have sup f F |ρ(F( ; f)) ρ( b F( ; f))| L(ρ, 1, L )ϵ, where T is the set of L( , 1, L ) Hölder risk functionals on the space of bounded random variables. In cases where the CDF estimation error ϵ is of order O(1/ n), we can estimate the set of Hölder risks T for all hypotheses in F at rate O(1/ n). Because our result is uniform over both the hypothesis class F and the risk functional class T, it is a generalization of existing uniform convergence results that are uniform over F, but for a single risk functional (Khim et al., 2020; Lee et al., 2020). In Appendix C, we provide similar uniform convergence results where the set of risk functionals are Hölder smooth in Wasserstein distance. As a direct consequence to Theorem 5.1, uniform convergence of a single Hölder risk, e.g., a distortion risk (4), is given below. Corollary 5.1. For a hypothesis class F, a bounded loss function ℓ: Y Y [0, D], and δ (0, 1], if P(en(F, ℓ) ϵ) 1 δ, then with probability 1 δ, we have sup f F |ρ(F( ; f)) ρ( b F( ; f))| Lϵ, where ρ is a distortion risk with L D-Lipschitz distortion function. Remark 5.1. As an example, consider a binary classification setting where the loss function ℓis the 0/1 loss and the hypothesis class F is finite, using our results in Corollary 5.1, we obtain that the the generation error for the expected value is O( p log(|F|)/n) and the generation error for the CVa R is O( p log(|F|)/α2n), which are of the same rates (with better constants) as the ones in Lee et al. (2020). 6. Empirical Risk Minimization For a single risk functional, one may want to learn models that optimize it. Our uniform convergence results for risk estimation license us to learn models that minimize the population risk through Empirical Risk Minimization (ERM). We denote the population and empirical risk minimizers as f arg min f F ρ(F( ; f)), bf arg min f F ρ( b F( ; f)). (5) The excess risk of the empirical risk minimizer bf can be bounded by ρ(F( ; bf )) ρ(F( ; f )) = ρ(F( ; bf )) ρ( b F( ; bf )) + ρ( b F( ; bf )) ρ( b F( ; f )) + ρ( b F( ; f )) ρ(F( ; f )) 2 sup f F |ρ( b F( ; f)) ρ(F( ; f))|. We study ERM when the loss function is non-negative and the risk functional of interest is a distortion risk with a Lipschitz distortion function (4). Such distortion risk functionals consist many well-studied risks, including the expected value, CVa R, cumulative prospect theory risks, and other spectral risks (Bäuerle & Glauner, 2021). Using Corollary 5.1, we obtain that when en(F, ℓ) = O(1/ n) and the loss ℓis bounded, the excess risk of bf is O(1/ n). We consider settings where the hypothesis class F is a class of parameterized functions, e.g., linear models and neural networks and use Θ Rd to denote the set of parameters. For a hypothesis f F parameterized by θ Θ, we denote F( ; f) and ℓf(zi) by Fθ and ℓθ(i), respectively. Similarly, we use θ and bθ for referring to f and bf . As in Section 4.1, we use πθ : [n] [n] to denote the permutation function such that ℓθ(πθ(i)) is the i-th smallest loss under the current model θ and the fixed dataset {zi}n i=1. Using the CDF estimator b Fθ (1), the empirical distortion risk ρ( b Fθ) can be re-written as i=1 g 1 i 1 (ℓθ(πθ(i)) ℓθ(πθ(i 1))) , where for all θ Θ, we set ℓθ(πθ(0)) := 0 since the losses are non-negative. To employ first-order methods for minimizing the empirical distortion risk, it is natural to first identify when ρ( b Fθ) is differentiable. Lemma 6.1. If {ℓθ(zi)}n i=1 are Lipschitz continuous in θ Θ, then for all i [n], ℓθ(πθ(i)), i.e., the i-th smallest loss, is Lipschitz continuous in θ and ρ( b Fθ) is differentiable in θ almost everywhere. When ρ( b Fθ) (and ℓθ(πθ(i))) is differentiable, the gradient θρ( b Fθ) can be written as θℓθ(πθ(i)). (6) Supervised Learning with General Risk Functionals VGG-11 Goog Le Net Shuffle Net Inception Res Net-18 Accuracy 69.022% 69.772% 69.356% 69.542% 69.756% E[ℓf] 1.261 1.283 1.360 1.829 1.247 CVa R.05(ℓf) 1.327 1.350 1.431 1.925 1.313 E[ℓf] + 0.5Var(ℓf) 5.215 4.376 6.718 14.416 5.353 HRM.3,.4(ℓf) 1.374 1.336 1.542 2.214 1.382 HRM.2,.8(ℓf) 1.233 1.239 1.344 1.845 1.225 Table 1. Risks for different Image Net classification models evaluated on the validation set. ℓf(Z) is the cross-entropy loss for each model f. For simplicity, we omitted the arguments Z in the table. CVa Rα is the expected value of the top 100α percent losses. HRMa,b is the cumulative prospect theory risk defined in Leqi et al. (2019). All results are rounded to 3 digits. When g(x) = x, the distortion risk is the same as the expected loss and we recover the gradient for the traditional empirical risk. To avoid the non-differentiable points, we add a small noise to the gradient descent steps. By doing so, we ensure that the descent steps will end up in differentiable points almost surely. Choose initial point θ1 Θ. At iteration t, the parameter is updated as follows θt+1 θt η θρ( b Fθ) + wt , (7) where η is the learning rate, θρ( b Fθ) is given in (6) and wt is sampled from a d-dimensional Gaussian with mean 0 and variance 1 In general, even when the loss function ℓis convex in the parameter θ, the empirical distortion risk ρ( b Fθ) may not be convex in θ. In Corollary 6.1, we show local convergence of θt obtained through following (7). Corollary 6.1. If {ℓθ(zi)}n i=1 are Lipschitz continuous and ρ( b Fθ) is β-smooth in θ, then the following holds almost surely when the learning rate in (7) is η = 1 β t=1 E h θρ( b Fθt) 2i 2β ρ( b Fθ1) ρ( b Fθ ) + 1 Corollary 6.1 demonstrates that the average gradient magnitude over T iterations shrinks as T goes to infinity, suggesting that performing gradient descent by following (7) will converge to an approximate stationary point. 7. Experiment In our experiments, we demonstrate the efficacy of our proposed estimator for risk estimation and proposed learning procedure for obtaining risk-sensitive models. In Section 7.1, we work with a risk assessment setting where we simultaneously inspect a finite set of models in terms of multiple risks. In Section 7.2, we show the performance of empirical risk minimization under various distortion risks. After showing that the classifier learned under different risk objectives behave differently in a toy example, we learn risk-sensitive models for CIFAR-10. 7.1. Risk Assessment on Image Net Models We perform risk assessments on pretrained Pytorch models for Image Net classification. In particular, we choose models with similar accuracy (both reported on the official Pytorch website and confirmed by us) on the validation set (50,000 images) for the Image Net classification challenge (Russakovsky et al., 2015). The models are VGG-11 (Simonyan & Zisserman, 2014), Goog Le Net (Szegedy et al., 2015), Shuffle Net (Ma et al., 2018), Inception (Szegedy et al., 2016) and Res Net-18 (He et al., 2016) and the accuracy of these models evaluated on the validation set are around 69% (Table 1). By assessing the risks of models with similar accuracy, we highlight how models with similar performance under traditional metrics (e.g., accuracy) could have different risk performances. For example, though Inception has similar accuracy compared to other models, its loss variance is much higher compared to others, which may be detrimental in settings where high-varying performance is not preferred. We also evaluated the CVa R of these models under different α s (Figure 3 in Appendix E). Our theoretical results suggest that all these evaluations hold simultaneously across the risk functionals and models of interest with the error being O( p log |F|/n) (|F| = 5 in this experiment). In addition to showcasing the power of our theoretical results, this example demonstrates how model assessments under multiple risk notions provide a better understanding of model behaviors. 7.2. Empirical Distortion Risk Minimization To illustrate the difference among models learned under different risk objectives, we first present a toy example for comparing models learned under the expected loss objective and the CVa R objective respectively. We then show the efficacy of our proposed optimization procedure through training deep neural networks on CIFAR-10. In both cases, the models are learned by following (7). For more details on these Supervised Learning with General Risk Functionals 5.0 2.5 0.0 2.5 5.0 (a) Prediction contour plot under expected value 0 1 2 3 (b) Loss histogram under expected value 5.0 2.5 0.0 2.5 5.0 (c) Prediction contour plot under CVa R. 05 0.68 0.69 0.70 0.71 (d) Loss histogram under CVa R. 05 Figure 1. Prediction (predicted probability of a covariate being labeled as 1) contours and loss histograms of two models learned under the expected loss and the CVa R.05 objective, respectively. The blue pluses and orange dots represent two classes. The loss distribution for the expected loss model has extremely high values for a small subset of the covariates. 0 20 40 60 80 100 120 140 Epoch Expected value CVa R0.05 CVa R0.7 HRM. 3, . 4 (a) Training objective values 0 20 40 60 80 100 120 140 Epoch 5 Expected value CVa R0.05 CVa R0.7 HRM. 3, . 4 (b) Testing objective values 0 20 40 60 80 100 120 140 Epoch Expected value (objective) CVa R0.05 CVa R0.7 HRM. 3, . 4 (c) Training risk evaluations 0 20 40 60 80 100 120 140 Epoch Expected value (objective) CVa R0.05 CVa R0.7 HRM. 3, . 4 (d) Testing risk evaluations Figure 2. Performance of VGG-16 models trained under expected loss, CVa R.05, CVa R.7 and HRM.3,.4. In Figures 2a and 2b, each model is trained and evaluated on the same objective. In Figures 2c and 2d, we only train one model under the expected loss but report all four objectives of that model. All results are averaged over 5 runs. experiments, we refer the readers to Appendix E. Toy Example In the toy example, we work with a binary classification task where the covariates are 2-dimensional. In Figure 1, the blue pluses and orange dots represent two classes, respectively. We have learned logistic regression models to minimize the expected loss and the CVa R.05 (expected value of the top 5% losses) through minimizing their empirical risks. The loss distribution along with the prediction contours of the two classifiers showcase the difference between the two models. In particular, the model learned under expected loss suffers high loss for a small subset of the covariates while the model learned under CVa R.05 have all losses concentrated around a small value. Indicated by the (uniform) grey color in the contour plot, the predictions (predicted probability of a covariate being labeled as 1) for the CVa R.05 model are around 0.5. In contrast, the predictions for the expected loss model spread across a wide range between 0 and 1. CIFAR-10 We have trained VGG-16 models on CIFAR10 through minimizing the empirical risks for expected loss, CVa R.05, CVa R.7 and HRM.3,.4 (Leqi et al., 2019) using the gradient descent step presented in (7). The models are trained over 150 epochs and the learning rate is chosen to be 0.005. As shown in Figure 2a and 2b, in general, the objective values are decreasing over the epochs during training and testing. In addition, we observe that minimizing the empirical risk for expected loss does not necessarily imply minimizing other risks, e.g., CVa R.05 (Figure 2c). These results suggest the efficacy of our proposed optimization procedure for minimizing distortion risks. 8. Discussion We have presented a principled framework, including analytic tools and algorithms for risk-sensitive learning and assessment that: (1) obtains the empirical CDF; (2) estimates the risks of interest through plugging in the empirical CDF; and (3) minimizes the empirical risk (for risk-sensitive learning). Our theoretical results on the uniform convergence of the proposed risk estimators hold simultaneously over a hypothesis class (constrained by an appropriate complexity measure) and over Hölder risks. The key building block for these results is the uniform convergence of the CDF estimator. There are multiple future directions of our work. First, we hope to more precisely characterize the permutation complexity (under various hypothesis classes). Second, our gradient descent procedure (7) requires sorting all losses. An important next step would be to allow minibatches (sorting only a small subset of losses) when minimizing empirical distortion risks. Third, as shown in Figure 1, models learned Supervised Learning with General Risk Functionals under different risk objectives behave distinctly. Characterizing these model behaviors theoretically and empirically, and understanding the trade-offs among these objectives is crucial for building future models. Acknowledgements LL is generously supported by an Open Philanthropy AI Fellowship. Acerbi, C. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505 1518, 2002. Alexander, K. S. Probability inequalities for empirical processes and a law of the iterated logarithm. The Annals of Probability, pp. 1041 1067, 1984. Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999. Balbás, A., Garrido, J., and Mayoral, S. Properties of distortion risk measures. Methodology and Computing in Applied Probability, 11(3):385 399, 2009. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Bäuerle, N. and Glauner, A. Minimizing spectral risk measures applied to markov decision processes. Mathematical Methods of Operations Research, pp. 1 35, 2021. Ben-Tal, A. and Teboulle, M. Expected utility, penalty functions, and duality in stochastic nonlinear programming. Management Science, 32(11):1445 1466, 1986. Bhat, S. P. and LA, P. Concentration of risk measures: A wasserstein distance approach. Advances in Neural Information Processing Systems, 32:11762 11771, 2019. Björk, T., Murgoci, A., and Zhou, X. Y. Mean variance portfolio optimization with state-dependent risk aversion. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 24(1):1 24, 2014. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018. Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. Cassel, A., Mannor, S., and Zeevi, A. A general framework for bandit problems beyond cumulative objectives. Chandak, Y., Shankar, S., and Thomas, P. S. Highconfidence off-policy (or counterfactual) variance estimation, 2021. Denneberg, D. Distorted probabilities and insurance premiums. Methods of Operations Research, 63(3):3 5, 1990. Devroye, L., Györfi, L., and Lugosi, G. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, 2018. Gänssler, P. and Stute, W. Empirical processes: a survey of results for independent and identically distributed random variables. The Annals of Probability, pp. 193 243, 1979. Gopalan, A., Prashanth, L., Fu, M., and Marcus, S. Weighted bandits or: How bandits learn distorted values that are not expected. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. Guide, A. H. Infinite dimensional analysis. Springer, 2006. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Holland, M. J. and Haress, E. M. Spectral risk-based learning using unbounded losses. ar Xiv preprint ar Xiv:2105.04816, 2021. Huang, A., Leqi, L., Lipton, Z. C., and Azizzadenesheli, K. Off-policy risk assessment in contextual bandits. ar Xiv preprint ar Xiv:2104.08977, 2021. Jie, C., Prashanth, L., Fu, M., Marcus, S., and Szepesvári, C. Stochastic optimization in a cumulative prospect theory framework. IEEE Transactions on Automatic Control, 63 (9):2867 2882, 2018. Khim, J., Leqi, L., Prasad, A., and Ravikumar, P. Uniform convergence of rank-weighted learning. In International Conference on Machine Learning, pp. 5254 5263. PMLR, 2020. Krokhmal, P. A. Higher moment coherent risk measures. 2007. Kusuoka, S. On law invariant coherent risk measures. In Advances in mathematical economics, pp. 83 95. Springer, 2001. Supervised Learning with General Risk Functionals Lee, J., Park, S., and Shin, J. Learning bounds for risksensitive learning. In 34th Conference on Neural Information Processing Systems (Neur IPS) 2020. Neural Information Processing Systems, 2020. Leqi, L., Prasad, A., and Ravikumar, P. On human-aligned risk minimization. In Advances in Neural Information Processing Systems, 2019. Li, T., Beirami, A., Sanjabi, M., and Smith, V. Tilted empirical risk minimization. In International Conference on Learning Representations, 2020. Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116 131, 2018. Massart, P. The tight constant in the dvoretzky-kieferwolfowitz inequality. The annals of Probability, pp. 1269 1283, 1990. Massart, P. Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 9, pp. 245 303, 2000. Munos, R. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. 2014. Oliveira, L. P. R. and Yang, S. S. Topics in statistical mathematics: Lecture 2. https: //w3.impa.br/~rimfo/estatistica_a16/ notas_alunos/shangjie.pdf. Accessed: 2021-9-15. Prashanth, L., Jie, C., Fu, M., Marcus, S., and Szepesvári, C. Cumulative prospect theory meets reinforcement learning: Prediction and control. In International Conference on Machine Learning, pp. 1406 1415. PMLR, 2016. Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317. Springer Science & Business Media, 2009. Rockafellar, R. T., Uryasev, S., et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. doi: 10.1007/s11263-015-0816-y. Sani, A., Lazaric, A., and Munos, R. Risk-aversion in multiarmed bandits. ar Xiv preprint ar Xiv:1301.1936, 2013. Shapiro, A., Dentcheva, D., and Ruszczy nski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2014. Sharpe, W. F. Mutual fund performance. The Journal of business, 39(1):119 138, 1966. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, 2015. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016. Torossian, L., Garivier, A., and Picheny, V. X -armed bandits: Optimizing quantiles, cvar and other risks. In Asian Conference on Machine Learning, pp. 252 267. PMLR, 2019. Vakili, S. and Zhao, Q. Mean-variance and value at risk in multi-armed bandit problems. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1330 1335. IEEE, 2015. Vallender, S. Calculation of the wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784 786, 1974. Vapnik, V. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006. Vapnik, V. The nature of statistical learning theory. Springer science & business media, 2013. Vapnik, V. N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988 999, 1999. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019. Wang, S. Premium calculation by transforming the layer premium density. ASTIN Bulletin: The Journal of the IAA, 26(1):71 92, 1996. Wang, S. S., Young, V. R., and Panjer, H. H. Axiomatic characterization of insurance prices. Insurance: Mathematics and economics, 21(2):173 183, 1997. Supervised Learning with General Risk Functionals Wirch, J. L. and Hardy, M. R. A synthesis of risk measures for capital adequacy. Insurance: mathematics and Economics, 25(3):337 347, 1999. Wirch, J. L. and Hardy, M. R. Distortion risk measures: Coherence and stochastic dominance. In International congress on insurance: Mathematics and economics, pp. 15 17, 2001. Zimin, A., Ibsen-Jensen, R., and Chatterjee, K. Generalized risk-aversion in stochastic multi-armed bandits. ar Xiv preprint ar Xiv:1405.0833, 2014. Supervised Learning with General Risk Functionals A. Hölder Risk Functionals In the definition of Hölder risk functionals, we require d to be a quasi-metric, which we provide the definition here. Definition A.1. A function d : L (R, B(R)) L (R, B(R)) [0, + ) is a quasi-metric if the following two conditions hold: For all FU, FU L (R, B(R)), d(FU, FU ) = 0 if and only if FU = FU ; For all FU, FU , FZ L (R, B(R)), d(FU, FZ ) d(FU, FU ) + d(FU , FZ ). If a quasimetric is symmetic, i.e., for all FU, FU L (R, B(R)), d(FU, FU ) = d(FU , FU), it is also a metric. The set of quasi-metrics contains symmetric quasi-metics, e.g., sup norms L , Wasserstein distance, along with non-symmetric quasi-metrics, e.g., Kullback-Leibler divergence. We will now discuss why optimized certainty equivalent (OCE) risks (e.g., mean-variance, entropic risk, CVa R) and spectral risks with bounded spectrum (e.g., CVa R, certain CPT-inspired Risks) are Lipschitz on bounded random variables. OCE risks, first introduced by Ben-Tal & Teboulle (1986), are defined as ρoce(FU) := inf λ R {λ + E[ϕ(U λ)]} , where ϕ : R R {+ } is a nondecreasing, closed and convex function with ϕ(0) = 0 and 1 ϕ(0). To complement the risk-averse OCEs, a risk-seeking version (inverted OCE) is proposed: ρoce(FU) := sup λ R {λ E[ϕ(λ U)]} . Proposition A.1. If ϕ is continuously differentiable, then the OCE risks ρoce and inverted OCE risks ρoce are Lipschitz on the space of bounded random variables with support [0, D]: |ρoce(FU) ρoce(FU )| max x [0,D](ϕ(D x) ϕ( x)) FU FU , |ρoce(FU) ρoce(FU )| max x [0,D](ϕ(x D) ϕ(x)) FU FU . Remark A.1. Similar to Huang et al. (2021, Lemma 4.1), Proposition A.1 shows that the expected value and CVa Rα are Dand D α -Lipschtiz on random variables with support [0, D], respectively. In addition, this result also provides Lipschitzness of the entropic risks (since the corresponding ϕ is continuously differentiable (Lee et al., 2020, Table 1)) and other OCE and inverted OCE risks that do not belong to distortion risk functionals. Proof. When U has support [0, D], as shown in Lee et al. (2020, Lemma 9), we can re-write the OCE and inverted OCE risks as follows: ρoce(FU) = min λ [0,D] {λ + E[ϕ(U λ)]} ρoce(FU) = max λ [0,D] {λ E[ϕ(λ U)]} . For any U, U [0, D], denote λU arg minλ [0,D] λ + EU[ϕ(U λ)] and λU arg minλ [0,D] λ + EU [ϕ(U λ)]. Supervised Learning with General Risk Functionals Consider the case where ρoce(FU) < ρoce(FU ). |ρoce(FU) ρoce(FU )| = ρoce(FU ) ρoce(FU) = λU + EU [ϕ(U λU )] λU EU[ϕ(U λU)] (i) λU + EU [ϕ(U λU)] λU EU[ϕ(U λU)] 0 ϕ(u λU)d (FU (u) FU(u)) (ii) = ϕ(u λU) (FU (u) FU(u)) u=D 0 ϕ (u λU) (FU (u) FU(u)) du (iii) FU FU 0 ϕ (u λU)du = (ϕ(D λU) ϕ( λU)) FU FU , where (i) comes from the definition of λU , (ii) uses integration by parts, and (iii) uses the fact that ϕ is non-decreasing, i.e., ϕ is non-negative, and FU(0) = FU (0) = 0, FU(D) = FU (D) = 1. The case when ρoce(FU ) < ρoce(FU) proceeds similarly: |ρoce(FU) ρoce(FU )| = ρoce(FU) ρoce(FU ) 0 ϕ(u λU )d (FU(u) d FU (u)) = ϕ(u λU ) (FU(u) FU (u)) u=D 0 ϕ (u λU ) (FU(u) FU (u)) du (ϕ(D λU ) ϕ( λU )) FU FU . Putting it together, we have that |ρoce(FU) ρoce(FU )| max λ [0,D](ϕ(D x) ϕ( x)) FU FU . For inverted OCE risks, denote λU arg maxλ [0,D] λ + EU[ϕ(λ U)] and λU arg maxλ [0,D] λ + EU [ϕ(λ U)]. The proof proceeds similarly by using the fact that ρoce(FU ) ρoce(FU) EU[ϕ(λU U)] EU [ϕ(λU U )] and ρoce(FU) ρoce(F U) EU [ϕ(λU U )] EU[ϕ(λU U)]. Following similar steps, we obtain that |ρoce(FU) ρoce(FU )| max λ [0,D](ϕ(x D) ϕ(x)) FU FU . Spectral risks (also known as L-risks or rank-weighted risks) are a subset of distortion risk functionals. As noted in Bäuerle & Glauner (2021), a spectral risk can be written as a distortion risk (Equation (4)) with the following distortion function: for t [0, 1], where h : [0, 1] R+ is the non-decreasing spectrum function that integrates to 1. Since g is Lipschitz when h is bounded (i.e., g (t) = h(t) for t (0, 1)), spectral risks are Lipschitz on the space of bounded random variables when their spectrum is bounded. Finally, examples of risk functionals that are Hölder but are not Lipschitz include distortion risks whose distortion functions are Hölder but not Lipschitz. Supervised Learning with General Risk Functionals B. Proof of Results in Section 4 We note that in the following proofs, Z is used to denote a generic random variable and the loss function is denoted by ℓ: X Y Y R. (The loss function presented in the main text is a special case of this.) For a given (X, Y, f(X)), the loss is denoted by ℓ(X, Y, f(X)). B.1. Auxillary Lemmas The below two auxillary lemmas are mainly adaptations to the class note from Professor Roberto Imbuzeiro Oliveira (Oliveira & Yang). Lemma B.1. Let G(1) := g( ; r) := 1{ r} : r R . For a fixed f F, iid sample {Xi, Yi}n i=1 with joint probability measure P, and a loss function ℓ: X Y Y R, we have that i=1 ξig(ℓ(Xi, Yi, f(Xi))) = max j [n] and further λ n sup g G(1) i=1 ξig(ℓ(Xi, Yi, f(Xi))) 1{Pj i=1 ξi 0} Remark B.1. We note that an important property of Lemma B.1 is that the bound is independent of the samples {Xi, Yi}n i=1. Proof. Let {Ri}n i=1 denote the sorted sequence of {ℓ(Xi, Yi, f(Xi))}n i=1, where R1 R2 . . . Rn. Using {Ri}n i=1, we have λ n sup g G(1) i=1 ξig(ℓ(Xi, Yi, f(Xi))) λ n sup g G(1) i=1 ξig(Ri) Consider a function g(t; r) = 1{t r}. For such a function, Pn i=1 ξig(Ri; r) is equal to 0 if r < mini [n] Ri, Pj i=1 ξi when Rj r < Rj+1 for some j {1, . . . , n 1}, Pn i=1 ξi otherwise. Therefore, we have that i=1 ξig(ℓ(Xi, Yi, f(Xi))) = max j [n] Finally, we notice that sup g G(1) exp i=1 ξig(Ri) max j [n] exp =EP,R h max j [n] 1{Pj i=1 ξi 0} + exp λ 1{Pj i=1 ξi<0} i 1{Pj i=1 ξi 0} 1{Pj i=1 ξi<0} 1{Pj i=1 ξi 0} Supervised Learning with General Risk Functionals Lemma B.2. Let {Zi}n i=1 taking values in Z denote independent samples drawn from P and w : Z R+. For n independent Rademacher random variables {ξi}n i=1, we have that for all λ 0, max j [n] λ n i=1 w(Zi)ξi i=1 w(Zi)ξi Further, if 1 n Pn i=1 w(Zi)ξi is mean zero γ2 n -sub Gaussian, then max j [n] λ n Remark B.2. We note that an immediate consequence of Lemma B.2 is i=1 w(Zi)ξi 1{Pj i=1 w(Zi)ξi 0} max j [n] λ n i=1 w(Zi)ξi i=1 w(Zi)ξi Proof. The proof contains two main steps: Step 1: We will show that for all t > 0, i=1 w(Zi)ξi t i=1 w(Zi)ξi t To show (11), for t > 0, consider events E0 := and Ej := {Pj i=1 w(Zi)ξi t, Pl i=1 w(Zi)ξi < t, l < j} for j [n], which states that j is the first index such that the partial sum Pj i=1 w(Zi)ξi is at least t. We first notice that i=1 w(Zi)ξi t Since Ej and Pn i=j+1 w(Zi)ξi 0 implies that Pn i=1 w(Zi)ξi t, we obtain i=j+1 w(Zi)ξi 0 i=1 w(Zi)ξi t Moreover, for any j [n], we have i=j+1 w(Zi)ξi 0 since Pn i=j+1 w(Zi)ξi is symmetic around 0 (i.e., Pn i=j+1 w(Zi)ξi d= Pn i=j+1 w(Zi)ξi) for all i {0, . . . , n}. For any j [n], since the event Ej (dependeing on {w(Zi), ξi}i j) is independent of {Pn i=j+1 w(Zi)ξi 0}, we have i=j+1 w(Zi)ξi 0 i=j+1 w(Zi)ξi 0 Supervised Learning with General Risk Functionals As a result, we have i=1 w(Zi)ξi t i=j+1 w(Zi)ξi 0 i=j+1 w(Zi)ξi 0 i=1 w(Zi)ξi t where the first equality holds because for i = j, Ei Ej = , the first inequality follows from (12) and the last inequality comes from the union bound. Step 2: For any differentiable f and random variable X, E[f(X)1{X 0}] = E = f(0)E[1{X 0}] + E Z 0 f (t)1{X t}dt = f(0)P(X 0) + Z 0 f (t)P(X t)dt, (14) where the last equality follows from Fubini s theorem. Putting it altogether, we have max j [n] λ n i=1 w(Zi)ξi 0 exp (λt) P max j [n] 1 n i=1 w(Zi)ξi t 0 exp (λt) P i=1 w(Zi)ξi t i=1 w(Zi)ξi 1{Pn i=1 w(Zi)ξi 0} i=1 w(Zi)ξi i=1 w(Zi)ξi 0 i=1 w(Zi)ξi where the first inequality follows from (11), the second equality uses (14) with f(t) = eλt and X = maxj [n] 1 n Pj i=1 w(Zi)ξi, and the last inequality follows from (13). Finally, if 1 n Pn i=1 w(Zi)ξi is mean zero γ2 n - sub Gaussian, then using the definition of a sub Gaussian random variable, we obtain i=1 w(Zi)ξi Supervised Learning with General Risk Functionals B.2. Proof of Theorem 4.1 Theorem 4.1. Given a hypothesis class F, any loss function ℓ: Y Y R, and n samples {Zi}n i=1, we have that with probability at least 1 δ, en(F, ℓ) 2R(n, F) + Proof. We first analyze the sensitivity of the sup-norm of the CDF estimator over F and G(1). For a given two sets {xi, yi}n i=1 and {x i, y i}n i=1, which just differ in j th entry, let b F1(r; f) := 1 i=1 1{ℓ(yi,f(xi)) r} and, b F2(r; f) := 1 i=1 1{ℓ(y i,f(x i)) r} Then, if supf F supr R b F1(r; f) F(r; f) supf F supr R b F2(r; f) F(r; f) , we have sup f F sup r R b F1(r; f) F(r; f) sup f F sup r R b F2(r; f) F(r; f) = sup f F sup r R i=1 1{ℓ(Xi,yi,f(xi)) r} F(r; f) sup f F sup r R i=1 1{ℓ(xi,y i,f(x i)) r} F(r; f) = sup f F sup r R i=1 1{ℓ(x i,y i,f(x i)) r} + 1 1{ℓ(xj,yj,f(xj)) r} 1{ℓ(x j,y j,f(x j)) r} F(r; f) sup f F sup r R i=1 1{ℓ(x i,y i,f(x i)) r} F(r; f) = sup f F sup r R 1{ℓ(xj,yj,f(xj)) r} 1{ℓ(x j,y j,f(x j)) r} 1 sup f F sup r R b F1(r; f) F(r; f) sup f F sup r R b F2(r; f) F(r; f) sup f F sup r R 1{ℓ(xj,yj,f(xj)) r} 1{ℓ(x j,y j,f(x j)) r} = 1 This bound holds no matter what j and what data set we choose. Using bounded difference inequality, i.e., Mc Diarmid s inequality (Boucheron et al., 2013), we have, sup f F sup r R b F(r; f) F(r; f) EP sup f F sup r R b F(r; f) F(r; f) with probability at least 1 δ. Using a ghost sample set {X i, Y i }n i=1 we have sup f F sup r R b F(r; f) F(r; f) sup f F sup r R i=1 1{ℓ(Xi,Yi,f(Xi)) r} EP i=1 1{ℓ(X i,Y i ,f(X i)) r} Supervised Learning with General Risk Functionals sup f F sup g G(1) i=1 g(ℓ(Xi, Yi, f(Xi))) EP i=1 g(ℓ(X i, Y i , f(X i))) sup f F sup g G(1) i=1 g(ℓ(Xi, Yi, f(Xi))) i=1 g(ℓ(X i, Y i , f(X i))) σ ({Xi, Yi}n i=1) sup f F sup g G(1) i=1 (g(ℓ(Xi, Yi, f(Xi))) g(ℓ(X i, Y i , f(X i)))) sup f F sup g G(1) i=1 ξi (g(ℓ(Xi, Yi, f(Xi))) g(ℓ(X i, Y i , f(X i)))) sup f F sup g G(1) i=1 ξig(ℓ(Xi, Yi, f(Xi))) = 2R(n, F). (16) Putting (15) and (16) together concludes the proof. B.3. Permuation Complexity We note that this proof, along with many other proofs for Section 4, is based on the machinery and techniques in Massart s finite class Lemma (Massart, 2000) and DKW inequality in (Devroye et al., 2013). Theorem 4.2. For any hypothesis class F and loss function ℓ: Y Y R, we have that log(4NΠ(n, Fℓ, P)) Proof. For a positive λ, we have, exp (λR(n, F)) = exp sup f F sup g G(1) i=1 ξig(ℓ(Xi, Yi, f(Xi))) λ n sup f F sup g G(1) i=1 ξig(ℓ(Xi, Yi, f(Xi))) λ n sup f F sup g G(1) i=1 ξig(ℓ(Xi, Yi, f(Xi))) ! σ ({Xi, Yi}n i=1) For any f F, let πf denote a permutation such that ℓ(Xπf (i), Yπf (i), f(Xπf (i))) ℓ(Xπf (j), Yπf (j), f(Xπf (j))) for any i, j [n] and i j. Therefore we have, i=1 ξig(ℓ(Xi, Yi, f(Xi))) = sup g G(1) i=1 ξπf (i)g(ℓ(Xπf (i), Yπf (i), f(Xπf (i)))) Consider a function g(r ) = 1{r r}. For such a function, Pn i=1 ξπf (i)g(ℓ(Xπf (i), Yπf (i), f(Xπf (i)))) is equal to, 0 if r < mini{ℓ(Xπf (i), Yπf (i), f(Xπf (i)))}n i , Pj i ξπf (i) when ℓ(Xπf (j), Yπf (j), f(Xπf (j))) r < ℓ(Xπf (j+1), Yπf (j+1), f(Xπf (j+1))) for a j {1, . . . , n 1}, Pn i ξπf (i) otherwise. Using this property, we have, i=1 ξπf (i)g(ℓ(Xπf (i), Yπf (i), f(Xπf (i)))) Supervised Learning with General Risk Functionals Using this equality, we can further extend the Eq. 17, exp (λR(n, F)) EP,R λ n sup f F max j ! σ ({Xi, Yi}n i=1) NΠ(Fℓ, {Xi, Yi}n i=1) exp ! σ ({Xi, Yi}n i=1) NΠ(n, Fℓ, P)EP,R where the second inequality follows from the fact that effectively, there are at most NΠ(Fℓ, {Xi, Yi}n i=1) number of πf s. Using the same derivation for (10) (Lemma B.1), we have exp (λR(n, F)) 2NΠ(n, Fℓ, P)EP,R 1{Pj i ξi 0} By Lemma B.2, we have exp (λR(n, F)) 4NΠ(n, Fℓ, P) exp λ2 Now, taking the log from both sides, and dividing by λ, we have R(n, F) log(4NΠ(n,Fℓ,P)) λ + λ 2n. Choosing λ = p 2n log(2NΠ(n, Fℓ, P)), we obtain the final result log(4NΠ(n, Fℓ, P)) Corollary 4.1. For a finite hypothesis class F, Proof. The result follows since when |F| < , NΠ(n, Fℓ, P) |F|, i.e., we need at most one permutation function to sort the losses for each f F. Supervised Learning with General Risk Functionals C. Proof of Results in Section 5 Theorem 5.1. For a hypothesis class F, a bounded loss function ℓ, and δ (0, 1], if P(en(F, ℓ) ϵ) 1 δ, then with probability 1 δ, for all ρ T, we have sup f F |ρ(F( ; f)) ρ( b F( ; f))| L(ρ, 1, L )ϵ, where T is the set of L( , 1, L ) Hölder risk functionals on the space of bounded random variables. Proof. For all ρ T, since ρ is L(ρ, 1, L ) Hölder, we have that |ρ(F( ; f)) ρ( b F( ; f))| L(ρ, 1, L ) F b F . The desired result then follows. The error of risk assessment can be bounded using distances other than the sup-norm, such as the Wasserstein distance. For two random variables U and U with bounded support [0, D], the dual form of the Wasserstein distance W1(FU, FU ) is given by (Vallender, 1974), W1(FU, FU ) = Z D 0 |FU(t) FU (t)|dt D FU FU . This inequality suggests the following corollary. Corollary C.1. Under the setting of Theorem 5.1 where the loss has support [0, D], with probability 1 δ, for all ρ T, we have sup f F |ρ(F( ; f)) ρ( b F( ; f))| L(ρ, p, W1)Dpϵp, where T is the set of L( , p, W1) Hölder risk functionals on the space of bounded random variables. Proof. For all ρ T, since ρ is L(ρ, p, W1) Hölder, we have that |ρ(F( ; f)) ρ( b F( ; f))| L(ρ, p, W1)W1(F, b F)p L(ρ, p, W1)Dp F b F p . The desired result then follows. Corollary 5.1. For a hypothesis class F, a bounded loss function ℓ: Y Y [0, D], and δ (0, 1], if P(en(F, ℓ) ϵ) 1 δ, then with probability 1 δ, we have sup f F |ρ(F( ; f)) ρ( b F( ; f))| Lϵ, where ρ is a distortion risk with L D-Lipschitz distortion function. Proof. As is shown in Huang et al. (2021, Lemma 4.1), ρ is L-Lipschitz. Thus, directly applying Theorem 5.1 concludes the proof. Supervised Learning with General Risk Functionals D. Proofs for results in Section 6 D.1. Proof of Lemma 6.1 Before proving Lemma 6.1, we first present two auxiliary lemmas. Lemma D.1. For any continuous function g, h : Rd R, min(g, h) and max(g, h) are continuous. Similarly, for any Lipschitz continuous function g, h : Rd R, min(g, h) and max(g, h) are Lipschitz continuous. Proof. Denote fmin = min(g, h) and fmax = max(g, h). Continuity: Continuity: We first consider the case when both g and h are continuous. Define Θ = {θ Rd : g(θ) = h(θ)}. For θ / Θ, fmin and fmax are continuous since g, h are continuous. Consider θ Θ. For every ϵ > 0, there exists δg, δh > 0 such that for all θ Rd, θ θ ϵ = |g(θ) g(θ )| δg, |h(θ) h(θ )| δh. In addition, fmin(θ) = g(θ) = h(θ), fmax(θ) = g(θ) = h(θ) and fmin(θ ), fmax(θ ) can be either g(θ ) or h(θ ). Combining both facts gives us that |fmin(θ) fmin(θ )| max(δg, δh) and |fmax(θ) fmax(θ )| max(δg, δh). Lipschitz Continuity: We next work with the case where both g and h are Lipschitz continuous. When |fmax(θ) fmax(θ )| = |g(θ) h(θ )|, we have the following two cases: 1. g(θ) > h(θ ): |g(θ) h(θ )| = g(θ) h(θ ) g(θ) g(θ ), since fmax(θ ) = h(θ ). 2. g(θ) h(θ ): |g(θ) h(θ )| = h(θ ) g(θ) h(θ ) h(θ), since fmax(θ) = g(θ). Since both h and g are Lipschitz continuous, we obtain that |fmax(θ) fmax(θ )| max f {g,h} |f(θ) f(θ )| θ θ , showing that fmax is Lipschitz continuous. The proof completes with the fact that min(f, g) = max( f, g). Lemma D.2. If {ℓθ(xj, yj)}n j=1 are Lipschitz continuous in θ, then for all i [n], ℓθ(πθ(i)), i.e., the i-th smallest loss evaluated using data points {xj, yj}n j=1, is Lipschitz continuous in θ. Proof. The key observation is that the i-th smallest loss can be defined as ℓθ(πθ(i)) = min{max{ℓθ(j) : j J} : J [n], |J| = i}. Since each ℓθ(j) is Lipschitz continuous in θ and that for j J, max{ℓθ(j) : j J} = max{ℓθ(j ), max{ℓθ(j) : j J \ {j }}}, by Lemma D.1, we have max{ℓθ(j) : j J} to be Lipschitz continuous in θ. Similarly, since max{ℓθ(j) : j J} is Lipschitz continuous in θ, we have min{max{ℓθ(j) : j J} : J [n], |J| = i} to be Lipschitz continuous. Lemma 6.1. If {ℓθ(zi)}n i=1 are Lipschitz continuous in θ Θ, then for all i [n], ℓθ(πθ(i)), i.e., the i-th smallest loss, is Lipschitz continuous in θ and ρ( b Fθ) is differentiable in θ almost everywhere. Proof. Using Lemma D.2, we obtain that ℓθ(πθ(i)) is Lipschitz in θ Θ. Following from a classical result of Rademacher (Rockafellar & Wets, 2009, Theorem 9.60), i.e., a locally Lipschitz function is differentiable almost everywhere, we have that ρ( b Fθ) is differentiable almost everywhere. D.2. Local Convergence The proofs for Corollary 6.1 are standard (Bottou et al., 2018), which we provide for completeness. Corollary 6.1. If {ℓθ(zi)}n i=1 are Lipschitz continuous and ρ( b Fθ) is β-smooth in θ, then the following holds almost surely when the learning rate in (7) is η = 1 β t=1 E h θρ( b Fθt) 2i 2β ρ( b Fθ1) ρ( b Fθ ) + 1 Proof. For notation simplicity, we use h(θ) to denote ρ( b Fθ) and gt to denote θρ( b Fθ) when θ = θt. Since h(θ) is differentiable almost everywhere, following (7), the sequence {h(θt)}T t=1 will be differentiable almost surely. Since h is Supervised Learning with General Risk Functionals β-smooth, we have that h(θt+1) h(θt) θh(θt) (θt+1 θt) + β 2 θt+1 θt 2. We denote the filtration for the stochastic process {θt}T t=1 to be {Ft}T t=1. Extending the above inequality, we have h(θt+1) h(θt) θh(θt) ( η(gt + wt)) + β 2 η(gt + wt) 2, which suggests that E[h(θt+1) h(θt)] η θh(θt) gt + η2 β 2 ( gt 2 + dσ2 w), where σ2 w = 1/d is the variance of wt. Therefore, for the conditional expectation, we have E h h(θt+1) h(θt) Ft i E η θh(θt) gt + η2 β 2 ( gt 2 + 1) Fk 2 )E h θh(θt) 2 Ft i + η2 β Using the telescoping sum and law of total expectation, we obtain E h h(θT ) h(θ1) F1 i = E t=1 h(θ1) h(θt 1) F1 t=1 θh(θt) 2 F1 Use the fact that h(θ ) h(θT ), we have E [h(θ1) h(θT )] h(θ1) h(θ ), which implies t=1 θh(θt) 2 # h(θ1) h(θ ) + Tη2 β Plugging the learning rate η = 1 β T , we have η η2 β T 1 2βT 1 β T > 0. Therefore we have t=1 θh(θt) 2 # h(θ1) h(θ ) + 1 Rearranging the above inequality gives the result. Supervised Learning with General Risk Functionals E. Additional Experimental Details E.1. Risk Assessment on Image Net Models Figure 3 shows the CVa R of models presented in Table 1 under different α s. We note that CVa Rα is the expected value above the top 100α percent losses. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 vgg11, accuracy: 0.69022 googlenet, accuracy: 0.69772 shufflenet, accuracy: 0.69356 inception, accuracy: 0.69542 resnet18, accuracy: 0.69756 Figure 3. CVa Rα(ℓf(Z)) across different α s where ℓf is the cross-entropy loss evaluated on the Image Net validation dataset. E.2. Empirical Distortion Risk Minimization Toy Example The data used in this experiment is generated using the make_blobs function from sklearn.datasets with the following parameters: n_samples = [1000, 50], centers = [[0.0, 0.0], [1.0, 1.0]], cluster_std = [1.5, 0.5], random_state = 0, shuffle = False. CIFAR-10 For completeness, we report the average test accuracy for the VGG-16 models obtained through minimizing the empirical risks for expected loss, CVa R.05, CVa R.7 and HRM.3,.4. The models are trained over 150 epochs and the learning rate is chosen to be 0.005. The accuracy is 55.2%, 13.2%, 51.3%, and 53.9% respectively. We note that the goal of this experiment is to illustrate the efficacy of our proposed optimization procedure for minimizing distortion risks instead of arguing for the usage of a particular risk functional for finding a model with the highest test accuracy.