# uniform_convergence_of_rankweighted_learning__404b3207.pdf Uniform Convergence of Rank-weighted Learning Justin Khim 1 Liu Leqi 1 Adarsh Prasad 1 Pradeep Ravikumar 1 The decision-theoretic foundations of classical machine learning models have largely focused on estimating model parameters that minimize the expectation of a given loss function. However, as machine learning models are deployed in varied contexts, such as in high-stakes decisionmaking and societal settings, it is clear that these models are not just evaluated by their average performances. In this work, we propose and study a novel notion of L-Risk based on the classical idea of rank-weighted learning. These L-Risks, induced by rank-dependent weighting functions with bounded variation, is a unification of popular risk measures such as conditional value-atrisk and those defined by cumulative prospect theory. We give uniform convergence bounds of this broad class of risk measures and study their consequences on a logistic regression example. 1. Introduction The statistical decision-theoretic foundations of modern machine learning have largely focused on solving tasks by minimizing the expectation of some loss function. This ensures that the resulting models have high average case performance. However, as machine learning models are deployed along side humans in decision-making, it is clear that they are not just evaluated by their average case performance but also properties like fairness. There has been increasing interest in capturing these additional properties via appropriate modifications of the classical objective of expected loss (Duchi et al., 2019; GarcΔ±a & Fern andez, 2015; Sra et al., 2012). In parallel, there is a long line of work exploring alternatives to expected loss based risk measures in decisionmaking (Howard & Matheson, 1972), and in reinforcement 1Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA. Correspondence to: Justin Khim . Proceedings of the 37 π‘‘β„ŽInternational Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). learning, where percentile based risk measures have been used to quantify the tail-risk of models. A recent line of work borrows classical ideas from behavioral economics for use in machine learning to make models more humanaligned. In particular, Prashanth et al. (2016) have brought ideas from cumulative prospect theory (Tversky & Kahneman, 1992) into reinforcement learning and bandits and Leqi et al. (2019) have used cumulative prospect theory to introduce the notion of human-aligned risk measures. A common theme in these prior works is the notion of rankweighted risks. The aforementioned risk measures weight each loss by its relative rank, and are based upon the classical rank-dependent expected utility theory (Diecidue & Wakker, 2001). These rank-dependent utilities have also been used in several different contexts in machine learning. For example, such rank-weighted risks have been used to speed up training of deep networks (Jiang et al., 2019). They have also played a key role in designing estimators which are robust to outliers in data. In particular, trimming procedures that simply throw away data with high losses have been used to design estimators that are robust to outliers (Daniell, 1920; Bhatia et al., 2015; Lai et al., 2016; Prasad et al., 2018; Lugosi & Mendelson, 2019; Prasad et al., 2019; Shah et al., 2020). While these rank-based objectives have found widespread use in machine learning, establishing their statistical properties has remained elusive. On the other hand, we have a clear understanding of the generalization properties for average-case risk measures. Loosely speaking, given a collection of models and finite training data, and suppose we choose a model by minimizing average training error, then, we can roughly guarantee on how well this chosen model performs in an average sense. Such guarantees are typically obtained by studying uniform convergence of average-case risk measures. However, as noted before, uniform convergence of rankweighted risk measures have not been explored in detail. This difficulty comes from the weights being dependent on the whole data, thereby, inducing complex dependencies. Hence, existing work on generalization has been on the weaker notion of pointwise concentration bounds (Bhat, 2019; Duchi & Namkoong, 2018) or have focused on specific forms of rank-weighted risk measures such as condi- Rank-weighted Learning tional value-at-risk (CVa R) (Duchi & Namkoong, 2018). Contributions. In this work, we propose the study of rank-weighted risks and prove uniform convergence results. In particular, we propose a new notion of L-Risk in Section 2 that unifies existing rank-dependent risk measures including CVa R and those defined by cumulative prospect theory. In Section 3, we present uniform convergence results, and we observe that the learning rate depends on the weighting function. In particular, when the weighting function is Lipschitz, we recover the standard 𝑂(𝑛 1/2) convergence rate. Finally, we instantiate our result on logistic regression in Section 4 and empirically study the convergence performance of the L-Risks in Section 6. 2. Background and Problem Setup In this section, we provide the necessary background on rank-weighted risk minimization, and introduce the notion of bounded variation functions that we consider in this work. Additionally, we provide standard learning-theoretic definitions and notation before moving on to our main results. 2.1. L-Risk Estimation We assume that there is a joint probability distribution 𝑃(𝑋, π‘Œ) over the space 𝒡= 𝒳 𝒴and our goal is to learn a function 𝑓: 𝒳 𝒴. In the standard decision-theoretic framework, 𝑓* is chosen among a class of functions β„±using a non-negative loss function β„“: β„± 𝒡 R+. Classical Risk Estimation. In the traditional setting of risk minimization, the population risk of a function 𝑓is given by the expectation of the loss function β„“(𝑓, 𝑍) when 𝑍is drawn according to the distribution 𝑃: β„›(𝑓) = E𝑍 𝑃[β„“(𝑓, 𝑍)]. Given 𝑛i.i.d. samples {𝑍𝑖}𝑛 𝑖=1, empirical risk minimization substitutes the empirical risk for the population risk in the risk minimization objective: 𝑖=1 β„“(𝑓, 𝑍𝑖). L-Risk. As noted in the Section 1, there are many scenarios in machine learning, where we want to evaluate a function by other metrics apart from average loss. To this end, we first define the notion of an L-Statistic which dates back to the classical work of (Daniell, 1920). Definition 1. Let 𝑋(1) 𝑋(2) . . . 𝑋(𝑛) be the order statistics of the sample 𝑋1, 𝑋2, . . . 𝑋𝑛. Then, the L-statistic is a linear combination of order statistics, 𝑇𝑀({𝑋𝑖}𝑛 𝑖=1) = 1 where 𝑀: [0, 1] [0, ) is a scoring function. With the above notion of an empirical L-statistic at hand, we define the notion of empirical L-risk for any function 𝑓 by simply replacing the empirical average of the losses with their corresponding L-statistic. Definition 2. The empirical L-risk of 𝑓is ℒℛ𝑀,𝑛(𝑓) = 1 where β„“(1)(𝑓) . . . β„“(𝑛)(𝑓) are the order statistics of the sample losses β„“(𝑓, 𝑍1), . . . β„“(𝑓, 𝑍𝑛). Note that the empirical L-risk can be alternatively written in terms of the empirical cumulative distribution function 𝐹𝑓,𝑛of the sample losses {β„“(𝑓, 𝑍𝑖)}𝑛 𝑖=1: 𝑖=1 β„“(𝑓, 𝑍𝑖)𝑀(𝐹𝑓,𝑛(β„“(𝑓, 𝑍𝑖))). (1) Accordingly, the population L-risk for any function 𝑓can be defined as: ℒℛ𝑀(𝑓) = E𝑍 𝑃[β„“(𝑓, 𝑍)𝑀(𝐹𝑓(β„“(𝑓, 𝑍))] , (2) where 𝐹𝑓( ) is the cumulative distribution function of β„“(𝑓, 𝑍) for 𝑍drawn from 𝑃. 2.2. Illustrative Examples of L-Risk The framework of risk minimization is a central paradigm of statistical estimation and is widely applicable. In this section, we provide illustrative examples that L-risk generalizes classical risk and encompasses several other notions of risk measures. To begin with, observe that simply setting the weighting function as 𝑀(𝑑) = 1 for all 𝑑 [0, 1], L-Risk minimization corresponds to the classical empirical risk estimation. Conditional Value-at-Risk (CVa R). As noted in Section 1, in settings where low-probability events have catastrophic losses, using classical risk is inappropriate. Conditional value-at-risk was introduced to handle such tail events and measures the expected loss when conditioned on the event that the loss exceeds a certain threshold. Moreover, CVa R has several desirable properties as a risk measure and in particular, is convex and coherent (Krokhmal et al., 2013). Hence, CVa R is studied across a number of fields such as mathematical finance (Rockafellar et al., 2000), decision making, and more recently machine learning (Duchi et al., 2019). Formally, the CVa R of a function 𝑓at a confidence level 1 𝛼is defined as, β„›CVa R,𝛼(𝑓) = E𝑍 𝑃[β„“(𝑓, 𝑍)|β„“(𝑓, 𝑍) Va R𝛼(β„“(𝑓, 𝑍))], (3) Rank-weighted Learning where Va R𝛼(β„“(𝑓, 𝑍)) = inf π‘₯:𝐹𝑓(π‘₯) 1 𝛼π‘₯is the value-at-risk. Observe that CVa R is a special case L-Risk in (2) and can be obtained by choosing 𝑀(𝑑) = 1 𝛼1{𝑑 1 𝛼}, where 1{ } is the indicator function. Human-Aligned Risk (HRM). Cumulative prospect theory (CPT), which is motivated by empirical studies of human decision-making from behavioral economics (Tversky & Kahneman, 1992), has recently been studied in machine learning (Prashanth et al., 2016; Gopalan et al., 2017; Leqi et al., 2019). In particular, Leqi et al. (2019) proposed the following human-aligned risk objective, β„›HRM,π‘Ž,𝑏(𝑓) = E𝑍 𝑃[β„“(𝑓, 𝑍)𝑀a,b(𝐹𝑓(β„“(𝑓, 𝑍))], where 𝑀a,b(𝑑) = 3 3𝑏 π‘Ž2 π‘Ž+1 ( 3𝑑2 2(π‘Ž+ 1)𝑑+ π‘Ž ) + 1. Trimmed Risk. Trimmed mean is a measure of the central tendency of a distribution and is calculated by discarding samples that are above and below a certain threshold and using the remaining samples to calculate the remaining sample mean. It is known to be more robust to outliers and heavy-tails and is widely used across a variety of disciplines such as finance and aggregating scores in sports. Finally, the trimmed risk of a function 𝑓at trimming level 𝛼can be defined as, β„›TRIM,𝛼(𝑓) = E𝑍 𝑃[β„“(𝑓, 𝑍)|𝐹𝑓(β„“(𝑓, 𝑍)) [𝛼, 1 𝛼]]. The trimmed risk is also a special of L-Risk in (2) and is obtained by setting 𝑀(𝑑) = 1 1 2𝛼1{𝛼 𝑑 1 𝛼}. 2.3. Bounded Variation Weighting Functions Recall from the previous section that the L-risk for any function 𝑓depends crucially on the weighting function 𝑀( ). Moreover, for popular risk measures such as CVa R and Trimmed Risk, this weighting function is not differentiable and Lipschitz. In this section, we formally introduce a class of weighting functions called bounded variation functions, which can be viewed as a strict generalization of Lipschitz functions (Carothers, 2000; Musielak & Orlicz, 1959). More formally, we define 𝑝-variation as: Definition 3. Let 𝑓: [0, 1] R be a function. Let 𝑃= {π‘₯1, . . . , π‘₯𝑁} [0, 1] be a partition of [0, 1]. Without loss of generality, we assume π‘₯1 . . . π‘₯𝑁. Let 𝒫be the set of all partitions of [0, 1], where the size of the partitions may vary between elements of 𝒫. The 𝑝-variation of 𝑓with respect to a partition 𝑃is 𝑖=1 |𝑓(π‘₯𝑖+1) 𝑓(π‘₯𝑖)|𝑝 ) 1 The 𝑝-variation of 𝑓is 𝑉𝑝(𝑓) = sup𝑃 𝒫𝑉𝑝(𝑓, 𝑃). Figure 1. An illustration of the partition argument. Here, we have Δ𝑖= πœ€π‘–and Ξ” = πœ€. The blue points, {(𝑖+1)/𝑛, (𝑖+1)/𝑛+ πœ€π‘–+1}, and the red points, {𝑖/𝑛, 𝑖/𝑛+ πœ€π‘–, (𝑖+ 2)/𝑛} are in two separate partitions. We construct the partitions this way to ensure that πœ€π‘–+2 and (𝑖+2)/𝑛are not in between 𝑖/𝑛and 𝑖/𝑛+πœ€π‘–while the order of 𝑖/𝑛, πœ€π‘–, (𝑖+ 1)/𝑛, and πœ€π‘–+1 do not matter because 𝑖/𝑛and (𝑖+ 1)/𝑛are in separate partitions. When 𝑝= 1, the variation 𝑉𝑝(𝑓) = 𝑉1(𝑓) is also called the total variation. Moreover, when the weighting function is πœ†-Lipschitz, then for all 𝑝 1, the 𝑝-variation is upper bounded by πœ†. Table 1 summarizes the bounded variation constants for the aforementioned scoring function. The proofs for these claims can be found in the Appendix. Moving forward, we work with the assumption the weighting function 𝑀( ) has bounded variation. Assumption 1. The weighting function 𝑀has bounded variation 𝑉𝑝(𝑀) for 𝑝= 1, 2. Stability to β„“ Perturbations. Under Assumption 1, we next present a deterministic bound which controls the stability of bounded-variation functions to β„“ perturbations. Lemma 1. Let 𝑃* = { 𝑖 𝑛}𝑛 𝑖=1 be the 𝑛-sized equally spaced partition of the interval [0, 1]. Then, for any perturbed partition 𝑃= { 𝑖 𝑛+ 𝑖}𝑛 𝑖=1 such that sup𝑖| 𝑖| = , we have ) 2𝑛 𝑉1(𝑀). Proof Sketch. The key idea is that we need to construct sufficiently spaced-out partitions 𝒫so that for all 𝑖 [𝑛], there exists a partition 𝑃 𝒫such that both 𝑖/𝑛and 𝑖/𝑛+ 𝑖are in 𝑃and no points in between 𝑖/𝑛and 𝑖/𝑛+ 𝑖 are in 𝑃. Since for all 𝑖 [𝑛], we know that 𝑖/𝑛+ 𝑖 (𝑖/𝑛 , 𝑖/𝑛+ ), it suffices to have partitions that are 2 spaced-out. This implies that the total number of partitions we need is at least 2 Ξ” 1/𝑛. Since 𝑀has 1-bounded variation 𝑉1(𝑀), we have reached the desired result. Figure 1 shows the necessity of having spaced-out partitions. The above result is a key tool for studying uniform convergence of L-Risks with weighting function 𝑀that have Rank-weighted Learning Table 1. Bounded variation for different weight functions. L-Risk 𝑀(𝑑) 𝑉1(𝑀) 𝑉2(𝑀) β„› 1 0 0 β„›CVa R,𝛼 1 𝛼1{1 𝛼 𝑑} 1 𝛼 1 𝛼 β„›HRM,π‘Ž,𝑏 3 3𝑏 π‘Ž2 π‘Ž+1 ( 3𝑑2 2(π‘Ž+ 1)𝑑+ π‘Ž ) + 1 6(1 𝑏)(2 π‘Ž) π‘Ž2 π‘Ž+1 6(1 𝑏)(2 π‘Ž) π‘Ž2 π‘Ž+1 β„›TRIM,𝛼 1 1 2𝛼1{𝛼 𝑑 1 𝛼} 2 1 2𝛼 bounded variations. It is novel to the best of our knowledge and may be of independent interest. 2.4. Rademacher Complexity, Covering Numbers, and VC Dimension Finally, we discuss notions of function class complexity that we shall use in our results. We make use of Rademacher complexity, covering numbers, and VC dimension. The reason for all of these is that Rademacher complexity is often an intermediate step but is difficult to analyze. Covering number bounds can be used to help untangle the product structure of the L-Risks, but ultimately to deal with an empirical cumulative distribution function, it is simpler to use VC dimension. We start with Rademacher complexity. Let 𝜎1, . . . , πœŽπ‘›be i.i.d Rademacher random variables, i.e., random variables that take the values +1 and 1 each with probability 1/2. The empirical Rademacher complexity of a function class β„±given a sample 𝑆= 𝑍1, . . . , 𝑍𝑛is Λ†R𝑛(β„±) = E𝜎sup 𝑓 β„± 𝑖=1 πœŽπ‘–π‘“(𝑍𝑖). (4) The Rademacher complexity of β„±is then the expectation of the empirical Rademacher complexity with respect to the sample 𝑆, i.e. R𝑛(β„±) = E𝑆ˆR𝑛(β„±). Given a class of functions β„±: 𝒳 R, a finite collection of functions 𝑓1, . . . , 𝑓𝑁mapping from 𝒳to R is called an πœ€-cover for β„±with respect to a semi-norm if for every 𝑓in β„±, we have min𝑗=1,...,𝑁 𝑓 𝑓𝑗 πœ€. The πœ€covering number of β„±with respect to is the size of the smallest πœ€-cover of β„±and is denoted by 𝒩(πœ€, β„±, ). Note that often bounds are stated in terms of log 𝒩(πœ€, β„±, ), which is also called the metric entropy of β„±. One particular norm of interest is the empirical 2-norm. Given 𝑛samples 𝑋1, . . . , 𝑋𝑛, define the empirical 2-norm of 𝑓to be 𝑖=1 𝑓(𝑋𝑖)2. Finally, we discuss VC dimension. Let 𝐺be a class of functions from 𝒳to some finite set, e.g., {0, 1}. We define the growth function Π𝐺: N N by Π𝐺(𝑛) = max π‘₯1,...,π‘₯𝑛 𝒳|{(𝑔(π‘₯1), . . . , 𝑔(π‘₯𝑛)) : 𝑔 𝐺}| . In words, this is the maximum number of ways that functions in 𝐺may classify 𝑛points. Now, suppose that functions in 𝐺map to a set of two classes, such as {0, 1}. Then, a set 𝑆= (π‘₯1, . . . , π‘₯𝑛) of 𝑛points in is said to be shattered by 𝐺if there are functions in 𝐺realizing all possible label assignments, i.e., Π𝐺(𝑛) = 2𝑛= |{(𝑔(π‘₯1), . . . , 𝑔(π‘₯𝑛)) : 𝑔 𝐺}| . Finally, the VC-dimension of 𝐺, which we denote by VC(𝐺), is given by VC(𝐺) = max {𝑛: Π𝐺(𝑛) = 2𝑛} . (5) In words, if the VC-dimension of 𝐺is 𝑉, then there is some set of 𝑉points shattered by 𝐺. The set that we shall be most interested in using with VC-dimension is β„’:= {𝑔𝑓,𝑑: 𝒡 {0, 1}|𝑔𝑓,𝑑(𝑧) = 1 {β„“(𝑓, 𝑧) 𝑑} for 𝑓 β„±, 𝑑 [0, 𝐡]}. Thus, the VC-dimension depends on both the function class β„±and the loss function. We note that this is not too large for logistic regression in Lemma 5, and for linear regression in R𝑑with squared error loss, the VC dimension is upper bounded by a constant times 𝑑(Akama et al., 2010). This VC-dimension plays a key role in the following assumption. Assumption 2. Assume that sup 𝑓 β„± sup π‘₯ [0,𝐡ℓ] |𝐹𝑓,𝑛(π‘₯) 𝐹𝑓(π‘₯)| πœ€. We make a few remarks related to this assumption. First, it can be thought of as a Glivenko-Cantelli theorem-like assumption, except here we require uniformity over β„±. Second, our main results relying on variation use this assumption in two places. Third, the assumption can be shown to hold with high probability for πœ€= 𝑂(𝑛 1/2) when β„’ has bounded VC-dimension via a standard symmetrization Rank-weighted Learning and VC-dimension upper bound argument. We state this sufficient condition as an alternative assumption. Assumption 2 (sufficient condition). The function class β„’ has bounded VC-dimension. Consequently, it is natural to wonder whether this can be relaxed into a statement about the VC-dimension of β„±. Unfortunately, this result depends on the loss function; so such a general result is unknown to the best of our knowledge. However, we prove it to be the case for logistic regression in Lemma 5, and the proof extends to other widely-used continuous classification losses. For linear regression with squared error loss, this is bounded by a constant times 𝑑 (Akama et al., 2010). We speculate that VC(β„’) is on the order of VC(β„±) for reasonable continuous losses. 3. Uniform Convergence Results In this section, we present our main generalization result in terms of a Rademacher complexity depending on 𝑀. Then, we specialize the upper bound using an entropy integral argument in two cases: (a) 𝑀with bounded variation and (b) 𝑀that is Lipschitz. The former result allows for far more general 𝑀, but our proofs lead to slower rates. At best, the rate of 𝑂(𝑛 1/4) can be achieved by instantiating our bounds, although a more refined argument could possibly improve this. The latter result for Lipschitz 𝑀allows for the usual 𝑂(𝑛 1/2) learning rate. We use 𝐹ℱto denote the set {𝐹𝑓}𝑓 β„±and define β„“(β„±) 𝑀(𝐹ℱ) to be the set {ℓ𝑓𝑀(𝐹𝑓) | 𝑓 β„±}. We have the following bound. Theorem 1. Let β„±be a set of predictors and the loss function β„“take values in [0, 𝐡ℓ]. By Assumption 1, with probability at least 1 𝛿, we have sup 𝑓 β„± |ℒℛ𝑀,𝑛(𝑓) ℒℛ𝑀(𝑓)| 2 R𝑛(β„“(β„±) 𝑀(𝐹ℱ)) + 4𝐢𝐡ℓ𝑉1(𝑀) R𝑛(β„’) In contrast to standard generalization bounds, our result has the terms R𝑛(β„’) and R𝑛(β„“(β„±) 𝑀(𝐹ℱ)). Since the former is a Rademacher complexity of indicator variables, we simply use a VC-dimension upper bound. The VC-dimension then needs to be analyzed for particular losses and β„±. Examples of losses and β„±that permit finite VC-dimension include linear regression with squared error loss and arbitrary β„±with logistic loss. To analyze the latter empirical Rademacher complexity, we use the standard Dudley entropy integral result in order to obtain covering numbers. From here, we can more easily decompose the covering number into a product of covering numbers. Lemma 2. Suppose that the loss function β„“is πœ†β„“-Lipschitz in 𝑓(𝑋) and bounded by 𝐡ℓ. Additionally, assume that 𝑀 is bounded by 𝐡𝑀. Then, we have 𝒩(𝑑, β„“(β„±) 𝑀(𝐹ℱ), 𝑛) 𝒩 ( 𝑑 2π΅π‘€πœ†β„“ , β„±, 𝑛 2𝐡ℓ , 𝑀(𝐹ℱ), 𝑛 Now, we use separate tools to analyze the covering number of 𝑀(𝐹ℱ) for 𝑀of bounded variation and Lipschitz 𝑀. 3.1. Weight Functions of Bounded Variation In this section, we consider 𝑀of bounded 2-variation 𝑉2(𝑀); we have the following lemma. Lemma 3. Let π’ž(πœ€ , 𝐹ℱ) be a πœ€ cover of 𝐹ℱin . Suppose that Assumption 2 holds with πœ€from the statement of the assumption. Then the set 𝑀(π’ž(πœ€ , 𝐹ℱ)) is a 𝑉2(𝑀) 3(πœ€+ πœ€ )-cover for 𝑀(𝐹ℱ) in 𝑛. Lemma 3 is one of our key technical and conceptual contributions. The key contribution is that we can bound a covering number even though 𝑀may be discontinuous, and this relies on constructing a number of partitions that may be upper bounded by total variation as mentioned previously. The shortcoming of this result is also clear. Since πœ€from Assumption 2 is of order 𝑛 1/2, the cover is only at the resolution πœ€1/2 = 𝑛 1/4. Corollary 1. Let β„±be a set of predictors, the loss function β„“take values in [0, 𝐡ℓ] and is πœ†β„“-Lipschitz in 𝑓(𝑋). If 𝑀 is bounded by 𝐡𝑀and the VC-dimension 𝑉of β„’satisfies 1 𝑉 𝑛, by Assumption 1, there exists a universal constant 𝐢such that with probability at least 1 𝛿, we have sup 𝑓 β„± |ℒℛ𝑀,𝑛(𝑓) ℒℛ𝑀(𝑓)| log 𝒩 ( 𝑑 2π΅π‘€πœ†β„“ , β„±, 𝑛 + log 𝒩 ( 𝑑2 12𝐡2 ℓ𝑉2 2 (𝑀) πœ€, 𝐹ℱ, + 4𝐢𝐡ℓ𝑉1(𝑀) 𝑛+ 6𝐡ℓ𝑉1(𝑀) where πœ€= 2𝐢 3.2. Lipschitz Weight Functions In this section, we consider the far simpler case when 𝑀is πœ†π‘€-Lipschitz. Again, we start with a bound for covering numbers. Rank-weighted Learning Lemma 4. When 𝑀is πœ†π‘€-Lipschitz, we have the covering number bound 𝒩(𝑑, 𝑀(𝐹ℱ), 𝑛) 𝒩(𝑑, 𝐹ℱ, ) . This leads naturally to a uniform convergence bound. Corollary 2. Let β„±be a set of predictors, the loss function β„“take values in [0, 𝐡ℓ] and is πœ†β„“-Lipschitz in 𝑓(𝑋). If 𝑀is bounded by 𝐡𝑀and is πœ†β„“-Lipschitz and the VC-dimension 𝑉 of β„’satisfies 1 𝑉 𝑛, there exists a universal constant 𝐢such that with probability at least 1 𝛿, we have sup 𝑓 β„± |ℒℛ𝑀,𝑛(𝑓) ℒℛ𝑀(𝑓)| log 𝒩 ( 𝑑 2π΅π‘€πœ†β„“ , β„±, 𝑛 + log 𝒩 ( 𝑑 2π΅πœ†πœ†π‘€ , 𝐹ℱ, + 4𝐢𝐡ℓ𝑉1(𝑀) 𝑛+ 6𝐡ℓ𝑉1(𝑀) Note the key difference between Corollary 1 and Corollary 2. The presence of 𝑑2 πœ€in the former ensures that πœ‚needs to be non-zero, and further, πœ‚ πœ€. This leads to the slow rate of convergence. In the latter corollary, this is not a problem; thus, we obtain the standard rate of convergence. 4. Logistic Regression Example We consider a basic example for logistic regression. To instantiate the bounds, we need two things: (1) a bound on the VC dimension of β„’and (2) a covering number bound on 𝐹ℱ, the distributions of losses over the predictor class β„±. Thus, we specify a distribution for (𝑋, π‘Œ). Let 𝑆𝑑 1 = {π‘₯ R𝑑: π‘₯ = 1} denote the (𝑑 1)-dimensional sphere in R𝑑, and let 𝐡(𝑑) = {π‘₯ R𝑑: π‘₯ 1} denote the unit ball of radius 1 in R𝑑. Let πœƒ* Uniform(𝑆𝑑 1) be the true regression vector. Suppose that 𝑋 Uniform(𝐡(𝑑)) and that π‘Œtakes the value +1 with probability 𝑝= (1+πœƒ *𝑋)/2 and 1 otherwise. Next, we restrict the class of regressors, β„±. We set β„±= {𝑓: 𝑓(π‘₯) = πœƒ π‘₯for πœƒ R𝑑s.t. π‘Ÿ1 πœƒ π‘Ÿ2}, (6) where π‘Ÿ1, π‘Ÿ2 > 0 are fixed constants. We may refer directly to the parametrizations πœƒfor simplicity. Note that bounding the πœƒaway from the zero vector is necessary because as πœƒapproaches the zero vector, the distribution of losses approaches a step function, making a cover impossible. Finally, recall that the logistic loss is β„“(𝑓, 𝑍) = log ( 1 + 𝑒 π‘Œπ‘“(π‘₯)) . Now, we start with our first lemma on VC-dimension. Lemma 5. Let β„±be the class of linear functions 𝑓(π‘₯) = πœƒ π‘₯+ 𝑏for π‘₯in R𝑑. Then, the VC-dimension of β„’for the logistic loss satisfies VC(β„’) 𝑑+ 2. This type of bound is not particularly surprising, and other bounds on the VC-dimension of β„’in terms of the VCdimension of β„±are known for simple losses. Next, we consider the bound on the covering numbers of the losses. Lemma 6. Let (𝑋, π‘Œ) have the distribution defined above, and let β„±be as in equation (6). Then, we have the πœ€-entropy bound log 𝒩(πœ€, 𝐹ℱ, ) 𝐢(β„±) To the best of our knowledge, such bounds on the set of cumulative distribution functions are novel. Finally, we may apply these lemmas to obtain the following corollaries for weight functions of bounded variation and Lipschitz weight functions. We start with the former. Corollary 3. Let (𝑋, π‘Œ) have the distribution defined above. Let β„±be as defined in equation (6). Suppose that the logistic loss is bounded by 𝐡ℓon 𝐡(𝑑) {+1, 1} and is πœ†β„“-Lipschitz with respect to 𝑓(π‘₯) for all 𝑓in β„±. Suppose that 𝑀has bounded first and second variations 𝑉1(𝑀) and 𝑉2(𝑀). Then with probability at least 1 𝛿, we have sup 𝑓 β„± |ℒℛ𝑀,𝑛(𝑓) ℒℛ𝑀(𝑓)| 𝐢(β„±, 𝑑, 𝛿, 𝐡ℓ, 𝐡𝑀, 𝑉1(𝑀), 𝑉2(𝑀))𝑛 1 Next, we consider Lipschitz weight functions. Corollary 4. Let (𝑋, π‘Œ) have the distribution defined above. Let β„±be as defined in equation (6). Suppose that the logistic loss is bounded by 𝐡ℓon 𝐡(𝑑) {+1, 1} and is πœ†β„“-Lipschitz with respect to 𝑓(π‘₯) for all 𝑓in β„±. Finally, suppose that 𝑀is πœ†π‘€-Lipschitz, and let 𝐢be some universal constant. Then with probability at least 1 𝛿, we have sup 𝑓 β„± |ℒℛ𝑀,𝑛(𝑓) ℒℛ𝑀(𝑓)| 𝐢(β„±, 𝑑, 𝛿, 𝐡ℓ, 𝐡𝑀, πœ†β„“)𝑛 1 Here, we obtain the desired rates of convergence showing that the generalization bound may be instantiated and lead to quantifiable learning rates. Rank-weighted Learning In this section, we prove our main theorem (Theorem 1). Lemmas are proved in the Appendix. The first key is to choose the correct decomposition to analyze the main estimator. This is a necessary step because we cannot directly apply standard generalization results to obtain the bound for rank-weighted estimators since the terms {𝐹𝑓,𝑛(ℓ𝑓,𝑖)}𝑛 𝑖=1 have used the samples {β„“(𝑓, 𝑍𝑖)}𝑛 𝑖=1 twice, and so there is dependence across samples. Our goal is to use symmetrization with the main error. Lemma 7. We have the inequality P ( sup 𝑓 β„± |ℒℛ𝑀,𝑛(𝑓) ℒℛ𝑀(𝑓)| > 𝑑+ 𝑑 ) P ( sup 𝑓 β„± 𝑖=1 ℓ𝑓,𝑖𝑀(𝐹𝑓(ℓ𝑓,𝑖)) > 𝑑 ) + P ( sup 𝑓 β„± 𝑖=1 ℓ𝑓,𝑖𝑀(𝐹𝑓(ℓ𝑓,𝑖)) ℒℛ𝑀(𝑓) > 𝑑 ) The proof of the lemma is immediate, but it sets the stage to analyze the risk. Next, we present a lemma to deal with the first term on the right hand side of Lemma 7. Lemma 8. For some universal constant 𝐢, we have 𝑖=1 ℓ𝑓,𝑖𝑀(𝐹𝑓(ℓ𝑓,𝑖)) > 4𝐡ℓ𝑉1(𝑀) R𝑛(β„’) + 6𝐡ℓ𝑉1(𝑀) The proof of this lemma is given in the Appendix due to space concerns. However, note that this is the second lemma that makes use of partitions towards using first bounded variations in its proof. To deal with the second term of Lemma 7, we use a standard Rademacher complexity bound that we provide in the Appendix, which yields 𝑖=1 ℓ𝑓,𝑖𝑀(𝐹𝑓(ℓ𝑓,𝑖)) ℒℛ𝑀(𝑓) > 2 R𝑛(β„“(β„±) 𝑀(𝐹ℱ)) + 3 6. Experiments Following our theoretical analysis, we test our results on logistic regression and linear regression. We describe our setups in Section 6.1, the metrics as well as the optimization methods we have used in Section 6.2, and finally discuss our results in Sections 6.3. 6.1. Setups In the logistic regression setup, the features are drawn from 𝑋 Uniform(𝐡(𝑑)) where 𝐡(𝑑) is the 𝑑-dimension ball. The labels are sampled from a distribution over { 1, 1} where π‘Œtakes the value +1 with probability (1+𝑋 πœƒ*)/2. where πœƒ* = (1, 0, 0, , 0) R𝑑. We use the logistic loss β„“(πœƒ; (𝑋, π‘Œ)) = log ( 1 + exp( π‘Œπ‘‹ πœƒ) ) . In the linear regression experiment, we draw our covariates from a Gaussian 𝑋 𝒩(0, I𝑑) in R𝑑. The noise distribution is fixed as πœ€ 𝒩(0, 0.01). We draw our response variable π‘Œas, π‘Œ= 𝑋 πœƒ* + πœ€where πœƒ* = (1, 1, , 1) R𝑑. We fix the squared error β„“(πœƒ; (𝑋, π‘Œ)) = 1 2(π‘Œ 𝑋 πœƒ)2 as our loss function. 6.2. Metrics and Methods For each setting, we look at two metrics: (a) approximate uniform error; and (b) training and testing performance of the empirical minimizer of different L-Risks. We approximate the uniform error when 𝑑= 1 in the following manner: for logistic regression, we pick the interval ( 1.5, 1.5) to be our parameter space Θ and construct a grid of size 200 where each grid point represents a model πœƒ. The true L-Risk for each πœƒis then approximated by the empirical L-Risk using 20, 000 samples. The empirical L-Risk is calculated using sample size 𝑛ranging from 20 to 22, 100. Finally, the approximated uniform error at each sample size 𝑛is obtained by taking the maximum over the difference of the empirical and true L-Risk for all πœƒ. In the linear regression setup, the parameter space is chosen to be ( 15, 15) with size of 500. Te explore the training and testing performance of the empirical minimizer of different L-Risks, we perform the following iterative optimization procedure to obtain a heuristic minimizer: πœƒπ‘‘+1 = πœƒπ‘‘ 𝛾𝑑 𝑖=1 𝑀𝑑 𝑖 πœƒβ„“(πœƒπ‘‘; 𝑍𝑖), (7) for all 𝑑 {0, , 𝑇 1}, where 𝑀𝑑 𝑖= 𝑀(𝐹𝑛(β„“(πœƒπ‘‘; 𝑍𝑖))) is the rank-dependent weighting and 𝛾𝑑is the learning rate. Although in general, the empirical L-Risks are non-convex with respect to the model parameter πœƒ, there are special cases like CVa R, which has a convex dual form that is easy to optimize (Rockafellar et al., 2000): β„›CVa R,𝛼(𝑓) = inf πœ‚ R 𝛼E[max(ℓ𝑓 πœ‚, 0)] } . (8) In such cases, we compare the heuristic method with the procedure that first optimizes the L-Risk with respect to πœ‚ and then with respect to πœƒat each iteration. Rank-weighted Learning (a) log(uniform error) v.s. log 𝑛 (b) 𝑑= 5; 𝑛from 20 to 22, 100 (c) 𝑛= 2000; 𝑑from 7 to 55 Figure 2. Experimental results for logistic regression are averaged over 20 seeds. Figure 2a is the log-log plot of the approximated uniform error with respect to different sample sizes. Figure 2b and Figure 2c are log-log plots of the performance of the empirical minimizer of different L-Risks with respect to sample size and sample dimensions. For HRM, we have chosen π‘Ž= 0.6 and 𝑏= 0.4. For CVa R and trimmed mean, 𝛼is set to be 0.1. The results for logistic regression suggests similar convergence behavior across different L-Risks. (a) log(uniform error) v.s. log 𝑛 (b) 𝑑= 5; 𝑛from 20 to 22, 100 (c) 𝑛= 2000; 𝑑from 7 to 55 Figure 3. Experimental results for linear regression are averaged over 20 seeds. Figure 3a is the log-log plot of the approximated uniform error with respect to different sample sizes. Figure 3b and Figure 3c are log-log plots of the performance of the empirical minimizer of different L-Risks with respect to sample size and sample dimensions. For HRM, we have chosen π‘Ž= 0.3 and 𝑏= 0.4. For CVa R and trimmed mean, 𝛼is set to be 0.1. The results for linear regression suggests similar convergence behavior across different L-Risks. 6.3. Results Our results for logistic regression are given in Figure 2, and our results for linear regression are given in Figure 3. Figure 2a and Figure 3a are the log-log plots of the approximated uniform error with respect to the sample size in the two experimental settings. We fit a line for the data of each L-Risk to examine the rate of convergence of these approximated uniform errors. In both the logistic regression and linear regression experiments, we see that rank-weighting does not have much effect on the convergence rate. In the logistic regression case, the 𝑛 1/2 convergence rate of HRM matches the expectation since its weight function is Lipschitz. The convergence rate for CVa R and trimmed mean suggests that there might be tighter uniform convergence results for weight function with bounded variation. Figure 2b, 2c, 3b and 3c show the behavior of the L-Risk of the empirical minimizers. Empirically, we observe that the convergence performance across different L-Risks are similar. Similar to the log-log plots, this may suggest that there are faster rates of convergence than the ones we have obtained in Section 3, especially for the weighting functions that are not Lipschitz but have bounded variation. Finally, we make a remark about the optimization of CVa R by its rank-weighted formulation versus its convex dual. As shown in Appendix ??, we observe that when optimizing CVa R, our heuristic method given in equation (7) improves more quickly than the convex dual form of CVa R equation (8). However, we do note that the dual form achieves a marginally better solution in terms of training error. Practically, the comparable performance of the rank-weighted optimization perspective and the provably convex perspective in the special case where we can compare the two is encouraging for other weight functions. The code of the experiments can be found at https://bit.ly/2Yzw Rk J. Rank-weighted Learning 7. Discussion In this work, we study uniform convergence of rankweighted risks, which we define as L-Risk. Different from classical population risks, L-Risk utilizes the ranking of the losses. There are a number of future directions in studying L-Risk. One direction, suggested by our experimental results in Section 6, is to obtain faster rates of convergence for L-Risk with weighting functions of bounded variations. In addition to that, it is also of interest to obtain a lower bound of the convergence rate for L-Risk with non-Lipschitz weighting functions. Finally, since L-Risks are in general non-convex with respect to the model parameters, in order to use it to learn machine learning models, we need to investigate on algorithms for optimizing them. In particular, it would be useful to better understand simple optimization methods for arbitrary rank-weighting functions, such as the iterative approach we have used in the experiments. Acknowledgements We acknowledge the support of NSF via IIS-1909816, and ONR via N000141812861. Akama, Y., Irie, K., Kawamura, A., and Uwano, Y. VC dimensions of principal component analysis. Discrete & Computational Geometry, 44(3):589 598, 2010. Bhat, S. P. Improved concentration bounds for conditional value-at-risk and cumulative prospect theory using wasserstein distance. ar Xiv preprint ar Xiv:1902.10709, 2019. Bhatia, K., Jain, P., and Kar, P. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pp. 721 729, 2015. Carothers, N. L. Real Analysis. Cambridge University Press, 2000. Daniell, P. Observations weighted according to order. American Journal of Mathematics, 42(4):222 236, 1920. Diecidue, E. and Wakker, P. P. On the intuition of rankdependent utility. Journal of Risk and Uncertainty, 23(3): 281 298, 2001. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. Duchi, J. C., Hashimoto, T., and Namkoong, H. Distributionally robust losses against mixture covariate shifts. preprint, 2019. GarcΔ±a, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Gopalan, A., Prashanth, L., Fu, M., and Marcus, S. Weighted bandits or: How bandits learn distorted values that are not expected. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972. Jiang, A. H., Wong, D. L.-K., Zhou, G., Andersen, D. G., Dean, J., Ganger, G. R., Joshi, G., Kaminksy, M., Kozuch, M., Lipton, Z. C., et al. Accelerating deep learning by focusing on the biggest losers. ar Xiv preprint ar Xiv:1910.00762, 2019. Krokhmal, P., Zabarankin, M., and Uryasev, S. Modeling and optimization of risk. In HANDBOOK OF THE FUNDAMENTALS OF FINANCIAL DECISION MAKING: Part II, pp. 555 600. World Scientific, 2013. Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 665 674. IEEE, 2016. Leqi, L., Prasad, A., and Ravikumar, P. On human-aligned risk minimization. In Advances in Neural Information Processing Systems, 2019. Lugosi, G. and Mendelson, S. Robust multivariate mean estimation: the optimality of trimmed mean. ar Xiv preprint ar Xiv:1907.11391, 2019. Musielak, J. and Orlicz, W. On generalized variations (i). Studia Mathematica, 18(1):11 41, 1959. Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018. Prasad, A., Balakrishnan, S., and Ravikumar, P. A unified approach to robust mean estimation. ar Xiv preprint ar Xiv:1907.00927, 2019. Prashanth, L., Jie, C., Fu, M., Marcus, S., and Szepesv ari, C. Cumulative prospect theory meets reinforcement learning: Prediction and control. In International Conference on Machine Learning, pp. 1406 1415, 2016. Rockafellar, R. T., Uryasev, S., et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. Shah, V., Wu, X., and Sanghavi, S. Choosing the sample with lowest loss makes sgd robust. ar Xiv preprint ar Xiv:2001.03316, 2020. Rank-weighted Learning Sra, S., Nowozin, S., and Wright, S. J. Optimization for machine learning. Mit Press, 2012. Tversky, A. and Kahneman, D. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297 323, 1992.