# uniform_convergence_of_rankweighted_learning__404b3207.pdf

Uniform Convergence of Rank-weighted Learning

Justin Khim 1 Liu Leqi 1 Adarsh Prasad 1 Pradeep Ravikumar 1

The decision-theoretic foundations of classical machine learning models have largely focused on estimating model parameters that minimize the expectation of a given loss function. However, as machine learning models are deployed in varied contexts, such as in high-stakes decisionmaking and societal settings, it is clear that these models are not just evaluated by their average performances. In this work, we propose and study a novel notion of L-Risk based on the classical idea of rank-weighted learning. These L-Risks, induced by rank-dependent weighting functions with bounded variation, is a uniﬁcation of popular risk measures such as conditional value-atrisk and those deﬁned by cumulative prospect theory. We give uniform convergence bounds of this broad class of risk measures and study their consequences on a logistic regression example.

1. Introduction

The statistical decision-theoretic foundations of modern machine learning have largely focused on solving tasks by minimizing the expectation of some loss function. This ensures that the resulting models have high average case performance. However, as machine learning models are deployed along side humans in decision-making, it is clear that they are not just evaluated by their average case performance but also properties like fairness. There has been increasing interest in capturing these additional properties via appropriate modiﬁcations of the classical objective of expected loss (Duchi et al., 2019; Garcıa & Fern andez, 2015; Sra et al., 2012).

In parallel, there is a long line of work exploring alternatives to expected loss based risk measures in decisionmaking (Howard & Matheson, 1972), and in reinforcement

1Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA. Correspondence to: Justin Khim <jkhim@cs.cmu.edu>.

Proceedings of the 37 𝑡ℎInternational Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

learning, where percentile based risk measures have been used to quantify the tail-risk of models. A recent line of work borrows classical ideas from behavioral economics for use in machine learning to make models more humanaligned. In particular, Prashanth et al. (2016) have brought ideas from cumulative prospect theory (Tversky & Kahneman, 1992) into reinforcement learning and bandits and Leqi et al. (2019) have used cumulative prospect theory to introduce the notion of human-aligned risk measures.

A common theme in these prior works is the notion of rankweighted risks. The aforementioned risk measures weight each loss by its relative rank, and are based upon the classical rank-dependent expected utility theory (Diecidue & Wakker, 2001). These rank-dependent utilities have also been used in several different contexts in machine learning. For example, such rank-weighted risks have been used to speed up training of deep networks (Jiang et al., 2019). They have also played a key role in designing estimators which are robust to outliers in data. In particular, trimming procedures that simply throw away data with high losses have been used to design estimators that are robust to outliers (Daniell, 1920; Bhatia et al., 2015; Lai et al., 2016; Prasad et al., 2018; Lugosi & Mendelson, 2019; Prasad et al., 2019; Shah et al., 2020).

While these rank-based objectives have found widespread use in machine learning, establishing their statistical properties has remained elusive. On the other hand, we have a clear understanding of the generalization properties for average-case risk measures. Loosely speaking, given a collection of models and ﬁnite training data, and suppose we choose a model by minimizing average training error, then, we can roughly guarantee on how well this chosen model performs in an average sense. Such guarantees are typically obtained by studying uniform convergence of average-case risk measures.

However, as noted before, uniform convergence of rankweighted risk measures have not been explored in detail. This difﬁculty comes from the weights being dependent on the whole data, thereby, inducing complex dependencies. Hence, existing work on generalization has been on the weaker notion of pointwise concentration bounds (Bhat, 2019; Duchi & Namkoong, 2018) or have focused on speciﬁc forms of rank-weighted risk measures such as condi-

Rank-weighted Learning

tional value-at-risk (CVa R) (Duchi & Namkoong, 2018).

Contributions. In this work, we propose the study of rank-weighted risks and prove uniform convergence results. In particular, we propose a new notion of L-Risk in Section 2 that uniﬁes existing rank-dependent risk measures including CVa R and those deﬁned by cumulative prospect theory. In Section 3, we present uniform convergence results, and we observe that the learning rate depends on the weighting function. In particular, when the weighting function is Lipschitz, we recover the standard 𝑂(𝑛 1/2) convergence rate. Finally, we instantiate our result on logistic regression in Section 4 and empirically study the convergence performance of the L-Risks in Section 6.

2. Background and Problem Setup

In this section, we provide the necessary background on rank-weighted risk minimization, and introduce the notion of bounded variation functions that we consider in this work. Additionally, we provide standard learning-theoretic deﬁnitions and notation before moving on to our main results.

2.1. L-Risk Estimation

We assume that there is a joint probability distribution 𝑃(𝑋, 𝑌) over the space 𝒵= 𝒳 𝒴and our goal is to learn a function 𝑓: 𝒳 𝒴. In the standard decision-theoretic framework, 𝑓* is chosen among a class of functions ℱusing a non-negative loss function ℓ: ℱ 𝒵 R+.

Classical Risk Estimation. In the traditional setting of risk minimization, the population risk of a function 𝑓is given by the expectation of the loss function ℓ(𝑓, 𝑍) when 𝑍is drawn according to the distribution 𝑃:

ℛ(𝑓) = E𝑍 𝑃[ℓ(𝑓, 𝑍)].

Given 𝑛i.i.d. samples {𝑍𝑖}𝑛 𝑖=1, empirical risk minimization substitutes the empirical risk for the population risk in the risk minimization objective:

𝑖=1 ℓ(𝑓, 𝑍𝑖).

L-Risk. As noted in the Section 1, there are many scenarios in machine learning, where we want to evaluate a function by other metrics apart from average loss. To this end, we ﬁrst deﬁne the notion of an L-Statistic which dates back to the classical work of (Daniell, 1920). Deﬁnition 1. Let 𝑋(1) 𝑋(2) . . . 𝑋(𝑛) be the order statistics of the sample 𝑋1, 𝑋2, . . . 𝑋𝑛. Then, the L-statistic is a linear combination of order statistics,

𝑇𝑤({𝑋𝑖}𝑛 𝑖=1) = 1

where 𝑤: [0, 1] [0, ) is a scoring function.

With the above notion of an empirical L-statistic at hand, we deﬁne the notion of empirical L-risk for any function 𝑓 by simply replacing the empirical average of the losses with their corresponding L-statistic. Deﬁnition 2. The empirical L-risk of 𝑓is

ℒℛ𝑤,𝑛(𝑓) = 1

where ℓ(1)(𝑓) . . . ℓ(𝑛)(𝑓) are the order statistics of the sample losses ℓ(𝑓, 𝑍1), . . . ℓ(𝑓, 𝑍𝑛).

Note that the empirical L-risk can be alternatively written in terms of the empirical cumulative distribution function 𝐹𝑓,𝑛of the sample losses {ℓ(𝑓, 𝑍𝑖)}𝑛 𝑖=1:

𝑖=1 ℓ(𝑓, 𝑍𝑖)𝑤(𝐹𝑓,𝑛(ℓ(𝑓, 𝑍𝑖))). (1)

Accordingly, the population L-risk for any function 𝑓can be deﬁned as:

ℒℛ𝑤(𝑓) = E𝑍 𝑃[ℓ(𝑓, 𝑍)𝑤(𝐹𝑓(ℓ(𝑓, 𝑍))] , (2)

where 𝐹𝑓( ) is the cumulative distribution function of ℓ(𝑓, 𝑍) for 𝑍drawn from 𝑃.

2.2. Illustrative Examples of L-Risk

The framework of risk minimization is a central paradigm of statistical estimation and is widely applicable. In this section, we provide illustrative examples that L-risk generalizes classical risk and encompasses several other notions of risk measures. To begin with, observe that simply setting the weighting function as 𝑤(𝑡) = 1 for all 𝑡 [0, 1], L-Risk minimization corresponds to the classical empirical risk estimation.

Conditional Value-at-Risk (CVa R). As noted in Section 1, in settings where low-probability events have catastrophic losses, using classical risk is inappropriate. Conditional value-at-risk was introduced to handle such tail events and measures the expected loss when conditioned on the event that the loss exceeds a certain threshold. Moreover, CVa R has several desirable properties as a risk measure and in particular, is convex and coherent (Krokhmal et al., 2013). Hence, CVa R is studied across a number of ﬁelds such as mathematical ﬁnance (Rockafellar et al., 2000), decision making, and more recently machine learning (Duchi et al., 2019). Formally, the CVa R of a function 𝑓at a conﬁdence level 1 𝛼is deﬁned as,

ℛCVa R,𝛼(𝑓) = E𝑍 𝑃[ℓ(𝑓, 𝑍)|ℓ(𝑓, 𝑍) Va R𝛼(ℓ(𝑓, 𝑍))], (3)

Rank-weighted Learning

where Va R𝛼(ℓ(𝑓, 𝑍)) = inf 𝑥:𝐹𝑓(𝑥) 1 𝛼𝑥is the value-at-risk.

Observe that CVa R is a special case L-Risk in (2) and can be obtained by choosing 𝑤(𝑡) = 1

𝛼1{𝑡 1 𝛼}, where 1{ } is the indicator function.

Human-Aligned Risk (HRM). Cumulative prospect theory (CPT), which is motivated by empirical studies of human decision-making from behavioral economics (Tversky & Kahneman, 1992), has recently been studied in machine learning (Prashanth et al., 2016; Gopalan et al., 2017; Leqi et al., 2019). In particular, Leqi et al. (2019) proposed the following human-aligned risk objective,

ℛHRM,𝑎,𝑏(𝑓) = E𝑍 𝑃[ℓ(𝑓, 𝑍)𝑤a,b(𝐹𝑓(ℓ(𝑓, 𝑍))],

where 𝑤a,b(𝑡) = 3 3𝑏 𝑎2 𝑎+1 ( 3𝑡2 2(𝑎+ 1)𝑡+ 𝑎 ) + 1.

Trimmed Risk. Trimmed mean is a measure of the central tendency of a distribution and is calculated by discarding samples that are above and below a certain threshold and using the remaining samples to calculate the remaining sample mean. It is known to be more robust to outliers and heavy-tails and is widely used across a variety of disciplines such as ﬁnance and aggregating scores in sports. Finally, the trimmed risk of a function 𝑓at trimming level 𝛼can be deﬁned as,

ℛTRIM,𝛼(𝑓) = E𝑍 𝑃[ℓ(𝑓, 𝑍)|𝐹𝑓(ℓ(𝑓, 𝑍)) [𝛼, 1 𝛼]].

The trimmed risk is also a special of L-Risk in (2) and is obtained by setting 𝑤(𝑡) = 1 1 2𝛼1{𝛼 𝑡 1 𝛼}.

2.3. Bounded Variation Weighting Functions

Recall from the previous section that the L-risk for any function 𝑓depends crucially on the weighting function 𝑤( ). Moreover, for popular risk measures such as CVa R and Trimmed Risk, this weighting function is not differentiable and Lipschitz. In this section, we formally introduce a class of weighting functions called bounded variation functions, which can be viewed as a strict generalization of Lipschitz functions (Carothers, 2000; Musielak & Orlicz, 1959). More formally, we deﬁne 𝑝-variation as:

Deﬁnition 3. Let 𝑓: [0, 1] R be a function. Let 𝑃= {𝑥1, . . . , 𝑥𝑁} [0, 1] be a partition of [0, 1]. Without loss of generality, we assume 𝑥1 . . . 𝑥𝑁. Let 𝒫be the set of all partitions of [0, 1], where the size of the partitions may vary between elements of 𝒫. The 𝑝-variation of 𝑓with respect to a partition 𝑃is

𝑖=1 |𝑓(𝑥𝑖+1) 𝑓(𝑥𝑖)|𝑝 ) 1

The 𝑝-variation of 𝑓is 𝑉𝑝(𝑓) = sup𝑃 𝒫𝑉𝑝(𝑓, 𝑃).

Figure 1. An illustration of the partition argument. Here, we have Δ𝑖= 𝜀𝑖and Δ = 𝜀. The blue points, {(𝑖+1)/𝑛, (𝑖+1)/𝑛+ 𝜀𝑖+1}, and the red points, {𝑖/𝑛, 𝑖/𝑛+ 𝜀𝑖, (𝑖+ 2)/𝑛} are in two separate partitions. We construct the partitions this way to ensure that 𝜀𝑖+2 and (𝑖+2)/𝑛are not in between 𝑖/𝑛and 𝑖/𝑛+𝜀𝑖while the order of 𝑖/𝑛, 𝜀𝑖, (𝑖+ 1)/𝑛, and 𝜀𝑖+1 do not matter because 𝑖/𝑛and (𝑖+ 1)/𝑛are in separate partitions.

When 𝑝= 1, the variation 𝑉𝑝(𝑓) = 𝑉1(𝑓) is also called the total variation. Moreover, when the weighting function is 𝜆-Lipschitz, then for all 𝑝 1, the 𝑝-variation is upper bounded by 𝜆.

Table 1 summarizes the bounded variation constants for the aforementioned scoring function. The proofs for these claims can be found in the Appendix. Moving forward, we work with the assumption the weighting function 𝑤( ) has bounded variation.

Assumption 1. The weighting function 𝑤has bounded variation 𝑉𝑝(𝑤) for 𝑝= 1, 2.

Stability to ℓ Perturbations. Under Assumption 1, we next present a deterministic bound which controls the stability of bounded-variation functions to ℓ perturbations.

Lemma 1. Let 𝑃* = { 𝑖

𝑛}𝑛 𝑖=1 be the 𝑛-sized equally spaced partition of the interval [0, 1]. Then, for any perturbed partition 𝑃= { 𝑖

𝑛+ 𝑖}𝑛 𝑖=1 such that sup𝑖| 𝑖| = , we have

) 2𝑛 𝑉1(𝑤).

Proof Sketch. The key idea is that we need to construct sufﬁciently spaced-out partitions 𝒫so that for all 𝑖 [𝑛], there exists a partition 𝑃 𝒫such that both 𝑖/𝑛and 𝑖/𝑛+ 𝑖are in 𝑃and no points in between 𝑖/𝑛and 𝑖/𝑛+ 𝑖 are in 𝑃. Since for all 𝑖 [𝑛], we know that 𝑖/𝑛+ 𝑖 (𝑖/𝑛 , 𝑖/𝑛+ ), it sufﬁces to have partitions that are 2 spaced-out. This implies that the total number of partitions we need is at least 2 Δ

1/𝑛. Since 𝑤has 1-bounded variation 𝑉1(𝑤), we have reached the desired result. Figure 1 shows the necessity of having spaced-out partitions.

The above result is a key tool for studying uniform convergence of L-Risks with weighting function 𝑤that have

Rank-weighted Learning

Table 1. Bounded variation for different weight functions. L-Risk 𝑤(𝑡) 𝑉1(𝑤) 𝑉2(𝑤)

ℛ 1 0 0 ℛCVa R,𝛼 1 𝛼1{1 𝛼 𝑡} 1 𝛼 1 𝛼 ℛHRM,𝑎,𝑏 3 3𝑏 𝑎2 𝑎+1 ( 3𝑡2 2(𝑎+ 1)𝑡+ 𝑎 ) + 1 6(1 𝑏)(2 𝑎)

𝑎2 𝑎+1 6(1 𝑏)(2 𝑎)

𝑎2 𝑎+1 ℛTRIM,𝛼 1 1 2𝛼1{𝛼 𝑡 1 𝛼} 2 1 2𝛼

bounded variations. It is novel to the best of our knowledge and may be of independent interest.

2.4. Rademacher Complexity, Covering Numbers, and VC Dimension

Finally, we discuss notions of function class complexity that we shall use in our results. We make use of Rademacher complexity, covering numbers, and VC dimension. The reason for all of these is that Rademacher complexity is often an intermediate step but is difﬁcult to analyze. Covering number bounds can be used to help untangle the product structure of the L-Risks, but ultimately to deal with an empirical cumulative distribution function, it is simpler to use VC dimension.

We start with Rademacher complexity. Let 𝜎1, . . . , 𝜎𝑛be i.i.d Rademacher random variables, i.e., random variables that take the values +1 and 1 each with probability 1/2. The empirical Rademacher complexity of a function class ℱgiven a sample 𝑆= 𝑍1, . . . , 𝑍𝑛is

ˆR𝑛(ℱ) = E𝜎sup 𝑓 ℱ

𝑖=1 𝜎𝑖𝑓(𝑍𝑖). (4)

The Rademacher complexity of ℱis then the expectation of the empirical Rademacher complexity with respect to the sample 𝑆, i.e.

R𝑛(ℱ) = E𝑆ˆR𝑛(ℱ).

Given a class of functions ℱ: 𝒳 R, a ﬁnite collection of functions 𝑓1, . . . , 𝑓𝑁mapping from 𝒳to R is called an 𝜀-cover for ℱwith respect to a semi-norm if for every 𝑓in ℱ, we have min𝑗=1,...,𝑁 𝑓 𝑓𝑗 𝜀. The 𝜀covering number of ℱwith respect to is the size of the smallest 𝜀-cover of ℱand is denoted by 𝒩(𝜀, ℱ, ). Note that often bounds are stated in terms of log 𝒩(𝜀, ℱ, ), which is also called the metric entropy of ℱ. One particular norm of interest is the empirical 2-norm. Given 𝑛samples 𝑋1, . . . , 𝑋𝑛, deﬁne the empirical 2-norm of 𝑓to be

𝑖=1 𝑓(𝑋𝑖)2.

Finally, we discuss VC dimension. Let 𝐺be a class of functions from 𝒳to some ﬁnite set, e.g., {0, 1}. We deﬁne the growth function Π𝐺: N N by

Π𝐺(𝑛) = max 𝑥1,...,𝑥𝑛 𝒳|{(𝑔(𝑥1), . . . , 𝑔(𝑥𝑛)) : 𝑔 𝐺}| .

In words, this is the maximum number of ways that functions in 𝐺may classify 𝑛points. Now, suppose that functions in 𝐺map to a set of two classes, such as {0, 1}. Then, a set 𝑆= (𝑥1, . . . , 𝑥𝑛) of 𝑛points in is said to be shattered by 𝐺if there are functions in 𝐺realizing all possible label assignments, i.e.,

Π𝐺(𝑛) = 2𝑛= |{(𝑔(𝑥1), . . . , 𝑔(𝑥𝑛)) : 𝑔 𝐺}| .

Finally, the VC-dimension of 𝐺, which we denote by VC(𝐺), is given by

VC(𝐺) = max {𝑛: Π𝐺(𝑛) = 2𝑛} . (5)

In words, if the VC-dimension of 𝐺is 𝑉, then there is some set of 𝑉points shattered by 𝐺. The set that we shall be most interested in using with VC-dimension is

ℒ:= {𝑔𝑓,𝑡: 𝒵 {0, 1}|𝑔𝑓,𝑡(𝑧) = 1 {ℓ(𝑓, 𝑧) 𝑡}

for 𝑓 ℱ, 𝑡 [0, 𝐵]}.

Thus, the VC-dimension depends on both the function class ℱand the loss function. We note that this is not too large for logistic regression in Lemma 5, and for linear regression in R𝑑with squared error loss, the VC dimension is upper bounded by a constant times 𝑑(Akama et al., 2010). This VC-dimension plays a key role in the following assumption.

Assumption 2. Assume that

sup 𝑓 ℱ sup 𝑥 [0,𝐵ℓ] |𝐹𝑓,𝑛(𝑥) 𝐹𝑓(𝑥)| 𝜀.

We make a few remarks related to this assumption. First, it can be thought of as a Glivenko-Cantelli theorem-like assumption, except here we require uniformity over ℱ. Second, our main results relying on variation use this assumption in two places. Third, the assumption can be shown to hold with high probability for 𝜀= 𝑂(𝑛 1/2) when ℒ has bounded VC-dimension via a standard symmetrization

Rank-weighted Learning

and VC-dimension upper bound argument. We state this sufﬁcient condition as an alternative assumption.

Assumption 2 (sufﬁcient condition). The function class ℒ has bounded VC-dimension.

Consequently, it is natural to wonder whether this can be relaxed into a statement about the VC-dimension of ℱ. Unfortunately, this result depends on the loss function; so such a general result is unknown to the best of our knowledge. However, we prove it to be the case for logistic regression in Lemma 5, and the proof extends to other widely-used continuous classiﬁcation losses. For linear regression with squared error loss, this is bounded by a constant times 𝑑 (Akama et al., 2010). We speculate that VC(ℒ) is on the order of VC(ℱ) for reasonable continuous losses.

3. Uniform Convergence Results

In this section, we present our main generalization result in terms of a Rademacher complexity depending on 𝑤. Then, we specialize the upper bound using an entropy integral argument in two cases: (a) 𝑤with bounded variation and (b) 𝑤that is Lipschitz. The former result allows for far more general 𝑤, but our proofs lead to slower rates. At best, the rate of 𝑂(𝑛 1/4) can be achieved by instantiating our bounds, although a more reﬁned argument could possibly improve this. The latter result for Lipschitz 𝑤allows for the usual 𝑂(𝑛 1/2) learning rate.

We use 𝐹ℱto denote the set {𝐹𝑓}𝑓 ℱand deﬁne ℓ(ℱ) 𝑤(𝐹ℱ) to be the set {ℓ𝑓𝑤(𝐹𝑓) | 𝑓 ℱ}. We have the following bound.

Theorem 1. Let ℱbe a set of predictors and the loss function ℓtake values in [0, 𝐵ℓ]. By Assumption 1, with probability at least 1 𝛿, we have

sup 𝑓 ℱ |ℒℛ𝑤,𝑛(𝑓) ℒℛ𝑤(𝑓)|

2 R𝑛(ℓ(ℱ) 𝑤(𝐹ℱ)) + 4𝐶𝐵ℓ𝑉1(𝑤) R𝑛(ℒ)

In contrast to standard generalization bounds, our result has the terms R𝑛(ℒ) and R𝑛(ℓ(ℱ) 𝑤(𝐹ℱ)). Since the former is a Rademacher complexity of indicator variables, we simply use a VC-dimension upper bound. The VC-dimension then needs to be analyzed for particular losses and ℱ. Examples of losses and ℱthat permit ﬁnite VC-dimension include linear regression with squared error loss and arbitrary ℱwith logistic loss. To analyze the latter empirical Rademacher complexity, we use the standard Dudley entropy integral result in order to obtain covering numbers. From here, we can more easily decompose the covering

number into a product of covering numbers.

Lemma 2. Suppose that the loss function ℓis 𝜆ℓ-Lipschitz in 𝑓(𝑋) and bounded by 𝐵ℓ. Additionally, assume that 𝑤 is bounded by 𝐵𝑤. Then, we have

𝒩(𝑡, ℓ(ℱ) 𝑤(𝐹ℱ), 𝑛)

𝒩 ( 𝑡 2𝐵𝑤𝜆ℓ , ℱ, 𝑛

2𝐵ℓ , 𝑤(𝐹ℱ), 𝑛

Now, we use separate tools to analyze the covering number of 𝑤(𝐹ℱ) for 𝑤of bounded variation and Lipschitz 𝑤.

3.1. Weight Functions of Bounded Variation

In this section, we consider 𝑤of bounded 2-variation 𝑉2(𝑤); we have the following lemma.

Lemma 3. Let 𝒞(𝜀 , 𝐹ℱ) be a 𝜀 cover of 𝐹ℱin . Suppose that Assumption 2 holds with 𝜀from the statement of the assumption. Then the set 𝑤(𝒞(𝜀 , 𝐹ℱ)) is a 𝑉2(𝑤)

3(𝜀+ 𝜀 )-cover for 𝑤(𝐹ℱ) in 𝑛.

Lemma 3 is one of our key technical and conceptual contributions. The key contribution is that we can bound a covering number even though 𝑤may be discontinuous, and this relies on constructing a number of partitions that may be upper bounded by total variation as mentioned previously. The shortcoming of this result is also clear. Since 𝜀from Assumption 2 is of order 𝑛 1/2, the cover is only at the resolution 𝜀1/2 = 𝑛 1/4.

Corollary 1. Let ℱbe a set of predictors, the loss function ℓtake values in [0, 𝐵ℓ] and is 𝜆ℓ-Lipschitz in 𝑓(𝑋). If 𝑤 is bounded by 𝐵𝑤and the VC-dimension 𝑉of ℒsatisﬁes 1 𝑉 𝑛, by Assumption 1, there exists a universal constant 𝐶such that with probability at least 1 𝛿, we have

sup 𝑓 ℱ |ℒℛ𝑤,𝑛(𝑓) ℒℛ𝑤(𝑓)|

log 𝒩 ( 𝑡 2𝐵𝑤𝜆ℓ , ℱ, 𝑛

+ log 𝒩 ( 𝑡2

12𝐵2 ℓ𝑉2 2 (𝑤) 𝜀, 𝐹ℱ,

+ 4𝐶𝐵ℓ𝑉1(𝑤)

𝑛+ 6𝐵ℓ𝑉1(𝑤)

where 𝜀= 2𝐶

3.2. Lipschitz Weight Functions

In this section, we consider the far simpler case when 𝑤is 𝜆𝑤-Lipschitz. Again, we start with a bound for covering numbers.

Rank-weighted Learning

Lemma 4. When 𝑤is 𝜆𝑤-Lipschitz, we have the covering number bound

𝒩(𝑡, 𝑤(𝐹ℱ), 𝑛) 𝒩(𝑡, 𝐹ℱ, ) .

This leads naturally to a uniform convergence bound.

Corollary 2. Let ℱbe a set of predictors, the loss function ℓtake values in [0, 𝐵ℓ] and is 𝜆ℓ-Lipschitz in 𝑓(𝑋). If 𝑤is bounded by 𝐵𝑤and is 𝜆ℓ-Lipschitz and the VC-dimension 𝑉 of ℒsatisﬁes 1 𝑉 𝑛, there exists a universal constant 𝐶such that with probability at least 1 𝛿, we have

sup 𝑓 ℱ |ℒℛ𝑤,𝑛(𝑓) ℒℛ𝑤(𝑓)|

log 𝒩 ( 𝑡 2𝐵𝑤𝜆ℓ , ℱ, 𝑛

+ log 𝒩 ( 𝑡 2𝐵𝜆𝜆𝑤 , 𝐹ℱ,

+ 4𝐶𝐵ℓ𝑉1(𝑤)

𝑛+ 6𝐵ℓ𝑉1(𝑤)

Note the key difference between Corollary 1 and Corollary 2. The presence of 𝑡2 𝜀in the former ensures that 𝜂needs to be non-zero, and further, 𝜂 𝜀. This leads to the slow rate of convergence. In the latter corollary, this is not a problem; thus, we obtain the standard rate of convergence.

4. Logistic Regression Example

We consider a basic example for logistic regression. To instantiate the bounds, we need two things: (1) a bound on the VC dimension of ℒand (2) a covering number bound on 𝐹ℱ, the distributions of losses over the predictor class ℱ.

Thus, we specify a distribution for (𝑋, 𝑌). Let 𝑆𝑑 1 = {𝑥 R𝑑: 𝑥 = 1} denote the (𝑑 1)-dimensional sphere in R𝑑, and let 𝐵(𝑑) = {𝑥 R𝑑: 𝑥 1} denote the unit ball of radius 1 in R𝑑. Let 𝜃* Uniform(𝑆𝑑 1) be the true regression vector. Suppose that 𝑋 Uniform(𝐵(𝑑)) and that 𝑌takes the value +1 with probability 𝑝= (1+𝜃 *𝑋)/2 and 1 otherwise.

Next, we restrict the class of regressors, ℱ. We set

ℱ= {𝑓: 𝑓(𝑥) = 𝜃 𝑥for 𝜃 R𝑑s.t. 𝑟1 𝜃 𝑟2}, (6)

where 𝑟1, 𝑟2 > 0 are ﬁxed constants. We may refer directly to the parametrizations 𝜃for simplicity. Note that bounding the 𝜃away from the zero vector is necessary because as 𝜃approaches the zero vector, the distribution of losses approaches a step function, making a

cover impossible. Finally, recall that the logistic loss is ℓ(𝑓, 𝑍) = log ( 1 + 𝑒 𝑌𝑓(𝑥)) . Now, we start with our ﬁrst lemma on VC-dimension. Lemma 5. Let ℱbe the class of linear functions 𝑓(𝑥) = 𝜃 𝑥+ 𝑏for 𝑥in R𝑑. Then, the VC-dimension of ℒfor the logistic loss satisﬁes

VC(ℒ) 𝑑+ 2.

This type of bound is not particularly surprising, and other bounds on the VC-dimension of ℒin terms of the VCdimension of ℱare known for simple losses. Next, we consider the bound on the covering numbers of the losses. Lemma 6. Let (𝑋, 𝑌) have the distribution deﬁned above, and let ℱbe as in equation (6). Then, we have the 𝜀-entropy bound

log 𝒩(𝜀, 𝐹ℱ, ) 𝐶(ℱ)

To the best of our knowledge, such bounds on the set of cumulative distribution functions are novel. Finally, we may apply these lemmas to obtain the following corollaries for weight functions of bounded variation and Lipschitz weight functions. We start with the former. Corollary 3. Let (𝑋, 𝑌) have the distribution deﬁned above. Let ℱbe as deﬁned in equation (6). Suppose that the logistic loss is bounded by 𝐵ℓon 𝐵(𝑑) {+1, 1} and is 𝜆ℓ-Lipschitz with respect to 𝑓(𝑥) for all 𝑓in ℱ. Suppose that 𝑤has bounded ﬁrst and second variations 𝑉1(𝑤) and 𝑉2(𝑤). Then with probability at least 1 𝛿, we have

sup 𝑓 ℱ |ℒℛ𝑤,𝑛(𝑓) ℒℛ𝑤(𝑓)|

𝐶(ℱ, 𝑑, 𝛿, 𝐵ℓ, 𝐵𝑤, 𝑉1(𝑤), 𝑉2(𝑤))𝑛 1

Next, we consider Lipschitz weight functions. Corollary 4. Let (𝑋, 𝑌) have the distribution deﬁned above. Let ℱbe as deﬁned in equation (6). Suppose that the logistic loss is bounded by 𝐵ℓon 𝐵(𝑑) {+1, 1} and is 𝜆ℓ-Lipschitz with respect to 𝑓(𝑥) for all 𝑓in ℱ. Finally, suppose that 𝑤is 𝜆𝑤-Lipschitz, and let 𝐶be some universal constant. Then with probability at least 1 𝛿, we have

sup 𝑓 ℱ |ℒℛ𝑤,𝑛(𝑓) ℒℛ𝑤(𝑓)|

𝐶(ℱ, 𝑑, 𝛿, 𝐵ℓ, 𝐵𝑤, 𝜆ℓ)𝑛 1

Here, we obtain the desired rates of convergence showing that the generalization bound may be instantiated and lead to quantiﬁable learning rates.

Rank-weighted Learning

In this section, we prove our main theorem (Theorem 1). Lemmas are proved in the Appendix. The ﬁrst key is to choose the correct decomposition to analyze the main estimator. This is a necessary step because we cannot directly apply standard generalization results to obtain the bound for rank-weighted estimators since the terms {𝐹𝑓,𝑛(ℓ𝑓,𝑖)}𝑛 𝑖=1 have used the samples {ℓ(𝑓, 𝑍𝑖)}𝑛 𝑖=1 twice, and so there is dependence across samples. Our goal is to use symmetrization with the main error. Lemma 7. We have the inequality

P ( sup 𝑓 ℱ |ℒℛ𝑤,𝑛(𝑓) ℒℛ𝑤(𝑓)| > 𝑡+ 𝑡 )

P ( sup 𝑓 ℱ

𝑖=1 ℓ𝑓,𝑖𝑤(𝐹𝑓(ℓ𝑓,𝑖)) > 𝑡 )

+ P ( sup 𝑓 ℱ

𝑖=1 ℓ𝑓,𝑖𝑤(𝐹𝑓(ℓ𝑓,𝑖)) ℒℛ𝑤(𝑓) > 𝑡 )

The proof of the lemma is immediate, but it sets the stage to analyze the risk. Next, we present a lemma to deal with the ﬁrst term on the right hand side of Lemma 7. Lemma 8. For some universal constant 𝐶, we have

𝑖=1 ℓ𝑓,𝑖𝑤(𝐹𝑓(ℓ𝑓,𝑖))

> 4𝐵ℓ𝑉1(𝑤) R𝑛(ℒ) + 6𝐵ℓ𝑉1(𝑤)

The proof of this lemma is given in the Appendix due to space concerns. However, note that this is the second lemma that makes use of partitions towards using ﬁrst bounded variations in its proof. To deal with the second term of Lemma 7, we use a standard Rademacher complexity bound that we provide in the Appendix, which yields

𝑖=1 ℓ𝑓,𝑖𝑤(𝐹𝑓(ℓ𝑓,𝑖)) ℒℛ𝑤(𝑓)

> 2 R𝑛(ℓ(ℱ) 𝑤(𝐹ℱ)) + 3

6. Experiments

Following our theoretical analysis, we test our results on logistic regression and linear regression. We describe our setups in Section 6.1, the metrics as well as the optimization methods we have used in Section 6.2, and ﬁnally discuss our results in Sections 6.3.

6.1. Setups

In the logistic regression setup, the features are drawn from 𝑋 Uniform(𝐵(𝑑)) where 𝐵(𝑑) is the 𝑑-dimension ball. The labels are sampled from a distribution over { 1, 1} where 𝑌takes the value +1 with probability (1+𝑋 𝜃*)/2. where 𝜃* = (1, 0, 0, , 0) R𝑑. We use the logistic loss ℓ(𝜃; (𝑋, 𝑌)) = log ( 1 + exp( 𝑌𝑋 𝜃) ) .

In the linear regression experiment, we draw our covariates from a Gaussian 𝑋 𝒩(0, I𝑑) in R𝑑. The noise distribution is ﬁxed as 𝜀 𝒩(0, 0.01). We draw our response variable 𝑌as, 𝑌= 𝑋 𝜃* + 𝜀where 𝜃* = (1, 1, , 1) R𝑑. We ﬁx the squared error ℓ(𝜃; (𝑋, 𝑌)) = 1

2(𝑌 𝑋 𝜃)2 as our loss function.

6.2. Metrics and Methods

For each setting, we look at two metrics: (a) approximate uniform error; and (b) training and testing performance of the empirical minimizer of different L-Risks. We approximate the uniform error when 𝑑= 1 in the following manner: for logistic regression, we pick the interval ( 1.5, 1.5) to be our parameter space Θ and construct a grid of size 200 where each grid point represents a model 𝜃. The true L-Risk for each 𝜃is then approximated by the empirical L-Risk using 20, 000 samples. The empirical L-Risk is calculated using sample size 𝑛ranging from 20 to 22, 100. Finally, the approximated uniform error at each sample size 𝑛is obtained by taking the maximum over the difference of the empirical and true L-Risk for all 𝜃. In the linear regression setup, the parameter space is chosen to be ( 15, 15) with size of 500.

Te explore the training and testing performance of the empirical minimizer of different L-Risks, we perform the following iterative optimization procedure to obtain a heuristic minimizer:

𝜃𝑡+1 = 𝜃𝑡 𝛾𝑡

𝑖=1 𝑤𝑡 𝑖 𝜃ℓ(𝜃𝑡; 𝑍𝑖), (7)

for all 𝑡 {0, , 𝑇 1}, where 𝑤𝑡 𝑖= 𝑤(𝐹𝑛(ℓ(𝜃𝑡; 𝑍𝑖))) is the rank-dependent weighting and 𝛾𝑡is the learning rate. Although in general, the empirical L-Risks are non-convex with respect to the model parameter 𝜃, there are special cases like CVa R, which has a convex dual form that is easy to optimize (Rockafellar et al., 2000):

ℛCVa R,𝛼(𝑓) = inf 𝜂 R

𝛼E[max(ℓ𝑓 𝜂, 0)] } . (8)

In such cases, we compare the heuristic method with the procedure that ﬁrst optimizes the L-Risk with respect to 𝜂 and then with respect to 𝜃at each iteration.

Rank-weighted Learning

(a) log(uniform error) v.s. log 𝑛 (b) 𝑑= 5; 𝑛from 20 to 22, 100 (c) 𝑛= 2000; 𝑑from 7 to 55

Figure 2. Experimental results for logistic regression are averaged over 20 seeds. Figure 2a is the log-log plot of the approximated uniform error with respect to different sample sizes. Figure 2b and Figure 2c are log-log plots of the performance of the empirical minimizer of different L-Risks with respect to sample size and sample dimensions. For HRM, we have chosen 𝑎= 0.6 and 𝑏= 0.4. For CVa R and trimmed mean, 𝛼is set to be 0.1. The results for logistic regression suggests similar convergence behavior across different L-Risks.

(a) log(uniform error) v.s. log 𝑛 (b) 𝑑= 5; 𝑛from 20 to 22, 100 (c) 𝑛= 2000; 𝑑from 7 to 55

Figure 3. Experimental results for linear regression are averaged over 20 seeds. Figure 3a is the log-log plot of the approximated uniform error with respect to different sample sizes. Figure 3b and Figure 3c are log-log plots of the performance of the empirical minimizer of different L-Risks with respect to sample size and sample dimensions. For HRM, we have chosen 𝑎= 0.3 and 𝑏= 0.4. For CVa R and trimmed mean, 𝛼is set to be 0.1. The results for linear regression suggests similar convergence behavior across different L-Risks.

6.3. Results

Our results for logistic regression are given in Figure 2, and our results for linear regression are given in Figure 3. Figure 2a and Figure 3a are the log-log plots of the approximated uniform error with respect to the sample size in the two experimental settings. We ﬁt a line for the data of each L-Risk to examine the rate of convergence of these approximated uniform errors. In both the logistic regression and linear regression experiments, we see that rank-weighting does not have much effect on the convergence rate. In the logistic regression case, the 𝑛 1/2 convergence rate of HRM matches the expectation since its weight function is Lipschitz. The convergence rate for CVa R and trimmed mean suggests that there might be tighter uniform convergence results for weight function with bounded variation.

Figure 2b, 2c, 3b and 3c show the behavior of the L-Risk of the empirical minimizers. Empirically, we observe that

the convergence performance across different L-Risks are similar. Similar to the log-log plots, this may suggest that there are faster rates of convergence than the ones we have obtained in Section 3, especially for the weighting functions that are not Lipschitz but have bounded variation.

Finally, we make a remark about the optimization of CVa R by its rank-weighted formulation versus its convex dual. As shown in Appendix ??, we observe that when optimizing CVa R, our heuristic method given in equation (7) improves more quickly than the convex dual form of CVa R equation (8). However, we do note that the dual form achieves a marginally better solution in terms of training error. Practically, the comparable performance of the rank-weighted optimization perspective and the provably convex perspective in the special case where we can compare the two is encouraging for other weight functions. The code of the experiments can be found at https://bit.ly/2Yzw Rk J.

Rank-weighted Learning

7. Discussion

In this work, we study uniform convergence of rankweighted risks, which we deﬁne as L-Risk. Different from classical population risks, L-Risk utilizes the ranking of the losses. There are a number of future directions in studying L-Risk. One direction, suggested by our experimental results in Section 6, is to obtain faster rates of convergence for L-Risk with weighting functions of bounded variations. In addition to that, it is also of interest to obtain a lower bound of the convergence rate for L-Risk with non-Lipschitz weighting functions. Finally, since L-Risks are in general non-convex with respect to the model parameters, in order to use it to learn machine learning models, we need to investigate on algorithms for optimizing them. In particular, it would be useful to better understand simple optimization methods for arbitrary rank-weighting functions, such as the iterative approach we have used in the experiments.

Acknowledgements

We acknowledge the support of NSF via IIS-1909816, and ONR via N000141812861.

Akama, Y., Irie, K., Kawamura, A., and Uwano, Y. VC dimensions of principal component analysis. Discrete & Computational Geometry, 44(3):589 598, 2010.

Bhat, S. P. Improved concentration bounds for conditional value-at-risk and cumulative prospect theory using wasserstein distance. ar Xiv preprint ar Xiv:1902.10709, 2019.

Bhatia, K., Jain, P., and Kar, P. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pp. 721 729, 2015.

Carothers, N. L. Real Analysis. Cambridge University Press, 2000.

Daniell, P. Observations weighted according to order. American Journal of Mathematics, 42(4):222 236, 1920.

Diecidue, E. and Wakker, P. P. On the intuition of rankdependent utility. Journal of Risk and Uncertainty, 23(3): 281 298, 2001.

Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018.

Duchi, J. C., Hashimoto, T., and Namkoong, H. Distributionally robust losses against mixture covariate shifts. preprint, 2019.

Garcıa, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015.

Gopalan, A., Prashanth, L., Fu, M., and Marcus, S. Weighted bandits or: How bandits learn distorted values that are not expected. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972.

Jiang, A. H., Wong, D. L.-K., Zhou, G., Andersen, D. G., Dean, J., Ganger, G. R., Joshi, G., Kaminksy, M., Kozuch, M., Lipton, Z. C., et al. Accelerating deep learning by focusing on the biggest losers. ar Xiv preprint ar Xiv:1910.00762, 2019.

Krokhmal, P., Zabarankin, M., and Uryasev, S. Modeling and optimization of risk. In HANDBOOK OF THE FUNDAMENTALS OF FINANCIAL DECISION MAKING: Part II, pp. 555 600. World Scientiﬁc, 2013.

Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 665 674. IEEE, 2016.

Leqi, L., Prasad, A., and Ravikumar, P. On human-aligned risk minimization. In Advances in Neural Information Processing Systems, 2019.

Lugosi, G. and Mendelson, S. Robust multivariate mean estimation: the optimality of trimmed mean. ar Xiv preprint ar Xiv:1907.11391, 2019.

Musielak, J. and Orlicz, W. On generalized variations (i). Studia Mathematica, 18(1):11 41, 1959.

Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018.

Prasad, A., Balakrishnan, S., and Ravikumar, P. A uniﬁed approach to robust mean estimation. ar Xiv preprint ar Xiv:1907.00927, 2019.

Prashanth, L., Jie, C., Fu, M., Marcus, S., and Szepesv ari, C. Cumulative prospect theory meets reinforcement learning: Prediction and control. In International Conference on Machine Learning, pp. 1406 1415, 2016.

Rockafellar, R. T., Uryasev, S., et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000.

Shah, V., Wu, X., and Sanghavi, S. Choosing the sample with lowest loss makes sgd robust. ar Xiv preprint ar Xiv:2001.03316, 2020.

Rank-weighted Learning

Sra, S., Nowozin, S., and Wright, S. J. Optimization for machine learning. Mit Press, 2012.

Tversky, A. and Kahneman, D. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297 323, 1992.