# learning_to_abstain_from_uninformative_data__57294d5a.pdf

Learning to Abstain From Uninformative Data

Yikai Zhang Yikai.Zhang@morganstanley.com Morgan Stanley

Songzhu Zheng Songzhu.Zheng@morganstanley.com Morgan Stanley

Mina Dalirrooyfard Mina.Dalirrooyfard@morganstanley.com Morgan Stanley

Pengxiang Wu pxiangwu@gmail.com Snap Inc.

Anderson Schneider Anderson.Schneider@morganstanley.com Morgan Stanley

Anant Raj Anant.Raj@morganstanley.com Morgan Stanley

Yuriy Nevmyvaka Yuriy.Nevmyvaka@morganstanley.com Morgan Stanley

Chao Chen chao.chen.1@stonybrook.edu Stony Brook University

Learning and decision-making in domains with naturally high noise-to-signal ratios such as Finance or Healthcare is often challenging, while the stakes are very high. In this paper, we study the problem of learning and acting under a general noisy generative process. In this problem, the data distribution has a significant proportion of uninformative samples with high noise in the label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data during both training and testing. We propose a novel approach to learning under these conditions via a loss inspired by the selective learning theory. By minimizing this loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluates its empirical performance in a variety of settings.

1 Introduction

Despite the success of machine learning (ML) in computer vision (Krizhevsky et al., 2009; He et al., 2016a; Huang et al., 2017) and natural language processing (Vaswani et al., 2017; Devlin et al., 2018), the power of ML is yet to make similarly-weighty impact in other areas such as Finance or Public Health. One major challenge is the inherently high noise-to-signal ratio in certain domains. In financial statistical arbitrage, for instance, the spread between two assets is usually modeled using Orstein-Uhlembeck processes (Øksendal, 2003; Avellaneda & Lee, 2010). Spreads behave almost randomly near zero and are naturally unpredictable.

: Equal contribution.

They become predictable in certain rare pockets/scenarios: for example, when the spread exceeds a certain threshold, it will move towards zero with a high probability, making arbitrage profits possible. In cancer research, due to limited resources, only a small number of the most popular gene mutations are routinely tested for differential diagnosis and prognosis. However, because of the long tail distribution of mutation frequencies across genes, these popular gene mutations can only capture a small proportion of the relevant list of driver mutations of a patient (Reddy et al., 2017). For a significant number of patients, the tested gene mutations may not be in the relevant list of driver mutations and their relationship w.r.t. the outcome may appear completely random. Identifying these patients automatically will justify additional gene mutation testing.

These high noise-to-signal ratio datasets pose new challenges for learning. New methods are required to deal with the large fraction of uninformative/high-noise data in both the training and testing stages. The source of uninformative data can be either due to the random nature of the data-generating process or due to the fact that the real causing factors are missing from the data. Direct application of standard supervised learning methods to such datasets is both challenging and unwarranted. Deep neural networks are even more affected by the presence of noise, due to their strong memorization power (Zhang et al., 2017): they are likely to overfit the noise and make overly confident predictions where weak/no real structure exists.

In this paper, we propose a novel method for learning on datasets where a significant portion of content has high noise. Instead of forcing the classifier to make predictions for every sample, we learn to first decide whether a datapoint is informative or not. Our idea is inspired by the classic selective prediction problem (Chow, 1957), in which one learns to select a subset of the data and only predict on that subset. However, the goal of selective prediction is very different from ours. A selective prediction method pursues a balance between coverage (i.e. proportion of the data selected) and conditional accuracy on the selected data, and does not explicitly model the underlying generative process. In particular, the aforementioned balance needs to be specified by a human expert, as opposed to being derived directly from the data. In our problem, we assume that uninformative data is an integral part of the underlying generative process and needs to be accounted for. By definition, no learning method, no matter how powerful, can successfully make predictions on uninformative data. Our goal is therefore to identify these uninformative/high noise samples and, consequently, to train a classifier that is less influenced by the noisy data.

Our method learns a selector, g, to approximate the optimal indicator function of informative data, g . We assume that g exists as a part of the data generation process, but it is never revealed to us, even during training. Instead of direct supervision, we therefore must rely on mistakes made by the predictor in order to train the selector. To achieve this goal, we propose a novel selector loss enforcing that (1) the selected data best fits the predictor, and (2) in the portion of the data where we abstain from forecasting, the predictor s performance is similar to random chance. The resulting loss function is quite different from the loss in classic selective prediction, which penalizes all unselected data equally.

We theoretically analyze our method under a general noisy data generation process, imposing an additional structure on the standard data-dependent label noise model (Massart & Nédélec, 2006; Hanneke, 2009). We distinguish informative versus uninformative data via a gap in the label noise ratio. A major contribution of this paper is the derivation of theoretical guarantees for the selector loss using empirical risk minimization (ERM). A minimax-optimal sample complexity bound for approximating the optimal selector is provided. We show that optimizing the selector loss can recover nearly all the informative data in a PAC fashion (Valiant, 1984). This guarantee holds even in a challenging setting where the uninformative data has purely random labels and dominates the training set. By leveraging the estimated selector, one can further pick out the informative subset of the training samples. We prove that the classifier generated through risk minimization conditional on the informative subset exhibits a superior upper bound on risk compared to a conventional classifier trained using the complete training dataset.

The theoretical results generalize the method to a more realistic setting where the sample size is limited. Furthermore, the initial predictor is not sufficiently close to the ground truth. Our method yields an iterative algorithm, in which both the predictor and the selector are progressively optimized. The selector is improved by optimizing our novel selector loss. Meanwhile, the predictor is strengthened by optimizing the empirical risk: re-weighted based on the output from the selector, where uninformative samples identified by the

selector are down-weighed. Experiments on both synthetic and real-world datasets demonstrate the merit of our method compared to existing baselines.

2 Related work

Learning with untrusted data aims to recover the ground truth model from a partially corrupted dataset. Different noise models have been studied, including random label noise (Bylander, 1994; Natarajan et al., 2013), Massart Noise (Massart & Nédélec, 2006; Awasthi et al., 2015; Hanneke, 2009; Hanneke & Yang, 2015; Yan & Zhang, 2017; Diakonikolas et al., 2019; 2020; Zhang & Li, 2021) and adversarial noise (Kearns & Li, 1993; Kearns et al., 1994; Kalai et al., 2008; Klivans et al., 2009; Awasthi et al., 2017). Our noise model is similar to General Massart Noise (Massart & Nédélec, 2006; Hanneke, 2009; Diakonikolas et al., 2019), where the label noise is data-dependent, and the label can be generated via a purely random coin flipping. The major distinction in our noisy generative model is the existence of some uninformative data with high noise in label compared to informative data with low noise in label. We characterize such uninformative/informative data structure via a non-vanishing label noise ratio gap. While there exists a long history of literature studying training classifiers with noisy label (Thulasidasan et al., 2019), we are the first to investigate learning a model robust against label noise at inference stage. We study the case where uninformative samples are an integral part of the generative process and thus will appear during inference stage as well, where they must be discarded if detected. We view this as a realistic setup in industries like Finance and Healthcare.

Selective learning is an active research area (Chow, 1957; 1970; El-Yaniv et al., 2010; Kalai et al., 2012; Nan & Saligrama, 2017; Ni et al., 2019; Acar et al., 2020; Gangrade et al., 2021a) that expands on the classic selective prediction problem. It focuses on how to select a subset of data for different learning tasks, and has also been further generalized to other problems, e.g., learning to defer human expert (Madras et al., 2018; Mozannar & Sontag, 2020). We can approximately classify existing methods into 4 categories: Monte Carlo sampling based methods (Gal & Ghahramani, 2016; Kendall & Gal, 2017; Pearce et al., 2020), margin based methods (Fumera & Roli, 2002; Bartlett & Wegkamp, 2008; Grandvalet et al., 2008; Wegkamp et al., 2011; Zhang et al., 2018), confidence based methods (Wiener & El-Yaniv, 2011; Geifman & El-Yaniv, 2017; Jiang et al., 2018) and customized selective loss (Cortes et al., 2016; Geifman & El-Yaniv, 2019; Liu et al., 2019; Gangrade et al., 2021c). Notably, several works propose customized losses, and incorporate them into neural networks. In (Geifman & El-Yaniv, 2019), the network maintains an extra output neuron to indicate rejection of datapoints. Liu et al. (2019) introduces the Gambler loss where a cost term is associated with each output neuron and a doubling-rate-like loss function is used to balance rejections and predictions. Thulasidasan et al. (2019) also applies an extra output neuron for identifying noise label to improve the robustness in learning. Huang et al. (2020) adopts a progressive label smoothing method which prevents DNN from overfitting and improves selective risk when applied to selective classification task. Cortes et al. (2016) performs data selection with an extra model and introduces a selective loss that helps to maximize the coverage ratio, thus trading off a small fraction of data for better precision. Sharing a similar spirit with (Kalai et al., 2012), (Gangrade et al., 2021c) applies a one-sided prediction method to model a high confidence region for each individual class, and maximizes coverage while maintaining a low risk level.

Existing works on selective prediction are all motivated by the trade off between accuracy and coverage - i.e. one wants to make confident predictions to achieve higher precision while maintaining a reasonable recall. To the best of our knowledge, our paper is the first to investigate the case where some (or even the majority) of the data is uninformative, and thus must be discarded at test time. Unlike the selective prediction, our framework considers a latent never-revealed ground truth indicator function about whether a data point should be selected or not. Our method is guaranteed to identify those uninformative samples.

3 Problem formulation

In this section, we describe the framework for the inherently noisy data generation process that we study.

Definition 1 (Noisy Generative Process). Let α (0, 1) be a problem-dependent constant and ΩD Rd be the support of Dα below. We define Noisy Generative Process by the following notation x Dα where

( x DU with prob. 1 α (Uninformative) x DI with prob. α (Informative). (1)

Given X Rd, let the ground truth labeling function f : X {0, 1} be in hypothesis class F. Suppose {ΩU, ΩI} is a partition of ΩD. Let λ(x) ( λ, 1

2] with λ > 0, the latent informative/uninformative status z {1, 0} has posterior distribution:

( 1 2 λ(x), if x ΩU 1 2 + λ(x), if x ΩI. (2)

The observed data (x, y) is generated according to:

z P[z|x]; ( y Bernoulli(0.5), if z = 0 y = f (x), if z = 1.

Since λ(x) > 0, x from ΩU has a lower probability to be observed with true label compared to ΩI, thus it can be viewed as uninformative data in a relative sense. On the contrary, x from ΩI can be viewed as informative data. Our Noisy Generative Process follows standard data-dependent label noise, e.g., Massart Noise (Massart & Nédélec, 2006) and Benign Label Noise (Hanneke, 2009; Hanneke & Yang, 2015; Diakonikolas et al., 2019) with label noise ratio 1

4 (2g (x) 1)λ(x)

2 . Indeed, one can always choose λ(x) [0, 1

2] and α [0, 1] to replicate General Massart noise. Compared to classical label noise models, the assumption λ(x) > λ introduces a label noise ratio gap, which distinguishes the informative and uninformative data. In Equation 3, the Bernoulli(0.5) label noise serves as a proxy for white noise in label corruption. When λ(x) = 1

2 and x ΩU, Bernoulli(0.5) random label noise can be viewed as the strongest known non-adversarial label noise, of both theoretical and practical interest (Diakonikolas et al., 2019). Such Bernoulli(0.5) random label noise could happen when hard-to-classify examples are shown to human annotators (Klebanov & Beigman, 2010), or when fluctuations in financial markets closely resemble a random walk (Tsay, 2005).

A typical setting that is studied in this work is the case that both values of 1 α and λ(x) are non-vanishing, i.e., there is a significant fraction of uninformative data (large 1 α) and the label noise ratio gap is distinguishable between informative and uninformative data (large λ).

The next definition describes a recoverable condition of the optimal function for the latent informative/uninformative status z.

Definition 2 (G-realizable). Given support ΩD and λ(x) ( λ, 1

2], let the posterior distribution of z be defined in Equation (2). We say ΩD is G-realizable if there exists g G : X {0, 1} satisfying g (x) = 1{P[z = 1|x] > 1

Ideally, one would want to select all informative samples where signal dominates noise. This can be done via recovering g ( ), which we view as the ground truth selector we wish to recover. The G-realizable condition is analogous to the realizability condition (Massart & Nédélec, 2006; Hanneke & Yang, 2015) in the classical label noise problem. The major difference and challenge in recovering g ( ) compared to learning a classifier, is that there is no direct observation on the informative/non-informative status z. The major contribution of this work is proposing a natural selector risk which recovers g ( ) without observing the latent variable z.

Having introduced the data generation process, we now describe the learning task:

Assumption 1. Data Sn = {xi, yi}n i=1 is i.i.d generated according to the Noisy Generative Process (Definition 1), with f F and support ΩD satisfies G-realizable condition.

Given the above assumption, we are interested in the following learning task:

Problem 1 (Abstain from Uninformative Data). Under Assumption 1 with i.i.d observations from Dα, we aim to learn a selector bg G that is close to g (x) and predictor bf with low selective risk.

(a) Inform/Uninformative Data (b) Correct/Incorrect classification (c) Selective classification

Figure 1: Illustration of the learning strategy when x is distributed according to a Gaussian mixture. We replace the 0-1 loss with hinge loss and train SVM models for f and g. (a) upper panel shows the original dataset and bottom panel shows the region of informative (easy) and uninformative (hard) data. (b) shows that the classifier has high accuracy in the informative region, but low accuracy in the uninformative region. In (c), the selector trained with bf successfully recovers informative support thus resulting in low selective risk, and we abstain from making a prediction elsewhere.

4 Our method

In this section, we present our approach to learning and abstaining in the presence of uninformative data (Problem 1). The main challenge is that the latent informative/uninformative status of a datapoint is unknown. Our main idea is to introduce a novel yet natural selector loss function that trains a selector based on the performance of the best predictor (Section 4.1). In Section 5, we present our main theoretical result. We show that, given any reasonably good classifier, by finding a selector minimizing the proposed selector loss, we can solve Problem 1 with minimax-optimal sample complexity. Inspired by the theoretical results, in Section 6, we propose a practical algorithm that iteratively optimizes the predictor and the selector.

4.1 Selector loss

In an idealized setting, when access to latent informative/uninformative variables {(xi, zi)}n i=1 is available, recovering g shares a similar spirit with learning a classifier under label noise. It suffices to minimize the following classical classification risk :

Non-Realizable Risk(g; Sn)

i=1 1{g(xi) = zi} (4)

However, in practice z is never revealed. To learn a selector without direct supervision, we have to leverage the performance of a predictor f. We propose to replace z in the Equation 4 with a pseudo-informative label 1{f(x) = y}, which has randomness coming from z and noisy label y.

Definition 3 (Selector Loss). Let scalar β > 0 be some weight coefficient, given f F and its selector g G, we define the following empirical version of weighted 0-1 type risk w.r.t g( ) as selector risk:

RSn(g; f, β)

β1{f(xi) = yi}1{g(xi) > 0} + 1{f(xi) = yi}1{g(xi) 0} (5)

where Sn = {(xi, yi)}n i=1 defined in Assumption 1.

The selector loss is also a natural metric to evaluate the quality of the selector. This loss penalizes when (1) the predictor makes a correct prediction on a datapoint that the selector considers uninformative and

abstains from, or (2) the predictor makes an incorrect prediction when the selector considers informative. Intuitively, the loss will drive the selector to partition the domain into informative and uninformative regions. Within the informative region, the predictor is supposed to fit the data well, and should be more accurate. Meanwhile, within the uninformative region, the label is random and the predictor is prone to make mistakes.

Note that there are two types of errors penalized in the selector loss: an incorrect prediction on a selected datapoint, (f(x) = y) (g(x) > 0), and a correct prediction on an unselected datapoint, (f(x) = y) (g(x) 0). Since the label noise is non-adversarial, y tends to have higher probability of coincidence with f (x), introducing imbalance on the pseudo-informative label. We thus use β to weigh these two types of errors in the loss. An analysis can be found in section A.1 on the choice of β. Our theoretical analysis suggests that, for a wide range of β, the accuracy of the selector is guaranteed. In particular, the range of β depends on the value of label noise ratio gap λ, empirically, many values work - see Appendix - and we choose β = 3 throughout. Our experiments show stability with regard to this choice.

Learning a selector with the novel loss. To learn a selector, one can follow standard procedures e.g., empirical risk minimization (ERM), to get a predictor bf with reasonable quality. The selector can be estimated by minimizing the selector loss bg = arg ming G RSn(g, bf, β), conditioned on the estimated predictor bf.

In Figure 1, we show an example of using the ERM strategy using SVM with 0-1 loss replaced by hinge loss. In this case, the losses are all convex and the empirical minimizers bf and bg can be computed exactly.

In practice, however, empirical minimization is not always possible, as optimization for general complex models (e.g., DNNs), resulting in non-convex losses, remains an open problem. We therefore propose a practical algorithm in the spirit of our theoretical results: it jointly learns f and g by minimizing the selector loss and a reweighed classification risk iteratively (see Section 6).

5 Theoretical results

In this section, we present our theoretical results. The main one can be summarized in the following (informal) statement.

Main Result (Informal) For any reasonably good predictor bf, with sufficient data, the selector bg estimated using bg = arg ming G RSn(g, bf, β) is sufficiently close to the targets g with high probability, where Sn = {(xi, yi)}n i=1 defined in Assumption 1. Furthermore, training predictor f only using informative data selected by bg improves selective risk.

Remark 1. The toolkit we use in the proof is a Bernstein-type inequality for fast generalization rate under margin condition (Massart & Nédélec, 2006; Van Erven et al., 2015; Li & Liu, 2021). We also provide an information theoretic lower bound construction in section A.2 to show our selector risk bound is minimaxoptimal. Our construction of the lower bound is motivated from (Ehrenfeucht et al., 1989; Blumer et al., 1989) and Le Cam s method (Yu, 1997). Due to space constraints, detailed proofs of our theorems are provided in the Appendix. To avoid lengthy technical definitions, we only present a light version of our results in the format of finite hypothesis class. The extension to VC-class using Local Rademacher Average tools (Bartlett et al., 2005) is shown in section B.

Theorem 1 (Minimax-Opitmal Selector Risk Bound). Let Sn = {(xi, yi)}n i=1 be i.i.d sample from Data Generative Process described in Definition 1 under Assumption 1, with f ( ) F and g ( ) G, |F| < |G| < . Given λ, let β 3 2 λ

1+2 λ + λ, min( 3+2 λ

1 2 λ λ 1 4 λ2 , 10) . For any bf( ) F, let bg = arg min g G RSn(g; bf, β).

Then for any ε > 0, δ > 0 such that the following holds: For n max{ 32β2 log( |G|

δ ) λε , 24β log( |F|

δ ) ε }, and for bf that satisfies one of the following condition:

For any bf( ) F such that Ex[ bf(x) = f (x)] ε 8β with prob at least 1 δ,

For any bf( ) F such that Ex[ bf(x) = f (x)|x ΩI] ε 8βα with prob at least 1 δ,

The following holds with probability at least 1 2δ:

R(bg; f , β) R(g ; f , β) ε

Remark 2. The assumption that Ex[ bf(x) = f (x)] ε could be achieved via an ERM on classification loss Pn i=1 1{f(xi) = yi} under some margin condtions (Bousquet, 2004; Massart & Nédélec, 2006; Bartlett et al., 2005). In practice, one can also apply some methods beyond ERM to obtain bf (Namkoong & Duchi, 2017). In particular, in the case λ = 1

2, the data in support ΩU is un-learnable as y are purely random. While approximating f on the full support is not possible in general, one can control the conditional risk Ex[ bf(x) = f (x)|x ΩI] via a standard ERM schema (see proof in appendix Section A.7). We stress that Theorem 1 holds for any classifier that is close to f , even the case where bf and bg are trained on the same dataset.

Corollary 1 (Recovering g ). Given conditions in Theorem 1, if we choose β = 3, we have:

Ex[1{bg(x) = g (x)}] 4ε(1 + 2 λ)

Corollary 1 suggests that by minimizing the empirical version of the loss from Definitions 3, one can recover g in a PAC fashion. The theoretical guarantee holds even under a very challenging case where α > 0.5 and λ = 1

2, .e.g, the majority of the data has purely random labels. The analysis of the selector loss (Theorem 1) relies on the quality of the classifier bf. But since we know that bg is able to abstain from uninformative data, we can retrain bf beyond standard ERM, with upweighted informative data, therefore improving the accuracy of bf. Such circular logic naturally leads to a alternating risk minimization schema. We formulate such schema in equations (8) and (9). Next theorem studies the risk bounds of classifiers by minimizing the empirical risk conditional on the samples selected by bg. The risk bound improves the conditional risk compared to the standard empirical risk minimization method up to some problem-dependent constant. Theorem 2 (Joint Risk Bounds ). Let Sn{(xi, yi)}n i=1 be i.i.d samples from the Data Generative Process described in Definition 1 under Assumption 1, with f ( ) F and g ( ) G, |F| < , |G| < and

2 h, h > 0. For any ϵ > 0 and 0 < δ < 1, if n 64 max log( |G|

δ ) λ2ε , log( |F|

ERM Classifier: bf = arg min f F

i=1 1{f(xi) = yi} (7)

ERM Selector: bg = arg min g G RSn(g; bf, 3) (8)

Subset-ERM Classifier: ef = arg min f F

i=1 1{bg(xi) > 0}1{f(xi) = yi} (9)

the following inequalities hold with probability at least 1 3δ:

Ex[ bf(x) = f (x)] ε (10)

Ex[1{bg(x) = g (x)}] 4ε (11)

Ex[ ef(x) = f (x)|g (x) = 1] h2ε

Theorem 2 implies the convergence of the alternating minimization procedure between equation (8) and (9) by setting bf := ef, due to the fact that equation (12) satisfies the pre-requisite condition on bf in Theorem 1. Another take away of Theorem 2 is an improved selective risk of classifier resulting from training only with samples picked out by the selector. To see this, we note that Massart Noise condition is satisfied with margin h 2 under the assumption that λ(x) < 1

2 h. The risk bound under standard margin condition in equation (10) follows from a standard result (see Section 5.2 in (Bousquet, 2004)). Indeed, equation (10) implies the following risk bounds conditional on ΩI:

Ex[ ef(x) = f (x)|g (x) = 1] ε

Which is improved by a problem-dependent constant factor, of order h2, compared to equation (12). While risk bound in equation (10) is minimax optimal in general due to (Massart & Nédélec, 2006), the problemdependent structure introduced in Definition 1 allows tighter risk bounds on sub-regions. The refined selective risk could be achieved by a weighted-ERM, with binary weights {0, 1} given by selectors.

Remark 3. The analysis presented in (Cortes et al., 2016) suggests that the generalization bound exhibits a rate of O(1/ n). This risk bound covers scenarios where there is no presumption of a ground truth model, nor any gap on the label noise ratio to distinguish informative and uninformative data. In comparison, the risk bounds established in Theorem 2 indicate a faster, mini-max optimal rate of O(1/n) by capitalizing on the structural characteristics of the underlying data generation process. This contrast in assumptions stems from the motivation behind selective learning. While in the works of (Cortes et al., 2016; Geifman & El-Yaniv, 2019), the selective loss is devised with a focus on coverage ratio, i.e., optimizing for higher precision (selective loss) by trading coverage ratio, our approach is tailored to distinguish data that is inherently unlearnable and unpredictable. This discrepancy leads to an alternative theoretical outcome. The analysis in (Cortes et al., 2016) primarily centers on selective risk, whereas our theoretical analysis places emphasis on the selector s capability to differentiate between informative and uninformative data without adjusting the rejection cost imposed by a human.

Algorithm 1 Iterative Soft Abstain (ISA)

1: Input: Dataset Sn = {(x1, y1), ..., (xn, yn))}, weight parameter:β, random initial classifier ˆf 0 and selector ˆg0, number of iterations T 2: for t 1, , T do 3: Optimize loss to update predictor ˆf t : 1

n Pn i=1 ˆgt(xi){yi log( ˆf t(xi)) + (1 yi) log(1 ˆf t(xi))}

4: Approximate the pseudo-informative label : zt i = 1{1{ ˆf t(xi) > 1

2} = yi} 5: Optimize loss to update selector gt : Pn i=1 {zt i log(ˆgt(xi)) + β(1 zt i) log(1 ˆgt(xi))} 6: end for 7: Output: ˆf T , ˆg T

6 A practical algorithm

Motivated by our theoretical analysis, we propose a practical algorithm that shares a spirit similar to the selector loss. From a computational standpoint, we replace the binary loss by cross-entropy loss instead and require that both f and g have continuous-valued output, ranging between 0 and 1 instead of binary output. The label y also needs to be processed so that the values are in the {0, 1} range.

Following alternating steps given in equation (8) and (9), the practical algorithm trains both predictor and selector in an iterative manner. To accelerate the training, instead of applying an alternating minimization schema, we relax the requirement for minimization oracles by stochastic gradient updates. During the joint optimization process, the predictor is counting on the selector to upweight informative data. By putting more effort on the informative data, we wish to improve the performance of the predictor on informative data, as in equation 12. Algorithm 1 shows the logic above. A pictorial example of Algorithm 1 s performance can be found in Figure 7 in the Appendix.

7 Experiments

In this section, we test the efficacy of our practical algorithm (Algorithm 1) on both publicly-available and semi-synthetic datasets. The code for reproducing the results could be found in https://github. com/morganstanley/MSML/tree/main/paper/Learn_to_Abstain. The empirical study aims to answer the following questions:

Q1 : How does Algorithm 1 compare to baselines on semi-synthetic datasets in recovering ground truth selector g ?

The results are presented in Figure 2, 3 and 4, 5. We conducted semi-synthetic experiments to evaluate the ability of each baseline in recovering the g given different values for the noise ratio gap λ and sample size constraints. One can see our proposed method outperforms all listed baselines in this particular task.

Q2 : Does our algorithm outperform ERM on selected data points ?

One important motivation behind selective learning is to find a model that outperforms the ERM on the informative part of the data. This is a challenging task as ERM is minimax optimal in general setting(Massart & Nédélec, 2006; Bartlett & Mendelson, 2006). We show in Table 1 that our proposed algorithm does exploit additional problem structure and achieves lower risk on selected datapoints.

Q3 : How does Algorithm 1 work on real world datasets compared to selective learning baselines?

On real world datasets, Algorithm 1 consistently shows competitive/superior performance against other baselines in the low coverage regime, e.g., the proportion of data chosen by the selector being below 20%. These empirical results suggest that that our method is good at picking out strongly informative data. This observation supports theorem 2.

Baselines. We compare our method to recently proposed selective learning algorithms. (1) Selective Net (Geifman & El-Yaniv, 2019), which integrates an extra neuron as a data selector in the output layer and also introduces a loss term to control the coverage ratio; (2) Deep Gambler (Liu et al., 2019), which also maintains an extra neuron for abstention and uses a doubling-rate-like loss term (i.e., gambler loss) to train the model. (3) Adaptive (Huang et al., 2020) uses the moving average confidence of classifier as the soft-label for the training and an extra output neuron is used to indicate abstention; (4) Oneside (Gangrade et al., 2021b) formulates selective classification problems as multiple one-sided prediction problems where each data class s coverage is maximized under error constrains. The implemented classifier is trained to optimize a relaxed min-max loss. The selection then is performed according to the classifier s confidence; (5) We also create a heuristic baseline that selects data using the model prediction confidence, which we refer to as Confidence. The intuition behind this heuristic baseline is that informative data should have higher confidence compared to uninformative data.

Our Algorithm: ISA and ISA-V2. Several specific implementation details need to be mentioned. First of all, we empirically get better performance if we average zt (in Algorithm 1) over past 10 epochs instead of using zt from the most recent epoch. Second, we use a rolling window average of past 10 epochs of g for sample weights instead of only using current epoch s output. Furthermore, inspired by existing literature, we empirically get better performance in real-world datasets if we allow both the classifier model and the selector model to use the same backbone network to share data representations. We add an extra output neuron to the classifier model to achieve this goal. Finally, we use focal loss Lin et al. (2017) in order to deal with imbalance of informative versus uninformative data. The selector focal loss has the following format:

L(zi, ˆgi) =

i=1 { [1{zi}(1 ˆgi) log ˆgi + β1{1 zi}ˆgi log (1 ˆgi)]} (14)

where zi = 1{ ˆfi = yi} is the classifier ˆf s correctness on ith input. We use ˆgi as the abbreviation for the ith entry in ˆgi(xi).

We will denote as ISA the modified version of Algorithm 1 that adopts the rolling average and focal loss, and ISA-V2 as the version that adopts all 4 relaxations. Notice that both ISA and ISA-V2 can be easily applied in multi-class scenarios in practice.

Experiment Details. We use a lightweight CNN for MNIST+Fashion. Its architecture is given in Table 5 in section D.1. We use Res Net18 (He et al., 2016b) for SVHN. For all of our synthetic experiments, we use Adam optimizer with learning rate 1e-3 and weight decay rate 1e-4. We use batch size 256 and train 60 epochs for MNIST+Fashion and 120 epochs for SVHN. The learning rate is reduced by 0.5 at epochs 15, 35 and 55 for MNIST+Fashion and is reduced by 0.5 at epochs 40, 60 and 80 for SVHN. We repeat each experiment 3 times using seeds 80, 81 and 82.

Regarding hyper-parameters for each baseline, we mainly follow the setting given by the original paper. Except for Deep Gambler in MNIST+Fashion, which the original paper does not contemplate. We set the pay-off o = 1.5 instead of 2.6, which gives better performance for Deep Gambler. For Oneside, since we are not maximizing the coverage (the optimal coverage is fixed in our problem), we fix the Lagrangian multiplier µ to be 0.5 and use t as the score to compute average precision. We made our best effort to implement this algorithm, given that the authors have not released the original code. We list all the hyper-parameters in Table 6 in section D.1.

We conduct all our experiments using Pytorch 3.10 (Paszke et al., 2019). We execute our program on Red Hat Enterprise Linux Server 7.9 (Maipo) and use NVIDIA V100 GPU with cuda version 12.1.

7.1 Experiments Using Semi-Synthetic Data for Q1

Dataset Construction: We explicitly control the support of informative/uninformative data. For MNIST+Fashion-MNIST dataset, images from MNIST are defined to be uninformative, while images from Fashion-MNIST are set to be informative. For SVHN(Netzer et al., 2011) dataset, class 5-9 are set to be uninformative and class 0-4 are set to be informative. Datasets are constructed with different values of informative data fraction α and label noise ratio gap λ driving the noisy generative process. We inject label noise accordingly to Definition 1 by setting λ(x) = λ. Note that when we vary α, we only constrain the sample size in the training set, while in the testing set we always use the complete testing set for performance evaluation.

We present two sets of experiments. In the first experiment, we set λ = 0.5, which gives realizable informative data and completely randomly shuffled uninformative data. We fix the number of uninformative data points and increase the number of informative ones as a proxy for different value of α in the Definition 1. Specifically, we run our experiments with Informative Data Uninformative Data = α 1+α {1.0, 0.75, 0.5, 0.25}. Furthermore, we conduct our experiments in both a complete dataset setting and a sparse dataset setting to test each algorithm s sample efficiency. In a complete dataset setting, we use 100% of all informative data and uninformative data, whie in a sparse dataset setting, we use only 25% of samples.

In the second set of experiments, we inject 10%, 20% and 30% proportion of uniform label noise into the informative part and also inject 80%, 70% and 60% uniform label noise into the uninformative part.

Evaluation Metrics. Every baseline will generate a score besides their classification output to decide if the input should be abstained. For SLNet, Deep Gambler, Adaptive and ISA-V2, the score is given by the extra neuron. For Confidence and Oneside, such score is its maximum probabilistic output. For ISA, the score is given by the selector model. In the semi-synthetic experiment, we can calculate the average precision (AP) using the selection score and ground-truth informative versus uninformative binary label to test each baseline s ability in recovering g . In our study, we especially focus on the case where α is small and informative data is the minority part of the dataset. In this situation, there is severe class imbalance in the latent informative/uninformative labeling. AP is preferred over other metrics, like F1 or AUC, in this scenario (Saito & Rehmsmeier, 2015). In Table 2, Table 3 and Table 4, we use transformation log (1 x) to make the performance difference visually distinguishable.

Results and Discussion. There are 3 empirical observations we can get from Figure 2, 3, 4, where we manifest the performance difference though the transformation log(1 Average Precision) for differentiation purpose. More detailed results can be found in appendix section D.2. Firstly, the proposed ISA method outperforms all baselines under most of the scenarios according to the average precision criterion, which supports the superiority of the proposed method in recovery g . Secondly, we observe that our method s performance is more stable and suffers less deterioration as the data size becomes more limited or the label noise becomes stronger. Finally, in many cases, Confidence, which simply uses the confidence of a model trained with vanilla cross entropy loss, is a competitive baseline. Our proposed method outperforms Confidence in most scenarios.

Figure 2: Average Precision (AP) v.s Different α. Numerical results in Table 7.

Figure 3: Average Precision (AP) v.s Different α - 25% Samples Size. Numerical results in Table 8.

Figure 4: Average Precision (AP) v.s. Different λ . Numerical results in Table 9.

7.2 Experiments Using Semi-Synthetic Data for Q2

Evaluation Metrics: The construction of informative/uninformative data follows Section 7.1. In this section, we want to measure the classification performance of the proposed method conditioning on selected data points, by looking at the selective risk (SR) metric. Its formal definition is given by SRτ = 1

n Pn i I{arg max ˆf(xi) = yi|1{ˆg(xi) > q}}, where the quantity q is used to control the coverage. The confidence method could be viewed as a proxy for ERM method with following adjustments: 1) replacing cross-entropy as surrogate loss for binary classification loss 2) replacing selector 1{bg(x) q} by using estimated margin based selection rule 1{| bf ERM(x) 1/2| q}. Coverage is defined as the proportion of selected data to the whole dataset. In this experiment, we always calculate SR at the ground-truth coverage α = 0.5. It is the proportion of informative data that we mixed into the testing set. Ideally, we hope that the selector can select all but only informative data, which should give the lowest risk given the coverage level.

Table 1: SR . ERM v.s ISA

Dataset Method 100% Sample 25% Sample

τI = 0.3 τU = 0.6 τI = 0.2 τU = 0.7 τI = 0.1 τU = 0.8 τI = 0.3 τU = 0.6 τI = 0.2 τU = 0.7 τI = 0.1 τU = 0.8

MNIST+Fashion Confidence 0.387 0.001 0.293 0.005 0.188 0.005 0.469 0.002 0.362 0.002 0.235 0.002 ISA 0.376 0.004 0.280 0.004 0.186 0.003 0.446 0.002 0.338 0.005 0.233 0.004 ISA-V2 0.378 0.008 0.281 0.001 0.192 0.001 0.487 0.022 0.355 0.008 0.254 0.007

SVHN Confidence 0.409 0.006 0.324 0.006 0.227 0.007 0.526 0.008 0.416 0.012 0.295 0.007 ISA 0.391 0.003 0.310 0.018 0.217 0.003 0.554 0.011 0.365 0.016 0.256 0.010 ISA-V2 0.455 0.006 0.357 0.010 0.243 0.003 0.516 0.007 0.430 0.014 0.307 0.027

Results and discussion. In Table 1, we present the SR for Confidence (ERM) and the proposed ISA. Following the setting in section 7.1, we inject different levels of uniform label noise into uninformative data and informative data separately, denoted by τI and τU, corresponding to label noise ratio for informative and uninformative data respectively. Parameter λ in Definition 1 can be calculated accordingly: λ = 0.9 τI

There are two main empirical observations from Table 1. First, ISA consistently maintains smaller SR than Confidence (ERM) across different scenarios. This result is consistent with Theorem 2, suggesting that upweighting informative data leads to improved risk on informative data. Second, we observe that both methods exhibit performance deterioration as the noise gap becomes smaller, which is suggested by the sample complexity in our theorem. Our theorem suggests that the sample complexity increases as the label noise ratio gap λ decreases. When λ vanishes, informative date becomes less distinguishable from the uninformative data, and thus learning g becomes more challenging.

7.3 Experiments Using Real-world Data for Q3

Dataset. In this section, we report our empirical study on 3 publicly-available datasets: (1) Oxford realized volatility (Volatility) dataset (Heber et al., 2009), (2) breast ultrasound images (BUS) (Al-Dhabyani et al., 2020), and (3) lending club dataset (LC) (Lending Club, 2007). The experiments aim to demonstrate the potential of the proposed algorithm in real-world application and its advantages in selecting useful information out of noisy dataset. The data description and respective train-test splits are presented in Table 11 in section D.1.

Experiment Details. We use the same network architectures, software and hardward as we do in section 7.17.2. For Volatility and LC, we use the same Adam optimizer with learning rate (1e-3), weight decay rate (1e-4) and batch size 256. For BUS, due to its limited sample size, we use smaller batch size (16) and reduce learning rate (1e-4) accordingly. For all three datasets, we train each algorithm for 50 epochs and reduce the learning rate by half at epochs 15 and 35. Each experiment is repeated 6 times with random seeds 77, 78, 79, 80, 81 and 82.

Specifically, for Deep Gambler the default hyper-parameter o = 2.2 gives unreasonably poor performance on LC and BUS. We find that setting o = 1.5 gives the best performance and hence we use this value for Deep Gambler in these two datasets. For other baselines as well as our method, we use consistent hyper-parameters across all experiments. They are listed in Table 4 in section D.1.

Evaluation Metric. Unlike synthetic experiments, in the real-world datasets, the ground-truth binary labels that distinguish whether data is informative/uninformative are not available. Hence metrics like precision/recall are not applicable. Instead, we report the selective risk of each algorithm given different coverage levels. Specifically, we pick testing data points that have top% coverage selective confidence given by each selector, and calculate the testing selective risk at different coverage levels accordingly.

Results and Discussion. From Table 2, we can see that the proposed method gains competitive performance against other baselines at low coverage levels. This suggests that our method is especially good at picking out strongly informative data. Such low risk regime can be captured by our selector loss, leading to lower selective risk, which is consistent with the conclusion in Theorem 2.

Table 2: Real-World Experiments: Selective Risk v.s Coverage

Dataset Coverage Confidence SLNet Deep Gambler Adaptive Oneside ISA ISA-V2

0.001 0.000 0.000 0.200 0.000 0.267 0.121 0.000 0.000 0.050 0.055 0.000 0.000 0.000 0.000 0.100 0.097 0.008 0.122 0.003 0.328 0.015 0.168 0.008 0.113 0.006 0.091 0.008 0.083 0.004 0.200 0.133 0.004 0.198 0.002 0.326 0.009 0.225 0.012 0.159 0.003 0.136 0.002 0.127 0.008 0.500 0.218 0.004 0.316 0.003 0.327 0.005 0.290 0.004 0.239 0.003 0.227 0.002 0.225 0.004 0.900 0.309 0.001 0.332 0.003 0.327 0.001 0.321 0.001 0.314 0.001 0.312 0.002 0.310 0.001 0.999 0.327 0.003 0.333 0.003 0.327 0.001 0.329 0.001 0.332 0.002 0.328 0.001 0.324 0.002

0.001 0.026 0.032 0.167 0.408 0.000 0.000 0.167 0.408 0.052 0.036 0.000 0.000 0.000 0.000 0.100 0.026 0.032 0.396 0.156 0.156 0.147 0.083 0.076 0.052 0.036 0.020 0.031 0.010 0.026 0.200 0.026 0.032 0.422 0.134 0.151 0.057 0.089 0.094 0.052 0.036 0.021 0.026 0.005 0.013 0.500 0.038 0.029 0.432 0.062 0.184 0.046 0.135 0.140 0.052 0.036 0.058 0.019 0.017 0.016 0.900 0.085 0.019 0.393 0.039 0.133 0.035 0.200 0.109 0.093 0.023 0.098 0.018 0.095 0.022 0.999 0.109 0.016 0.382 0.040 0.121 0.032 0.114 0.026 0.219 0.107 0.109 0.013 0.118 0.024

0.001 0.829 0.052 0.410 0.142 0.362 0.058 0.357 0.072 0.106 0.036 0.061 0.012 0.219 0.068 0.100 0.204 0.005 0.370 0.102 0.419 0.031 0.284 0.023 0.207 0.036 0.151 0.007 0.151 0.021 0.200 0.216 0.008 0.393 0.077 0.419 0.023 0.265 0.014 0.228 0.038 0.214 0.006 0.212 0.019 0.500 0.312 0.006 0.421 0.030 0.414 0.011 0.241 0.007 0.305 0.030 0.333 0.005 0.362 0.011 0.900 0.385 0.004 0.399 0.009 0.418 0.004 0.229 0.004 0.373 0.021 0.389 0.006 0.427 0.009 0.999 0.398 0.004 0.399 0.010 0.421 0.004 0.231 0.003 0.386 0.019 0.395 0.007 0.427 0.008

7.4 Ablation Study

In this section we present ablation studies about 1) the sensitivity of the algorithm s performance to the choice of hyper-parameters and 2) aforementioned practical implementations. We first present the results on MNIST+Fashion given different hyper-parameter combinations. In section 7.1 we set β = 3.0, T = 1 and pre-train epochs to be 10. In this section, we vary each of them one-by-one while fix the rest of them. The results are presented in Figure 6 in Appendix. We can see that the proposed algorithm s performance is robust against the choice of hyper-parameters. We next present an ablation study on the aforementioned practical implementations in Table 3. Recall that we have the following relaxations: (1) moving average estimation of sample weight (MAW); (2) moving average of soft-label for informative data; (3) classifier and selector sharing the same backbone and the use of focal loss instead of cross-entropy. We systematically incorporate each module one at a time, evaluating their marginal contributions to the performance on SVHN in the high label noise setting (τI = 0.3 and τU = 0.6). Results shown in Table 3. For Algorithm 1, the best run can get AP = 0.87 but the vanilla algorithm is not stable and end up getting large variance in its performance. In comparison, the relaxed version get much more stable performance. The ablation study suggests that each relaxation strategy helps stabilizing the algorithm and make it robust against variance from stochastic approximation.

Table 3: Ablation Study on Implementation Relaxation. Method AP SR Algorithm 1 0.778 0.131 0.576 0.034 + MA-Weight 0.763 0.094 0.573 0.019 + MA-Soft (ISA) 0.850 0.020 0.554 0.011 + Focal Loss (ISA-V2) 0.906 0.000 0.516 0.007

8 Conclusion and Future Work

In this work, we take the first step towards principled learning in domains where a lot of data is naturally uninformative/highly noisy and should be discarded in both learning and inference stage. We propose a general noisy generative process that formally describes such setting. Supported by theoretical guarantees, a novel loss is designed for the training of the selector model. Based on this loss, we design a practical algorithm that jointly learns the predictor and selector. Empirical analysis demonstrates the effectiveness of our method. There are several directions for future work:

In the current framework, the fundamental principle of proposed method is distinguishing the uninformative data points and to discard them during the training process. While this approach may enhance the statistical efficiency and risk associated with informative data, it may under-utilize the potential of data from a representation learning perspective - for instance, while the uninformative labels are useless by definition, the discarded samples themselves may still be helpful for representation learning. This can be achieved through methods similar to those presented in studies like (Sohn et al., 2020; Li et al., 2021).

using methods similar to (Sohn et al., 2020; Li et al., 2021).

The Noisy Generative Process can be potentially generalized to solve different problems, such as active learning (Cohn et al., 1994) and out of distribution generalization (Arjovsky et al., 2019).

Bridging the gap between Algorithm 1 and Theorem 2 presents two main challenges. Firstly, analyzing the disparity when minimizing cross-entropy as a surrogate for binary classification. We are optimistic that the H-consistency bound framework introduced in (Awasthi et al., 2022) could be utilized for this purpose. Secondly, examining the use of Stochastic Gradient Descent (SGD) instead of the Empirical Risk Minimization (ERM) oracle in Theorem 2. In this regard, we believe that tools for analyzing SGD in bi-level optimization problems (Chen et al., 2021) could be applied.

We look forward to these extensions.

9 Acknowledgement

We thank anonymous reviewers for their constructive feedback, improving the quality of this work. This effort was partially supported by the Intelligence Advanced Research Projects Agency (IARPA) and Army Research Office (ARO) under Contract No. W911NF20C0038. Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors.

Durmus Alp Emre Acar, Aditya Gangrade, and Venkatesh Saligrama. Budget learning via bracketing. In International Conference on Artificial Intelligence and Statistics, pp. 4109 4119. PMLR, 2020.

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28:104863, 2020.

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Marco Avellaneda and Jeong-Hyun Lee. Statistical arbitrage in the us equities market. Quantitative Finance, 10(7):761 782, 2010.

Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pp. 167 190. PMLR, 2015.

Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1 27, 2017.

Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp. 1117 1174. PMLR, 2022.

Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3): 311 334, 2006.

Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss. JMLR, 9(8), 2008.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Vorabversion eines Lehrbuchs, 5, 2016.

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929 965, 1989.

Olivier Bousquet. Theory of classification: a survey of recent advances. ESAIM Probability and Statistics, 2004.

Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning theory. In Summer school on machine learning, pp. 169 207. Springer, 2003.

Tom Bylander. Learning linear threshold functions in the presence of classification noise. In Proceedings of the seventh annual conference on Computational learning theory, pp. 340 347, 1994.

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294 25307, 2021.

Chi-Keung Chow. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, (4):247 254, 1957.

CK Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16 (1):41 46, 1970.

David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994.

Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In ICALT, pp. 67 82, 2016.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent pac learning of halfspaces with massart noise. Advances in Neural Information Processing Systems, 32, 2019.

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, pp. 1486 1513. PMLR, 2020.

Richard M Dudley. Uniform central limit theorems, volume 142. Cambridge university press, 2014.

Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247 261, 1989.

Ran El-Yaniv et al. On the foundations of noise-free selective classification. JMLR, 11(5), 2010.

Giorgio Fumera and Fabio Roli. Support vector machines with embedded reject option. In International Workshop on Support Vector Machines, pp. 68 82, 2002.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pp. 1050 1059, 2016.

Aditya Gangrade, Anil Kag, Ashok Cutkosky, and Venkatesh Saligrama. Online selective classification with limited feedback. Advances in Neural Information Processing Systems, 34, 2021a.

Aditya Gangrade, Anil Kag, and Venkatesh Saligrama. Selective classification via one-sided prediction. In International Conference on Artificial Intelligence and Statistics, pp. 2179 2187. PMLR, 2021b.

Aditya Gangrade, Anil Kag, and Venkatesh Saligrama. Selective classification via one-sided prediction. In International Conference on Artificial Intelligence and Statistics, pp. 2179 2187. PMLR, 2021c.

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Neur IPS, 2017.

Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In ICML, pp. 2151 2159, 2019.

Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and Stéphane Canu. Support vector machines with a reject option. In Neur IPS, 2008.

Steve Hanneke. Theoretical foundations of active learning. Carnegie Mellon University, 2009.

Steve Hanneke and Liu Yang. Minimax analysis of active learning. J. Mach. Learn. Res., 16(1):3487 3602, 2015.

David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 232, 1995.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016b.

Heber, Gerd, Lunde Asger, Shephard Neil, and Kevin Sheppard. Oxford-man institute s realized library, 2009. URL https://realized.oxford-man.ox.ac.uk/data.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Lang Huang, Chao Zhang, and Hongyang Zhang. Self-adaptive training: beyond empirical risk minimization. Advances in neural information processing systems, 33:19365 19376, 2020.

Heinrich Jiang, Been Kim, Melody Y Guan, and Maya R Gupta. To trust or not to trust a classifier. In Neur IPS, pp. 5546 5557, 2018.

Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008.

Adam Tauman Kalai, Varun Kanade, and Yishay Mansour. Reliable agnostic learning. Journal of Computer and System Sciences, 78(5):1481 1495, 2012.

Michael Kearns and Ming Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22 (4):807 837, 1993.

Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17(2-3):115 141, 1994.

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Neur IPS, pp. 5580 5590, 2017.

Beata Beigman Klebanov and Eyal Beigman. Some empirical evidence for annotation noise in a benchmarked dataset. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 438 446, 2010.

Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

Lending Club. 2007 through current lending club accepted and rejected loan data, 2007. Lending Club platform is retired at 2020. The data is retrieved from Kaggle, https://www.kaggle.com/datasets/ wordsforthewise/lending-club.

Junnan Li, Caiming Xiong, and Steven CH Hoi. Comatch: Semi-supervised learning with contrastive graph regularization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9475 9484, 2021.

Shaojie Li and Yong Liu. High probability generalization bounds with fast rates for minimax problems. In International Conference on Learning Representations, 2021.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980 2988, 2017.

Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory. In Neur IPS, 2019.

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in Neural Information Processing Systems, 31, 2018.

Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5): 2326 2366, 2006.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pp. 7076 7087. PMLR, 2020.

Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. Advances in neural information processing systems, 30, 2017.

Feng Nan and Venkatesh Saligrama. Adaptive classification for prediction under a budget. ar Xiv preprint ar Xiv:1705.10194, 2017.

Nagarajan Natarajan, Inderjit S Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NIPS, volume 26, pp. 1196 1204, 2013.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classification with rejection. ar Xiv preprint ar Xiv:1901.10655, 2019.

Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pp. 65 84. Springer, 2003.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

Tim Pearce, Felix Leibfried, and Alexandra Brintrup. Uncertainty in neural networks: Approximately bayesian ensembling. In AISTATS, pp. 234 244, 2020.

Anupama Reddy, Jenny Zhang, Nicholas S Davis, Andrea B Moffitt, Cassandra L Love, Alexander Waldrop, Sirpa Leppa, Annika Pasanen, Leo Meriranta, Marja-Liisa Karjalainen-Lindsberg, et al. Genetic and functional drivers of diffuse large b cell lymphoma. Cell, 171(2):481 494, 2017.

Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. Plo S one, 10(3):e0118432, 2015.

Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145 147, 1972.

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596 608, 2020.

Sunil Thulasidasan, Tanmoy Bhattacharya, Jeff Bilmes, Gopinath Chennupati, and Jamal Mohd-Yusof. Combating label noise in deep learning using abstention. ar Xiv preprint ar Xiv:1905.10964, 2019.

Ruey S Tsay. Analysis of financial time series, volume 543. John wiley & sons, 2005.

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Tim Van Erven, Peter Grunwald, Nishant A Mehta, Mark Reid, Robert Williamson, et al. Fast rates in statistical and online learning. 2015.

Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pp. 11 30. Springer, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017.

Mathukumalli Vidyasagar. Learning and generalisation: with applications to neural networks. Springer Science & Business Media, 2013.

Marten Wegkamp, Ming Yuan, et al. Support vector machines with a reject option. Bernoulli, 17(4): 1368 1385, 2011.

Jon Wellner et al. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 2013.

Yair Wiener and Ran El-Yaniv. Agnostic selective classification. Neur IPS, 24:1665 1673, 2011.

Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. pp. 1056 1066, 2017.

Bin Yu. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pp. 423 435. Springer, 1997.

Chicheng Zhang and Yinan Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. ar Xiv preprint ar Xiv:2102.05312, 2021.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

Chong Zhang, Wenbo Wang, and Xingye Qiao. On reject and refine options in multicategory classification. JASA, 113(522):730 745, 2018.

A Theoretical Results Details

In this appendix section, we present the missing proofs as well as additional theory results.

A.1 Preliminaries

We describe the risk for selector loss on (x, y) X Y Rd {+1, 1}.

R(g; f, β) := Ex,y β1{g(x) = 1}1{f(x) = y} + 1{g(x) = 1}1{f(x) = y} (15)

The choice of β should ensure that given Bayes optimal classifier f ( ), g ( ) is the minimizer for the selector risk R(g; f , β). We have that given f , the risk gap between any selector g and g is R(g; f , β) R(g ; f , β) could be written as:

R(g; f , β) R(g ; f , β)

β1{g(x) = 1}1{g (x) = 1}1{f (x) = y} + β1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

+1{g(x) = 1}1{g (x) = 1}1{f (x) = y} + 1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

β1{g(x) = 1}1{g (x) = 1}1{f (x) = y} + β1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

+1{g(x) = 1}1{g (x) = 1}1{f (x) = y} + 1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

1{g(x) = 1}1{g (x) = 1}

1{g(x) = 1}1{g (x) = 1}

Since λ(x) is data dependent, to ensure that R(g, f , β) R(g ; f , β) for all g G, it suffices to pick β 1

2 0. So we need β sup x

3 2λ(x) 1+2λ(x) and β inf x

3+2λ(x) 1 2λ(x)

which implies that it suffices to pick β 3 2 λ

1+2 λ, 3+2 λ

Assuming λ(x) λ, we pick β within certain margin of the above interval: by picking 3 2 λ

1+2 λ + λ, 3+2 λ

1 2 λ λ 1 4 λ2

R(g; f , β) R(g ; f , β) λ 4(1 + 2 λ)Ex[1{g(x) = g (x)}] (17)

Note that if λ = 1

2, the above interval for β is [2, ].

A.2 Proof of Information Theoretic Lower Bound

In this section we quantify the hardness of recovering g , from an information theoretic perspective. We define a generative process that conforms with the noisy generative process defined in 1, and show a sample lower bound for finding a selector given samples generated from this process.

Let X : {τ e|e {e1, ..., ed}, |τ| 1} where ej represents the j-th cannonical basis. So X consists of τ-scaled basis vectors. Let Y = {+1, 1}, samples are drawn from X Y Rd {+1, 1}. Let w be vector of ones, w = 1. For our lower bound construction we define f as follows for x X.

f (x) = 21{w x > 0} 1.

Let G be the hypothesis class that contains all functions g : X {0, 1}. So |G| = 2d and G contains g (x). Let σ = (σ1, . . . , σd) {+1, 1}d be a d-dimensional Rademacher vector, α (0, 1). We define g σ G as follows

j=1 1{x ej = 0} 1{ x 1 α} (1 σj)/2 + 1{ x α} (1 + σj)/2 .

To clarify the definitions above, suppose for each j = 1, . . . , d, ej is the vector with the jth entry being one and the other entries zero. Let Ωj : {x|x = τ ej} be all the vectors in X with non-zero jth. Then for x Ωj, we have that f (x) is 1 if τ is positive and 1 if τ is negative. Moreover, g (x) is 1 σj

2 if τ 1 α, it is 1+σj

2 if τ α, and it is 0 otherwise. In other words, if σj = 1, the informative part of Ωj is {x| x 1 α} otherwise the informative part of Ωj becomes {x| x α}. Moreover, considering Definition 1, the domain Dα is d j=1Ωj. Let Sσ,n = {(xi, yi)}n i=1 be generated from following process which we denote as Q:

σ Unif{+1, 1}d

j=1 1{x ej = 0} 1{ x 1 α} (1 σj)/2

+ 1{ x α} (1 + σj)/2

Generate S = {(xi, yi)}n i=1 according to:

( j = 1, with prob 1 ε

λ j Unif{2, ..., d} with prob ε

τ Unif[ 1, 1]

( f (x), with prob 3

4 + (2g (x) 1) λ

2 f (x) with prob 1

4 (2g (x) 1) λ

Process Q follows process 1. Let A be any (potentially randomized) algorithm that takes dataset Sσ,n as input where Sσ,n is generated from the process described in Equation 18. Let bg be the hypothesis ouput of algorithm A, i.e. bg( ) = A(Sσ,n). For a parameter β we define the risk on algorithm A as

R(A(Sσ,n), β) = R(bg(x), f , β) = Ex,y β1{bg(x) = 1}1{f (x) = y} + 1{bg(x) = 1}1{f (x) = y} . (19)

Theorem 3. Let β and ε be any real numbers where β 3 2 λ

1+2 λ + λ, 3+2 λ

1 2 λ λ 1 4 λ2 and 0 < ε λ. For any algorithm A that takes n samples Sn from the noisy generative process as in Assumption 1 for some integer n and outputs a selector R(Sn) we have the following: If the risk gap is bounded by ε 8(1+2 λ), i.e.

ESn[R(A(Sn), f , β) R(g , f , β)] ε 8(1 + 2 λ)

then A requires log(|G|)

λε many samples, i.e. n log(|G|)

Proof. The lower bound construction is the process defined in Equation 18. Let bg = A(Sn) be the output of A. From equation (17) the risk gap R(bg, f , β) R(g , f , β) averaged over σ and Sσ,n can be written as follows.

EσESσ,n[R(bg, f , β) R(g σ, f , β)]

λ 4(1 + 2 λ)EσESσ,n

Ex[1{bg(x) = g σ(x)}] σ

λ 4(1 + 2 λ)EσESσ,n

j=2 Px x Ωj]Px bg(x) = g σ(x) x Ωj σ

ε 4(1 + 2 λ)d

Px bg(x) = g σ(x) x Ωj σ

In the last inequality we use the fact that Px[x Ωj] = ϵ/( λ(d 1)) ϵ/( λd) for j 2.

Let σ/j be a Rademacher vector conditional on coordinates {1, ..., j 1, j + 1, ...d}. Let σ{ j} be a vector equal to σ except at the jth entry. We drop the n subscript from Sσ,n for simplicity. Above equation becomes:

EσESσ[R(bg, f , β) R(g σ, f , β)]

ε 8(1 + 2 λ)d

j=2 Eσ/j ESσ

Px bg(x) = g σ(x) x Ωj

Px bg(x) = g σ(x) x Ωj σ/j

σ/j {+1, 1}d 1

ε 8(1 + 2 λ)d

PSσ,x bg(x) = g σ(x) x Ωj

+ PSσ{ j},x bg(x) = g σ(x) x Ωj

We make our notation more specific. Let A(Sσ) = bgσ and A(Sσ{ j}) = bgσ j. For simplicity, by g σ j we mean g σ{ j}.

Note that for all x Ωj, g σ j(x) = g σ(x) could happen only when α x or x 1 α. So equation 21 becomes

EσESσ[R(bg, f , β) R(g σ, f , β)]

σ/j {+1, 1}d 1

ε 8(1 + 2 λ)d

PSσ,x bgσ(x) = g σ(x) x Ωj

+ PSσ{ j},x bgσ j(x) = g σ j(x) x Ωj

σ/j {+1, 1}d 1

αε 8(1 + 2 λ)d

PSσ,x bgσ(x) = g σ(x) x Ωj, α x , or, x 1 α

+ PSσ{ j},x bgσ j(x) = g σ j(x) x Ωj, α x , or, x 1 α

σ/j {+1, 1}d 1

αε 8(1 + 2 λ)d

PSσ,x bgσ(x) = g σ(x) x Ωj, α x , or, x 1 α

+ PSσ{ j},x bgσ j(x) = g σ(x) x Ωj, α x , or, x 1 α

Next we make Equation 22 independent of x.

EσESσ[R(bg, f , β) R(g σ, f , β)]

σ/j {+1, 1}d 1

αε 8(1 + 2 λ)d

PSσ,x bgσ(x) = g σ(x) x Ωj, α x , or, x 1 α

+ PSσ{ j},x bgσ j(x) = g σ(x) x Ωj, α x , or, x 1 α

σ/j {+1, 1}d 1

αε 8(1 + 2 λ)d

ESσ,x 1{bgσ(x) = g σ(x)} x Ωj, α x , or, x 1 α

+ ESσ{ j},x 1{bgσ j(x) = g σ(x)} x Ωj, α x , or, x 1 α

σ/j {+1, 1}d 1

αε 8(1 + 2 λ)d

ESσ,x 1{bgσ = g σ} x Ωj, α x , or, x 1 α

+ ESσ{ j},x 1{bgσ j = g σ} x Ωj, α x , or, x 1 α

σ {+1, 1}d 1

αε 8(1 + 2 λ)d

PSσ bgσ = g σ + PSσ{ j} bgσ j = g σ

σ {+1, 1}d 1

αε 8(1 + 2 λ)d

1 PSσ bgσ = g σ + PSσ{ j} bgσ j = g σ

σ/j {+1, 1}d 1

αε 8(1 + 2 λ)d

1 Q(n) σ Q(n) σ{ j} T V

where Q(n) σ and Q(n) σ{ j} are the product distribution of n samples for Sσ and Sσ j respectively. The last step of inequality follows from the Le Cam s method. In the Equation we use the fact that for all x s.t., x α or x 1 α, 1{bgσj(x) = g σj(x)} = 1{bgσj = g σj} is free of x.

Let Qσ be the distribution of Sσ and Qσ{ j} be distribution of Sσ j. The total variation distance can be bounded using the Hellinger distance, which is denoted as H( , ). Below we bound the TV distance using Hellinger distance.

Q(n) σ Q(n) σ{ j} T V

H(Q(n) σ , Q(n) σ{ j})

1 H2(Q(n) σ , Q(n) σ{ j}) 4

n H(Qσ, Qσ{ j})

1 H2(Q(n) σ , Q(n) σ{ j}) 4

H2(Q(n) σ ,Q(n)

σ{ j}) n H2(Qσ,Qσ{ j})

n H(Qσ, Qσ{ j})

Now we bound the Hellinger distance.

H2(Qσ, Qσ{ j})

Qσ{ j}(x, y) 2dxdy

Qσ{ j}(x, y) 2dxdy

x Ωj, x 1 α

Qσ{ j}(x, y) 2dxdy

Qσ{ j}(x, y) 2dxdy

x Ωj, x 1 α

Qσ{ j}(x, y) 2dxdy

Thus we can bound the total variation distance as:

Q(n) σ Q(n) σ{ j} T V

Note that from inequality 23, inequality 24 and inequality 26 we have

EσESσ[R(A(Sσ), β) R(g , f , β)]

=EσESσ[R(bg, f , β) R(g , f , β)]

d αε 8(1 + 2 λ)

Above implies

sup σ {+1, 1}d ESσ[R(A(Sσ), f , β) R(g , f , β)] EσESσ[R(A(Sσ), β) R(g , f , β)]

d αε 8(1 + 2 λ)

Since |G| = 2d, any algorithm A needs at least n = Ω log |G)|

number of samples so that there is a hope to

achieve sup σ ESσ[R(A(Sσ), β)] R(g , f , β) αε 32(1 + 2 λ).

Replacing αε with ε finishes the proof.

Remark 4. From the second inequality in Equation 20, it can be observed that the construction of information theoretic lower bound for risk function R(g, β) can also be applied to the construction of an Ω(log(|G|/( λε))) sample complexity lower bound for Ex[g(x) = g (x)]. Thus our Corollary 1 also achieves minimax-optimal rate for recovering g for family of Noise Generative Process.

A.3 Proof of Sample Complexity Upper Bound

Here we prove Theorem 1 in which we bound the risk gap R(g; f , β) R(g ; f , β). Recall that the empirical version of the selector loss is

RSn(g; f, β) = 1

β1{g(xi) = 1}1{f(xi) = yi} + 1{g(xi) = 1}1{f(xi) = yi} .

Our high level approach is as follows. We first analyze the gap between RSn(bg; f , β) and RSn(g ; f , β) and provide an upper bound for it. Then we use this upper bound to get an upper bound for the gap between R(bg; f , β) and R(g ; f , β) using concentration properties and Bernstein inequality.

CASE I: bf( ) F and Ex[ bf(x) = f (x)] ε 8β with probability at least 1 δ. To upper bound

RSn(bg; f , β) RSn(g ; f , β), we use RSn(bg; bf, β) and RSn(g ; bf, β) as a middle step. Since bg is the empirical risk minimizer, we have RSn(bg; bf, β) RSn(g ; bf, β). (28)

Next we leverage on the fact that bf is consistent with f to establish an inequality in following fashion:

RSn(bg; f , β) RSn(g ; f , β) + const ε

Note we have that

RSn(bg; bf, β)

i=1 1 bf(xi) = f (xi) β1{bg(xi) = 1}1{ bf(xi) = yi} + 1{bg(xi) = 1}1{ bf(xi) = yi}

i=1 1 bf(xi) = f (xi) β1{bg(xi) = 1}1{ bf(xi) = yi} + 1{bg(xi) = 1}1{ bf(xi) = yi} (29)

Recall that

RSn(bg; f , β) = 1

β1{bg(xi) = 1}1{f (xi) = yi} + 1{bg(xi) = 1}1{f (xi) = yi} .

So RSn(bg; bf, β)

= RSn(bg; f , β)

i=1 1 bf(xi) = f (xi) β1{bg(xi) = 1}1{f (xi) = yi} + 1{bg(xi) = 1}1{f (xi) = yi}

i=1 1 bf(xi) = f (xi) β1{bg(xi) = 1}1{ bf(xi) = yi} + 1{bg(xi) = 1}1{ bf(xi) = yi}

RSn(bg; f , β) β 1

i=1 1 bf(xi) = f (xi)

Recall that in the theorem assumptions we have Ex[ bf(x) = f (x)] ε 8β with probability at least 1 δ. By

Lemma 2, if n 24β2 log(|F|/δ)

ε we have with probability at least 1 δ, 1

n Pn i=1 1{ bf(xi) = f (xi)} ε 4β , so we have RSn(bg; bf, β) RSn(bg; f , β) ε/4 (31)

With a similar approach we get that

RSn(g ; bf, β) RSn(g ; f , β) + ε/4

Thus using (28) the following inequality holds with probability at least 1 δ

RSn(bg; f , β) RSn(g ; f , β) + ε/2. (32)

To get a bound for R(bg; f , β) R(g ; f , β), we first define ℓ(g; f, x, y) = β1{g(x) = 1}1{f(x) = y} + 1{g(x) = 1}1{f(x) = y}. Note that at this point we think of β as fixed and so we have not included it in the arguments of ℓ( ) for simplicity.

Observe that RSn(g, f , β) = 1

n Pn i=1 ℓ(g; f , xi, y) and R(g, f , β) = Ex,y ℓ(g; f , x, y) for any g. First we have the following simple inequality directly taken from (32).

R(bg, f , β) R(g , f , β)

= Ex,yℓ(bg; f , x, y) Ex,yℓ(g ; f , x, y)

RSn(g ; f , β) RSn(bg; f , β) (Ex,yℓ(g ; f , x, y) Ex,yℓ(bg; f , x, y)) + ε/2 (33)

By defining ℓ(g , g, x, y) = ℓ(g ; f , x, y) ℓ(g; f , x, y) for any g, we can express the above inequality as follows: Ex,yℓ(bg; f , x, y) Ex,yℓ(g ; f , x, y)

i=1 ℓ(g , bg, xi, y) Ex,y ℓ(g ; bg, x, y) + ε/2 (34)

n Pn i=1 ℓ(g , bg, xi, y) Ex,y ℓ(g ; bg, x, y) with high probability over all Sn, we need to find a bound on 1

n Pn i=1 ℓ(g , g, xi, y) Ex,y ℓ(g ; g, x, y) that is true for all g simultaneously with high probability. We have :

i=1 ℓ(g , g; xi, yi) Ex,y[ ℓ(g , g; x, y)]

β log(|G|/δ) +

2V ar( (g , g; x, y)) log(|G|/δ)

i=1 ℓ(g , g; xi, yi) Ex,y[ ℓ(g , g; x, y)]

β log(|G|/δ) +

2V ar( (g , g; x, y)) log(|G|/δ)

Now to bound 1

n Pn i=1 ℓ(g , g, xi, y) Ex,y ℓ(g ; g, x, y) we use Bernstein inequality. For that we need to bound V arx,y[ ℓ(g , g, x, y)]. First we expand ℓ(g , g, x, y).

ℓ(g , g, x, y) =ℓ(g ; f , x, y) ℓ(g; f , x, y)

=β1{g (x) = 1}1{f (x) = y} + 1{g (x) = 1}1{f (x) = y}

β1{g(x) = 1}1{f (x) = y} 1{g(x) = 1}1{f (x) = y}

2ℓ(g , g, x, y)

= β2 1{g (x) = 1} + 1{g(x) = 1} 21{g(x) = 1}1{g (x) = 1} 1{f (x) = y}

+ 1{g (x) = 1} + 1{g(x) = 1} 21{g(x) = 1}1{g (x) = 1} 1{f (x) = y}

= β21{f (x) = y} + 1{f (x) = y} 1{g(x) = 1}1{g (x) = 1} + 1{g(x) = 1}1{g (x) = 1}

β21{g (x) = g(x)}

Hence we conclude that V arx,y[ ℓ(g , g, x, y)] Ex,y 2ℓ(g , g, x, y) β2Ex[1{g (x) = g(x}]. On the other hand, Equation 17 implies that R(g; f , β) R(g ; f , β) λ 1+2 λEx[1{g (x) = g(x}]. Thus we can use the following inequality to achieve fast rate of convergence using the Bernstein Inequality:

V arx,y[ ℓ(g , g; x, y)] β2(1 + 2 λ)

λ {R(g; f , β)) R(g ; f , β)}. (37)

We use following version of the Bernstein Inequality, with X1, ..., Xn i.i.d random variable uniformly bounded by b:

i=1 Xi E[X] < b

n log(1/δ) +

2V ar(X) log(1/δ)

Using union bounds, the Bernstein Inequality implies that with probability for all g G simultaneously:

i=1 ℓ(g , g; xi, yi) Ex,y[ ℓ(g , g; x, y)]

n log G /δ +

2V ar( (g , g; x, y)) log G /δ

Thus by applying inequality 38 with bg we have:

i=1 ℓ(g , bg; xi, yi) Ex,y[ ℓ(g , bg; x, y)]

n log G /δ +

2V ar( (g , bg; x, y)) log G /δ

By inequality 32, we have 1 n Pn i=1 ℓ(g , bg; xi, yi) = RSn(g ; f , β) RSn(bg; f , β) ε/2 holds with probability 1 δ. Note R(bg; f , β)) R(g ; f , β) = Ex,y[ ℓ(g , bg; x, y)]. So we have

{R(bg; f , β) R(g ; f , β)}

n log G /δ +

2β2(1 + 2 λ){R(bg; f , β)) R(g ; f , β)} log G /δ

The choice of n 16β2 log( |G|

δ ) λε ensures that with probability at least 1 2δ, R(bg; f , β) R(g ; f , β) ε.

CASE II: bf( ) F that satisfies Ex[ bf(x) = f (x)|x ΩI] ε 8βα with probability at least 1 δ.

Note bf only approximates f ( ) on the informative support ΩI : bf( ) F that satisfies Ex[ bf(x) = f (x)|x ΩI] ε 8βα with probability at least 1 δ. For simplicity of analysis, we introduce a pseudo version of f ( )

denoted as ef . Let e F be following hypothesis class: ef ef(x) =

( f1(x), x ΩU f2(x), x ΩI , f1 F, f2 F

and we let ef ( ) be:

( bf(x), x ΩU f (x), x ΩI

Clearly, ef e F. Note such hypothesis class is only introduced in analysis and is potentially impractical. The cardinality of hypothesis class | e F| |F|2. The construction of ef is to make R(g, ef , β) R(g , ef , β) R(g, f , β) R(g , f , β) for all g G. To see this, we use the fact that f is the Bayes optimal classifier, which implies that x, Ey[1{g (x) = 1}1{ ef (x) = y}] Ey[1{g (x) = 1}1{f (x) = y}]. Subsequently, we have x, Ey[1{g (x) = 1}1{ ef (x) = y}] Ey[1{g (x) = 1}1{f (x) = y}]. So we have the following.

R(g(x), ef , β) R(g (x), ef , β)

=Ex,y β1{g(x) = 1}1{g (x) = 1}1{ ef (x) = y} 1{g(x) = 1}1{g (x) = 1}1{ ef (x) = y}

+1{g(x) = 1}1{g (x) = 1}1{ ef (x) = y} β1{g(x) = 1}1{g (x) = 1}1{ ef (x) = y}

=Ex,y β1{g(x) = 1}1{g (x) = 1}1{ ef (x) = y} 1{g(x) = 1}1{g (x) = 1}1{ ef (x) = y}

+1{g(x) = 1}1{g (x) = 1}1{f (x) = y} β1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

Ex,y β1{g(x) = 1}1{g (x) = 1}1{f (x) = y} 1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

+1{g(x) = 1}1{g (x) = 1}1{f (x) = y} β1{g(x) = 1}1{g (x) = 1}1{f (x) = y}

1{g(x) = 1}1{g (x) = 1}

1{g(x) = 1}1{g (x) = 1}

=R(g; f , β) R(g ; f , β)

Meanwhile ef ( ) also satisfies the property that Ex[ ef (x) = bf(x)] ε 8β with probability at least 1 δ. Thus

by Lemma 2, if n 24β2 log( |F|

δ ) ε we have with probability at least 1 δ, 1

n Pn i=1 1{ bf(xi) = ef (xi)} ε 8β .

The rest of the proof is the same as the proof in CASE I by replacing f with ef , leveraging on the fact that ef is to make R(g(x), ef , β) R(g (x), ef , β) R(g(x), f , β) R(g (x), f , β)

A.4 Proof for Corollary 6

It can be easily verified that β = 3 is in the interval β 3 2 λ

1+2 λ + λ, min( 3+2 λ

1 2 λ λ 1 4 λ2 , 10) . By the choice of β, from (17) we have

R(bg, f , β) R(g , f , β) λ 4(1 + 2 λ)Ex[1{bg(x) = g (x)}],

together with the conclusion in Theorem 1 that

R(bg; f , β) R(g ; f , β) ε

we can conclude that Equation 6 holds.

A.5 Theorem 4 on improving selective risk

Theorem 4 (Improving Predictor Selective Risk ). Let Sn = {(xi, yi)}n i=1 be i.i.d samples from the Data Generative Process described in Definition 1 under Assumption 1, with f ( ) F and g ( ) G, |F| < , |G| < . For any bg( ) G s.t., Ex[1{g (x) = bg(x)}] ε, let ef = arg min f F

1 n Pn i=1 1{bg(xi) > 0}1{f(xi) = yi}.

Then for any ε > 0, δ > 0 such that the following holds: For n max{ 24 log( |G|

δ ) ε , 12 log( |F|

δ ) ε }, we have

Ex[1{ ef(x) = f (x)}|g (x) = bg(x)] ε

Proof. Recall that ef = arg min f F

1 n Pn i=1 1{bg(xi) > 0}1{f(xi) = yi} with bg( ) G s.t., Ex[1{g (x) = bg(x)}]

ε. From the minimization property we have

i=1 1{bg(xi) > 0}1{ ef(xi) = yi} 1

i=1 1{bg(xi) > 0}1{f (xi) = yi} (41)

Note we have for any f:

i=1 1{bg(xi) > 0}1{f(xi) = yi} 1

i=1 1{g (xi) > 0}1{f(xi) = yi}

1{bg(xi) > 0} 1{g (xi) > 0} 1{f(xi) = yi}

1{bg(xi) > 0} 1{g (xi) > 0}

i=1 1{bg(xi) = g (xi)} (42)

For n 3 log( |G|

δ ) ε , by Lemma 2 we have 1

n Pn i=1 1{bg(xi) = g (xi)} ϵ + Ex[1{g (x) = bg(x)}] 2ϵ

with probability at least 1 δ. Thus by inequality 42 and inequality 41 we have both

i=1 1{g (xi) > 0}1{ ef(xi) = yi} 2ε 1

i=1 1{bg(xi) > 0}1{f (xi) = yi}

i=1 1{bg(xi) > 0}1{ ef(xi) = yi} 1

i=1 1{g (xi) > 0}1{f (xi) = yi} + 2ε.

with probability at least 1 δ. Summing up the above inequalities and using inequality 41 we have with probability at least 1 δ:

i=1 1{g (xi) > 0}1{ ef(xi) = yi} 1

i=1 1{g (xi) > 0}1{f (xi) = yi} + 4ε (43)

Next we bound for Ex,y[1{g (x) > 0}{1{ ef(x) = y} 1{f (x) = y}}], we first define (f , f, xi, yi) = {1{f (xi) = yi} 1{f(xi) = yi}} for any f F. Due to inequality (43), we have:

Ex[1{g (x) > 0}{1{ ef(x) = y} 1{f (x) = y}}]

i=1 1{g (xi) > 0} (f , ef, xi, yi) Ex,y[1{g (x) > 0} (f , ef, x, y)}] + 4ε (44)

n Pn i=1 1{g (xi) > 0} (f , ef, xi, yi) Ex,y[1{g (x) > 0} (f , ef, x, y)}] with high probability over all Sn, we use the Bernstein inequality to achieve the fast generalization rate (similar to the proof in Theorem 1).

By the definition of (f , f, x, y) we have

2(f , f, x, y) = 1{g (x) > 0}{1{f(x) = y} 1{f (x) = y}}2 = 1{g (x) > 0}1{f(x) = f (x)} (45)

Let ω(x) P[f (x) = y|x] = 3

4 + λ(x)(2g (x) 1)

2 . We obtain the following:

Ex,y[1{g (x) > 0}{1{ ef(x) = y} 1{f (x) = y}}]

=Ex[1{g (x) > 0}Ey[{1{ ef(x) = y} 1{f (x) = y}}|x]]

=Ex[1{g (x) > 0}{ω(x)1{ ef(x) = f (x)} + (1 ω(x))(1{ ef(x) = f (x)} 1)}|x]]

=Ex[1{g (x) > 0}{(2ω(x) 1)1{ ef(x) = f (x)}}]

4Ex[1{g (x) > 0}{1{ ef(x) = f (x)}]

Thus using equation 45 and inequality 46 we have

V arx,y[1{g (x) > 0} (f , ef; x, y)] 4Ex,y[1{g (x) > 0}{1{ ef(x) = y} 1{f (x) = y}}]. (47)

Using an approach similar to the proof of Theorem 1 one can show that n 16 log( |F|

δ ) ε suffices to guarantee following inequality holds with probability at least 1 2δ.

Ex[1{g (x) > 0}{1{ ef(x) = f (x)}] ε (48)

which implies that

Ex[{1{ ef(x) = f (x)}|1{g (x) > 0}] ε

A.6 Proof for Theorem 2

Define ω(x) P[f (x) = y|x] = 3

4 + λ(x)(2g (x) 1)

2 . Since λ(x) 1

2 h, one can show P[f (x) = y|x]

satisfies Massart condition with margin h

2 . Since n 64 log( |F|

δ ) h2ε , the conclusion that Ex[ bf(x) = f (x)] ε

α follows from 5.2-section in (Bousquet, 2004). The risk bounds on bg and ef follows from Theorem 1 and Theorem 4.

A.7 Proof for controlling conditional risk Ex[ bf(x) = f (x)|x ΩI]

Definition 4 (Growth Function(Vapnik & Chervonenkis, 2015)). Let G be the hypothesis class of function f and Fx1,...,xn = { f(x1), ..., f(xn) : f F} {+1, 1}n. The growth function is defined to be the maximum number of ways in which n points can be classified by the function class: BF(n) = supx1,...,xn |Fx1,...,xn|.

Lemma 1 (Sauer Shelah Lemma(See (Blum et al., 2016; Mohri et al., 2018; Sauer, 1972))). Let dvc(G) be the VC-dimension of hypothesis class G and let BG(n) be the growth function, for all n N,

Theorem 5. For every ε > 0, δ > 0, under Assumption 1, given a set of samples Sn = {(x1, y1), ..., (xn, yn)} drawn i.i.d. from the Noisy Generative Process and

bf = arg min f F

i=1 1{f(xi) = yi},

if n is chosen such that

n 32 d V C(F) log( 1

ε) + log( 1

then with probability at least 1 2δ we have:

Ex,y[ bf(x) = y] 1

2(1 α) + 2ϵα.

Furthermore the following holds: Px[f (x) = bf(x)|x ΩI] 2ϵ

Proof. We first bound the probability of the event that Ex,y[ bf(x) = y] 1

2(1 α) + 2ϵα.

By Lemma 3 we have

PSn[sup f F | 1

i=1 1{f(xi) = yi} Ex,y[1{f(x) = y}]| t] 4BF(2n)e nt2

By setting t = αϵ and n 32(log(4BF(2n))+log( 1

δ )) α2ϵ2 we have 4BF(2n)e nt2

32 δ and so using inequality 50 we have with probability of at least 1 δ:

i=1 1{ bf(xi) = yi} Ex,y[1{ bf(x) = y}]| αϵ

The term BF(2n) could be bounded by Lemma 1 as BF(2n) 2en d V C(F)

d V C(F) . Next we apply the fact that

bf = arg minf F 1 n Pn i=1 1{ bf(xi) = yi}. We have:

Ex,y[1{ bf(x) = y}] αϵ + 1

i=1 1{ bf(xi) = yi} αϵ + 1

i=1 1{f (xi) = yi} (51)

By Lemma 5 we have 1

n Pn i=1 1{f (xi) = yi} 1

2(1 α) + ϵα with failure probability at most δ. Combining this with inequality 51 we have with probability at least 1 2δ:

Ex,y[1{ bf(x) = y}] 1

2(1 α) + 2ϵα.

And this finishes the proof of the first claim. Next we prove the claim that:

Px DI[f (x) = bf(x)] 2ϵ.

First note that Px DI[f (x) = bf(x)] = Px Dα[1{ bf(x) = f (x)}|x ΩI]. Using Ex,y[1{ bf(x) = y}] 1 2(1 α) + 2ϵα below we have:

Ex,y[1{ bf(x) = y}]

=E(x,y) Dα[1{ bf(x) = y}]

= E(x,y) Dα[1{ bf(x) = y}|x ΩU] | {z } 1 2

P(x,y) Dα[x ΩU] | {z } 1 α

+ E(x,y) Dα[1{ bf(x) = y}|x ΩI] | {z }

P(x,y) Dα[1{b f(x) =f (x)}|x ΩI]

P(x,y) Dα[x ΩI] | {z } α

2(1 α) + αPx Dα[ bf(x) = f (x)|x ΩI]

2(1 α) + 2ϵα

= Px Dα[1{ bf(x) = f (x)}|x ΩI] 2ϵ

This finishes the proof of the second claim.

B Extention to VC-Class

In order to leverage the margin condition of distribution of z to obtain a minimax-optimal generalization rate, we leverage on the Local Rademacher Average tool. Our analysis tool largely follows from (Bousquet et al., 2003; Bartlett et al., 2005). Throughout this section, and represent as shorthand for the and that ignores universal constants.

Definition 5 (L2-Covering Number). (Wellner et al., 2013) Let x1:n be set of points. A set of U Rn is

an ε-cover w.r.t L2-norm of F on x1:n, if f F, u U, s.t. q

1 n Pn i=1 |[u]i f(xi)|2 ε, where [u]i is the i-th coordinate of u. We define the covering number N2(ε, F, x1:n) :

N2(ε, F, x1:n) := min{|U|: U is an ε-cover of F on x1:n}

Let N2(ε, F, n) be the maximum cardinality of N2(ε, F, x1:n) over all x1:n. Formally N2(ε, F, n) is defined as:

N2(ε, F, n) := sup x1:n X n min{|U|: U is an ε-cover of F on x1:n}

Definition 6 (Local Rademacher Average (Bartlett et al., 2005; Bousquet et al., 2003)). Let σ1:n be Rademacher sequence of length n, given samples {xi, yi}n i=1 the Empirical Local Rademacher Complexity at distributional and empirical radius r 0 for the class F are defined as

Rn(F, Pf 2 r) Eσ1:n

sup f F,Ex[f(x)2] r

i=1 σif(xi)

Rn(F, Pnf 2 r) Eσ1:n

i=1 f(xi)2 r

i=1 σif(xi)

and their distributional Average as: R(F, Pf 2 r) ESn[Rn(F, Pf 2 r)] and R(F, Pnf 2 r) ESn[Rn(F, Pnf 2 r)]. Definition 7 (Star Hull). (Bartlett et al., 2005; Bousquet et al., 2003) The star hull of set of functions F is defined as F {αf : f F, α [0, 1]}

Definition 8 (Sub-Root Function). (Bartlett et al., 2005; Massart & Nédélec, 2006; Bousquet et al., 2003) A function ψ : R R is sub-root if

ψ is non-decreasing

ψ is non-negative

ψ(r)/ r is non-increasingfor any r > 0

And we say r is a fixed point of ψ if ψ(r ) = r . Theorem 6. [Risk Bound VC-Class] Let Sn = {(xi, yi)}n i=1 be i.i.d samples from Data Generative Process described in Definition 1 under Assumption 1, with f ( ) F and g ( ) G with VC-dimension d V C(F) < d V C(G) < . Given λ, let β 3 2 λ

1+2 λ + λ, min( 3+2 λ

1 2 λ λ 1 4 λ2 , 10) . For any bf( ) F, let bg =

arg min g G RSn(g; bf, β). Then for any ε > 0, δ > 0 such that the following holds: For

n max β4d V C(G) log( 1

ε) + β4 log( 1

δ ) λε , βd V C(F) log( d V C(F)

ε ) + β log( 1

and for bf that satisfies one of the following condition:

For any bf( ) F that satisfies Ex[ bf(x) = f (x)] ε

β with probability at least 1 δ,

For any bf( ) F that satisfies Ex[ bf(x) = f (x)|x ΩI] ε βα with probability at least 1 δ,

The following holds with probability at least 1 3δ:

R(bg; f , β) R(g ; f , β) ε

Proof. The major difference from the proof for Theorem 1 is the fact that G and F are not finite hypothesis classes. To achieve fast generalization rate, we leverage the Local Rademacher Complexity Tool from (Bartlett et al., 2005).

CASE I: bf( ) F and Ex[ bf(x) = f (x)] ε

β with probability at least 1 δ.

We achieve equation 30 which is restated below exactly the same as in Theorem 1.

RSn(bg; bf, β) RSn(bg; f , β) β 1

i=1 1 bf(xi) = f (xi)

Then to achieve inequality 31, we invoke Lemma 8 instead of Lemma 2 as F is a VC-class. In particular, since n β(d V C(F) log( 1

δ )) ε , we have that 1

n Pn i=1 1{f(xi) = f (xi)} Ex1{f(x) = f (x)} + ε. Moreover, from the case assumption we have Ex[ bf(x) = f (x)] ε

β with probability at least 1 δ. So with probability at least 1 δ RSn(bg; bf, β) RSn(bg; f , β) ε/4.

Similarly we get that with probability at least 1 δ

RSn(g ; bf, β) RSn(g ; f , β) + ε/4.

Thus following hold with probability at least 1 δ:

RSn(bg; f , β) RSn(g ; f , β) + ε

Next we turn to bound the risk gap using R(bg; f , β) R(g ; f , β) using the concentration property of inequality 53. Similar to the proof in Theorem 1, for any f F, g G we define ℓ(g; f, x, y) = β1{g(x) = 1}1{f(x) = y} + 1{g(x) = 1}1{f(x) = y}. Based on ℓ, we define following hypothesis class:

ℓ G ℓ(g; g , x, y) = ℓ(g; f , x, y) ℓ(g ; f , x, y) : g G . (54)

To invoke Lemma 6, we need to establish some hypothesis class H that satisfies condition V ar[h] BE[h]. Next we show ℓ G satisfies the condition that V ar[h] BE[h] and thus we can apply Lemma 6 with H = ℓ G. To begin with, using the same approach as in Theorem 1 we can obtain Equation 36 which is restated below. 2ℓ(g , g, x, y) β21{g (x) = g(x)}

By Equation 36 we have that V arx,y[ ℓ(g , g, x, y)] Ex,y 2ℓ(g , g, x, y) β2Ex[1{g (x) = g(x}].

On the other hand, Equation 16 implies that

R(g; f , β) R(g ; f , β) λ 1 + 2 λEx[1{g (x) = g(x}].

Thus we have following holds:

V arx,y[ ℓ(g; g , x, y)] β2(1 + 2 λ)

λ Ex,y{ ℓ(g; g , x, y)} (55)

Above variance bound allows invoking Lemma 6 with H = ℓ G, T(h) = E[h2] and B = β2(1+2 λ)

λ . To satisfy the rest of the assumptions of Lemma 6, we find a subroot function ψ(r) that

ψ(r) β2(1 + 2 λ)

λ ESn[Rn{ ℓ(g; g ) H : E[h2] r}].

Where we shorten ℓ(g; g , x, y) to ℓ(g; g ). To find ψ(r), we show some analysis on the Local Rademacher Average ESn[Rn{ ℓ(g; g ) H : E[h2] r}] as follows.

ESn[Rn( ℓ G, r)] =ESn,σ1:n[ sup g G,Ex,y[ 2ℓ(g;g )] r

i=1 σi ℓ(g; g , xi, yi)]

ESn,σ1:n[ sup g G,Ex[1(g =g )] r

i=1 σi ℓ(g; g , xi, yi)]

| {z } 1(g =g ) 2ℓ(g;g )

βESn,σ1:n[ sup g G,Ex[1(g =g )] r

i=1 σi1{g(xi) = g (xi)}]

| {z } | ℓ(g1;g ) ℓ(g2;g )| β|1(g1 =g ) 1(g2 =g )| Talagrand Contraction Inequality (Ledoux & Talagrand, 1991)

In the last inequality we use Talagrand Contraction Inequality (Ledoux & Talagrand, 1991) for which we need to show that (1) ℓ(g1; g ) is a function of 1{g1 = g } and (2) ℓ(g1; g ) is β Lipschitz in 1{g1 = g }. These conditions are satisfied as follows:

ℓ(g; g ) = 1{g = g} ℓ(g; g )

| ℓ(g1; g ) ℓ(g2; g )| |ℓ(g1, f ) ℓ(g2, f )| β|1(g1 = g2)| = β|1(g1 = g ) 1(g2 = g )|.

This finishes the proof of Inequality 56. Now define the class 1 G 1{g(x) = g (x), g G}. The indicator function 1{g(x) = g (x)} is a Boolean function taking g as input, thus d V C(1 G) d V C(G). Thus we have

β2(1 + 2 λ)

λ ESn[Rn{ ℓ(g; g ) H : E[h2] r}]

β3(1 + 2 λ)

λ ESn[Rn{1{g(x) = g (x)} 1 G : E[1{g(x) = g (x)}] r}]

Above implies that we can pick ψ(r) to be

ψ(r) = β3(1 + 2 λ)

λ ESn[Rn{ 1 G : E[1{g(x) = g (x)}] r}] + 11β2 log n

By Equation 80 in Lemma 6, we have:

Ex,y[ ℓ(bg; g , x, y)] 2

i=1 ℓ(bg; g ; xi, yi) + 1500 λ

β2 r + log(1/δ)(11β + 52

Where r is the fixed point of ψ(r). By inequality 53, we have 1

n Pn i=1 ℓ(bg; g ; xi, yi) = RSn(bg; f , β)

RSn(g ; f , β) ε/2 holds with probability 1 δ. By Lemma 7 we have r β6

λ2 d V C(G) log n

n . Plugging in

Equation 59 we have that n β4(d V C(G) log( 1

ε )+log(1/δ)) λε suffices to achieve Ex,y[ ℓ(bg; g , x, y)] ε. Recall that Ex,y[ ℓ(bg; g , x, y)] = R(bg; f , β) R(g ; f , β) so this finishes the proof.

CASE II: bf( ) F that satisfies Ex[ bf(x) = f (x)|x ΩI] ε 8αβ with probability at least 1 δ. The proof is similar to the one in Theorem 1 except for that we need to bound the VC-dimension of pseudo hypothesis class F. Since ef can be viewed as Boolean function given f1(x), f2(x) as input, with two hypothesis f1, f2 F, by Lemma 3.2.3 in (Blumer et al., 1989) we know d V C( e F) 2d V C(F) log(d V C(F)). The rest of the proof follows from the one in Theorem 1.

Next we present our extension of information theoretic lower bound to VC-class. The lower bounds suggest that the risk bound in Theorem 6 is tight up to some logarithmic factor.

Theorem 7. Consider the noisy generative process defined in Definition 1 with ΩD being G-realizable. Then for any ε λ, to achieve

ESn[R(A(Sn), f , β) R(g , f , β)] ε 8(1 + 2 λ)

with β 3 2 λ

1+2 λ + λ, 3+2 λ

1 2 λ λ 1 4 λ2 , any algorithm A will take at least d V C(G) log(d V C(G)) λε many samples.

Proof. The proof follows from the proof of Theorem 3 except for the fact that we need to have an upper bound on the VC-dimension of our hypothesis construction G. Since G consists of composition of interval hypothesis and each individual interval has VC-dimension at most 3. By Lemma 3.2.3 in (Blumer et al., 1989) we know d V C(G) 6d log(d) which implies a d V C(G) log(d V C(G)) λε lower bound.

Theorem 8 (Improving Predictor Selective Risk for VC-Class). Let Sn{(xi, yi)}n i=1 be i.i.d samples from the Data Generative Process described in Definition 1 under Assumption 1, with f ( ) F and g ( ) G, d V C(F) < , d V C(G) < . For any bg( ) G s.t., Ex[g (x) = bg(x)] ε, let ef = arg min f F

1 n Pn i=1 1{bg(xi) >

0}1{f(xi) = yi}. Then for any ε > 0, δ > 0 such that the following holds: if

n max d V C(F) log( d V C(F)

ε ) + log( 1

δ ) ε , d V C(G) log( d V C(G)

ε ) + log( 1

Ex[ ef(x) = f (x)|g (x) = 1] ε

Proof. We use a proof similar to the one in Theorem 4 up to Equation (42), restated below: For any f: 1 n

i=1 1{bg(xi) > 0}1{f(xi) = yi} 1

i=1 1{g (xi) > 0}1{f(xi) = yi}

i=1 1{bg(xi) = g (xi)} (61)

Since G is a VC-class, we will invoke Lemma 8 instead of Lemma 2. Since n d V C(G) log( d V C (G)

δ ) ε , by Lemma 8 we have 1

n Pn i=1 1{bg(xi) = g (xi)} Ex[1{bg(x) = g (x)}]+ϵ with probability at least 1 δ. Since Ex[1{bg(x) = g (x)}] ϵ, by inequality 61 we have the following with probability at least 1 δ.

i=1 1{g (xi) > 0}1{ ef(xi) = yi} 1

i=1 1{g (xi) > 0}1{f (xi) = yi} + 2ε (62)

Similar to the proof of Theorem 6, we next establish some hypothesis class H that satisfies condition V ar[h] BE[h]. We define:

G F 1{g (xi) > 0} (f; f , x, y) : f F . (63)

To invoke Lemma 6, we need to establish some hypothesis class H that satisfies condition V ar[h] BE[h]. Next we show G F satisfies this condition and thus we can apply Lemma 6 with H = G F. Recall Equation (47) in Theorem 4 shows that,

V arx,y[1{g (x) > 0} ( ef, f ; x, y)] 4Ex,y[1{g (x) > 0}{1{ ef(x) = y} 1{f (x) = y}}].

where (f1, f2, xi, yi) = {1{f1(xi) = yi} 1{f2(xi) = yi}}. So H satisfies the condition that V ar[h] BE[h] for B = 4. To invoke Lemma 6 we let H = G F, T[h] = E[h2] and B = 4. Similar to the proof in Theorem 6 we next find a subroot function ψ(r) that

ψ(r) 4ESn[Rn{1{g (xi) > 0} (f; f ) H : E[h2] r}].

To find ψ(r), we show some analysis on the Local Rademacher Average.

ESn[Rn(G F, r)] =ESn,σ1:n[ sup f F,Ex,y[1{g (x)>0} 2(f;f )] r

i=1 σi1{g (xi) > 0} (f; f ; xi, yi)]

ESn,σ1:n[ sup f F,Ex[1{g (x)>0}1(f =f )] r

i=1 σi1{g (xi) > 0} (f; f ; xi, yi)]

| {z } 1{g (x)>0}1(f =f )=1{g (x)>0} 2(f;f )

ESn,σ1:n[ sup f F,Ex[1{g (x)>0}1(f =f )] r

i=1 σi1{g (x) > 0}1{f(xi) = f (xi)}]

| {z } |1{g (xi)>0} (f1;f ) 1{g (xi)>0} (f2;f )| |1{g (xi)>0}1(f1 =f ) 1{g (xi)>0}1(f2 =f )| Talagrand Contraction Inequality (Ledoux & Talagrand, 1991) (64)

In the last inequality, we use the fact that 1{g (xi) > 0} (f; f ) is a 1-Lipschitz function of 1{g (xi) > 0}1{f = f}. In particular, (f; f ) = 1{f = f} (f; f ) and

|1{g (xi) > 0} (f1; g ) 1{g (xi) > 0} (f2; g )| (65)

1{g (xi) > 0}| (f1; g ) (f2; g )|

1{g (xi) > 0}|1(f1 = y) 1(f2 = y)|

=1{g (xi) > 0}|1(f1 = f2)|

=|1{g (xi) > 0}1(f1 = f ) 1{g (xi) > 0}1(f2 = f )| (66)

And so we can use Talagrand Contraction Inequality and this finishes the proof of inequality 64.

Now define the class G 1 F 1{g (x) > 0}1{f(x) = f (x), f F}. The indicator function 1{g (x) > 0}1{f(x) = f (x) is a Boolean function taking f as input, thus d V C(G 1 F) d V C(F) (Vidyasagar, 2013). Thus we have

ESn[Rn{1{g (x) > 0} (f; f ) H : E[h2] r}]

ESn[Rn{1{g (x) > 0}1{f(x) = f (x)} 1 F : E[1{g (x) > 0}1{f(x) = f (x)}] r}] (67)

Above implies that we can pick ψ(r) to be

ψ(r) = 4ESn[Rn{ G 1 F : E[1{g (x) > 0}1{f(x) = f (x)}] r}] + 176 log n

By Equation (80), we have:

Ex,y[1{g (x) > 0} ( ef; f )] 2

i=1 1{g (xi) > 0} ( ef; f ; xi, yi) + 1500r + 176 log(1/δ)

By inequality (62), we have 1

n Pn i=1 ( ef; f ; xi, yi) 2ε holds with probability 1 δ. By Lemma 7 we have

r d V C(F) log n

n . Plugging in Equation 80 we have that n (d V C(F) log( d V C (F)

ε )+log(1/δ)) ε suffices to achieve Ex,y[1{g (x) > 0} ( ef; f , x, y)] ε. Similar to the proof in Theorem 4, we have:

Ex[{1{ ef(x) = f (x)}|1{g (x) > 0}] ε

C Technical Lemmas

Lemma 2. Let Sn = {(xi, yi)} be i.i.d sample from Data Generative Process described in Definition 1. For

every ε > 0, there exist a δ > 0 such that if n 3 log( |F|

δ ) ε , the following inequality holds simultaneously for all f F with |F| < with probability at least 1 δ

i=1 1{f(xi) = f (xi)} < Ex[1{f(x) = f (x)}] + ε (71)

Proof. By taking union bound one can ensure that

i=1 1{f(xi) = f (xi)} n Ex[1{f(x) = f (x)}] nε

i=1 1{f(xi) = f (xi)} n Ex[1{f(x) = f (x)}] nε

i=1 1{f(xi) = f (xi)} n Ex[1{f(x) = f (x)}] nε

We next apply the following version of Chernoff inequality with a 1: Let X = Pn i=1 Xi where Xi {0, 1}. Then

P[X (1 + a)E(X)] exp ( a2

2 + a E(X)) exp ( a

P[X (1 a)E(X)] exp ( a2

2 E(X)) exp ( a

So we have P[|X E(X)| a E(X)] exp ( a

3E(X)) (72)

For any fixed f F, let Xi = 1{f(xi) = f (xi)}, and let a = ε/Ex[1{f(x) = f (x)}]. Then by inequality 72 we have

i=1 1{f(xi) = f (xi)} n Ex1{f(x) = f (x)} nε

exp ( n Ex[1{f(x) = f (x)}]a

3 ) = exp ( nε

Using (72) and setting δ = |F| exp( nϵ/3) finishes the proof.

Lemma 3. Suppose Sn = {(x1, y1), ..., (xn, yn)} are i.i.d sampled , and let L(f, x, y) {+1, 1} be a function. Let LSn(f) = 1

n Pn i=1 L(f, xi, yi) and L(f) = Ex,y[L(f, x, y)]. Given parameter t such that

then we have: PSn D[sup f F |LSn(f) L(f)| t] 4BF(2n)e nt2

Proof: For two sample sets Sn and S n, if we have |LSn(f) L(f)| t and |LS n(f) L(f)| t

2 then we get that |LSn LS n| t

2. Let bf be f that attains sup f F |LSn(f) L(f)|, one can verify that :

1{|LSn( bf) L( bf)| t} 1{|LS n( bf) L( bf)| t

1{sup f F |LSn(f) LS n(f)| t

Taking expectation w.r.t Sn D and S n D we have

PSn D sup f F |LSn(f) L(f)| t PS n D |LS n( bf) L( bf)| t

2|Sn, bf sup f F |LSn(f) L(f)|

PSn,S n D sup f F |LSn(f) LS n(f)| t

Next we lower bound P |LSn(f) L(f)| t

2 for any fixed f. Note that the choice of bf is free of S n since S n and Sn are iid samples. Since L(f, x, y) [ 1, 1] and so V ar(L(f, x, y)) b2/4, using nt2 2b2 we have that:

PSn D |LSn(f) L(f)| t

2 4V ar(LSn)

So we have PS n D |LS n( bf) L( bf)| t 2|Sn, bf sup f F |LSn(f) L(f)| 1

2. Let FS2n {+1, 1}2n be

projection of F on Sn S S n, combining this inequality with (75) we have

PSn D[sup f F |LSn(f) L(f)| t]

2PSn,S n D[sup f F |LSn(f) LS n(f)| t

=2PSn,S n D[ sup f(x) FS2n |LSn(f) LS n(f)| t

2PS2n PSn=S2n S n[ sup f(x) FS2n |LSn(f) LS n(f)| t

f(x) FS2n PSn=S2n S n[|LSn(f) LS n(f)| t

2PS2n 2|FS2n|e nt2

2PS2n sup S2n |FS2n|e nt2

2 sup S2n |FS2n|PS2n e nt2

2BF(2n)e nt2

Lemma 4 (Hoeffding s Inequality). Let Z1, ..., Zn be independent bounded random variables with Zi [a, b] for all i, where < a < b < . Then for all t > 0:

i=1 Zi E[Zi]| t) 2e 2nt2

(b a)2 (77)

Lemma 5. Consider a set of samples S = {(x1, y1), ..., (xn, yn)} drawn i.i.d. from the Noisy Generative Process and f in the hypothesis class F satisfying f(x) { 1, +1}. If:

Then we have with probability at least 1 δ :

i=1 1{f (xi) = yi} 1

2(1 α) + αε (78)

The terms 1{f(xi) = yi} {0, 1} for i [n] are a set of n independent random variables since (xi, yi) are independent. So we can use Hoeffding s inequality (Lemma 4). By setting b a = 1, t = αϵ in Equation 77, the choice of n ensures that 2nt2

(b a)2 6 log(δ). Thus

i=1 1{f (xi) = yi} Ex,y[1{f (x) = y}]| ϵα] δ.

where we have

E(x,y) Dα[1{f (x) = y}]

= E(x,y) Dα[1{f (x) = y}|x ΩU]P[x ΩU] | {z } 1 2 P[x ΩU]: Since y is labeled by coin flipping in ΩU

+ E(x,y) Dα[1{f (x) = y}|x ΩI]P[x ΩI] | {z } 0: Since y is labeled by f with 0 Bayes Risk in ΩI

This way we have:

i=1 1{f (xi) = yi} 1

2(1 α)| ϵα] δ.

which implies that Equation 78 holds with probability at least 1 δ.

Lemma 6 (Theorem 3.3 in (Bartlett et al., 2005)). Let F be a class of functions with range in [a, b] and assume that there are some functional T : H R+ and some constant B such that for every h H, V ar(h) T(h) BE[h]. Let ψ be a subroot function and r be the fixed point of ψ. Assume the ψ satisfies, for any r r , ψ(r) BESn Rn{h H : T(h) r}

Then with c1 = 704 and c2 = 26, for any K > 1 and every t > 1 with probability at least 1 e t,

h H, P[h] K K 1Pnh + c1K

B r + t(11(b a) + c2BK)

Also with probability at least 1 e t,

h H, Pn[h] K + 1

B r + t(11(b a) + c2BK)

where Pf = Ex[h(x)] and Pn = 1

n Pn i=1 h(xi).

Lemma 7. Given hypothesis class F : X [ b, b] with some universal constant b and its VC-dimension d V C(F) < . Define following sub-root function with B 1:

ψ(r) = 100BESn Rn{ F, r} + 11b2 log n

Let r be fixed point of ψ(r) so that r = ψ(r ), suppose n d V C(F), we have

r B2d V C(F) log( n d V C(F))

Proof. The proof largely follows from the proof in Corollary 3.7 in (Bartlett et al., 2005). We include it here for completeness. Since f is uniformly bounded by b, for any r ψ(r), Corollary 2.2 in (Bartlett et al., 2005) implies that with probability at least 1 1

n, {f F : Pf 2 r} {f F : Pnf 2 2r}. Let E be event that {f F : Pf 2 r} {f F : Pnf 2 2r} holds, above implies

ESn Rn{ F, Pf 2 r}

P[E]ESn[Rn{ F, Pf 2 r}|E] + P[Ec]ESn[Rn{ F, Pf 2 r}|Ec]

ESn[Rn{ F, Pnf 2 2r}] + b

Since r = ψ(r ), r satisfies

r 100BESn Rn{ F, Pnf 2 2r } + b + 11b2 log n

Next we leverage Dudley s chaining bound (Dudley, 2014) to upper bound ERn{ F, Pnf 2 2r } using integral of covering number. We first bound the covering number of a star hull of F. It follows from (Bartlett et al., 2005) Corollary 3.7 that

log N2(ε, F, x1:n) log N2

And covering number log N2(ε, F, n) can be bounded using VC-dimension of F using Haussler s bound on the covering number (Haussler, 1995; Wellner et al., 2013):

2, F, n c1d V C log 1

where c1 is some universal constant. Now we are ready to apply the chaining bound, it follows from Theorem B.7 (Bartlett et al., 2005) that

ESn[Rn( F, Pnf 2 2r )]

log N2(ε, F, x1:n)dε

d V C(F)r log(1/r )

n2 + d V C(F)r log(n/ed V C(F))

Where c2 and c3 are some universal constants. Together with Equation 83 one can solve for r B2d V C(F) log( n d V C (F) )

Lemma 8. Let Sn = {(xi, yi)} be i.i.d sample from Data Generative Process described in Definition 1. For every ε > 0, there exist a δ > 0 such that if n d V C(F) log( 1

δ ) ε , following inequality holds simultaneously for all f F with d V C(F) < , with probability at least 1 δ

i=1 1{f(xi) = f (xi)} Ex1{f(x) = f (x)} + ε (85)

Proof. The proof invokes Lemma 6, in particular, the Equation 81. Let 1 F : 1{f(x) = f (x), f F} be the hypothesis class H in Lemma 6. Since f is a deterministic boolean function, it does not increase the number of points that can be shattered by F. We have d V C(1 F) d V C(F). In particular, we choose the functional T( ) = E[ ] and it is easy to verify that

V ar(1{f(x) = f (x)}) Ex[1{f(x) = f (x)}] = Ex[12{f(x) = f (x)}].

Let ψ(r) = 100ERn{ F, Ef r} + 11 log n

n . We have

ERn{F, Ef 2 r} ERn{ F, Ef 2 r} 100ERn{ F, Ef 2 r} + 11 log n

Since local Rademacher averages of the star-hull is sub-root function, we know for all r r , ψ(r) ψ(r ) = r . By Equation 81 in Lemma 6 we have

i=1 1{f(xi) = f (xi)} 2Ex1{f(x) = f (x)} + 15r + log(1/δ) + 5200

Next we bound r . A direct application of Lemma 7 show that

r d V C(1 F) log( n d V C(1 F))

n d V C(F) log( n d V C(F))

The rest of the proof follows from plugging r in Equation 81 and removing absolute constants.

D More Experimental Results and Details

D.1 Experiment Setting and Implementation Details

Extension to multi-class. Our method extends to the multi-class setting naturally. In the case of K-class classification, our selector loss remains the same while the predictor becomes f(x) = f(x)1:K : X K

where K is the K-simplex. Meanwhile, we use multi-class cross entropy loss to train the classifier. The pseudo-informative label becomes bzi = 1{arg maxk [K] f(xi)k = yi}.

Hyper-parameters and Neural Network Architectures We list the hyper-parameters and neural network architectures in Table 4,5 and 6.

Table 4: Hyper-parameters used in real-world experiments. Notation follows original paper.

Baseline Hyper-params SLNet α = 0.5, λ = 32 Deep Gambler o = 2 in Volatility and o = 1.5 in LC and BUS, Pretrain Epoch=10 Adaptive α = 0.9, Pretrain Epoch=10 Oneside µ = 0.5, Pretrain Epoch=10 ISA β = 30, T = 2, Pretrain Epoch=10

Table 5: CNN Architecture. Layer Name Filter Size Output Size 2d Convolution 3 3 32 28 28 Re LU - 1 28 28 2d Max Pool 2 2 32 14 14 2d Convolution 3 3 64 14 14 Re LU - 64 14 14 2d Max Pool 2 2 64 7 7 Linear - 600 Drop-out - - Linear - 120 Linear - 10

D.2 More Synthetic Experiments Results.

We present the exact numbers corresponding to Figure 2-5:

We also present experiments results that only use 25% of the dataset below to show the sample efficiency of our algorithm. We can see all conclusions still hold when we sharply reduce the sample size.

Table 6: Hyper-parameters used in semi-synthetic experiments. Notation follows original paper.

Dataset Baseline Hyper-params

MNIST+Fashion

SLNet αslnet = 0.5, λslnet = 32 Deep Gambler o = 1.5, Pretrain Epoch=5 Adaptive αadaptive = 0.9, Pretrain Epoch=5 Oneside µ = 0.5, Pretrain Epoch=5 ISA β = 3, T = 10, Pretrain Epoch=5

SLNet αslnet = 0.5, λslnet = 32 Deep Gambler o = 2.6, Pretrain Epoch=10 Adaptive αadaptive = 0.9, Pretrain Epoch=10 Oneside µ = 0.5, Pretrain Epoch=10 ISA β = 10, T = 1, Pretrain Epoch=10

Table 7: g recovering: AP v.s α with 100% Sample Size Dataset α Confidence SLNet Deep Gambler Adaptive Oneside ISA ISA-V2

MNIST + Fashion

0.20 1.000 0.000 0.982 0.011 1.000 0.000 0.807 0.027 1.000 0.000 1.000 0.000 1.000 0.000 0.33 1.000 0.000 0.998 0.002 1.000 0.000 0.825 0.027 1.000 0.000 1.000 0.000 1.000 0.000 0.43 1.000 0.000 0.943 0.002 1.000 0.000 0.855 0.003 1.000 0.000 1.000 0.000 1.000 0.000 0.50 1.000 0.000 0.998 0.003 1.000 0.000 0.829 0.011 1.000 0.000 1.000 0.000 1.000 0.000

0.20 0.984 0.001 0.621 0.256 0.831 0.270 0.984 0.001 0.957 0.007 0.989 0.001 0.982 0.001 0.33 0.989 0.001 0.868 0.039 0.990 0.001 0.990 0.001 0.978 0.002 0.992 0.000 0.987 0.001 0.43 0.990 0.001 0.925 0.002 0.990 0.000 0.990 0.000 0.982 0.001 0.993 0.001 0.990 0.001 0.50 0.991 0.000 0.914 0.053 0.991 0.000 0.989 0.001 0.983 0.002 0.993 0.001 0.991 0.001

Table 8: g Recovering: AP v.s. α with 25% Sample Size Dataset α Confidence SLNet Deep Gambler Adaptive Oneside ISA ISA-V2

MNIST+Fashion

0.20 1.000 0.000 0.412 0.052 0.985 0.019 0.413 0.024 0.969 0.002 1.000 0.000 1.000 0.000 0.33 1.000 0.000 0.704 0.208 0.990 0.001 0.362 0.011 0.977 0.004 1.000 0.000 1.000 0.000 0.43 1.000 0.000 0.881 0.179 0.987 0.010 0.333 0.006 0.976 0.007 1.000 0.000 1.000 0.000 0.50 1.000 0.000 0.876 0.186 0.991 0.009 0.336 0.006 0.980 0.004 1.000 0.000 1.000 0.000

0.20 0.952 0.005 0.711 0.121 0.540 0.024 0.909 0.013 0.607 0.125 0.972 0.003 0.856 0.115 0.33 0.967 0.001 0.914 0.033 0.979 0.002 0.950 0.009 0.926 0.019 0.981 0.001 0.970 0.003 0.43 0.973 0.001 0.931 0.021 0.983 0.002 0.967 0.003 0.962 0.001 0.984 0.001 0.977 0.001 0.50 0.978 0.002 0.947 0.006 0.986 0.000 0.966 0.005 0.965 0.004 0.985 0.001 0.981 0.001

Table 9: g recovering: AP v.s λ with 100% Sample Size Dataset λ Confidence SLNet Deep Gambler Adaptive Oneside ISA ISA-V2

MNIST + Fashion

0.1 0.915 0.010 0.599 0.168 0.985 0.010 0.550 0.033 0.862 0.033 1.000 0.000 0.999 0.001 0.2 0.986 0.001 0.712 0.140 0.992 0.013 0.566 0.018 0.973 0.004 1.000 0.000 1.000 0.000 0.3 1.000 0.000 0.886 0.148 1.000 0.000 0.571 0.008 0.998 0.000 1.000 0.000 1.000 0.000

0.1 0.955 0.004 0.641 0.164 0.736 0.045 0.952 0.002 0.944 0.001 0.931 0.021 0.956 0.003 0.2 0.978 0.002 0.874 0.029 0.974 0.003 0.953 0.001 0.972 0.001 0.977 0.003 0.978 0.002 0.3 0.989 0.002 0.949 0.013 0.981 0.002 0.970 0.008 0.981 0.002 0.986 0.002 0.987 0.000

Table 10: g recovering: AP v.s λ with 25% Sample Size Dataset λ Confidence SLNet Deep Gambler Adaptive Oneside ISA ISA-V2

MNIST+Fashion

0.1 0.866 0.008 0.507 0.011 0.799 0.155 0.935 0.027 0.584 0.017 1.000 0.000 0.968 0.038 0.2 0.974 0.005 0.625 0.100 0.917 0.015 0.795 0.013 0.718 0.031 1.000 0.000 1.000 0.000 0.3 0.999 0.000 0.686 0.105 0.972 0.010 0.521 0.055 0.870 0.004 1.000 0.000 1.000 0.000

0.1 0.883 0.009 0.784 0.086 0.567 0.036 0.795 0.015 0.820 0.014 0.848 0.084 0.892 0.008 0.2 0.938 0.013 0.916 0.039 0.811 0.108 0.857 0.012 0.913 0.011 0.944 0.006 0.959 0.006 0.3 0.965 0.002 0.938 0.015 0.938 0.013 0.901 0.029 0.952 0.006 0.967 0.000 0.969 0.004

D.3 Real-World Dataset Description

Oxford realized volatility (Volatility) (Heber et al., 2009) data set contains 5-min realized volatility of 31 stock indices from 2000 to 2022 and 155107 records in total. We use past volatility and returns as features, and the task here is to predict whether the next day volatility will be higher than current one, making it a

Figure 5: Average Precision (AP) v.s Different λ - 25% Samples Size. Numerical results in Table 10.

binary classification task. We choose data from 2000 to 2020 as our training set and the rest for the testing (2020 Jan. to 2021 Oct.). This data set is used as an example to show our algorithm s possible application in selectively forecasting financial time series.

Breast ultrasound images (BUS) (Al-Dhabyani et al., 2020). BUS contains 780 gray-scale breast ultrasound images among women in ages between 25 and 75 years old. These images have average size 500 500 pixels and can be categorized into 3 classes (487 benign, 210 malign and 133 healthy). We randomly choose 80% of data as training dataset and the rest 20% for testing. We are going to use this dataset as an example to show a possible application of our algorithm in automatic diagnosis. The machine can generate diagnosis result only on selected cases and deliver unsure cases to human expert for further investigation.

Lending club(Lending Club, 2007) is a peer-to-peer lending company that matches borrowers with investors through an online platform. The lending club dataset (LC) contains loan data of its customers from 2007 to 2018. We compare different version of existing dataset of LC and remove all inconsistent and incomplete records. There are different status of loans record in this dataset, we keep 3 types of these record that consist the major part of the dataset (261442 charged off cases, 1035418 fully paid cases and 25757 late cases). We use 20% of the dataset as the testing set. This example shows our algorithm can be use to grant loan given on different risk preference.

Table 12 presents the original accuracy given by neural network on each of these 3 real-world data set. For all dataset, the neural network without using selection mechanism cannot give reliable inference. In mortgage granting, high risk like this can cause significant loss. In medical diagnosis which is healthy issue critical, a diagnosis with miss-diagnose rate as high as 15% is not acceptable. However, if we apply our selective algorithm, we can see that the risk on all dataset sharply reduced. In BUS dataset, we can even almost perfectly guarantee the diagnosis result empirically for our most confident cases. These evidence are of practical interest.

Table 11: Real-world Dataset Description Dataset Category Input Feature Num. Class Train Test Volatility Time Series 2 2 143784 9525 BUS Image 3 324 324 3 624 156 LC Tabular 1805 3 1058093 264524

D.4 Ablation Study Results

E Illustrative Example for Algorithm 1

Table 12: DNN Original Risk on Each Dataset

Volatility BUS LC

Risk 0.340 0.002 0.152 0.008 0.392 0.001

Figure 6: Ablation Study. First column shows the result of different choices of β. Second column shows different choices of T. Third column shows different choices of pre-training epochs which gives different initial model ˆf (0)

. Upper panel presents results of average precision and bottom panel presents results of selective risk.

(a) (b) (c)

(d) (e) (f)

Figure 7: Illustration of Algorithm 1 when λ = 0.4 and α = 0.1. The middle blue horizontal line is the ground truth classifier f . The vertical green line is the boundary separates ΩI and ΩU. The data generation process follows Definition 1. We use SVM with degree-2 polynomial kernel for both ˆf and ˆg. a) shows the ground truth classification labels. b) shows the region of DI and DU. c) shows the latent informative label z. d) shows the realized samples from the noisy data generation process. e) shows the classification result given by ERM. f) shows the classification result given by ISA.

Figure 8: Accuracy on recovering g (x).