# is_outofdistribution_detection_learnable__f4ef581f.pdf

Is Out-of-Distribution Detection Learnable?

Zhen Fang1, Yixuan Li2, Jie Lu1 , Jiahua Dong3,4, Bo Han5, Feng Liu1,6

1Australian Artificial Intelligence Institute, University of Technology Sydney. 2Department of Computer Sciences, University of Wisconsin-Madison. 3State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. 4ETH Zurich, Switzerland. 5Department of Computer Science, Hong Kong Baptist University. 6School of Mathematics and Statistics, University of Melbourne. {zhen.fang,jie.lu}@uts.edu.au, sharonli@cs.wisc.edu, dongjiahua1995@gmail.com,bhanml@comp.hkbu.edu.hk,feng.liu1@unimelb.edu.au

Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.

1 Introduction

The success of supervised learning is established on an implicit assumption that training and test data share a same distribution, i.e., in-distribution (ID) [1, 2, 3, 4]. However, test data distribution in many real-world scenarios may violate the assumption and, instead, contain out-of-distribution (OOD) data whose labels have not been seen during the training process [5, 6]. To mitigate the risk of OOD data, researchers have considered a more practical learning scenario: OOD detection which determines whether an input is ID/OOD, while classifying the ID data into respective classes. OOD detection has shown great potential to ensure the reliable deployment of machine learning models in the real world. A rich line of algorithms have been developed to empirically address the OOD detection problem [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. However, very few works study theory of OOD detection, which hinders the rigorous path forward for the field. This paper aims to bridge the gap.

In this paper, we provide a theoretical framework to understand the learnability of the OOD detection problem. We investigate the probably approximately correct (PAC) learning theory of OOD detection, which is posed as an open problem to date. Unlike the classical PAC learning theory in a supervised setting, our problem setting is fundamentally challenging due to the absence of OOD data in training.

Corresponding author

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

In many real-world scenarios, OOD data can be diverse and priori-unknown. Given this, we study whether there exists an algorithm that can be used to detect various OOD data instead of merely some specified OOD data. Such is the significance of studying the learning theory for OOD detection [4]. This motivates our question: is OOD detection PAC learnable? i.e., is there the PAC learning theory to guarantee the generalization ability of OOD detection?

To investigate the learning theory, we mainly focus on two basic spaces: domain space and hypothesis space. The domain space is a space consisting of some distributions, and the hypothesis space is a space consisting of some classifiers. Existing agnostic PAC theories in supervised learning [21, 22] are distribution-free, i.e., the domain space consists of all domains. Yet, in Theorem 4, we shows that the learning theory of OOD detection is not distribution-free. In fact, we discover that OOD detection is learnable only if the domain space and the hypothesis space satisfy some special conditions, e.g., Conditions 1 and 3. Notably, there are many conditions and theorems in existing learning theories and many OOD detection algorithms in the literature. Thus, it is very difficult to analyze the relation between these theories and algorithms, and explore useful conditions to ensure the learnability of OOD detection, especially when we have to explore them from the scratch. Thus, the main aim of our paper is to study these essential conditions. From these essential conditions, we can know when OOD detection can be successful in practical scenarios. We restate our question and goal in following:

Given hypothesis spaces and several representative domain spaces, what are the conditions to ensure the learnability of OOD detection? If possible, we hope that these conditions are necessary and sufficient in some scenarios.

Main Results. We investigate the learnability of OOD detection starting from the largest space the total space, and give a necessary condition (Condition 1) for the learnability. However, we find that the overlap between ID and OOD data may result in that the necessary condition does not hold. Therefore, we give an impossibility theorem to demonstrate that OOD detection fails in the total space (Theorem 4). Next, we study OOD detection in the separate space, where there are no overlaps between the ID and OOD data. Unfortunately, there still exists impossibility theorem (Theorem 5), which demonstrates that OOD detection is not learnable in the separate space under some conditions.

Although the impossibility theorems obtained in the separate space are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we give several necessary and sufficient conditions to characterize the learnability of OOD detection in the separate space (Theorems 6 and 10). Especially, when our model is based on fully-connected neural network (FCNN), OOD detection is learnable in the separate space if and only if the feature space is finite. Furthermore, we investigate the learnability of OOD detection in other more practical domain spaces, e.g., the finite-ID-distribution space (Theorem 8) and the densitybased space (Theorem 9). By studying the finite-ID-distribution space, we discover a compatibility condition (Condition 3) that is a necessary and sufficient condition for this space. Next, we further investigate the compatibility condition in the density-based space, and find that such condition is also the necessary and sufficient condition in some practical scenarios (Theorem 11).

Implications and Impacts of Theory. Our study is not of purely theoretical interest; it has also practical impacts. First, when we design OOD detection algorithms, we normally only have finite ID datasets, corresponding to the finite-ID-distribution space. In this case, Theorem 8 gives the necessary and sufficient condition to the success of OOD detection. Second, our theory provides theoretical support (Theorems 10 and 11) for several representative OOD detection works [7, 8, 23]. Third, our theory shows that OOD detection is learnable in image-based scenarios when ID images have clearly different semantic labels and styles (far-OOD) from OOD images. Fourth, we should not expect a universally working algorithm. It is necessary to design different algorithms in different scenarios.

2 Learning Setups

We start by introducing the necessary concepts and notations for our theoretical framework. Given a feature space X Rd and a label space Y := {1, ..., K}, we have an ID joint distribution DXIYI over X Y, where XI X and YI Y are random variables. We also have an OOD joint distribution DXOYO, where XO is a random variable from X, but YO is a random variable whose outputs do not belong to Y. During testing, we will meet a mixture of ID and OOD joint distributions: DXY := (1 πout)DXIYI + πout DXOYO, and can only observe the marginal distribution DX := (1 πout)DXI + πout DXO, where the constant πout [0, 1) is an unknown class-prior probability.

Problem 1 (OOD Detection [4]). Given an ID joint distribution DXIYI and a training data S := {(x1, y1), ..., (xn, yn)} drawn independent and identically distributed from DXIYI, the aim of OOD detection is to train a classifier f by using the training data S such that, for any test data x drawn from the mixed marginal distribution DX: 1) if x is an observation from DXI, f can classify x into correct ID classes; and 2) if x is an observation from DXO, f can detect x as OOD data.

According to the survey [4], when K > 1, OOD detection is also known as the open-set recognition or open-set learning [24, 25]; and when K = 1, OOD detection reduces to one-class novelty detection and semantic anomaly detection [26, 27, 28].

OOD Label and Domain Space. Based on Problem 1, we know it is not necessary to classify OOD data into the correct OOD classes. Without loss of generality, let all OOD data be allocated to one big OOD class, i.e., YO = K + 1 [24, 29]. To investigate the PAC learnability of OOD detection, we define a domain space DXY , which is a set consisting of some joint distributions DXY mixed by some ID joint distributions and some OOD joint distributions. In this paper, the joint distribution DXY mixed by ID joint distribution DXIYI and OOD joint distribution DXOYO is called domain.

Hypothesis Spaces and Scoring Function Spaces. A hypothesis space H is a subset of function space, i.e., H {h : X Y {K + 1}}. We set Hin {h : X Y} to the ID hypothesis space. We also define Hb {h : X {1, 2}} as the hypothesis space for binary classification, where 1 represents the ID data, and 2 represents the OOD data. The function h is called the hypothesis function. A scoring function space is a subset of function space, i.e., Fl {f : X Rl}, where l is the output s dimension of the vector-valued function f. The function f is called the scoring function.

Loss and Risks. Let Yall = Y {K + 1}. Given a loss function ℓ2 : Yall Yall R 0 satisfying that ℓ(y1, y2) = 0 if and only if y1 = y2, and any h H, then the risk with respect to DXY is

RD(h) := E(x,y) DXY ℓ(h(x), y). (1)

The α-risk Rα D(h) := (1 α)Rin D(h) + αRout D (h), α [0, 1], where the risks Rin D(h), Rout D (h) are

Rin D(h) := E(x,y) DXIYIℓ(h(x), y), Rout D (h) := Ex DXOℓ(h(x), K + 1).

Learnability. We aim to select a hypothesis function h H with approximately minimal risk, based on finite data. Generally, we expect the approximation to get better, with the increase in sample size. Algorithms achieving this are said to be consistent. Formally, we introduce the following definition: Definition 1 (Learnability of OOD Detection). Given a domain space DXY and a hypothesis space H {h : X Yall}, we say OOD detection is learnable in DXY for H, if there exists an algorithm A3 : + n=1(X Y)n H and a monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any domain DXY DXY ,

ES Dn XIYI RD(A(S)) inf h H RD(h) ϵcons(n), (2)

An algorithm A for which this holds is said to be consistent with respect to DXY .

Definition 1 is a natural extension of agnostic PAC learnability of supervised learning [30]. If for any DXY DXY , πout = 0, then Definition 2 is the agnostic PAC learnability of supervised learning. Although the expression of Definition 1 is different from the normal definition of agnostic PAC learning in [21], one can easily prove that they are equivalent when ℓis bounded, see Appendix D.3.

Since OOD data are unavailable, it is impossible to obtain information about the class-prior probability πout. Furthermore, in the real world, it is possible that πout can be any value in [0, 1). Therefore, the imbalance issue between ID and OOD distributions, and the priori-unknown issue (i.e., πout is unknown) are the core challenges. To ease these challenges, researchers use AUROC, AUPR and FPR95 to estimate the performance of OOD detection [18, 31, 32, 33, 34, 35]. It seems that there is a gap between Definition 1 and existing works. To eliminate this gap, we revise Eq. (2) as follows:

ES Dn XIYI Rα D(A(S)) inf h H Rα D(h) ϵcons(n), α [0, 1]. (3)

If an algorithm A satisfies Eq. (3), then the imbalance issue and the prior-unknown issue disappear. That is, A can simultaneously classify the ID data and detect the OOD data well. Based on the above discussion, we define the strong learnability of OOD detection as follows:

2Note that Yall Yall is a finite set, therefore, ℓis bounded. 3Similar to [30], in this paper, we regard an algorithm as a mapping from + n=1(X Y)n to H.

Definition 2 (Strong Learnability of OOD Detection). Given a domain space DXY and a hypothesis space H {h : X Yall}, we say OOD detection is strongly learnable in DXY for H, if there exists an algorithm A : + n=1(X Y)n H and a monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any domain DXY DXY ,

ES Dn XIYI Rα D(A(S)) inf h H Rα D(h) ϵcons(n), α [0, 1].

In Theorem 1, we have shown that the strong learnability of OOD detection is equivalent to the learnability of OOD detection, if the domain space DXY is a prior-unknown space (see Definition 3). In this paper, we mainly discuss the learnability in the prior-unknown space. Therefore, when we mention that OOD detection is learnable, we also mean that OOD detection is strongly learnable.

Goal of Theory. Note that the agnostic PAC learnability of supervised learning is distribution-free, i.e., the domain space DXY consists of all domains. However, due to the absence of OOD data during the training process [8, 14, 24], it is obvious that the learnability of OOD detection is not distribution-free (i.e., Theorem 4). In fact, we discover that the learnability of OOD detection is deeply correlated with the relationship between the domain space DXY and the hypothesis space H. That is, OOD detection is learnable only when the domain space DXY and the hypothesis space H satisfy some special conditions, e.g., Condition 1 and Condition 3. We present our goal as follows:

Goal: given a hypothesis space H and several representative domain spaces DXY , what are the conditions to ensure the learnability of OOD detection? Furthermore, if possible, we hope that these conditions are necessary and sufficient in some scenarios.

Therefore, compared to the agnostic PAC learnability of supervised learning, our theory doesn t focus on the distribution-free case, but focuses on discovering essential conditions to guarantee the learnability of OOD detection in several representative and practical domain spaces DXY . By these essential conditions, we can know when OOD detection can be successful in real applications.

3 Learning in Priori-unknown Spaces

We first investigate a special space, called prior-unknown space. In such space, Definition 1 and Definition 2 are equivalent. Furthermore, we also prove that if OOD detection is strongly learnable in a space DXY , then one can discover a larger domain space, which is prior-unknown, to ensure the learnability of OOD detection. These results imply that it is enough to consider our theory in the prior-unknown spaces. The prior-unknown space is introduced as follows: Definition 3. Given a domain space DXY , we say DXY is a priori-unknown space, if for any domain DXY DXY and any α [0, 1), we have Dα XY := (1 α)DXIYI + αDXOYO DXY .

Theorem 1. Given domain spaces DXY and D XY = {Dα XY : DXY DXY , α [0, 1)}, then 1) D XY is a priori-unknown space and DXY D XY ; 2) if DXY is a priori-unknown space, then Definition 1 and Definition 2 are equivalent; 3) OOD detection is strongly learnable in DXY if and only if OOD detection is learnable in D XY .

The second result of Theorem 1 bridges the learnability and strong learnability, which implies that if an algorithm A is consistent with respect to a prior-unknown space, then this algorithm A can address the imbalance issue between ID and OOD distributions, and the priori-unknown issue well. Based on Theorem 1, we focus on our theory in the prior-unknown spaces. Furthermore, to demystify the learnability of OOD detection, we introduce five representative priori-unknown spaces:

Single-distribution space DDXY XY . For a domain DXY , DDXY XY := {Dα XY : α [0, 1)}.

Total space Dall XY , which consists of all domains.

Separate space Ds XY , which consists of all domains that satisfy the separate condition, that is for any DXY Ds XY , supp DXO supp DXI = , where supp means the support set.

Finite-ID-distribution space DF XY , which is a prior-unknown space satisfying that the number of distinct ID joint distributions DXIYI in DF XY is finite, i.e., |{DXIYI : DXY DF XY }| < + .

Density-based space Dµ,b XY , which is a prior-unknown space consisting of some domains satisfying that: for any DXY , there exists a density function f with 1/b f b in suppµ and 0.5 DXI +

Class 1 Class 2 Class K

(a) ID and OOD Distributions

0.00 0.20 0.40 0.60 0.80 1.00 Values of α

n=15,000 n=20,000 n=25,000 infh Rα D(h)

(b) Overlap Exists

0.00 0.20 0.40 0.60 0.80 1.00 Values of α

n=15,000 n=20,000 n=25,000 infh Rα D(h)

(c) No Overlap Figure 1: Illustration of infh H Rα D(h) (solid lines with triangle marks) and the estimated ES Dn in Rα D(A(S)) (dash lines) with α [0, 1) in different scenarios, where Din = DXIYI and the algorithm A is the free-energy OOD detection method [23]. Subfigure (a) shows the ID and OOD distributions. In (a), gap IO represents the distance between the support sets of ID and OOD distributions. In (b), since there is an overlap between ID and OOD data, the solid line is a ployline. In (c), since there is no overlap between ID and OOD data, we can check that infh H Rα D(h) forms a straight line (the solid line). However, since dash lines are always straight lines, two observations can be obtained from (b) and (c): 1) dash lines cannot approximate the solid ployline in (b), which implies the unlearnability of OOD detection; and 2) the solid line in (c) is a straight line and may be approximated by the dash lines in (c). The above observations motivate us to propose Condition 1.

0.5 DXO = R fdµ, where µ is a measure defined over X. Note that if µ is discrete, then DX is a discrete distribution; and if µ is the Lebesgue measure, then DX is a continuous distribution.

The above representative spaces widely exist in real applications. For example, 1) if the images from different semantic labels with different styles are clearly different, then those images can form a distribution belonging to a separate space Ds XY ; and 2) when designing an algorithm, we only have finite ID datasets, e.g., CIFAR-10, MNIST, SVHN, and Image Net, to build a model. Then, finite-ID-distribution space DF XY can handle this real scenario. Note that the single-distribution space is a special case of the finite-ID-distribution space. In this paper, we mainly discuss these five spaces.

4 Impossibility Theorems for OOD Detection

In this section, we first give a necessary condition for the learnability of OOD detection. Then, we show this necessary condition does not hold in the total space Dall XY and the separate space Ds XY .

Necessary Condition. We find a necessary condition for the learnability of OOD detection, i.e., Condition 1, motivated by the experiments in Figure 1. Details of Figure 1 can be found in Appendix C.2. Condition 1 (Linear Condition). For any DXY DXY and any α [0, 1),

inf h H Rα D(h) = (1 α) inf h H Rin D(h) + α inf h H Rout D (h).

To reveal the importance of Condition 1, Theorem 2 shows that Condition 1 is a necessary and sufficient condition for the learnability of OOD detection if the DXY is the single-distribution space.

Theorem 2. Given a hypothesis space H and a domain DXY , OOD detection is learnable in the single-distribution space DDXY XY for H if and only if linear condition (i.e., Condition 1) holds.

Theorem 2 implies that Condition 1 is important for the learnability of OOD detection. Due to the simplicity of single-distribution space, Theorem 2 implies that Condition 1 is the necessary condition for the learnability of OOD detection in the prior-unknown space, see Lemma 1 in Appendix F.

Impossibility Theorems. Here, we first study whether Condition 1 holds in the total space Dall XY . If Condition 1 does not hold, then OOD detection is not learnable. Theorem 3 shows that Condition 1 is not always satisfied, especially, when there is an overlap between the ID and OOD distributions: Definition 4 (Overlap Between ID and OOD). We say a domain DXY has overlap between ID and OOD distributions, if there is a σ-finite measure µ such that DX is absolutely continuous with respect to µ, and µ(Aoverlap) > 0, where Aoverlap = {x X : f I(x) > 0 and f O(x) > 0}. Here f I and f O are the representers of DXI and DXO in Radon Nikodym Theorem [36],

DXI = Z f Id µ, DXO = Z f Od µ.

Theorem 3. Given a hypothesis space H and a prior-unknown space DXY , if there is DXY DXY , which has overlap between ID and OOD, and infh H Rin D(h) = 0 and infh H Rout D (h) = 0, then Condition 1 does not hold. Therefore, OOD detection is not learnable in DXY for H.

Theorem 3 clearly shows that under proper conditions, Condition 1 does not hold, if there exists a domain whose ID and OOD distributions have overlap. By Theorem 3, we can obtain that the OOD detection is not learnable in the total space Dall XY for any non-trivial hypothesis space H.

Theorem 4 (Impossibility Theorem for Total Space). OOD detection is not learnable in the total space Dall XY for H, if |ϕ H| > 1, where ϕ maps ID labels to 1 and maps OOD labels to 2.

Since the overlaps between ID and OOD distributions may cause that Condition 1 does not hold, we then consider studying the learnability of OOD detection in the separate space Ds XY , where there are no overlaps between the ID and OOD distributions. However, Theorem 5 shows that even if we consider the separate space, the OOD detection is still not learnable in some scenarios. Before introducing the impossibility theorem for separate space, i.e., Theorem 5, we need a mild assumption:

Assumption 1 (Separate Space for OOD). A hypothesis space H is separate for OOD data, if for each data point x X, there exists at least one hypothesis function hx H such that hx(x) = K + 1.

Assumption 1 means that every data point x has the possibility to be detected as OOD data. Assumption 1 is mild and can be satisfied by many hypothesis spaces, e.g., the FCNN-based hypothesis space (Proposition 1 in Appendix K), score-based hypothesis space (Proposition 2 in Appendix K) and universal kernel space. Next, we use Vapnik Chervonenkis (VC) dimension [22] to measure the size of hypothesis space, and study the learnability of OOD detection in Ds XY based on the VC dimension.

Theorem 5 (Impossibility Theorem for Separate Space). If Assumption 1 holds, VCdim(ϕ H) < + and suph H |{x X : h(x) Y}| = + , then OOD detection is not learnable in separate space Ds XY for H, where ϕ maps ID labels to 1 and maps OOD labels to 2.

The finite VC dimension normally implies the learnability of supervised learning. However, in our results, the finite VC dimension cannot guarantee the learnability of OOD detection in the separate space, which reveals the difficulty of the OOD detection. Although the above impossibility theorems are frustrating, there is still room to discuss the conditions in Theorem 5, and to find out the proper conditions for ensuring the learnability of OOD detection in the separate space (see Sections 5 and 6).

5 When OOD Detection Can Be Successful

Here, we discuss when the OOD detection can be learnable in the separate space Ds XY , finite-IDdistribution space DF XY and density-based space Dµ,b XY . We first study the separate space Ds XY .

OOD Detection in the Separate Space. Theorem 5 has indicated that VCdim(ϕ H) = + or suph H |{x X : h(x) Y}| < + is necessary to ensure the learnability of OOD detection in Ds XY if Assumption 1 holds. However, generally, hypothesis spaces generated by feed-forward neural networks with proper activation functions have finite VC dimension [37, 38]. Therefore, we study the learnability of OOD detection in the case that |X| < + , which implies that suph H |{x X : h(x) Y}| < + . Additionally, Theorem 10 also implies that |X| < + is the necessary and sufficient condition for the learnability of OOD detection in separate space, when the hypothesis space is generated by FCNN. Hence, |X| < + may be necessary in the space Ds XY .

For simplicity, we first discuss the case that K = 1, i.e., the one-class novelty detection. We show the necessary and sufficient condition for the learnability of OOD detection in Ds XY , when |X| < + .

Theorem 6. Let K = 1 and |X| < + . Suppose that Assumption 1 holds and the constant function hin := 1 H. Then OOD detection is learnable in Ds XY for H if and only if Hall {hout} H, where Hall is the hypothesis space consisting of all hypothesis functions, and hout is a constant function that hout := 2, here 1 represents ID data and 2 represents OOD data.

The condition hin H presented in Theorem 6 is mild. Many practical hypothesis spaces satisfy this condition, e.g., the FCNN-based hypothesis space (Proposition 1 in Appendix K), score-based hypothesis space (Proposition 2 in Appendix K) and universal kernel-based hypothesis space. Theorem 6 implies that if K = 1 and OOD detection is learnable in Ds XY for H, then the hypothesis space H

should contain almost all hypothesis functions, implying that if the OOD detection can be learnable in the distribution-agnostic case, then a large-capacity model is necessary.

Next, we extend Theorem 6 to a general case, i.e., K > 1. When K > 1, we will first use a binary classifier hb to classify the ID and OOD data. Then, for the ID data identified by hb, an ID hypothesis function hin will be used to classify them into corresponding ID classes. We state this strategy as follows: given a hypothesis space Hin for ID distribution and a binary classification hypothesis space Hb introduced in Section 2, we use Hin and Hb to construct an OOD detection s hypothesis space H, which consists of all hypothesis functions h satisfying the following condition: there exist hin Hin and hb Hb such that for any x X,

h(x) = i, if hin(x) = i and hb(x) = 1; otherwise, h(x) = K + 1. (4)

We use Hin Hb to represent a hypothesis space consisting of all h defined in Eq. (4). In addition, we also need an additional condition for the loss function ℓ. This condition is shown as follows: Condition 2. ℓ(y2, y1) ℓ(K + 1, y1), for any in-distribution labels y1 and y2 Y.

Theorem 7. Let |X| < + and H = Hin Hb. If Hall {hout} Hb and Condition 2 holds, then OOD detection is learnable in Ds XY for H, where Hall and hout are defined in Theorem 6.

OOD Detection in the Finite-ID-Distribution Space. Since researchers can only collect finite ID datasets as the training data in the process of algorithm design, it is worthy to study the learnability of OOD detection in the finite-ID-distribution space DF XY . We first show two necessary concepts below. Definition 5 (ID Consistency). Given a domain space DXY , we say any two domains DXY DXY and D XY DXY are ID consistency, if DXIYI = D XIYI. We use the notation to represent the ID consistency, i.e., DXY D XY if and only if DXY and D XY are ID consistency.

It is easy to check that the ID consistency is an equivalence relation. Therefore, we define the set [DXY ] := {D XY DXY : DXY D XY } as the equivalence class with respect to space DXY . Condition 3 (Compatibility). For any equivalence class [D XY ] with respect to DXY and any ϵ > 0, there exists a hypothesis function hϵ H such that for any domain DXY [D XY ],

hϵ {h H : Rout D (h ) inf h H Rout D (h) + ϵ} {h H : Rin D(h ) inf h H Rin D(h) + ϵ}.

In Appendix F, Lemma 2 has implied that Condition 3 is a general version of Condition 1. Next, Theorem 8 indicates that Condition 3 is the necessary and sufficient condition in the space DF XY .

Theorem 8. Suppose that X is a bounded set. OOD detection is learnable in the finite-IDdistribution space DF XY for H if and only if the compatibility condition (i.e., Condition 3) holds. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

Theorem 8 shows that, in the process of algorithm design, OOD detection cannot be successful without the compatibility condition. Theorem 8 also implies that Condition 3 is essential for the learnability of OOD detection. This motivates us to study whether OOD detection can be successful in more general spaces (e.g., the density-based space), when the compatibility condition holds.

OOD Detection in the Density-based Space. To ensure that Condition 3 holds, we consider a basic assumption in learning theory Realizability Assumption (see Appendix D.2), i.e., for any DXY DXY , there exists h H such that RD(h ) = 0. We discover that in the density-based space Dµ,b XY , Realizability Assumption can conclude the compatibility condition (i.e., Condition 3). Based on this observation, we can prove the following theorem:

Theorem 9. Given a density-based space Dµ,b XY , if µ(X) < + , the Realizability Assumption holds, then when H has finite Natarajan dimension [21], OOD detection is learnable in Dµ,b XY for H. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

To further investigate the importance and necessary of Realizability Assumption, Theorem 11 has indicated that in some practical scenarios, Realizability Assumption is the necessary and sufficient condition for the learnability of OOD detection in the density-based space. Therefore, Realizability Assumption may be indispensable for the learnability of OOD detection in some practical scenarios.

6 Connecting Theory to Practice

In Section 5, we have shown the successful scenarios where OOD detection problem can be addressed in theory. In this section, we will discuss how the proposed theory is applied to two representative hypothesis spaces neural-network-based hypothesis spaces and score-based hypothesis spaces.

Fully-connected Neural Networks. Given a sequence q = (l1, l2, ..., lg), where li and g are positive integers and g > 2, we use g to represent the depth of neural network and use li to represent the width of the i-th layer. After the activation function σ is selected4, we can obtain the architecture of FCNN according to the sequence q. Let fw,b be the function generated by FCNN with weights w and bias b. An FCNN-based scoring function space is defined as: Fσ q := {fw,b : weights w, bias b}. In addition, for simplicity, given any two sequences q = (l1, ..., lg) and q = (l 1, ..., l g ), we use the notation q q to represent the following equations and inequalities:

1) g g , l1 = l 1, lg = l g ; 2) li l i, i = 1, ..., g 1; and 3) lg 1 l i, i = g, ..., g 1.

In Appendix L, Lemma 10 shows q q Fσ q Fσ q . We use to compare the sizes of FCNNs.

FCNN-based Hypothesis Space. Let lg = K + 1. The FCNN-based scoring function space Fσ q can induce an FCNN-based hypothesis space. For any fw,b Fσ q , the induced hypothesis function is:

hw,b := arg max k {1,...,K+1} f k w,b, where f k w,b is the k-th coordinate of fw,b.

Then, the FCNN-based hypothesis space is defined as Hσ q := {hw,b : weights w, bias b}.

Score-based Hypothesis Space. Many OOD detection algorithms detect OOD data by using a score-based strategy. That is, given a threshold λ, a scoring function space Fl {f : X Rl} and a scoring function E : Fl R, then x is regarded as ID data if and only if E(f(x)) λ. We introduce several representative scoring functions E as follows: for any f = [f 1, ..., f l] Fl, softmax-based function [7] and temperature-scaled function [8]: λ ( 1

l , 1) and T > 0,

E(f) = max k {1,...,l} exp (f k) Pl c=1 exp (f c) , E(f) = max k {1,...,l} exp (f k/T) Pl c=1 exp (f c/T) ; (5)

energy-based function [23]: λ (0, + ) and T > 0,

E(f) = T log

c=1 exp (f c/T). (6)

Using E, λ and f Fσ q , we have a classifier: hλ f,E(x) = 1, if E(f(x)) λ; otherwise, hλ f,E(x) = 2, where 1 represents the ID data and 2 represents the OOD data. Hence, a binary classification hypothesis space Hb, which consists of all hλ f,E, is generated. We define Hσ,λ q,E := {hλ f,E : f Fσ q }.

Learnability of OOD Detection in Different Hypothesis Spaces. Next, we present applications of our theory regarding the above two practical and important hypothesis spaces Hσ q and Hσ,λ q,E.

Theorem 10. Suppose that Condition 2 holds and the hypothesis space H is FCNN-based or scorebased, i.e., H = Hσ q or H = Hin Hb, where Hin is an ID hypothesis space, Hb = Hσ,λ q,E and H = Hin Hb is introduced below Eq. (4), here E is introduced in Eqs. (5) or (6). Then There is a sequence q = (l1, ..., lg) such that OOD detection is learnable in the separate space Ds XY for H if and only if |X| < + . Furthermore, if |X| < + , then there exists a sequence q = (l1, ..., lg) such that for any sequence q satisfying that q q , OOD detection is learnable in Ds XY for H.

Theorem 10 states that 1) when the hypothesis space is FCNN-based or score-based, the finite feature space is the necessary and sufficient condition for the learnability of OOD detection in the separate space; and 2) a larger architecture of FCNN has a greater probability to achieve the learnability of

4We consider the rectified linear unit (Re LU) function as the default activation function σ, which is defined by σ(x) = max{x, 0}, x R. We will not repeatedly mention the definition of σ in the rest of our paper.

OOD detection in the separate space. Note that when we select Eqs. (5) or (6) as the scoring function E, Theorem 10 also shows that the selected scoring functions E can guarantee the learnability of OOD detection, which is a theoretical support for the representative works [8, 23, 7]. Furthermore, Theorem 11 also offers theoretical supports for these works in the density-based space, when K = 1.

Theorem 11. Suppose that each domain DXY in Dµ,b XY is attainable, i.e., arg minh H RD(h) = (the finite discrete domains satisfy this). Let K = 1 and the hypothesis space H be score-based (H = Hσ,λ q,E, where E is in Eqs. (5) or (6)) or FCNN-based (H = Hσ q). If µ(X) < + , then the following four conditions are equivalent:

Learnability in Dµ,b XY for H Condition 1 Realizability Assumption Condition 3

Theorem 11 still holds if the function space Fσ q is generated by Convolutional Neural Network.

Overlap and Benefits of Multi-class Case. We investigate when the hypothesis space is FCNN-based or score-based, what will happen if there exists an overlap between the ID and OOD distributions?

Theorem 12. Let K = 1 and the hypothesis space H be score-based (H = Hσ,λ q,E, where E is in Eqs. (5) or (6)) or FCNN-based (H = Hσ q). Given a prior-unknown space DXY , if there exists a domain DXY DXY , which has an overlap between ID and OOD distributions (see Definition 4), then OOD detection is not learnable in the domain space DXY for H.

When K = 1 and the hypothesis space is FCNN-based or score-based, Theorem 12 shows that overlap between ID and OOD distributions is the sufficient condition for the unlearnability of OOD detection. Theorem 12 takes roots in the conditions infh H Rin D(h) = 0 and infh H Rout D (h) = 0. However, when K > 1, we can ensure infh H Rin D(h) > 0 if ID distribution DXIYI has overlap between ID classes. By this observation, we conjecture that when K > 1, OOD detection is learnable in some special cases where overlap exists, even if the hypothesis space is FCNN-based or score-based.

7 Discussion

Understanding Far-OOD Detection. Many existing works [7, 39] study the far-OOD detection issue. Existing benchmarks include 1) MNIST [40] as ID dataset, and Texture [41], CIFAR-10 [42] or Place365 [43] as OOD datasets; and 2) CIFAR-10 [42] as ID dataset, and MNIST [40], or Fashion MNIST [43] as OOD datasets. In far-OOD case, we find that the ID and OOD datasets have different semantic labels and different styles. From the theoretical view, we can define far-OOD detection tasks as follows: for τ > 0, a domain space DXY is τ-far-OOD, if for any domain DXY DXY ,

dist(supp DXO, supp DXI) > τ.

Theorems 7, 8 and 10 imply that under appropriate hypothesis space, τ-far-OOD detection is learnable. In Theorem 7, the condition |X| < + is necessary for the separate space. However, one can prove that in the far-OOD case, when Hin is agnostic PAC learnable for ID distribution, the results in Theorem 7 still holds, if the condition |X| < + is replaced by a weaker condition that X is compact. In addition, it is notable that when Hin is agnostic PAC learnable for ID distribution and X is compact, the KNN-based OOD detection algorithm [44] is consistent in the τ-far-OOD case.

Understanding Near-OOD Detection. When the ID and OOD datasets have similar semantics or styles, OOD detection tasks become more challenging. [45, 46] consider this issue and name it near-OOD detection. Existing benchmarks include 1) MNIST [40] as ID dataset, and Fashion-MNIST [43] or Not-MNIST [47] as OOD datasets; and 2) CIFAR-10 [42] as ID dataset, and CIFAR-100 [48] as OOD dataset. From the theoretical view, some near-OOD tasks may imply the overlap condition, i.e. Definition 4. Therefore, Theorems 3 and 12 imply that near-OOD detection may be not learnable. Developing a theory to understand the feasibility of near-OOD detection is still an open question.

Understanding One-class Novelty Detection. In one-class novelty detection and semantic anomaly detection (i.e. K = 1), Theorem 6 has revealed that it is necessary to use a large-capacity model to ensure the good generalization in the separate space. Theorem 3 and Theorem 12 suggest that we should try to avoid the overlap between ID and OOD distributions in the one-class case. If the overlap cannot be avoided, we suggest considering the multi-class OOD detection instead of the one-class case. Additionally, in the density-based space, Theorem 11 has shown that it is necessary to select a suitable hypothesis space satisfying the Realizability Assumption to ensure the learnability of OOD

detection in the density-based space. Generally, a large-capacity model can be helpful to guarantee that the Realizability Assumption holds.

8 Related Work

We briefly review the related theoretical works below. See Appendix A for detailed related works.

OOD Detection Theory. [49] understands the OOD detection via goodness-of-fit tests and typical set hypothesis, and argues that minimal density estimation errors can lead to OOD detection failures without assuming an overlap between ID and OOD distributions. Beyond [49], [50] paves a new avenue to designing provable OOD detection algorithms. Compared to [50, 49], our theory focuses on the PAC learnable theory of OOD detection and identifies several necessary and sufficient conditions for the learnability of OOD detection, opening a door to study OOD detection in theory.

Open-set Learning Theory. [51] and [29, 52] propose the agnostic PAC learning bounds for open-set detection and open-set domain adaptation, respectively. Unfortunately, [29, 51, 52] all require that the test data are indispensable during the training process. To investigate open-set learning (OSL) without accessing the test data during training, [24] proposes and investigates the almost agnostic PAC learnability for OSL. However, the assumptions used in [24] are very strong and unpractical.

Learning Theory for Classification with Reject Option. Many works [53, 54] also investigate the classification with reject option (Cw RO) problem, which is similar to OOD detection in some cases. [55, 56, 57, 58, 59] study the learning theory and propose the PAC learning bounds for Cw RO. However, compared to our work regarding OOD detection, existing Cw RO theories mainly focus on how the ID risk Rin D (i.e., the risk that ID data is wrongly classified) is influenced by special rejection rules. Our theory not only focuses on the ID risk, but also pays attention to the OOD risk.

Robust Statistics. In the field of robust statistics [60], researchers aim to propose estimators and testers that can mitigate the negative effects of outliers (similar to OOD data). The proposed estimators are supposed to be independent of the potentially high dimensionality of the data [61, 62, 63]. Existing works [64, 65, 66] in the field have identified and resolved the statistical limits of outlier robust statistics by constructing estimators and proving impossibility results. In the future, it is a promising and interesting research direction to study the robustness of OOD detection based on robust statistics.

PQ Learning Theory. Under some conditions, PQ learning theory [67, 68] can be regarded as the PAC theory for OOD detection in the semi-supervised or transductive learning cases, i.e., test data are required during training. Besides, [67, 68] aim to give the PAC estimation under Realizability Assumption [21]. Our theory does not only study the PAC estimation in the realization cases, but also studies the other cases, which are more difficult than PAC theory under Realizability Assumption.

9 Conclusions and Future Works

Detecting OOD data has shown its significance in improving the reliability of machine learning. However, very few works discuss OOD detection in theory, which hinders real-world applications of OOD detection algorithms. In this paper, we are the first to provide the PAC theory for OOD detection. Our results imply that we cannot expect a universally consistent algorithm to handle all scenarios in OOD detection. Yet, it is still possible to make OOD detection learnable in certain scenarios. For example, when we design OOD detection algorithms, we normally only have finite ID datasets. In this real scenario, Theorem 8 provides a necessary and sufficient condition for the success of OOD detection. Our theory reveals many necessary and sufficient conditions for the learnability of OOD detection, hence opening a door to studying the learnability of OOD detection. In the future, we will focus on studying the robustness of OOD detection based on robust statistics [64, 69].

Acknowledgment

JL and ZF were supported by the Australian Research Council (ARC) under FL190100149. YL is supported by the AFOSR Young Investigator Program Award. BH was supported by the RGC Early Career Scheme No. 22200720 and NSFC Young Scientists Fund No. 62006202. ZF would also like to thank Prof. Peter Bartlett and Dr. Tongliang Liu for productive discussions.

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[2] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.

[3] Yen-Chang Hsu, Yilin Shen, Hongxia Jin, and Zsolt Kira. Generalized ODIN: detecting out-of-distribution image without learning from out-of-distribution data. In CVPR, 2020.

[4] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. Co RR, abs/2110.11334, 2021.

[5] Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In CVPR, 2016.

[6] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Atom: Robustifying out-ofdistribution detection using outlier mining. ECML, 2021.

[7] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017.

[8] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.

[9] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Neur IPS, 2018.

[10] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Dae-ki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR, 2018.

[11] Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto. Generative probabilistic novelty detection with adversarial autoencoders. In Neur IPS, 2018.

[12] Eric T. Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan G or ur, and Balaji Lakshminarayanan. Do deep generative models know what they don t know? In ICLR, 2019.

[13] Dan Hendrycks, Mantas Mazeika, and Thomas G. Dietterich. Deep anomaly detection with outlier exposure. In ICLR, 2019.

[14] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. De Pristo, Joshua V. Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Neur IPS, 2019.

[15] Ziqian Lin, Sreya Dutta Roy, and Yixuan Li. Mood: Multi-level out-of-distribution detection. In CVPR, 2021.

[16] Mohammadreza Salehi, Hossein Mirzaei, Dan Hendrycks, Yixuan Li, Mohammad Hossein Rohban, and Mohammad Sabokrou. A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges. ar Xiv preprint ar Xiv:2110.14051, 2021.

[17] Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. In Neur IPS, 2021.

[18] Rui Huang, Andrew Geng, and Yixuan Li. On the Importance of Gradients for Detecting Distributional Shifts in the Wild. In Neur IPS, 2021.

[19] Stanislav Fort, Jie Ren, and Balaji Lakshminarayanan. Exploring the Limits of Out-of Distribution Detection. In Neur IPS, 2021.

[20] Yifei Ming, Hang Yin, and Yixuan Li. On the impact of spurious correlation for out-ofdistribution detection. AAAI, 2022.

[21] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[22] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

[23] Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li. Energy-based out-of-distribution detection. In Neur IPS, 2020.

[24] Zhen Fang, Jie Lu, Anjin Liu, Feng Liu, and Guangquan Zhang. Learning bounds for open-set learning. In ICML, 2021.

[25] Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian. Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

[26] Lukas Ruff, Nico G ornitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert A. Vandermeulen, Alexander Binder, Emmanuel M uller, and Marius Kloft. Deep one-class classification. In ICML, 2018.

[27] Sachin Goyal, Aditi Raghunathan, Moksh Jain, Harsha Vardhan Simhadri, and Prateek Jain. DROCC: deep robust one-class classification. In ICML, 2020.

[28] Lucas Deecke, Robert A. Vandermeulen, Lukas Ruff, Stephan Mandt, and Marius Kloft. Image anomaly detection with generative adversarial networks. In ECML, 2018.

[29] Z. Fang, Jie Lu, F. Liu, Junyu Xuan, and G. Zhang. Open set domain adaptation: Theoretical bound and algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2020.

[30] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635 2670, 2010.

[31] Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia Li, Tiejun Huang, Shiliang Pu, and Yonghong Tian. Learning open set network with discriminative reciprocal points. ICCV, 2020.

[32] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Informative outlier matters: Robustifying out-of-distribution detection using outlier mining. ICML Workshop, 2020.

[33] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Robust out-of-distribution detection for neural networks. ar Xiv preprint ar Xiv:2003.09711, 2020.

[34] Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learning for open set action recognition. ICCV, 2021.

[35] Wentao Bao, Qi Yu, and Yu Kong. Opental: Towards open set temporal action localization. CVPR, 2022.

[36] Donald L Cohn. Measure theory. Springer, 2013.

[37] Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vcdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019.

[38] Marek Karpinski and Angus Macintyre. Polynomial bounds for VC dimension of sigmoidal and general pfaffian neural networks. J. Comput. Syst. Sci., 54(1):169 176, 1997.

[39] Jingkang Yang, Kaiyang Zhou, and Ziwei Liu. Full-spectrum out-of-distribution detection. Co RR, 2022.

[40] Li Deng. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag., 2012.

[41] Gustaf Kylberg. Kylberg texture dataset v. 1.0. 2011.

[42] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on cifar-10. Technical report, Citeseer, 2009.

[43] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 2018.

[44] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In ICML, 2022.

[45] Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Lakshminarayanan. A simple fix to mahalanobis distance for improving near-ood detection. Co RR, abs/2106.09022, 2021.

[46] Stanislav Fort, Jie Ren, and Balaji Lakshminarayanan. Exploring the limits of out-of-distribution detection. In Neur IPS, 2021.

[47] Yaroslav Bulatov. Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html,2, 2011. [48] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 and cifar-100 datasets. 2009. [49] Lily H. Zhang, Mark Goldstein, and Rajesh Ranganath. Understanding failures in out-ofdistribution detection with deep generative models. In ICML, 2021. [50] Peyman Morteza and Yixuan Li. Provable guarantees for understanding out-of-distribution detection. AAAI, 2022. [51] Si Liu, Risheek Garrepalli, Thomas G. Dietterich, Alan Fern, and Dan Hendrycks. Open category detection with PAC guarantees. In ICML, 2018. [52] Yadan Luo, Zijian Wang, Zi Huang, and Mahsa Baktashmotlagh. Progressive graph learning for open-set domain adaptation. In ICML, 2020. [53] C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 1970. [54] Vojtech Franc, Daniel Pr uˇsa, and V. Voracek. Optimal strategies for reject option classifiers. Co RR, abs/2101.12523, 2021. [55] Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In ALT, 2016. [56] Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Boosting with abstention. In Neur IPS, 2016. [57] Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classification with rejection. In Neur IPS, 2019. [58] Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classification with rejection based on cost-sensitive classification. In ICML, 2021. [59] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 2008. [60] Peter J Rousseeuw, Frank R Hampel, Elvezio M Ronchetti, and Werner A Stahel. Robust statistics: the approach based on influence functions. John Wiley & Sons, 2011. [61] Elvezio M Ronchetti and Peter J Huber. Robust statistics. John Wiley & Sons, 2009. [62] Ilias Diakonikolas, Daniel M. Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. In Neur IPS, 2020. [63] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price, and Alistair Stewart. Outlierrobust high-dimensional sparse estimation via iterative filtering. In Neur IPS, 2019. [64] Ilias Diakonikolas, Daniel M. Kane, Alistair Stewart, and Yuxin Sun. Outlier-robust learning of ising models under dobrushin s condition. In COLT, 2021. [65] Yu Cheng, Ilias Diakonikolas, Daniel M Kane, Rong Ge, Shivam Gupta, and Mahdi Soltanolkotabi. Outlier-robust sparse estimation via non-convex optimization. In Neur IPS, 2021. [66] Ilias Diakonikolas, Daniel M Kane, Jasper CH Lee, and Ankit Pensia. Outlier-robust sparse mean estimation for heavy-tailed distributions. In Neur IPS, 2022. [67] Shafi Goldwasser, Adam Tauman Kalai, Yael Kalai, and Omar Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. In Neur IPS, 2020. [68] Adam Tauman Kalai and Varun Kanade. Efficient learning with arbitrary covariate shift. In ALT, Proceedings of Machine Learning Research, 2021. [69] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. A shorter version appears as an Invited Book Chapter in Beyond the Worst-Case Analysis of Algorithms, 2020. [70] Akshay Raj Dhamija, Manuel G unther, and Terrance E. Boult. Reducing network agnostophobia. In Neur IPS, pages 9175 9186, 2018. [71] Haoran Wang, Weitang Liu, Alex Bocchieri, and Yixuan Li. Can multi-label classification networks know what they don t know? In Neur IPS, 2021. [72] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Neur IPS, 2018.

[73] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood regret: An out-of-distribution detection score for variational auto-encoder. In Neur IPS, 2020.

[74] Alireza Zaeemzadeh, Niccol o Bisagno, Zeno Sambugaro, Nicola Conci, Nazanin Rahnavard, and Mubarak Shah. Out-of-distribution detection using union of 1-dimensional subspaces. In CVPR, 2021.

[75] Joost Van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Uncertainty estimation using a single deep deterministic neural network. In ICML, 2020.

[76] Sachin Vernekar, Ashish Gaurav, Vahdat Abdelzad, Taylor Denouden, Rick Salay, and Krzysztof Czarnecki. Out-of-distribution detection in classifiers via generation. In Neur IPS Workshop, 2019.

[77] Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positiveunlabeled learning with non-negative risk estimator. In Neur IPS, 2017.

[78] Takashi Ishida, Gang Niu, and Masashi Sugiyama. Binary classification from positiveconfidence data. In Neur IPS, 2018.

[79] Shuo Chen, Gang Niu, Chen Gong, Jun Li, Jian Yang, and Masashi Sugiyama. Large-margin contrastive learning with distance polarization regularizer. In ICML, 2021.

[80] Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In CVPR, 2020.

[81] Zhen Fang, Jie Lu, Feng Liu, and Guangquan Zhang. Semi-supervised heterogeneous domain adaptation: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[82] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[83] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 2012.

[84] Itay Safran and Ohad Shamir. Depth-width tradeoffs in approximating natural functions with neural networks. In ICML, 2017.

[85] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8: 143 195, 1999.

[86] Peter L Bartlett and Wolfgang Maass. Vapnik-chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 2003.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Appendix B

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Appendix B (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A]

(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]