# on_the_learnability_of_outofdistribution_detection__c8e73b32.pdf

Journal of Machine Learning Research 25 (2024) 1-83 Submitted 9/23; Published 4/24

On the Learnability of Out-of-distribution Detection

Zhen Fang zhen.fang@uts.edu.au Australian Artiﬁcial Intelligence Institute University of Technology Sydney 61 Broadway, Ultimo NSW 2007, Australia

Yixuan Li sharonli@cs.wisc.edu Department of Computer Sciences The University of Wisconsin Madison 1210 W Dayton St, Madison, WI 53706, USA

Feng Liu feng.liu1@unimelb.edu.au School of Computing and Information Systems The University of Melbourne 700 Swanston Street, Carlton VIC 3053, Australia

Bo Han bhanml@comp.hkbu.edu.hk Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong SAR

Jie Lu jie.lu@uts.edu.au Australian Artiﬁcial Intelligence Institute University of Technology Sydney 61 Broadway, Ultimo NSW 2007, Australia

Editor: Amos Storkey

Supervised learning aims to train a classiﬁer under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for eﬀective OOD detection algorithms, and corresponding learning theory is still an open problem. To study the generalization of OOD detection, this paper investigates the probably approximately correct (PAC) learning theory of OOD detection that ﬁts the commonly used evaluation metrics in the literature. First, we ﬁnd a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we ﬁnd that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and suﬃcient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we oﬀer theoretical support for representative OOD detection works based on our OOD theory.

Keywords: out-of-distribution detection, weakly supervised learning, learnability

2024 Zhen Fang, Yixuan Li, Feng Liu, Bo Han, Jie Lu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1257.html.

Fang, Li, Liu, Han, Lu

1. Introduction

The success of supervised learning is established on an in-distribution (ID) assumption that training and test data share the same distribution (Dosovitskiy et al., 2021; Huang et al., 2017; Hsu et al., 2020; Yang et al., 2021). However, in many real-world scenarios, the distribution of test data violates the assumption and, instead, contains out-of-distribution (OOD) data whose labels have not been seen during the training process (Bendale and Boult, 2016; Chen et al., 2021a). To mitigate the risk brought by OOD data, a more practical learning scenario is considered in the machine learning ﬁeld: OOD detection, which determines whether an input is ID/OOD, while classifying the ID data into respective classes.

OOD detection can signiﬁcantly increase the reliability of machine learning models when deploying them in the real world. Many seminar algorithms have been developed to empirically address the OOD detection problem (Hendrycks and Gimpel, 2017; Liang et al., 2018; Lee et al., 2018; Zong et al., 2018; Pidhorskyi et al., 2018; Nalisnick et al., 2019; Hendrycks et al., 2019; Ren et al., 2019; Lin et al., 2021; Salehi et al., 2021; Sun et al., 2021). A common solution paradigm to OOD detection is to propose a new learning objective or/and a score function to identify if one upcoming data point is OOD data. When evaluating algorithms under this solution paradigm, both threshold-dependent metrics (e.g., risk) and threshold-independent metrics (e.g., AUC) will be used to see to what extent the algorithms can successfully identify OOD data. However, very few works study theory of OOD detection, which hinders the rigorous path forward for the ﬁeld. This paper aims to bridge the gap.

In this paper, a theoretical framework is proposed to understand the learnability of OOD detection problem in view of threshold-dependent metrics and threshold-independent metrics1. We investigate the probably approximately correct (PAC) learning theory of OOD detection when the evaluation metrics are risk and AUC, which is posed as an open problem to date. Unlike the classical PAC learning theory in a supervised setting, our problem setting is fundamentally challenging due to the absence of OOD data in training. Because OOD data can be diverse in many real-world scenarios, we want to study whether there exists an algorithm that can be used to detect data from various OOD distributions instead of merely data from some speciﬁed OOD distributions. Such is the signiﬁcance of studying the learning theory for OOD detection (Yang et al., 2021). This motivates our question: is OOD detection PAC learnable? i.e., is there the PAC learning theory to guarantee the generalization ability of OOD detection under two common metrics: risk and AUC?

To answer the above research question and investigate the learning theory, we mainly focus on two basic spaces: domain space and function space. The domain space is a space consisting of some distributions, and the function space is a space consisting of some classiﬁers or ranking functions. Existing agnostic PAC theories in supervised learning (Shalev-Shwartz and Ben-David, 2014; Mohri et al., 2018) are distribution-free, i.e., the domain space consists of all domains. Yet, in Theorem 5 and Theorem 8, we show that the learning theory of OOD detection is not distribution-free. Furthermore, we ﬁnd that OOD detection is learnable only if the domain space and the function space satisfy some special conditions, e.g., Conditions

1. This paper is an extended version of our previous conference paper (Fang et al., 2022a). In Section 7, we discuss the main diﬀerence between this paper and Fang et al. (2022a).

On the Learnability of Out-of-distribution Detection

1, 2, and 4. Notably, there are many conditions and theorems in existing learning theories and many OOD detection algorithms in the literature. Thus, it is very diﬃcult to analyze the relation between these theories and algorithms, and explore useful conditions to ensure the learnability of OOD detection, especially when we have to explore them from the scratch. Thus, the main aim of our paper is to study these essential conditions under risk and AUC metrics. From these essential conditions, we can know when OOD detection can be successful in practical scenarios. We restate our question and goal in the following: Given hypothesis spaces and several representative domain spaces, what are the conditions to ensure the learnability of OOD detection in terms of risk and AUC? If possible, we hope that these conditions are necessary and suﬃcient in some scenarios.

Main Results. We start to study the learnability of OOD detection in the largest space the total space, and give two necessary conditions for the learnability of OOD detection under risk and AUC (Condition 1 for risk and Condition 2 for AUC). However, we ﬁnd that the overlap between ID and OOD data may result in that both necessary conditions do not hold. Therefore, we give two impossibility theorems to demonstrate that OOD detection fails in the total space (Theorem 5 under risk and Theorem 8 under AUC). Then, we investigate OOD detection in a separate space, where the ID and OOD data do not overlap. Unfortunately, there still exists impossibility theorems (Theorem 6 under risk and Theorem 9 under AUC), meaning that we cannot expect OOD detection is learnable under risk and AUC in the separate space under some conditions of the developed theorems.

It is frustrating to ﬁnd the impossibility theorems regarding OOD detection in a separate space, but we ﬁnd that some conditions of these impossibility theorems may not hold in several practical scenarios. Stemming from this observation, we give several necessary and suﬃcient conditions to characterize the learnability of OOD detection under risk and AUC in the separate space (Theorems 10 and 16 under risk, and Theorems 12 and 17 under AUC). Especially, when our function space is based on fully-connected neural network (FCNN), OOD detection is learnable under risk and AUC in the separate space if and only if the feature space is ﬁnite. Then, we focus on other more practical domain spaces, e.g., the ﬁnite-ID-distribution space and the density-based space and investigate the learnability of OOD detection in both spaces. Theorem 13 shows a necessary and suﬃcient condition of learnability of OOD detection under risk. Theorems 14 and 15 show two suﬃcient conditions for the learnability of OOD detection under risk and AUC, respectively. It should be noted that when studying learnability of OOD detection in the ﬁnite-ID-distribution space, we discover a compatibility condition (Condition 4) that is a necessary and suﬃcient condition of learnability of OOD detection under risk for this space. Then, we explore the compatibility condition in the density-based space, and ﬁnd that such condition is also the necessary and suﬃcient condition in some practical scenarios (Theorem 18).

Implications and Impacts of Theory. Our study is not of purely theoretical interest; it has also practical impacts. (i) From the perspective of domain space, we consider the ﬁnite ID-distribution space that ﬁts the common scenarios in the real world: we normally only have ﬁnite ID datasets. In this case, Theorem 13 gives a necessary and suﬃcient condition to the success of OOD detection under risk. More importantly, our theory shows that OOD detection is learnable in image-based scenarios when ID images have clearly diﬀerent semantic

Fang, Li, Liu, Han, Lu

labels and styles (far-OOD) from OOD images. (ii) From the perspective of function space, we investigate the learnability of OOD detection under risk and AUC for commonly used FCNN-based function spaces. Our theory provides theoretical support (Theorems 16 and 18 under risk, and Theorems 17 and 19 under AUC) for several representative OOD detection works (Hendrycks and Gimpel, 2017; Liang et al., 2018; Liu et al., 2020). (iii) From the perspective of evaluation metrics, our paper studies the learnability of OOD detection under risk and the learnability of OOD detection under AUC, which covers the major evaluation metrics used in OOD detection evaluation and provides theoretical guidance when users have diﬀerent requirement in evaluating OOD detection performance. Based on all of our theoretical results, they suggest we should not expect a universally working OOD detection algorithm. It is necessary to design diﬀerent algorithms in diﬀerent scenarios.

2. Learning Setups

We begin by introducing the necessary concepts and notations for our theoretical framework. Given a feature space X Rd and a label space Y := {1, . . . , K}, we have an ID joint distribution DXIYI over X Y, where XI X and YI Y are random variables. We also have an OOD joint distribution DXOYO, where XO is a random variable from X, but YO is a random variable whose outputs do not belong to Y. During testing, we encounter a mixture of ID and OOD joint distributions: DXY = (1 πout)DXIYI + πout DXOYO, and we can only observe the marginal distribution DX = (1 πout)DXI + πout DXO, where the constant πout [0, 1) represents an unknown class-prior probability. Next, we provide the formal deﬁnition of the OOD detection problem and key concepts used in this paper.

2.1 Problem Setting and Concepts

Problem 1 (OOD Detection (Yang et al., 2021)) Given an ID joint distribution DXIYI and a training data S := {(x1, y1), ..., (xn, yn)} drawn independent and identically distributed from DXIYI, the aim of OOD detection is to train a classiﬁer f by using the training data S such that, for any test data x drawn from the mixed marginal distribution DX:

if x is an observation from DXI, f can classify x into correct ID classes;

if x is an observation from DXO, f can detect x as OOD data.

According to Yang et al. (2021), when K = 1, OOD detection reduces to one-class novelty detection or semantic anomaly detection (Ruﬀet al., 2018; Goyal et al., 2020; Deecke et al., 2018). Next, we introduce some basic and important concepts and notations.

OOD Label and Domain Space. Based on Problem 1, we know it is not necessary to classify OOD data into the correct OOD classes. Without loss of generality, let all OOD data be allocated to one big OOD class, i.e., YO = K + 1 (Fang et al., 2021, 2020). To investigate the PAC learnability of OOD detection, we deﬁne a domain space DXY , which is a set consisting of some joint distributions DXY mixed by some ID joint distributions and some OOD joint distributions. In this paper, the joint distribution DXY mixed by ID joint distribution DXIYI and OOD joint distribution DXOYO is called domain.

On the Learnability of Out-of-distribution Detection

Hypothesis Spaces and Scoring Function Spaces. A hypothesis space H is a subset of function space, i.e., H {h : X Y {K + 1}}. We set Hin {h : X Y} to the ID hypothesis space. We also deﬁne Hb {h : X {1, 2}} as the hypothesis space for binary classiﬁcation, where 1 represents the ID data, and 2 represents the OOD data. The function h is called the hypothesis function. A scoring function space is a subset of function space, i.e., Fl {f : X Rl}, where l is the output s dimension of the vector-valued function f. The function f is called the scoring function.

Ranking Function Spaces. Most representative OOD detection algorithms (Liu et al., 2020) output a ranking function from a given ranking function space R Rall = {r : X R}. If the ranking function r(x) has a higher value, then x is from DXI with a higher probability. A perfect ranking function r fulﬁlls the condition r (x) > r (x ) for all x from DXI and all x from DXO, indicating that rankings of ID data are always higher than rankings of OOD data. The general strategy to construct the ranking function space R is to design a scoring function E : Rl R and integrate it with the scoring function space Fl, i.e., R = E Fl.

Loss, Risks and AUC Metric. Let Yall = Y {K + 1}. Given a loss function ℓ: Yall Yall R 0 satisfying that ℓ(y1, y2) = 0 if and only if y1 = y2, and any h H, then the risk with respect to DXY is

RD(h) := E(x,y) DXY ℓ(h(x), y). (1)

The α-risk Rα D(h) := (1 α)Rin D(h) + αRout D (h), α [0, 1], where Rin D(h) and Rout D (h) are

Rin D(h) := E(x,y) DXIYIℓ(h(x), y), Rout D (h) := Ex DXOℓ(h(x), K + 1).

Except for using risk to evaluate the OOD detection performance, AUC is also a promising metric to see if a ranking function r can separate the ID and OOD data:

AUC(r; DXY ) = Ex DXIEx DXO 1r(x)>r(x ) + 1

21r(x)=r(x ) . (2)

Note that since value of AUC only denpends on the marginal distributions DXI and DXO, therefore, it is also convientent for us to rewrite AUC(r; DXY ) as AUC(r; DXI, DXO).

2.2 Learnability under Risk

Based on risk deﬁned in Eq. (1), OOD detection aims to select a hypothesis function h H with approximately minimal risk, based on ﬁnite data. Generally, we expect the approximation to get better, with the increase in sample size. Algorithms achieving this are said to be consistent under risk. Formally, we have:

Condition 1 (Learnability of OOD Detection under Risk) Given a domain space DXY and a hypothesis space H {h : X Yall}, we say OOD detection is learnable in DXY for H under risk, if there exists an algorithm A2 : + n=1(X Y)n H and a

2. Similar to Shalev-Shwartz et al. (2010), in this paper, we regard an algorithm as a mapping from + n=1(X Y)n to H or R.

Fang, Li, Liu, Han, Lu

monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any domain DXY DXY ,

ES Dn XIYI RD(A(S)) inf h H RD(h) ϵcons(n), (3)

An algorithm A for which this holds is said to be consistent with respect to DXY .

Deﬁnition 1 is a natural extension of agnostic PAC learnability of supervised learning (Shalev-Shwartz et al., 2010). If for any DXY DXY , πout = 0, then Deﬁnition 2 is the agnostic PAC learnability of supervised learning. Although the expression of Deﬁnition 1 is diﬀerent from the normal deﬁnition of agnostic PAC learning in Shalev-Shwartz and Ben-David (2014), one can prove that they are equivalent if ℓis bounded, see Appendix A.3. Since OOD data are unavailable, it is impossible to obtain any information about the class-prior probability πout. Furthermore, in the real world, it is possible that πout can be any value in [0, 1). Therefore, the imbalance issue between ID and OOD distributions, and the priori-unknown issue (i.e., πout is unknown) are the core challenges. To mitigate this challenge, we revise Eq. (3) as follows:

ES Dn XIYI Rα D(A(S)) inf h H Rα D(h) ϵcons(n), α [0, 1]. (4)

If an algorithm A satisﬁes Eq. (4), then the imbalance issue and the prior-unknown issue disappear. That is, A can simultaneously classify the ID data and detect the OOD data well. Based on the above discussion, we deﬁne the strong learnability of OOD detection under risk as follows:

Condition 2 (Strong Learnability of OOD Detection under Risk) Given a domain space DXY and a hypothesis space H {h : X Yall}, we say OOD detection is strongly learnable in DXY for H, if there exists an algorithm A : + n=1(X Y)n H and a monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any domain DXY DXY ,

ES Dn XIYI Rα D(A(S)) inf h H Rα D(h) ϵcons(n), α [0, 1].

Remark. In Theorem 1, we have shown that the strong learnability of OOD detection under risk is equivalent to the learnability of OOD detection under risk, if the domain space DXY is a prior-unknown space (see Deﬁnition 4). In this paper, we mainly discuss the learnability in the prior-unknown space. Therefore, when we mention that OOD detection is learnable under risk, we also mean that OOD detection is strongly learnable under risk.

2.3 Learnability under AUC

Based on AUC deﬁned in Eq. (2), OOD detection aims to select a ranking function r R with approximately maximal AUC, based on ﬁnite data. Generally, we expect the approximation to get better, with the increase in sample size. Algorithms achieving this are said to be consistent under AUC. Formally, we have:

Condition 3 (Learnability of OOD Detection under AUC) Given a domain space DXY , a ranking function space R {r : X R}, we say OOD detection is learnable

On the Learnability of Out-of-distribution Detection

in DXY for R under AUC, if there exists an algorithm A : + n=1(X Y)n R and a monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any domain DXY DXY ,

ES Dn XIYI sup r R AUC(r; DXY ) AUC(A(S); DXY ) ϵcons(n). (5)

An algorithm A for which this holds is said to be AUC consistent with respect to DXY .

Deﬁnition 3 is another version of Deﬁnition 1. Here, we use AUC instead of risk to evaluate the performance of the OOD detection. Note that the learnability of OOD detection under AUC is not inﬂuenced by the πout, as AUC is directly calculated by using DXI and DXO.

2.4 Goal of Our Theory

Note that the agnostic PAC learnability of supervised learning is distribution-free, i.e., the domain space DXY consists of all domains. However, due to the absence of OOD data during the training process (Liang et al., 2018; Ren et al., 2019; Fang et al., 2021), it is obvious that the learnability of OOD detection is not distribution-free (i.e., Theorem 5 and Theorem 8). In fact, we discover that the learnability of OOD detection is deeply correlated with the relationship between the domain space DXY and the hypothesis space H (or the ranking function space R). That is, OOD detection is learnable only when the domain space DXY and the hypothesis space H (or the ranking function space R) satisfy some special conditions, e.g., Conditions 1, 4 (under risk), Conditions 2 (under AUC). We present our goal as follows: Goal: given a hypothesis space H (or a ranking function space R), and several representative domain spaces DXY , what are the conditions to ensure the learnability of OOD detection? Furthermore, if possible, we hope that these conditions are necessary and suﬃcient in some scenarios.

Therefore, compared to the agnostic PAC learnability of supervised learning, our theory doesn t focus on the distribution-free case, but focuses on discovering essential conditions to guarantee the learnability of OOD detection in several representative and practical domain spaces DXY . By these essential conditions, we can know when OOD detection can be successful in real applications.

3. Learning in Priori-unknown Spaces

We ﬁrst investigate a special space, called prior-unknown space and prove that if OOD detection is strongly learnable under risk or learnable under AUC in a space DXY , then one can discover a larger domain space, which is prior-unknown, to ensure the learnability of OOD detection under risk or AUC. These results imply that it is enough to study learnabiligy of OOD detection in the prior-unknown spaces. The prior-unknown space is as follows:

Condition 4 Given a domain space DXY , we say DXY is a priori-unknown space, if for any domain DXY DXY and any α [0, 1), we have Dα XY := (1 α)DXIYI + αDXOYO DXY .

Then the following theorem presents importance and necessity of priori-unknown space.

Fang, Li, Liu, Han, Lu

Theorem 1 Given spaces DXY and D XY = {Dα XY : DXY DXY , α [0, 1)}, then 1) D XY is a priori-unknown space and DXY D XY ; 2) if DXY is a priori-unknown space, then Deﬁnition 1 and Deﬁnition 2 are equivalent; 3) OOD detection is strongly learnable in DXY under risk if and only if OOD detection is learnable in D XY under risk; 4) OOD detection is learnable in DXY under AUC if and only if OOD detection is learnable in D XY under AUC.

The second result of Theorem 1 bridges the learnability and strong learnability under risk, which implies that if an algorithm A is consistent with respect to a prior-unknown space, then this algorithm A can address the imbalance issue between ID and OOD distributions, and the priori-unknown issue well. The fourth result of Theorem 1 shows that the learnability of OOD detection under AUC is not inﬂuenced by the unknown class-prior probability πout. Based on Theorem 1, we focus on our theory in the prior-unknown spaces. To demystify the learnability of OOD detection, we introduce ﬁve representative priori-unknown spaces:

Single-distribution space DDXY XY . For a domain DXY , DDXY XY := {Dα XY : α [0, 1)}.

Total space Dall XY , which consists of all domains.

Separate space Ds XY , which consists of all domains that satisfy the separate condition, that is for any DXY Ds XY , supp DXO supp DXI = , where supp D means the support set of a distribution D.

Finite-ID-distribution space DF XY , which is a prior-unknown space satisfying that the number of distinct ID joint distributions DXIYI in DF XY is ﬁnite, i.e., |{DXIYI : DXY DF XY }| < + .

Density-based space Dµ,b XY , which is a prior-unknown space consisting of some domains satisfying that: for any DXY , there exists a density function f with 1/b f b in suppµ and 0.5 DXI + 0.5 DXO = R fdµ, where µ is a measure deﬁned over X. Note that if µ is discrete, then DX is a discrete distribution; and if µ is the Lebesgue measure, then DX is a continuous distribution.

The above representative spaces widely exist in real applications. For example, 1) if the images from diﬀerent semantic labels with diﬀerent styles are clearly diﬀerent, then those images can form a distribution belonging to a separate space Ds XY ; and 2) when designing an algorithm, we only have ﬁnite ID datasets, e.g., CIFAR-10, MNIST, SVHN, and Image Net, to build a model. Then, ﬁnite-ID-distribution space DF XY can handle this real scenario. Note that the single-distribution space is a special case of the ﬁnite-ID-distribution space. In this paper, we mainly discuss these ﬁve spaces.

4. Impossibility Theorems for OOD Detection

In this section, we ﬁrst give a necessary condition for the learnability of OOD detection. Then, we show this necessary condition does not hold in the total space Dall XY and the separate space Ds XY .

On the Learnability of Out-of-distribution Detection

4.1 Necessary Conditions for Learnability of OOD Detection

We ﬁrst ﬁnd a necessary condition for the learnability of OOD detection under risk (AUC), i.e., Condition 1 (Condition 2).

Condition 1 (Linear Condition under Risk) For any DXY DXY and any α [0, 1),

inf h H Rα D(h) = (1 α) inf h H Rin D(h) + α inf h H Rout D (h).

The importance of Condition 1 is reﬂected by Theorem 2, showing that Condition 1 is a necessary and suﬃcient condition for the learnability of OOD detection under risk if the DXY is the single-distribution space.

Theorem 2 Given a hypothesis space H and a domain DXY , OOD detection is learnable under risk in the single-distribution space DDXY XY for H if and only if Condition 1 holds.

Theorem 2 implies that Condition 1 is important for the learnability of OOD detection under risk. Due to the simplicity of single-distribution space, Theorem 2 implies that Condition 1 is the necessary condition for the learnability of OOD detection under risk in the prior-unknown space, see Lemma 1 in Appendix C. Then, we focus on ﬁnding a necessary condition for the learnability of OOD detection under AUC. The condition is similar to Condition 1 but replacing risk with AUC. Note that, for simplicity, in the following of this paper, we use AUC(r; DXI, DXO) to present AUC(r; DXY ).

Condition 2 (Linear Condition under AUC) For any DXY = βDXIYI+(1 β)DXOYO, D XY = β DXIYI + (1 β )D XOYO DXY , then for any α [0, 1),

α sup r R AUC(r; DXI, DXO) + (1 α) sup r R AUC(r; DXI, D XO) = sup r R AUC(r; DXI, Dα XO),

where Dα XO = αDXO + (1 α)D XO.

The importance of Condition 2 is reﬂected in Theorem 3, showing that Condition 2 is a necessary condition for the learnability of OOD detection under AUC if the DXY is a simple distribution space.

Theorem 3 Given a ranking function space R and a domain space DXY , if OOD detection is learnable under AUC for R in DXY , then for any DXY , D XY DXY , the linear condition under AUC (i.e., Condition 2) holds.

Since the Condition 2 is a necessary condition for the learnability of OOD detection under AUC, this condition provides a new way to check if an OOD detection is learnable under AUC. Namely, if Condition 2 does not hold, OOD detection is not learnable under AUC.

4.2 Impossibility Theorems under Risk

In this subsection, we ﬁrst study whether Condition 1 holds in the total space Dall XY . If Condition 1 does not hold, then OOD detection is not learnable under risk. Theorem 4 shows that Condition 1 is not always satisﬁed, especially, when there is an overlap between the ID and OOD distributions:

Fang, Li, Liu, Han, Lu

Condition 5 (Overlap Between ID and OOD) We say a domain DXY has overlap between ID and OOD distributions, if there is a σ-ﬁnite measure µ such that DX is absolutely continuous with respect to µ, and µ(Aoverlap) > 0, where Aoverlap = {x X : f I(x) > 0 and f O(x) > 0}. Here f I and f O are the representers of DXI and DXO in Radon Nikodym Theorem (Cohn, 2013),

DXI = Z f Id µ, DXO = Z f Od µ.

Lemma 4 Given a hypothesis space H and a prior-unknown space DXY , if there is DXY DXY , which has overlap between ID and OOD, and infh H Rin D(h) = 0, infh H Rout D (h) = 0, then Condition 1 does not hold. Therefore, OOD detection is not learnable under risk in DXY for H.

Lemma 4 clearly shows that under proper conditions, Condition 1 does not hold, if there exists a domain whose ID and OOD distributions have overlap. By Lemma 4, we can obtain that the OOD detection is not learnable in the total space Dall XY for any non-trivial hypothesis space H.

Theorem 5 (Impossibility Theorem for Total Space under Risk) OOD detection is not learnable under risk in the total space Dall XY for H, if |φ H| > 1, where φ maps ID labels to 1 and maps OOD labels to 2.

Since the overlaps between ID and OOD distributions may cause that Condition 1 does not hold, we then consider studying the learnability of OOD detection in the separate space Ds XY , where there are no overlaps between the ID and OOD distributions. However, Theorem 6 shows that even if we consider the separate space, the OOD detection is still not learnable in some scenarios. Before introducing the impossibility theorem for separate space, i.e., Theorem 6, we need a mild assumption:

Assumption 1 (Separate Space for OOD under Risk) A hypothesis space H is separate for OOD data, if for each data point x X, there exists at least one hypothesis function hx H such that hx(x) = K + 1.

Assumption 1 means that every data point x has the possibility to be detected as OOD data. Assumption 1 is mild and can be satisﬁed by many hypothesis spaces, e.g., the FCNN-based hypothesis space (Proposition 3 in Appendix N), score-based hypothesis space (Proposition 4 in Appendix N) and universal kernel space. Next, we use Vapnik Chervonenkis (VC) dimension (Mohri et al., 2018) to measure the size of hypothesis space, and study the learnability of OOD detection in Ds XY based on the VC dimension.

Theorem 6 (Impossibility Theorem for Separate Space under Risk) If Assumption 1 holds, VCdim(φ H) < + and suph H |{x X : h(x) Y}| = + , OOD detection is not learnable under risk in the separate space Ds XY for H, where φ maps ID labels to 1 and maps OOD labels to 2.

The ﬁnite VC dimension normally implies the learnability of supervised learning. However, in our results, the ﬁnite VC dimension cannot guarantee the learnability of OOD detection under risk in the separate space, which reveals the diﬃculty of the OOD detection.

On the Learnability of Out-of-distribution Detection

4.3 Impossibility Theorems under AUC

We then study whether Condition 2 holds in the total space Dall XY . If Condition 2 does not hold, then OOD detection is not learnable under AUC. We ﬁrst present Lemma 7 to point out when Condition 2 does not hold.

Lemma 7 Given a ranking function space R, a domain space DXY and DXY = βDXIYI + (1 β)DXOYO, D XY = β DXIYI + (1 β )D XOYO DXY , let P be the overlap set between DXI and DXO and P be the overlap set between DXI and D XO based on the Deﬁnition 5. If

sup r R AUC(r; DXI, DXO) = sup r Rall AUC(r; DXI, DXO)

sup r R AUC(r; DXI, D XO) = sup r Rall AUC(r; DXI, D XO),

and DXI(P P ) < min{DXI(P), DXI(P )}, then Condition 2 does not hold, where Rall is a ranking function space consisting of all ranking functions from X to R. Therefore, OOD detection is not learnable under AUC in DXY for R.

Based on Lemma 7, we know that, under proper conditions, Condition 2 does not hold once there is one domain whose ID and OOD distributions overlap. Then, based on Lemma 7, we can obtain that the OOD detection is not learnable in the total space Dall XY for any non-trivial ranking function space R.

Theorem 8 (Impossibility Theorem for Total Space under AUC) Given ranking function space R, if there exist x, x X and r, r R such that

r(x) > r(x ) and r (x ) > r (x),

then the learnability of OOD detection under AUC is not distribution-free for R.

From Lemma 7, we know that the overlap between DXI and DXO is an important factor to inﬂuence the learnability of OOD detection under AUC. Thus, similar to the situation under risk, we want to study the learnability of OOD detection under AUC in separate space Ds XY ﬁrst. Before introducing the impossibility theorem for separate space, we need a mild assumption demonstrated below.

Assumption 2 (Separate Space for OOD under AUC) A ranking function space R is called separate ranking function space, if for any x X, there exists rx R such that rx(x) < rx(x ), for any x X {x}.

Note that, the above assumption is weak and can be satisﬁed by some well-known spaces (see Propositions 1 and 2). The above assumption means that, for any data point x, its ranking can be the lowest one compared to other data points in the space X. Finally, we use Vapnik Chervonenkis (VC) dimension (Mohri et al., 2018) to help measure the size of ranking function space, and study the learnability of OOD detection under AUC in Ds XY with the help the VC dimension.

Fang, Li, Liu, Han, Lu

Theorem 9 (Impossibility Theorem for Separate Space under AUC) Given a separate ranking function space R, if VC[φ R] = d < + and |X| (28d+14) log(14d+7), then OOD detection is not learnable under AUC in Ds XY for R, where φ R = {1r(x)>r(x ) : r R}.

Based on Theorem 9, we obtain a similar result to the learnability of OOD detection under risk: the ﬁnite VC dimension cannot guarantee the learnability of OOD detection under AUC in the separate space, which further reveals the diﬃculty of OOD detection. Although the above impossibility theorems (under risk and AUC) are frustrating, there is still room to discuss the conditions in Theorem 6 and Theorem 9, and to ﬁnd out the proper conditions for ensuring the learnability of OOD detection under risk and AUC in the separate space (see the following section).

5. When OOD Detection Can Be Successful

Here, we discuss when the OOD detection can be learnable under risk/AUC in diﬀerent spaces. We ﬁrst study the separate space Ds XY .

5.1 OOD Detection in the Separate Space

Both Theorem 6 and Theorem 9 have indicated that VCdim(φ H) = + or suph H |{x X : h(x) Y}| < + (or supr R |{x X : r(x) R}| < + under AUC metric) is necessary to ensure the learnability of OOD detection under risk or AUC in Ds XY if Assumption 1 or Assumption 2 holds. However, generally, hypothesis spaces generated by feed-forward neural networks with proper activation functions have ﬁnite VC dimension (Bartlett et al., 2019; Karpinski and Macintyre, 1997). Therefore, we study the learnability of OOD detection in the case that |X| < + , which implies that suph H |{x X : h(x) Y}| < + under risk metric or supr R |{x R : r(x) X}| < + under AUC metric. Additionally, Theorem 16 also implies that |X| < + is the necessary and suﬃcient condition for the learnability of OOD detection under risk in a separate space, when the hypothesis space is generated by FCNN. Hence, |X| < + may be necessary in the space Ds XY .

Learnability under Risk. For simplicity, we ﬁrst discuss the case that K = 1, i.e., the one-class novelty detection. We show the necessary and suﬃcient condition for the learnability of OOD detection under risk in Ds XY , when |X| < + .

Theorem 10 Let K = 1 and |X| < + . Suppose that Assumption 1 holds and the constant function hin := 1 H. Then OOD detection is learnable under risk in Ds XY for H if and only if Hall {hout} H, where Hall is the hypothesis space consisting of all hypothesis functions, and hout is a constant function that hout := 2, here 1 represents ID data and 2 represents OOD data.

The condition hin H presented in Theorem 10 is mild. Many practical hypothesis spaces satisfy this condition, e.g., the FCNN-based hypothesis space (Proposition 3 in Appendix N), score-based hypothesis space (Proposition 4 in Appendix N) and universal kernel-based hypothesis space. Theorem 10 implies that if K = 1 and OOD detection is learnable under risk in Ds XY for H, then the hypothesis space H should contain almost all hypothesis

On the Learnability of Out-of-distribution Detection

functions, implying that if the OOD detection can be learnable under risk in the distributionagnostic case, then a large-capacity model is necessary.

Next, we extend Theorem 10 to a general case, i.e., K > 1. When K > 1, we will ﬁrst use a binary classiﬁer hb to classify the ID and OOD data. Then, for the ID data identiﬁed by hb, an ID hypothesis function hin will be used to classify them into corresponding ID classes. We state this strategy as follows: given a hypothesis space Hin for ID distribution and a binary classiﬁcation hypothesis space Hb introduced in Section 2, we use Hin and Hb to construct an OOD detection s hypothesis space H, which consists of all hypothesis functions h satisfying the following condition: there exist hin Hin and hb Hb such that x X,

h(x) = i, if hin(x) = i and hb(x) = 1; otherwise, h(x) = K + 1. (6)

We use Hin Hb to represent a hypothesis space consisting of all h deﬁned in Eq. (6). In addition, we also need an additional condition for the loss function ℓ, shown as follows:

Condition 3 ℓ(y2, y1) ℓ(K + 1, y1), for any in-distribution labels y1 and y2 Y.

Theorem 11 Let |X| < + and H = Hin Hb. If Hall {hout} Hb and Condition 3 holds, then OOD detection is learnable under risk in Ds XY for H, where Hall and hout are deﬁned in Theorem 10.

Learnability under AUC. Then, we study the learnability of OOD detection under AUC in the separate space. Here we require to introduce a basic assumption in learning theory for AUC AUC-based Realizability Assumption, i.e., for any DXY DXY , there exists r R such that AUC(r ; DXI, DXO) = 1 (see Appendix A.2). Based on this AUC-based Realizability Assumption, we prove the following theorem.

Theorem 12 Given a separate ranking function space R, if |X| < + , then OOD detection is learnable under AUC in the separate space Ds XY for R if and only if AUC-based Realizability Assumption holds.

Theorem 12 indicates the signiﬁcance of AUC-based Realizability Assumption in OOD detection under AUC, which also means that a large ranking function space is essential for the success of OOD detection under AUC.

5.2 OOD Detection in the Finite-ID-Distribution Space

Since researchers can only collect ﬁnite ID datasets as the training data in the process of algorithm design, it is worthy to study the learnability of OOD detection under risk in the ﬁnite-ID-distribution space DF XY . We ﬁrst show two necessary concepts below.

Condition 6 (ID Consistency) Given a domain space DXY , we say any two domains DXY DXY and D XY DXY are ID consistency, if DXIYI = D XIYI. We use to represent the ID consistency, i.e., DXY D XY if and only if DXY and D XY are ID consistency.

It is easy to check that the ID consistency is an equivalence relation. Therefore, we deﬁne the set [DXY ] := {D XY DXY : DXY D XY } as the equivalence class regarding DXY .

Fang, Li, Liu, Han, Lu

Condition 4 (Compatibility) For any equivalence class [D XY ] with respect to DXY and any ϵ > 0, there exists a hypothesis function hϵ H such that for any domain DXY [D XY ],

hϵ {h H : Rout D (h ) inf h H Rout D (h) + ϵ} {h H : Rin D(h ) inf h H Rin D(h) + ϵ}.

In Appendix C, Lemma 2 has implied that Condition 4 is a general version of Condition 1. Next, Theorem 13 shows that Condition 4 is the necessary and suﬃcient condition in DF XY .

Theorem 13 Suppose that X is bounded. OOD detection is learnable under risk in DF XY for H if and only if the compatibility condition (i.e., Condition 4) holds. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

Theorem 13 shows that, in the process of algorithm design, OOD detection cannot be successful without the compatibility condition if we use risk to evaluate the performance. Theorem 13 also implies that Condition 4 is essential for the learnability of OOD detection under risk. This motivates us to study whether OOD detection can be successful in more general spaces (e.g., the density-based space), when the compatibility condition holds.

As for the learnability of OOD detection under AUC in the ﬁnite-ID-distribution space, since Condition 2 only considers linearity between OOD distributions instead of OOD and ID distributions as shown in Condition 1. To further reveal the learnability of OOD detection under AUC in the ﬁnite-ID-distribution space, we might need to discover a new condition for compatibility w.r.t. OOD and ID distributions to extend Condition 2.

5.3 OOD Detection in the Density-based Space

Learnability under Risk. To ensure that Condition 4 holds, we consider a basic assumption in learning theory Risk-based Realizability Assumption (see Appendix A.2), i.e., for any DXY DXY , there exists h H such that RD(h ) = 0. We discover that in the densitybased space Dµ,b XY , Risk-based Realizability Assumption can conclude the compatibility condition (Condition 4). Based on this observation, we prove the following theorem:

Theorem 14 Given a density-based space Dµ,b XY , if µ(X) < + , the Risk-based Realizability Assumption holds, then when H has ﬁnite Natarajan dimension (Shalev-Shwartz and Ben David, 2014), OOD detection is learnable in Dµ,b XY for H. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

To further investigate the importance and necessary of Risk-based Realizability Assumption, Theorem 18 has indicated that in some practical scenarios, Risk-based Realizability Assumption is the necessary and suﬃcient condition for the learnability of OOD detection under risk in the density-based space. Therefore, Risk-based Realizability Assumption may be indispensable for the learnability of OOD detection under risk in some practical scenarios.

Learnability under AUC. To study the learnability of OOD detection under AUC in the density-based space, we ﬁrst need to introduce a constant-closure assumption for R.

Assumption 3 We say a ranking function space R is constant closure, if for any r R, the constant function space r(X) := {c : r(x) = c, x X, r R} R.

On the Learnability of Out-of-distribution Detection

Note that, the above assumption is weak and can be satisﬁed by some well-known ranking function space (see Propositions 1 and 2). Based on this assumption, we give a suﬃcient condition for learnability of OOD detection under AUC in the density-based space:

Theorem 15 Suppose that R is constant closure, separate, and µ(X) < + . Given a density-based space Dµ,b XY , if the AUC-based Realizability Assumption holds, then when VC[φ R] < + , OOD detection is learnable under AUC in Dµ,b XY for R, where φ R = {1r1(x)>r2(x ) : r1, r2 R}. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

Based on Theorem 15, we ﬁnd that the AUC-based Realizability Assumption is also important for the learnability of OOD detection under AUC in the density-based space.

6. Connecting Theory to Practice

In Section 5, we have shown the successful scenarios where OOD detection problem can be addressed in theory under risk or AUC metric. In this section, we will discuss how the proposed theory is applied to two representative hypothesis spaces neural-network-based spaces and score-based spaces.

6.1 Key Concepts Regarding Fully-connected Neural Networks

Fully-connected Neural Networks. Given a sequence q = (l1, l2, ..., lg), where li and g are positive integers and g > 2, we use g to represent the depth of neural network and use li to represent the width of the i-th layer. After the activation function σ is selected3, we can obtain the architecture of FCNN according to the sequence q. Let fw,b be the function generated by FCNN with weights w and bias b. An FCNN-based scoring function space is deﬁned as: Fσ q := {fw,b : weights w, bias b}. In addition, for simplicity, given any two sequences q = (l1, ..., lg) and q = (l 1, ..., l g ), we use the notation q q to represent the following equations and inequalities:

1) g g , l1 = l 1, lg = l g ; 2) li l i, i = 1, ..., g 1; and 3) lg 1 l i, i = g, ..., g 1.

Lemma 14 shows q q Fσ q Fσ q . We use to compare the sizes of FCNNs.

FCNN-based Hypothesis Space. Let lg = K + 1. The FCNN-based scoring function space Fσ q can induce an FCNN-based hypothesis space. For any fw,b Fσ q, the induced hypothesis function is:

hw,b := arg max k {1,...,K+1} fk w,b, where fk w,b is the k-th coordinate of fw,b.

Then, the FCNN-based hypothesis space is deﬁned as Hσ q := {hw,b : weights w, bias b}.

3. We consider the rectiﬁed linear unit (Re LU) function as the default activation function σ, which is deﬁned by σ(x) = max{x, 0}, x R. We will not repeatedly mention the deﬁnition of σ in the rest of our paper.

Fang, Li, Liu, Han, Lu

FCNN-based Ranking Function Space. Then, based on the deﬁnition of FCNN, we show that, given a speciﬁc q, under some mild conditions, Fσ q is the separate and constant closure ranking function space.

Proposition 1 Let X be a bounded feature space. Given q = (l1, ..., lg 1, 1), then

if some s with 1 < s < g, d = l1 l2 ... ls, and ls 2d, Fσ q is the separate ranking function space;

Fσ q is constant closure;

{1f1(x)<f2(x ) : f1, f2 Fσ q} has ﬁnite VC dimension.

Score-based Hypothesis Space. Many OOD detection algorithms detect OOD data by using a score-based strategy. That is, given a threshold λ, a scoring function space Fl {f : X Rl} and a score function E : Fl R, then x is regarded as ID data if and only if E(f(x)) λ. We introduce several representative score functions E as follows: for any f = [f1, ..., fl] Fl,

softmax-based function (Hendrycks and Gimpel, 2017): λ (1

l , 1) and T > 0,

E(f) = max k {1,...,l} exp (fk) Pl c=1 exp (fc) ; (7)

temperature-scaled function (Liang et al., 2018): λ (1

l , 1) and T > 0,

E(f) = max k {1,...,l} exp (fk/T) Pl c=1 exp (fc/T) ; (8)

energy-based function (Liu et al., 2020): λ (0, + ) and T > 0,

E(f) = T log

c=1 exp (fc/T). (9)

Using E, λ and f Fσ q, we have a classiﬁer: hλ f,E(x) = 1, if E(f(x)) λ; otherwise, hλ f,E(x) = 2, where 1 represents the ID data and 2 represents the OOD data. Hence, a binary classiﬁcation hypothesis space Hb, which consists of all hλ f,E, is generated. We deﬁne

Hσ,λ q,E := {hλ f,E : f Fσ q}.

Score-based Ranking Function Space. Similar to the FCNN-based ranking function space, for several representative score functions E (e.g.,, Eqs (7), (8), and (9)), the FCNNbased score ranking function space E Fσ q is separate, which is evidence that Assumption 2 is weak and can be easily satisﬁed.

Proposition 2 Given q = (l1, ..., lg 1, l) and q = (l1, ..., lg 1, 1), let R = E Fσ q, then

if Fσ q is a separate ranking function space, R is the separate ranking function space;

On the Learnability of Out-of-distribution Detection

R is constant closure;

{1r1(x)<r2(x ), r1, r2 R} has ﬁnite VC dimension,

where E is Eq. (7), (8) or (9).

According to the previous section, we ﬁnd that FCNN-based ranking function space and FCNN-based score ranking function space can satisfy almost all conditions in theorems.

6.2 Learnability of OOD Detection in Diﬀerent Spaces

Next, we present applications of our theory regarding the above practical and important hypothesis spaces and ranking function spaces.

Theorem 16 Suppose that Condition 3 holds and the hypothesis space H is FCNN-based or score-based, i.e., H = Hσ q or H = Hin Hb, where Hin is an ID hypothesis space, Hb = Hσ,λ q,E and H = Hin Hb is introduced below Eq. (6), here E is Eq. (7), (8) or (9). Then

There is a sequence q = (l1, ..., lg) such that OOD detection is learnable under risk in the separate space Ds XY for H if and only if |X| < + .

Furthermore, if |X| < + , then there exists a sequence q = (l1, ..., lg) such that for any sequence q satisfying that q q , OOD detection is learnable under risk in Ds XY for H.

If we consider the ranking function space, we can obtain a similar theoretical result.

Theorem 17 Suppose the ranking function space R is separate, and FCNN-based or scorebased, i.e., R = Fσ q or R = E Fσ q, where E is Eq. (7), (8) or (9). Then

There is a sequence q = (l1, ..., lg) such that OOD detection is AUC learnable in the separate space Ds XY for R if and only if |X| < + .

Furthermore, if |X| < + , then there is a sequence q = (l1, ..., lg) such that for any sequence q satisfying that q q , OOD detection is learnable under AUC in Ds XY for R.

Theorems 16 and 17 state that 1) when the hypothesis space or ranking function space is FCNN-based or score-based, the ﬁnite feature space is the necessary and suﬃcient condition for the learnability of OOD detection (under risk or AUC) in the separate space; and 2) a larger architecture of FCNN has a greater probability to achieve the learnability of OOD detection in the separate space. Note that when we select Eqs. (7), (8), or (9) as the score function E, Theorems 16 and 17 also show that the selected score functions E can guarantee the learnability of OOD detection (under risk or AUC), which is a theoretical support for the representative works (Liang et al., 2018; Liu et al., 2020; Hendrycks and Gimpel, 2017). Furthermore, Theorems 18 and 19 also oﬀer theoretical supports for these works in the density-based space.

Theorem 18 Suppose that each domain DXY in Dµ,b XY is attainable, i.e., arg minh H RD(h) = (the ﬁnite discrete domains satisfy this). Let K = 1 and the hypothesis space H be scorebased (H = Hσ,λ q,E, where E is in Eq. (7), (8), or (9)) or FCNN-based (H = Hσ q). If

Fang, Li, Liu, Han, Lu

µ(X) < + , then the following four conditions are equivalent:

Learnability in Dµ,b XY for H Condition 1 Risk-based Realizability Assumption Condition 4

Theorem 19 Suppose that the ranking function space R is separate and score-based (R = E Fσ q) or FCNN-based (R = Fσ q), where E is Eq. (7), (8) or (9). If µ(X) < + , then the following three conditions satisfy:

AUC-based Realizability Assumption Learnability in Dµ,b XY for R Condition 2

Compared to Theorem 18, Theorem 19 cannot obtain the equivalence among Realizability Assumption, Learnability in Dµ,b XY and linear condition under AUC. The main reason is that Condition 2 is a weaker necessary condition for learnability of OOD detection under AUC than Condition 1 for learnability of OOD detection under risk. We need to discover a strong necessary condition for learnability of OOD detection under AUC to obtain a similar equivalence that appeared in Theorem 184.

6.3 Overlap and Beneﬁts of Multi-class Case

We investigate when the hypothesis space is FCNN-based or score-based, what will happen if there exists an overlap between the ID and OOD distributions?

Theorem 20 Let K = 1 and the hypothesis space H be score-based (H = Hσ,λ q,E, where E is in Eq. (7), (8), or (9)) or FCNN-based (H = Hσ q). Given a prior-unknown space DXY , if there exists a domain DXY DXY , which has an overlap between ID and OOD distributions (see Deﬁnition 5), then OOD detection is not learnable under risk in DXY for H.

When K = 1 and the hypothesis space is FCNN-based or score-based, Theorem 20 shows that overlap between ID and OOD distributions is the suﬃcient condition for the unlearnability of OOD detection under risk. Theorem 20 takes roots in the conditions infh H Rin D(h) = 0 and infh H Rout D (h) = 0. However, when K > 1, we can ensure infh H Rin D(h) > 0 if ID distribution DXIYI has overlap between ID classes. By this observation, we conjecture that when K > 1, OOD detection is learnable in some special cases where overlap exists, even if the hypothesis space is FCNN-based or score-based. As for the ranking function space, we can obtain a corresponding but weaker theoretical result shown below.

Theorem 21 Let the separate ranking function space R be FCNN-based or score-based (where the score function E is Eq. (7), (8), or (9)). Suppose that DXY , D XY DXY are discrete distributions with DXIYI = DXIYI and DXO = δx, D XO = δx . If DXO = δx, D XO = δx have overlaps with DXIYI and DXO = D XO, then OOD detection is not learnable under AUC in DXY for R.

4. A stronger necessary condition normally means that this necessary condition is closer to the necessary and suﬃcient condition.

On the Learnability of Out-of-distribution Detection

7. Discussion

Connections between Theorems under Risk and AUC. In our previous conference paper (Fang et al., 2022a), we mainly focus on the learnability of OOD detection under risk. However, in practice, AUC-related metrics are often used to evaluate the performance of OOD detection algorithms (Lin et al., 2021; Huang et al., 2021; Fort et al., 2021b; Ming et al., 2022; Yang et al., 2022). To ﬁll this gap, we take a further step: investigating the learnability of OOD detection under AUC. Figure 1 illustrates the connections between the main theoretical results of learnability of OOD detection under risk and AUC. In the most of domain spaces considered in this paper, we can obtain similar theoretical results under AUC compared to theoretical results under risk. However, when considering the ﬁnite-ID-distribution space, we cannot get a corresponding theorem for the learnability of OOD detection under AUC. The main reason is that Condition 2 is a weaker necessary condition for learnability of OOD detection under AUC than Condition 1 for learnability of OOD detection under risk. Thus, additional information might be required to obtain a stronger necessary condition for AUC learnability. The inﬂuence of Condition 2 also appears in Theorem 19 where we cannot obtain a strong theoretical result like Theorem 18. To conclude, since Condition 1 is stronger (i.e., it is closer to necessary and suﬃcient condition) than Condition 2, the theoretical results under risk are stronger than those under AUC. In the future, we need to discover a stronger necessary condition for learnability of OOD detection under AUC.

Understanding Far-OOD Detection. Many existing works (Hendrycks and Gimpel, 2017; Yang et al., 2022) study the far-OOD detection issue. Existing benchmarks include 1) MNIST (Deng, 2012) as ID dataset, and Texture (Kylberg, 2011), CIFAR-10 (Krizhevsky and Hinton, 2009) or Place365 (Zhou et al., 2018) as OOD datasets; and 2) CIFAR-10 (Krizhevsky and Hinton, 2009) as ID dataset, and MNIST (Deng, 2012), or Fashion-MNIST (Zhou et al., 2018) as OOD datasets. In far-OOD case, we ﬁnd that the ID and OOD datasets have diﬀerent semantic labels and diﬀerent styles. From the theoretical view, we can deﬁne far-OOD detection tasks as follows: for τ > 0, a domain space DXY is τ-far-OOD, if for any domain DXY DXY ,

dist(supp DXO, supp DXI) > τ.

Theorems 11, 13 and 16 imply that under appropriate hypothesis space, τ-far-OOD detection is learnable under risk. Theorem 17 implies that under appropriate hypothesis space, τfar-OOD detection is learnable under AUC. In Theorem 11, the condition |X| < + is necessary for the separate space. However, one can prove that in the far-OOD case, when Hin is agnostic PAC learnable for ID distribution, the results in Theorem 11 still holds, if the condition |X| < + is replaced by a weaker condition that X is compact. In addition, it is notable that when Hin is agnostic PAC learnable for ID distribution and X is compact, the KNN-based OOD detection algorithm (Sun et al., 2022) is consistent in the τ-far-OOD case.

Understanding Near-OOD Detection. When the ID and OOD datasets have similar semantics or styles, OOD detection tasks become more challenging. (Ren et al., 2021; Fort et al., 2021a) consider this issue and name it near-OOD detection. Existing benchmarks

Fang, Li, Liu, Han, Lu

Learnability of OOD Detection

Learnability of OOD Detection

Total Space

Separate Space

Finite-ID-distribution Space

Density-based Space

Not Learnable

Under Some Conditions, Learnable

Not Learnable

Theorem 10 Learnable

Under Some Conditions, Learnable

Unknown, due to lack of proper definition of Compatibility Condition

Theorem 12 Theorem 13

Under Some Conditions, Learnable

Connecting Theory to Practice

Theorem 14 Theorem 15

Under Some Conditions, Learnable

Not Learnable

Learnable Learnable

Not Learnable

Lemma 1 Lemma 2

Not Learnable

Figure 1: Connections among main theoretical results in this paper. Compared to risk, AUC has a more strict requirement for the classiﬁcation. Perfect classiﬁcation (i.e., accuracy is 100%) does not imply perfect AUC, which is the reason why theories regarding risk and AUC are very diﬀerent.

include 1) MNIST (Deng, 2012) as ID dataset, and Fashion-MNIST (Zhou et al., 2018) or Not-MNIST (Bulatov, 2011) as OOD datasets; and 2) CIFAR-10 (Krizhevsky and Hinton, 2009) as ID dataset, and CIFAR-100 (Krizhevsky et al., 2009) as OOD dataset. From the theoretical view, some near-OOD tasks may imply the overlap condition, i.e. Deﬁnition 5. Therefore, Lemma 4 and Theorem 20 imply that near-OOD detection may be not

On the Learnability of Out-of-distribution Detection

learnable under risk, and Lemma 7 and Theorem 21 imply that near-OOD detection may be not learnable under AUC. Developing a theory to understand the feasibility of near-OOD detection is an open question.

8. Related Work

OOD Detection Algorithms. We will brieﬂy review many representative OOD detection algorithms in three categories. 1) Classiﬁcation-based methods use an ID classiﬁer to detect OOD data (Hendrycks and Gimpel, 2017)5. Representative works consider using the maximum softmax score (Hendrycks and Gimpel, 2017), temperature-scaled score (Ren et al., 2019) and energy-based score (Liu et al., 2020; Wang et al., 2021) to identify OOD data. 2) Density-based methods aim to estimate an ID distribution and identify the low-density area as OOD data (Zong et al., 2018). 3) The recent development of generative models provides promising ways to make them successful in OOD detection (Pidhorskyi et al., 2018; Nalisnick et al., 2019; Ren et al., 2019; Kingma and Dhariwal, 2018; Xiao et al., 2020). Distance-based methods are based on the assumption that OOD data should be relatively far away from the centroids of ID classes (Lee et al., 2018), including Mahalanobis distance (Lee et al., 2018; Ren et al., 2021), cosine similarity (Zaeemzadeh et al., 2021), and kernel similarity (Amersfoort et al., 2020).

Early works consider using the maximum softmax score to express the ID-ness (Hendrycks and Gimpel, 2017). Then, temperature scaling functions are used to amplify the separation between the ID and OOD data (Ren et al., 2019). Recently, researchers propose hyperparameter-free energy scores to improve the OOD uncertainty estimation (Liu et al., 2020; Wang et al., 2021). Additionally, researchers also consider using the gradients to help improve the performance of OOD detection (Huang et al., 2021).

Except for the above algorithms, researchers also study the situation, where auxiliary OOD data can be obtained during the training process (Hendrycks et al., 2019; Dhamija et al., 2018). These methods are called outlier exposure, and have much better performance than the above methods due to the appearance of OOD data. However, the exposure of OOD data is a strong assumption (Yang et al., 2021). Thus, researchers also consider generating OOD data to help the separation of OOD and ID data (Vernekar et al., 2019). In this paper, we do not make an assumption that OOD data are available during training, since this assumption may not hold in real world.

OOD Detection Theory. Zhang et al. (2021) rejects the typical set hypothesis, the claim that relevant OOD distributions can lie in high likelihood regions of data distribution, as implausible. Zhang et al. (2021) argues that minimal density estimation errors can lead to OOD detection failures without assuming an overlap between ID and OOD distributions. Compared to Zhang et al. (2021), our theory focuses on the PAC learnable theory of OOD detection. If detectors are generated by FCNN, our theory (Theorem 20) shows that the overlap is the suﬃcient condition to the failure of learnability of OOD detection, which is complementary to Zhang et al. (2021). In addition, we identify several necessary and

5. Note that, some methods assume that OOD data are available in advance (Hendrycks et al., 2019; Dhamija et al., 2018). However, the exposure of OOD data is a strong assumption (Yang et al., 2021). We do not consider this situation in our paper.

Fang, Li, Liu, Han, Lu

suﬃcient conditions for the learnability of OOD detection, which opens a door to studying OOD detection in theory. Beyond Zhang et al. (2021), Morteza and Li (2022) paves a new avenue to designing provable OOD detection algorithms. Compared to Morteza and Li (2022), our paper aims to characterize the learnability of OOD detection to answer the question: is OOD detection PAC learnable?

Open-set Learning Theory. Liu et al. (2018) is the ﬁrst to propose the agnostic PAC guarantees for open-set detection. Unfortunately, the test data must be used during the training process. Fang et al. (2020) considers the open-set domain adaptation (OSDA) (Luo et al., 2020) and proposes the ﬁrst learning bound for OSDA. Fang et al. (2020) mainly depends on the positive-unlabeled learning techniques (Kiryo et al., 2017; Ishida et al., 2018; Chen et al., 2021b). However, similar to Liu et al. (2018), the test data must be available during training. To study open-set learning (OSL) without accessing the test data during training, Fang et al. (2021) proposes and studies the almost PAC learnability for OSL, which is motivated by transfer learning (Dong et al., 2020; Fang et al., 2022b). Recently, Wang et al. (2022) proposes a novel AUC-based OOD detection objective named Open AUC (Yang et al., 2023; Jiang et al., 2023) as objective function to learn open-set predictors, and builds a corresponding AUC-based open-set learning theory. In our paper, we study the PAC learnability for OOD detection, which is an open problem proposed by Fang et al. (2021).

Learning Theory for Classiﬁcation with Reject Option. Many works (Chow, 1970; Franc et al., 2021) also investigate the classiﬁcation with reject option (Cw RO) problem, which is similar to OOD detection in some cases. Cortes et al. (2016b,a); Ni et al. (2019); Charoenphakdee et al. (2021); Bartlett and Wegkamp (2008) study the learning theory and propose the agnostic PAC learning bounds for Cw RO. However, compared to our work regarding OOD detection, existing Cw RO theories mainly focus on how the ID risk (i.e., the risk that ID data is wrongly classiﬁed) is inﬂuenced by special rejection rules. Our theory not only focuses on the ID risk, but also pays attention to the OOD risk.

Robust Statistics. In the ﬁeld of robust statistics (Rousseeuw et al., 2011), researchers aim to propose estimators and testers that can mitigate the negative eﬀects of outliers (similar to OOD data). The proposed estimators are supposed to be independent of the potentially high dimensionality of the data (Ronchetti and Huber, 2009; Diakonikolas et al., 2020, 2019). Existing works (Diakonikolas et al., 2021; Cheng et al., 2021; Diakonikolas et al., 2022) in the ﬁeld have identiﬁed and resolved the statistical limits of outlier robust statistics by constructing estimators and proving impossibility results. In the future, it is a promising and interesting research direction to study the robustness of OOD detection based on robust statistics.

PQ Learning Theory. Under some conditions, PQ learning theory (Goldwasser et al., 2020; Kalai and Kanade, 2021) can be regarded as the PAC theory for OOD detection in the semi-supervised or transductive learning cases, i.e., test data are required during the training process. Additionally, PQ learning theory in Goldwasser et al. (2020); Kalai and Kanade (2021) aims to give the PAC estimation under Realizability Assumption (Shalev-Shwartz and Ben-David, 2014). Our theory focuses on the PAC theory in diﬀerent cases, which is more diﬃcult and more practical than PAC theory under Realizability Assumption.

On the Learnability of Out-of-distribution Detection

9. Conclusions and Future Works

OOD detection has become an important technique to increase the reliability of machine learning. However, its theoretical foundation is merely investigated, which hinders real-world applications of OOD detection algorithms. This paper is the ﬁrst to provide the PAC theory for OOD detection in terms of two commonly used metrics: risk and AUC. Our results imply that a universally consistent algorithm might not exist for all scenarios in OOD detection. Yet, we still discover some scenarios where OOD detection is learnable under risk or AUC metrics. Our theory reveals many necessary and suﬃcient conditions for the learnability of OOD detection under risk or AUC, hence paving a road to studying the learnability of OOD detection. In the future, we will focus on studying the robustness of OOD detection based on robust statistics (Diakonikolas et al., 2021; Diakonikolas and Kane, 2020).

Acknowledgments

JL and ZF were supported by the Australian Research Council (ARC) under FL190100149. YL is supported by National Science Foundation (NSF) Award No. IIS-2237037. FL was supported by the ARC with grant numbers DP230101540 and DE240101089, and the NSF&CSIRO Responsible AI program with grant number 2303037. ZF would also like to thank Prof. Peter Bartlett, Dr. Tongliang Liu and Dr. Zhiyong Yang for productive discussions.

J. V. Amersfoort, L. Smith, Y. W. Teh, and Y. Gal. Uncertainty estimation using a single deep deterministic neural network. In ICML, 2020.

P. L. Bartlett and W. Maass. Vapnik-chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 2003.

P. L. Bartlett and M. H. Wegkamp. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 2008.

P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019.

A. Bendale and T. E. Boult. Towards open set deep networks. In CVPR, 2016.

Y. Bulatov. Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html,2, 2011.

N. Charoenphakdee, Z. Cui, Y. Zhang, and M. Sugiyama. Classiﬁcation with rejection based on cost-sensitive classiﬁcation. In ICML, 2021.

J. Chen, Y. Li, X. Wu, Y. Liang, and S. Jha. Atom: Robustifying out-of-distribution detection using outlier mining. ECML, 2021a.

Fang, Li, Liu, Han, Lu

S. Chen, G. Niu, C. Gong, J. Li, J. Yang, and M. Sugiyama. Large-margin contrastive learning with distance polarization regularizer. In ICML, 2021b.

Y. Cheng, I. Diakonikolas, D. M. Kane, R. Ge, S. Gupta, and M. Soltanolkotabi. Outlierrobust sparse estimation via non-convex optimization. In Neur IPS, 2021.

C. K. Chow. On optimum recognition error and reject tradeoﬀ. IEEE Transactions on Information Theory, 1970.

D. L. Cohn. Measure theory. Springer, 2013.

C. Cortes, G. De Salvo, and M. Mohri. Boosting with abstention. In Neur IPS, 2016a.

C. Cortes, G. De Salvo, and M. Mohri. Learning with rejection. In ALT, 2016b.

L. Deecke, R. A. Vandermeulen, L. Ruﬀ, S. Mandt, and M. Kloft. Image anomaly detection with generative adversarial networks. In ECML, 2018.

L. Deng. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag., 2012.

A. R. Dhamija, M. G unther, and T. E. Boult. Reducing network agnostophobia. In Neur IPS, pages 9175 9186, 2018.

I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. A shorter version appears as an Invited Book Chapter in Beyond the Worst-Case Analysis of Algorithms, 2020.

I. Diakonikolas, D. Kane, S. Karmalkar, E. Price, and A. Stewart. Outlier-robust highdimensional sparse estimation via iterative ﬁltering. In Neur IPS, 2019.

I. Diakonikolas, D. M. Kane, and A. Pensia. Outlier robust mean estimation with subgaussian rates via stability. In Neur IPS, 2020.

I. Diakonikolas, D. M. Kane, A. Stewart, and Y. Sun. Outlier-robust learning of ising models under dobrushin s condition. In COLT, 2021.

I. Diakonikolas, D. M. Kane, J. C. Lee, and A. Pensia. Outlier-robust sparse mean estimation for heavy-tailed distributions. In Neur IPS, 2022.

J. Dong, Y. Cong, G. Sun, B. Zhong, and X. Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In CVPR, 2020.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang. Open set domain adaptation: Theoretical bound and algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2020.

On the Learnability of Out-of-distribution Detection

Z. Fang, J. Lu, A. Liu, F. Liu, and G. Zhang. Learning bounds for open-set learning. In ICML, 2021.

Z. Fang, Y. Li, J. Lu, J. Dong, B. Han, and F. Liu. Is out-of-distribution detection learnable? In Neur IPS, 2022a.

Z. Fang, J. Lu, F. Liu, and G. Zhang. Semi-supervised heterogeneous domain adaptation: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.

S. Fort, J. Ren, and B. Lakshminarayanan. Exploring the limits of out-of-distribution detection. In Neur IPS, 2021a.

S. Fort, J. Ren, and B. Lakshminarayanan. Exploring the Limits of Out-of-Distribution Detection. In Neur IPS, 2021b.

V. Franc, D. Pr uˇsa, and V. Voracek. Optimal strategies for reject option classiﬁers. Co RR, abs/2101.12523, 2021.

S. Goldwasser, A. T. Kalai, Y. Kalai, and O. Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. In Neur IPS, 2020.

S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain. DROCC: deep robust one-class classiﬁcation. In ICML, 2020.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 2012.

D. Hendrycks and K. Gimpel. A baseline for detecting misclassiﬁed and out-of-distribution examples in neural networks. In ICLR, 2017.

D. Hendrycks, M. Mazeika, and T. G. Dietterich. Deep anomaly detection with outlier exposure. In ICLR, 2019.

Y. Hsu, Y. Shen, H. Jin, and Z. Kira. Generalized ODIN: detecting out-of-distribution image without learning from out-of-distribution data. In CVPR, 2020.

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.

R. Huang, A. Geng, and Y. Li. On the Importance of Gradients for Detecting Distributional Shifts in the Wild. In Neur IPS, 2021.

T. Ishida, G. Niu, and M. Sugiyama. Binary classiﬁcation from positive-conﬁdence data. In Neur IPS, 2018.

Y. Jiang, Q. Xu, Y. Zhao, Z. Yang, P. Wen, X. Cao, and Q. Huang. Positive-unlabeled learning with label distribution alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Fang, Li, Liu, Han, Lu

A. T. Kalai and V. Kanade. Eﬃcient learning with arbitrary covariate shift. In ALT, Proceedings of Machine Learning Research, 2021.

M. Karpinski and A. Macintyre. Polynomial bounds for VC dimension of sigmoidal and general pfaﬃan neural networks. J. Comput. Syst. Sci., 54(1):169 176, 1997.

D. P. Kingma and P. Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Neur IPS, 2018.

R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, 2017.

A. Krizhevsky and G. Hinton. Convolutional deep belief networks on cifar-10. Technical report, Citeseer, 2009.

A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 and cifar-100 datasets. 2009.

G. Kylberg. Kylberg texture dataset v. 1.0. 2011.

K. Lee, K. Lee, H. Lee, and J. Shin. A simple uniﬁed framework for detecting out-ofdistribution samples and adversarial attacks. In Neur IPS, 2018.

S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.

Z. Lin, S. D. Roy, and Y. Li. Mood: Multi-level out-of-distribution detection. In CVPR, 2021.

S. Liu, R. Garrepalli, T. G. Dietterich, A. Fern, and D. Hendrycks. Open category detection with PAC guarantees. In ICML, 2018.

W. Liu, X. Wang, J. D. Owens, and Y. Li. Energy-based out-of-distribution detection. In Neur IPS, 2020.

Y. Luo, Z. Wang, Z. Huang, and M. Baktashmotlagh. Progressive graph learning for open-set domain adaptation. In ICML, 2020.

Y. Ming, H. Yin, and Y. Li. On the impact of spurious correlation for out-of-distribution detection. AAAI, 2022.

M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.

P. Morteza and Y. Li. Provable guarantees for understanding out-of-distribution detection. AAAI, 2022.

E. T. Nalisnick, A. Matsukawa, Y. W. Teh, D. G or ur, and B. Lakshminarayanan. Do deep generative models know what they don t know? In ICLR, 2019.

C. Ni, N. Charoenphakdee, J. Honda, and M. Sugiyama. On the calibration of multiclass classiﬁcation with rejection. In Neur IPS, 2019.

On the Learnability of Out-of-distribution Detection

S. Pidhorskyi, R. Almohsen, and G. Doretto. Generative probabilistic novelty detection with adversarial autoencoders. In Neur IPS, 2018.

A. Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8: 143 195, 1999.

J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. De Pristo, J. V. Dillon, and B. Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Neur IPS, 2019.

J. Ren, S. Fort, J. Liu, A. G. Roy, S. Padhy, and B. Lakshminarayanan. A simple ﬁx to mahalanobis distance for improving near-ood detection. Co RR, abs/2106.09022, 2021.

E. M. Ronchetti and P. J. Huber. Robust statistics. John Wiley & Sons, 2009.

P. J. Rousseeuw, F. R. Hampel, E. M. Ronchetti, and W. A. Stahel. Robust statistics: the approach based on inﬂuence functions. John Wiley & Sons, 2011.

L. Ruﬀ, N. G ornitz, L. Deecke, S. A. Siddiqui, R. A. Vandermeulen, A. Binder, E. M uller, and M. Kloft. Deep one-class classiﬁcation. In ICML, 2018.

I. Safran and O. Shamir. Depth-width tradeoﬀs in approximating natural functions with neural networks. In ICML, 2017.

M. Salehi, H. Mirzaei, D. Hendrycks, Y. Li, M. H. Rohban, and M. Sabokrou. A uniﬁed survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges. ar Xiv preprint ar Xiv:2110.14051, 2021.

S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635 2670, 2010.

Y. Sun, C. Guo, and Y. Li. React: Out-of-distribution detection with rectiﬁed activations. In Neur IPS, 2021.

Y. Sun, Y. Ming, X. Zhu, and Y. Li. Out-of-distribution detection with deep nearest neighbors. In ICML, 2022.

S. Vernekar, A. Gaurav, V. Abdelzad, T. Denouden, R. Salay, and K. Czarnecki. Out-ofdistribution detection in classiﬁers via generation. In Neur IPS Workshop, 2019.

H. Wang, W. Liu, A. Bocchieri, and Y. Li. Can multi-label classiﬁcation networks know what they don t know? In Neur IPS, 2021.

Z. Wang, Q. Xu, Z. Yang, Y. He, X. Cao, and Q. Huang. Openauc: Towards auc-oriented open-set recognition. Advances in Neural Information Processing Systems, 35:25033 25045, 2022.

Fang, Li, Liu, Han, Lu

Z. Xiao, Q. Yan, and Y. Amit. Likelihood regret: An out-of-distribution detection score for variational auto-encoder. In Neur IPS, 2020.

J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. Co RR, abs/2110.11334, 2021.

J. Yang, K. Zhou, and Z. Liu. Full-spectrum out-of-distribution detection. Co RR, 2022.

Z. Yang, Q. Xu, W. Hou, S. Bao, Y. He, X. Cao, and Q. Huang. Revisiting auc-oriented adversarial training with loss-agnostic perturbations. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1 18, 2023.

A. Zaeemzadeh, N. Bisagno, Z. Sambugaro, N. Conci, N. Rahnavard, and M. Shah. Out-ofdistribution detection using union of 1-dimensional subspaces. In CVPR, 2021.

L. H. Zhang, M. Goldstein, and R. Ranganath. Understanding failures in out-of-distribution detection with deep generative models. In ICML, 2021.

B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 2018.

B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR, 2018.

On the Learnability of Out-of-distribution Detection

Table of Contents of Appendix

A Notations 31

A.1 Main Notations and Their Descriptions . . . . . . . . . . . . . . . . . . . . 31

A.2 Realizability Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.3 Learnability and PAC learnability . . . . . . . . . . . . . . . . . . . . . . . 32

B Proof of Theorem 1 34

C Proof of Theorem 2 36

D Proof of Theorem 3 41

E Proofs of Theorem 4 and Theorem 5 42

E.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

E.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

F Proof of Theorem 6 44

G Proofs of Lemma 7 and Theorem 8 50

G.1 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

G.2 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

H Proof of Theorem 9 53

I Proofs of Theorem 10 and Theorem 11 54

I.1 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

I.2 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

J Proof of Theorem 12 59

K Proofs of Theorems 13 and 14 60

K.1 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

K.2 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

L Proof of Theorem 15 66

M Proofs of Proposition 1 and Proposition 2 69

N Proofs of Proposition 3 and Proposition 4 71

O Proof of Theorem 16 73

Fang, Li, Liu, Han, Lu

P Proof of Theorem 17 80

Q Proof of Theorem 18 81

R Proof of Theorem 19 82

S Proof of Theorem 20 82

T Proof of Theorem 21 83

On the Learnability of Out-of-distribution Detection

Table 1: Main notations and their descriptions. Notation Description Spaces and Labels d and X Rd the feature dimension of data point and feature space Y ID label space {1, ..., K} K + 1 K + 1 represents the OOD labels Yall Y {K + 1} Distributions XI, XO, YI, YO ID feature, OOD feature, ID label, OOD label random variables DXIYI, DXOYO ID joint distribution and OOD joint distribution Dα XY Dα XY = (1 α)DXIYI + αDXOYO, α [0, 1] πout class-prior probability for OOD distribution DXY DXY = (1 πout)DXIYI + πout DXOYO, called domain DXI, DXO, DX marginal distributions for DXIYI, DXOYO and DXY , respectively Domain Spaces DXY domain space consisting of some domains Dall XY total space Ds XY seperate space DDXY XY single-distribution space DF XY ﬁnite-ID-distribution space Dµ,b XY density-based space Loss Function, Function Spaces ℓ( , ) loss: Yall Yall R 0: ℓ(y1, y2) = 0 if and only if y1 = y2 H hypothesis space Hin ID hypothesis space Hb hypothesis space in binary classiﬁcation Fl scoring function space consisting some l dimensional vectorvalued functions R ranking function space consisting some ranking functions Risks, Partial Risks and AUC RD(h) risk corresponding to DXY Rin D(h) partial risk corresponding to DXIYI Rout D (h) partial risk corresponding to DXOYO Rα D(h) α-risk corresponding to Dα XY AUC(r; DXY ) AUC corresponding to DXY AUC(r; DXI, DXO) AUC corresponding to DXI, DXO Fully-Connected Neural Networks q a sequence (l1, ..., lg) to represent the architecture of FCNN σ activation function. In this paper, we use Re LU function Fσ q FCNN-based scoring function space Hσ q FCNN-based hypothesis space fw,b FCNN-based scoring function, which is from Fσ q hw,b FCNN-based hypothesis function, which is from Hσ q Score-based Hypothesis Space E scoring function λ threshold Hσ,λ q,E score-based hypothesis space a binary classiﬁcation space hλ f,E score-based hypothesis function a binary classiﬁer

Appendix A. Notations

A.1 Main Notations and Their Descriptions

In this section, we summarize important notations in Table 1.

Fang, Li, Liu, Han, Lu

Given f = [f1, ..., fl] , for any x X,

arg max k {1,...,l} fk(x) := max{k {1, ..., l} : fk(x) fi(x), i = 1, ..., l},

where fk is the k-th coordinate of f and fi is the i-th coordinate of f. The above deﬁnition about arg max aims to overcome some special cases. For example, there exist k1, k2 (k1 < k2) such that fk1(x) = fk2(x) and fk1(x) > fi(x), fk2(x) > fi(x), i {1, ..., l} {k1, k2}. Then, according to the above deﬁnition, k2 = arg maxk {1,...,l} fk(x).

A.2 Realizability Assumptions

Assumption 4 (Risk-based Realizability Assumption) A domain space DXY and hypothesis space H satisfy the Risk-based Realizability Assumption, if for each domain DXY DXY , there exists at least one hypothesis function h H such that RD(h ) = 0.

Assumption 5 (AUC-based Realizability Assumption) A domain space DXY and ranking function space R satisfy the AUC-based Realizability Assumption under AUC, if for each domain DXY DXY , there exists at least one hypothesis function r R such that AUC(h ; DXY ) = 1.

A.3 Learnability and PAC learnability

Here we give a proof to show that Learnability given in Deﬁnition 1 and PAC learnability are equivalent.

First, we prove that Learnability concludes the PAC learnability.

According to Deﬁnition 1,

ES Dn XIYIRD(A(S)) inf h H RD(h) + ϵcons(n),

which implies that

ES Dn XIYI[RD(A(S)) inf h H RD(h)] ϵcons(n).

Note that RD(A(S)) infh H RD(h) 0. Therefore, by Markov s inequality, we have

P(RD(A(S)) inf h H RD(h) < ϵ) > 1 ES Dn XIYI[RD(A(S)) inf h H RD(h)]/ϵ 1 ϵcons(n)/ϵ.

Because ϵcons(n) is monotonically decreasing, we can ﬁnd a smallest m such that ϵcons(m) ϵδ and ϵcons(m 1) < ϵδ, for δ (0, 1). We deﬁne that m(ϵ, δ) = m. Therefore, for any ϵ > 0 and δ (0, 1), there exists a function m(ϵ, δ) such that when n > m(ϵ, δ), with the probability at least 1 δ, we have RD(A(S)) inf h H RD(h) < ϵ,

which is the deﬁnition of PAC learnability.

On the Learnability of Out-of-distribution Detection

Second, we prove that the PAC learnability concludes Learnability.

PAC-learnability: for any ϵ > 0 and 0 < δ < 1, there exists a function m(ϵ, δ) > 0 such that when the sample size n > m(ϵ, δ), we have that with the probability at least 1 δ > 0,

RD(A(S)) inf h H RD(h) ϵ.

Note that the loss ℓdeﬁned in Section 2 has upper bound (because Y {K +1} is a ﬁnite set). We assume the upper bound of ℓis M. Hence, according to the deﬁnition of PAC-learnability, when the sample size n > m(ϵ, δ), we have that

ES[RD(A(S)) inf h H RD(h)] ϵ(1 δ) + 2Mδ < ϵ + 2Mδ.

If we set δ = ϵ, then when the sample size n > m(ϵ, ϵ), we have that

ES[RD(A(S)) inf h H RD(h)] < (2M + 1)ϵ,

this implies that lim n + ES[RD(A(S)) inf h H RD(h)] = 0,

which implies the Learnability in Deﬁnition 1. We have completed this proof.

Fang, Li, Liu, Han, Lu

Appendix B. Proof of Theorem 1

Theorem 1 Given spaces DXY and D XY = {Dα XY : DXY DXY , α [0, 1)}, then 1) D XY is a priori-unknown space and DXY D XY ; 2) if DXY is a priori-unknown space, then Deﬁnition 1 and Deﬁnition 2 are equivalent; 3) OOD detection is strongly learnable in DXY under risk if and only if OOD detection is learnable in D XY under risk; 4) OOD detection is learnable in DXY under AUC if and only if OOD detection is learnable in D XY under AUC.

Proof [Proof of Theorem 1.] Proof of the First Result. To prove that D XY is a priori-unknown space, we need to show that for any Dα XY D XY , then Dα XY D XY for any α [0, 1).

According to the deﬁnition of D XY , for any Dα XY D XY , we can ﬁnd a domain DXY DXY , which can be written as DXY = (1 πout)DXIYI + πout DXOYO (here πout [0, 1)) such that

Dα XY = (1 α )DXIYI + α DXOYO.

Note that Dα XY = (1 α)DXIYI + αDXOYO. Therefore, based on the deﬁnition of D XY , for any α [0, 1), Dα XY D XY , which implies that D XY is a prior-known space. Additionally, for any DXY DXY , we can rewrite DXY as Dπout XY , thus DXY = Dπout XY D XY , which implies that DXY D XY .

Proof of the Second Result. First, we prove that Deﬁnition 1 concludes Deﬁnition 2, if DXY is a prior-unknown space: DXY is a priori-unknown space, and OOD detection is learnable in DXY for H. OOD detection is strongly learnable in DXY for H: there exist an algorithm A : + n=1(X Y)n H, and a monotonically decreasing sequence ϵ(n), such that ϵ(n) 0, as n +

ES Dn XIYI Rα D(A(S)) inf h H Rα D(h) ϵ(n), α [0, 1], DXY DXY .

In the priori-unknown space, for any DXY DXY , we have that for any α [0, 1),

Dα XY = (1 α)DXIYI + αDXOYO DXY .

Then, according to the deﬁnition of learnability of OOD detection, we have an algorithm A and a monotonically decreasing sequence ϵcons(n) 0, as n + , such that for any α [0, 1),

ES Dn XIYIRDα(A(S)) inf h H RDα(h) + ϵcons(n), (by the property of priori-unknown space)

RDα(A(S)) = Z

X Yall ℓ(A(S)(x), y)d Dα XY (x, y), RDα(h) = Z

X Yall ℓ(h(x), y)d Dα XY (x, y).

On the Learnability of Out-of-distribution Detection

Since RDα(A(S)) = Rα D(A(S)) and RDα(h) = Rα D(h), we have that

ES Dn XIYIRα D(A(S)) inf h H Rα D(h) + ϵcons(n), α [0, 1). (10)

Next, we consider the case that α = 1. Note that

lim inf α 1 inf h H Rα D(h) lim inf α 1 α inf h H Rout D (h) = inf h H Rout D (h). (11)

Then, we assume that hϵ H satisﬁes that

Rout D (hϵ) inf h H Rout D (h) ϵ.

It is obvious that Rα D(hϵ) inf h H Rα D(h).

Let α 1. Then, for any ϵ > 0,

Rout D (hϵ) = lim α 1 Rα D(hϵ) = lim sup α 1 Rα D(hϵ) lim sup α 1 inf h H Rα D(h),

which implies that

inf h H Rout D (h) = lim ϵ 0 Rout D (hϵ) lim ϵ 0 lim sup α 1 inf h H Rα D(h) = lim sup α 1 inf h H Rα D(h). (12)

Combining Eq. (11) with Eq. (12), we have

inf h H Rout D (h) = lim sup α 1 inf h H Rα D(h) = lim inf α 1 inf h H Rα D(h), (13)

which implies that inf h H Rout D (h) = lim α 1 inf h H Rα D(h). (14)

ES Dn XIYIRα D(A(S)) = (1 α)ES Dn XIYIRin D(A(S)) + αES Dn XIYIRout D (A(S)).

Hence, Lebesgue s Dominated Convergence Theorem (Cohn, 2013) implies that

lim α 1 ES Dn XIYIRα D(A(S)) = ES Dn XIYIRout D (A(S)). (15)

Using Eq. (10), we have that

lim α 1 ES Dn XIYIRα D(A(S)) lim α 1 inf h H Rα D(h) + ϵcons(n). (16)

Combining Eq. (14), Eq. (15) with Eq. (16), we obtain that

ES Dn XIYIRout D (A(S)) inf h H Rout D (h) + ϵcons(n).

Since Rout D (A(S)) = R1 D(A(S)) and Rout D (h) = R1 D(h), we obtain that

ES Dn XIYIR1 D(A(S)) inf h H R1 D(h) + ϵcons(n). (17)

Fang, Li, Liu, Han, Lu

Combining Eq. (10) and Eq. (17), we have proven that: if the domain space DXY is a priori-unknown space, then OOD detection is learnable in DXY for H. OOD detection is strongly learnable in DXY for H: there exist an algorithm A : + n=1(X Y)n H, and a monotonically decreasing sequence ϵ(n), such that ϵ(n) 0, as n + ,

ES Dn XIYIRα D(A(S)) inf h H Rα D(h) + ϵ(n), α [0, 1], DXY DXY .

Second, we prove that Deﬁnition 2 concludes Deﬁnition 1: OOD detection is strongly learnable in DXY for H: there exist an algorithm A : + n=1(X Y)n H, and a monotonically decreasing sequence ϵ(n), such that ϵ(n) 0, as n + ,

ES Dn XIYI Rα D(A(S)) inf h H Rα D(h) ϵ(n), α [0, 1], DXY DXY .

OOD detection is learnable in DXY for H. If we set α = πout, then ES Dn XIYIRα D(A(S)) infh H Rα D(h) + ϵ(n) implies that

ES Dn XIYIRD(A(S)) inf h H RD(h) + ϵ(n),

which means that OOD detection is learnable in DXY for H. We have completed this proof.

Proof of the Third Result. The third result is a simple conclusion of the second result. Hence, we omit it.

Proof of the Fourth Result. The fourth result is a simple conclusion of the property of AUC metric. Hence, we omit it.

Appendix C. Proof of Theorem 2

Before introducing the proof of Theorem 2, we extend Condition 1 to a general version (Condition 5). Then, Lemma 1 proves that Conditions 1 and 5 are the necessary conditions for the learnability of OOD detection. First, we provide the details of Condition 5. Let o l = {(λ1, ..., λl) : Pl j=1 λj < 1 and λj 0, j = 1, ..., l}, where l is a positive integer. Next, we introduce an important deﬁnition as follows:

Condition 7 (OOD Convex Decomposition and Convex Domain) Given any domain DXY DXY , we say joint distributions Q1, ..., Ql, which are deﬁned over X {K + 1}, are the OOD convex decomposition for DXY , if

j=1 λj)DXIYI +

On the Learnability of Out-of-distribution Detection

for some (λ1, ..., λl) o l . We also say domain DXY DXY is an OOD convex domain corresponding to OOD convex decomposition Q1, ..., Ql, if for any (α1, ..., αl) o l ,

j=1 αj)DXIYI +

j=1 αj Qj DXY .

We extend the linear condition (Condition 1) to a multi-linear scenario.

Condition 5 (Multi-linear Condition) For each OOD convex domain DXY DXY corresponding to OOD convex decomposition Q1, ..., Ql, the following function

f D,Q(α1, ..., αl) := inf h H

j=1 αj)Rin D(h) +

j=1 αj RQj(h) , (α1, ..., αl) o l

satisﬁes that

f D,Q(α1, ..., αl) = (1

j=1 αj)f D,Q(0) +

j=1 αjf D,Q(αj),

where 0 is the 1 l vector, whose elements are 0, and αj is the 1 l vector, whose j-th element is 1 and other elements are 0.

This is a more general condition compared to Condition 1. When l = 1 and the domain space DXY is a priori-unknown space, Condition 5 degenerates into Condition 1. Lemma 1 shows that Condition 5 is necessary for the learnability of OOD detection.

Lemma 1 Given a priori-unknown space DXY and a hypothesis space H, if OOD detection is learnable in DXY for H, then Conditions 1 and 5 hold.

Proof [Proof of Lemma 1] Since Condition 1 is a special case of Condition 5, we only need to prove that Condition 5 holds.

For any OOD convex domain DXY DXY corresponding to OOD convex decomposition Q1, ..., Ql, and any (α1, ..., αl) o l , we set

Qα = 1 Pl i=1 αi

Then, we deﬁne

i=1 αi)DXIYI + (

i=1 αi)Qα, which belongs to DXY .

Let Rα D(h) = Z

X Yall ℓ(h(x), y)d Dα XY (x, y).

Fang, Li, Liu, Han, Lu

Since OOD detection is learnable in DXY for H, there exist an algorithm A : + n=1(X Y)n H, and a monotonically decreasing sequence ϵ(n), such that ϵ(n) 0, as n + , and

0 ES Dn XIYIRα D(A(S)) inf h H Rα D(h) ϵ(n).

ES Dn XIYIRα D(A(S)) = (1

j=1 αj)ES Dn XIYIRin D(A(S)) +

j=1 αj ES Dn XIYIRQj(A(S)),

and inf h H Rα D(h) = f D,Q(α1, ..., αl),

where RQj(A(S)) = Z

X {K+1} ℓ(A(S)(x), y)d Qj(x, y).

Therefore, we have that for any (α1, ..., αl) o l ,

j=1 αj)ES Dn XIYIRin D(A(S)) +

j=1 αj ES Dn XIYIRQj(A(S)) f D,Q(α1, ..., αl) ϵ(n).

gn(α1, ..., αl) = (1

j=1 αj)ES Dn XIYIRin D(A(S)) +

j=1 αj ES Dn XIYIRQj(A(S)).

Note that Eq. (18) implies that

lim n + gn(α1, ..., αl) = f D,Q(α1, ..., αl), (α1, ..., αl) o l ,

lim n + gn(0) = f D,Q(0). (19)

Step 1. Since αj / o l , we need to prove that

lim n + ES Dn XIYIRQj(A(S)) = f(αj), i.e., lim n + gn(αj) = f(αj), (20)

where αj is the 1 l vector, whose j-th element is 1 and other elements are 0. Let DXY = 0.5 DXIYI + 0.5 Qj. The second result of Theorem 1 implies that

ES Dn XIYIRout D (A(S)) inf h H Rout D (h) + ϵ(n).

Since Rout D (A(S)) = RQj(A(S)) and Rout D (h) = RQj(h),

ES Dn XIYIRQj(A(S)) inf h H RQj(h) + ϵ(n).

On the Learnability of Out-of-distribution Detection

Note that infh H RQj(h) ES Dn XIYIRQj(A(S)). We have

0 ES Dn XIYIRQj(A(S)) inf h H RQj(h) ϵ(n). (21)

Eq. (21) implies that

lim n + ES Dn XIYIRQj(A(S)) = inf h H RQj(h). (22)

We note that infh H RQj(h) = f D,Q(αj). Therefore,

lim n + ES Dn XIYIRQj(A(S)) = f D,Q(αj), i.e., lim n + gn(αj) = f(αj). (23)

Step 2. It is easy to check that for any (α1, ..., αl) o l ,

lim n + gn(α1, ..., αl) = lim n + (1

j=1 αj)gn(0) +

j=1 αjgn(αj)

j=1 αj) lim n + gn(0) +

j=1 αj lim n + gn(αj).

According to Eq. (19) and Eq. (23), we have

lim n + gn(α1, ..., αl) = f D,Q(α1, ..., αl), (α1, ..., αl) o l ,

lim n + gn(0) = f D,Q(0),

lim n + gn(αj) = f(αj),

Combining Eq. (25) with Eq. (24), we complete the proof.

inf h H Rα D(h) = (1 α) inf h H Rin D(h) + α inf h H Rout D (h), α [0, 1),

if and only if for any ϵ > 0,

{h H : Rin D(h ) inf h H Rin D(h) + 2ϵ} {h H : Rout D (h ) inf h H Rout D (h) + 2ϵ} = .

Proof [Proof of Lemma 2] For the sake of convenience, we set f D(α) = infh H Rα D(h), for any α [0, 1]. First, we prove that f D(α) = (1 α)f D(0) + αf D(1), α [0, 1) implies

{h H : Rin D(h ) inf h H Rin D(h) + 2ϵ} {h H : Rout D (h ) inf h H Rout D (h) + 2ϵ} = .

For any ϵ > 0 and 0 α < 1, we can ﬁnd hα ϵ H satisfying that

Rα D(hα ϵ ) inf h H Rα D(h) + ϵ.

Fang, Li, Liu, Han, Lu

inf h H Rα D(h) = inf h H

(1 α)Rin D(h) + αRout D (h) (1 α) inf h H Rin D(h) + α inf h H Rout D (h).

(1 α) inf h H Rin D(h) + α inf h H Rout D (h) inf h H Rα D(h) Rα D(hα ϵ ) inf h H Rα D(h) + ϵ. (26)

Note that f D(α) = (1 α)f D(0) + αf D(1), α [0, 1), i.e.,

inf h H Rα D(h) = (1 α) inf h H Rin D(h) + α inf h H Rout D (h), α [0, 1). (27)

Using Eqs. (26) and (27), we have that for any 0 α < 1,

ϵ Rα D(hα ϵ ) inf h H Rα D(h) = (1 α) Rin D(hα ϵ ) inf h H Rin D(h) + α Rout D (hα ϵ ) inf h H Rout D (h) .

(28) Since Rout D (hα ϵ ) infh H Rout D (h) 0 and Rin D(hα ϵ ) infh H Rin D(h) 0, Eq. (28) implies that: for any 0 < α < 1,

Rin D(hα ϵ ) inf h H Rin D(h) + ϵ/(1 α),

Rout D (hα ϵ ) inf h H Rout D (h) + ϵ/α.

hα ϵ {h H : Rin D(h ) inf h H Rin D(h)+ϵ/(1 α)} {h H : Rout D (h ) inf h H Rout D (h)+ϵ/α}.

If we set α = 0.5, we obtain that for any ϵ > 0,

{h H : Rin D(h ) inf h H Rin D(h) + 2ϵ} {h H : Rout D (h ) inf h H Rout D (h) + 2ϵ} = .

Second, we prove that for any ϵ > 0, if

{h H : Rin D(h ) inf h H Rin D(h) + 2ϵ} {h H : Rout D (h ) inf h H Rout D (h) + 2ϵ} = ,

then f D(α) = (1 α)f D(0) + αf D(1), for any α [0, 1).

Let hϵ {h H : Rin D(h ) infh H Rin D(h)+2ϵ} {h H : Rout D (h ) infh H Rout D (h)+2ϵ}. Then,

inf h H Rα D(h) Rα D(hϵ) (1 α) inf h H Rin D(h) + α inf h H Rout D (h) + 2ϵ inf h H Rα D(h) + 2ϵ,

which implies that |f D(α) (1 α)f D(0) αf D(1)| 2ϵ. As ϵ 0, |f D(α) (1 α)f D(0) αf D(1)| 0. We have completed the proof.

On the Learnability of Out-of-distribution Detection

Theorem 2 Given a hypothesis space H and a domain DXY , OOD detection is learnable under risk in the single-distribution space DDXY XY for H if and only if Condition 1 holds.

Proof [Proof of Theorem 2] Based on Lemma 1, we obtain that Condition 1 is the necessary condition for the learnability of OOD detection in the single-distribution space DDXY XY . Next, it suﬃces to prove that Condition 1 is the suﬃcient condition for the learnability of OOD detection in the single-distribution space DDXY XY . We use Lemma 2 to prove the suﬃcient condition.

Let F be the inﬁnite sequence set that consists of all inﬁnite sequences, whose coordinates are hypothesis functions, i.e.,

F = {h = (h1, ..., hn, ...) : hn H, n = 1, ...., + }.

For each h F, there is a corresponding algorithm Ah6: Ah(S) = hn, if |S| = n. F generates an algorithm class A = {Ah : h F}. We select a consistent algorithm from the algorithm class A .

We construct a special inﬁnite sequence h = ( h1, ..., hn, ...) F. For each positive integer n, we select hn from {h H : Rin D(h ) infh H Rin D(h) + 2/n} {h H : Rout D (h ) infh H Rout D (h) + 2/n} (the existence of hn is based on Lemma 2). It is easy to check that

ES Dn XIYIRin D(A h(S)) inf h H Rin D(h) + 2/n.

ES Dn XIYIRout D (A h(S)) inf h H Rout D (h) + 2/n.

Since (1 α) infh H Rin D(h) + α infh H Rout D (h) infh H Rα D(h), we obtain that for any α [0, 1],

ES Dn XIYIRα D(A h(S)) inf h H Rα D(h) + 2/n.

We have completed this proof.

Appendix D. Proof of Theorem 3

Theorem 3 Given a ranking function space R and a domain space DXY , if OOD detection is learnable under AUC for R in DXY , then for any DXY , D XY DXY , the linear condition under AUC (i.e., Condition 2) holds.

Proof [Proof of Theorem 3] According to Deﬁnition 3, we assume that A is the AUC learnable algorithm: there exists learning rate ϵ(n) such that for any DXY DXY ,

ES Dn XIYI sup r R AUC(r; DXY ) AUC(A(S); DXY ) ϵ(n). (29)

6. In this paper, we regard an algorithm as a mapping from + n=1(X Y)n to H or R. So we can design an algorithm like this.

Fang, Li, Liu, Han, Lu

For any DXY = βDXIYI + (1 β)DXOYO, D XY = β DXIYI + (1 β )D XOYO DXY , we set Dα XO = αDXO + (1 α)D XO. Then it is clear that for any β (0, 1],

AUC(A(S); βDXIYI + (1 β)Dα XOYO)

=AUC(A(S); DXI, Dα XO)

=αAUC(A(S); DXI, DXO) + (1 α)AUC(A(S); DXI, D XO)

=αAUC(A(S); DXY ) + (1 α)AUC(A(S); D XY ).

ES Dn XIYI sup r R AUC(r; βDXIYI + (1 β)Dα XOYO) AUC(A(S); βDXIYI + (1 β)Dα XOYO)

ES Dn XIYI α sup r R AUC(r; DXY ) + (1 α) sup r R AUC(r; D XY )

αAUC(A(S); DXY ) (1 α)AUC(A(S); D XY )] ϵ(n),

which implies that

sup r R AUC(r; DXI, Dα XO) = lim n + ES Dn XIYIAUC(A(S); DXI, Dα XO).

lim n + ES Dn XIYIAUC(A(S); DXI, Dα XO)

=α lim n + ES Dn XIYIAUC(A(S); DXI, DXO) + (1 α) lim n + ES Dn XIYIAUC(A(S); DXI, D XO)

=α sup r R AUC(r; DXI, DXO) + (1 α) sup r R AUC(r; DXI, D XO).

sup r R AUC(r; DXI, Dα XO) = α sup r R AUC(r; DXI, DXO) + (1 α) sup r R AUC(r; DXI, D XO).

Appendix E. Proofs of Theorem 4 and Theorem 5

E.1 Proof of Theorem 4

Lemma 22 Given a hypothesis space H and a prior-unknown space DXY , if there is DXY DXY , which has overlap between ID and OOD, and infh H Rin D(h) = 0, infh H Rout D (h) = 0, then Condition 1 does not hold. Therefore, OOD detection is not learnable under risk in DXY for H.

Proof [Proof of Theorem 4] We ﬁrst explain how we get f I and f O in Deﬁnition 5. Since DX is absolutely continuous respect to µ (DX µ), then DXI µ and DXO µ. By

On the Learnability of Out-of-distribution Detection

Radon-Nikodym Theorem (Cohn, 2013), we know there exist two non-negative functions deﬁned over X: f I and f O such that for any µ-measurable set A X,

A f I(x)dµ(x), DXO(A) = Z

A f O(x)dµ(x).

Second, we prove that for any α (0, 1), infh H Rα D(h) > 0. We deﬁne Am = {x X : f I(x) 1

m and f O(x) 1

m}. It is clear that

+ m=1Am = {x X : f I(x) > 0 and f O(x) > 0} = Aoverlap,

and Am Am+1.

Therefore, lim m + µ(Am) = µ(Aoverlap) > 0,

which implies that there exists m0 such that

µ(Am0) > 0.

For any α (0, 1), we deﬁne cα = miny1 Yall (1 α) miny2 Y ℓ(y1, y2) + αℓ(y1, K + 1) . It is clear that cα > 0 for α (0, 1). Then, for any h H,

X Yall ℓ(h(x), y)d Dα XY (x, y)

X Y (1 α)ℓ(h(x), y)d DXIYI(x, y) + Z

X {K+1} αℓ(h(x), y)d DXOYO(x, y)

Am0 Y (1 α)ℓ(h(x), y)d DXIYI(x, y) + Z

Am0 {K+1} αℓ(h(x), y)d DXOYO(x, y)

Y ℓ(h(x), y)d DYI|XI(y|x) d DXI(x)

Am0 αℓ(h(x), K + 1)d DXO(x)

Am0 (1 α) min y2 Y ℓ(h(x), y2)d DXI(x) + Z

Am0 αℓ(h(x), K + 1)d DXO(x)

Am0 (1 α) min y2 Y ℓ(h(x), y2)f I(x)dµ(x) + Z

Am0 αℓ(h(x), K + 1)f O(x)dµ(x)

Am0 (1 α) min y2 Y ℓ(h(x), y2)dµ(x) + 1

Am0 αℓ(h(x), K + 1)dµ(x)

(1 α) min y2 Y ℓ(h(x), y2) + αℓ(h(x), K + 1) dµ(x) cα

m0 µ(Am0) > 0.

Therefore, inf h H Rα D(h) cα

m0 µ(Am0) > 0.

Fang, Li, Liu, Han, Lu

Third, Condition 1 indicates that infh H Rα D(h) = (1 α) infh H Rin D(h)+α infh H Rin D(h) = 0 (here we have used conditions infh H Rin D(h) = 0 and infh H Rout D (h) = 0), which contradicts with infh H Rα D(h) > 0 (α (0, 1)). Therefore, Condition 1 does not hold. Using Lemma 1, we obtain that OOD detection in DXY is not learnable for H.

E.2 Proof of Theorem 5

Theorem 5 (Impossibility Theorem for Total Space under Risk) OOD detection is not learnable under risk in the total space Dall XY for H, if |φ H| > 1, where φ maps ID labels to 1 and maps OOD labels to 2.

Proof [Proof of Theorem 5] We need to prove that OOD detection is not learnable in the total space Dall XY for H, if H is non-trivial, i.e., {x X : h1, h2 H, s.t. h1(x) Y, h2(x) = K + 1} = . The main idea is to construct a domain DXY satisfying that:

1) the ID and OOD distributions have overlap (Deﬁnition 5);

2) Rin D(h1) = 0, Rout D (h2) = 0.

According to the condition that H is non-trivial, we know that there exist h1, h2 H such that h1(x1) Y, h2(x1) = K+1, for some x1 X. We set DXY = 0.5 δ(x1,h1(x1))+0.5 δ(x1,h2(x1)), where δ is the Dirac measure. It is easy to check that Rin D(h1) = 0, Rout D (h2) = 0, which implies that infh H Rin D(h) = 0 and infh H Rout D (h) = 0. In addition, the ID distribution δ(x1,h1(x1)) and OOD distribution δ(x1,h2(x1)) have overlap x1. By using Theorem 4, we have completed this proof.

Appendix F. Proof of Theorem 6

Before proving Theorem 6, we need three important lemmas.

Lemma 3 Suppose that DXY is a domain with OOD convex decomposition Q1, ..., Ql (convex decomposition is given by Deﬁnition 7 in Appendix C), and DXY is a ﬁnite discrete distribution, then (the deﬁnition of f D,Q is given in Condition 5)

f D,Q(α1, ..., αl) = (1

j=1 αj)f D,Q(0) +

j=1 αjf D,Q(αj), (α1, ..., αl) o l ,

if and only if

arg min h H RD(h) =

j=1 arg min h H RQj(h) \ arg min h H Rin D(h),

On the Learnability of Out-of-distribution Detection

where 0 is the 1 l vector, whose elements are 0, and αj is the 1 l vector, whose j-th element is 1 and other elements are 0, and

X {K+1} ℓ(h(x), y)d Qj(x, y).

Proof [Proof of Lemma 3] To better understand this proof, we recall the deﬁnition of f D,Q(α1, ..., αl):

f D,Q(α1, ..., αl) = inf h H

j=1 αj)Rin D(h) +

j=1 αj RQj(h) , (α1, ..., αl) o l

First, we prove that if

f D,Q(α1, ..., αl) = (1

j=1 αj)f D,Q(0) +

j=1 αjf D,Q(αj), (α1, ..., αl) o l ,

arg min h H RD(h) =

j=1 arg min h H RQj(h) \ arg min h H Rin D(h).

Let DXY = (1 Pl j=1 λj)DXIYI + Pl j=1 λj Qj, for some (λ1, ..., λl) o l . Since DXY has ﬁnite support set, we have

arg min h H RD(h) = arg min h H

j=1 λj)Rin D(h) +

j=1 λj RQj(h) = .

We can ﬁnd that h0 arg minh H (1 Pl j=1 λj)Rin D(h) + Pl j=1 λj RQj(h) . Hence,

j=1 λj)Rin D(h0) +

j=1 λj RQj(h0) = inf h H

j=1 λj)Rin D(h) +

j=1 λj RQj(h) . (30)

Note that the condition f D,Q(α1, ..., αl) = (1 Pl j=1 αj)f D,Q(0) + Pl j=1 αjf D,Q(αj) implies

j=1 λj) inf h H Rin D(h)+

j=1 λj inf h H RQj(h) = inf h H

j=1 λj)Rin D(h)+

j=1 λj RQj(h) . (31)

Therefore, Eq. (30) and Eq. (31) imply that

j=1 λj) inf h H Rin D(h) +

j=1 λj inf h H RQj(h) = (1

j=1 λj)Rin D(h0) +

j=1 λj RQj(h0). (32)

Since Rin D(h0) infh H Rin D(h) and RQj(h0) infh H Rin Qj(h), for j = 1, ..., l, then using Eq. (32), we have that

Rin D(h0) = inf h H Rin D(h),

RQj(h0) = inf h H RQj(h), j = 1, ..., l,

Fang, Li, Liu, Han, Lu

which implies that

j=1 arg min h H RQj(h) \ arg min h H Rin D(h).

arg min h H RD(h)

j=1 arg min h H RQj(h) \ arg min h H Rin D(h). (33)

Additionally, using

f D,Q(α1, ..., αl) = (1

j=1 αj)f D,Q(0) +

j=1 αjf D,Q(αj), (α1, ..., αl) o l ,

we obtain that for any h Tl j=1 arg minh H RQj(h) T arg minh H Rin D(h),

inf h H RD(h) = inf h H

j=1 λj)Rin D(h) +

j=1 λj RQj(h)

j=1 λj) inf h H Rin D(h) +

j=1 λj inf h H RQj(h)

j=1 λj)Rin D(h ) +

j=1 λj RQj(h ) = RD(h ),

which implies that h arg min h H RD(h).

Therefore, l\

j=1 arg min h H RQj(h) \ arg min h H Rin D(h) arg min h H RD(h). (34)

Combining Eq. (33) with Eq. (34), we obtain that

j=1 arg min h H RQj(h) \ arg min h H Rin D(h) = arg min h H RD(h).

Second, we prove that if

arg min h H RD(h) =

j=1 arg min h H RQj(h) \ arg min h H Rin D(h),

f D,Q(α1, ..., αl) = (1

j=1 αj)f D,Q(0) +

j=1 αjf D,Q(αj), (α1, ..., αl) o l .

On the Learnability of Out-of-distribution Detection

j=1 arg min h H RQj(h) \ arg min h H Rin D(h),

then, for any (α1, ..., αl) o l ,

j=1 αj) inf h H Rin D(h) +

j=1 αj inf h H RQj(h) inf h H

j=1 αj)Rin D(h) +

j=1 αj RQj(h)

j=1 αj)Rin D(h0) +

j=1 αj RQj(h0)

j=1 αj) inf h H Rin D(h) +

j=1 αj inf h H RQj(h).

Therefore, for any (α1, ..., αl) o l ,

j=1 αj) inf h H Rin D(h) +

j=1 αj inf h H RQj(h) = inf h H

j=1 αj)Rin D(h) +

j=1 αj RQj(h) ,

which implies that: for any (α1, ..., αl) o l ,

f D,Q(α1, ..., αl) = (1

j=1 αj)f D,Q(0) +

j=1 αjf D,Q(αj).

We have completed this proof.

Fang, Li, Liu, Han, Lu

Lemma 4 Suppose that Assumption 1 holds. If there is a ﬁnite discrete domain DXY Ds XY such that infh H Rout D (h) > 0, then OOD detection is not learnable in Ds XY for H.

Proof [Proof of Lemma 4] Suppose that supp DXO = {xout 1 , ..., xout l }, then it is clear that DXY has OOD convex decomposition δxout 1 , ..., δxout l , where δx is the dirac measure whose support set is {x}.

Since H is the separate space for OOD (i.e., Assumption 1 holds), then j = 1, ..., l,

inf h H Rδxout j (h) = 0,

Rδxout j (h) = Z

X ℓ(h(x), K + 1)dδxout j (x).

This implies that: if Tl j=1 arg minh H Rδxout j (h) = , then for h Tl j=1 arg minh H Rδxout j (h),

h (xout i ) = K + 1, i = 1, ..., l.

Therefore, if Tl j=1 arg minh H Rδxout j (h) T arg minh H Rin D(h) = ,

then for any h Tl j=1 arg minh H Rδxout j (h) T arg minh H Rin D(h), we have that

h (xout i ) = K + 1, i = 1, ..., l.

Proof by Contradiction: assume OOD detection is learnable in Ds XY for H, then Lemmas 1 and 3 imply that

j=1 arg min h H Rδxout j (h) \ arg min h H Rin D(h) = arg min h H RD(h) = .

Therefore, for any h arg minh H RD(h), we have that

h (xout i ) = K + 1, i = 1, ..., l,

which implies that for any h arg minh H RD(h), we have Rout D (h ) = 0, which implies that infh H Rout D (h) = 0. It is clear that infh H Rout D (h) = 0 is inconsistent with the condition infh H Rout D (h) > 0. Therefore, OOD detection is not learnable in Ds XY for H.

Lemma 5 If Assumption 1 holds, VCdim(φ H) = v < + and suph H |{x X : h(x) Y}| > m such that v < m, then OOD detection is not learnable in Ds XY for H, where φ maps ID s labels to 1 and maps OOD s labels to 2.

On the Learnability of Out-of-distribution Detection

Proof [Proof of Lemma 5] Due to suph H |{x X : h(x) Y}| > m, we can obtain a set

C = {x1, ..., xm, xm+1},

which satisﬁes that there exists h H such that h(xi) Y for any i = 1, ..., m, m + 1. Let Hφ C = {(φ h(x1), ..., φ h(xm), φ h(xm+1) : h H}. It is clear that

(1, 1, ..., 1) = (φ h(x1), ..., φ h(xm), φ h(xm+1)) Hφ C,

where (1, 1, ..., 1) means all elements are 1.

Let Hφ m+1 = {(φ h(x1), ..., φ h(xm), φ h(xm+1) : h is any hypothesis function from X to Yall}. Clearly, Hφ C Hφ m+1 and |Hφ m+1| = 2m+1. Sauer-Shelah-Perles Lemma (Lemma 6.10 in (Shalev-Shwartz and Ben-David, 2014)) implies that

Since Pv i=0 m+1 i < 2m+1 1 (because v < m), we obtain that |Hφ C| 2m+1 2. Therefore, Hφ C {(2, 2..., 2)} is a proper subset of Hφ m+1, where (2, 2, ..., 2) means that all elements are 2. Note that (1, 1..., 1) (all elements are 1) also belongs to Hφ C. Hence, Hφ C {(2, 2..., 2)} {(1, 1..., 1)} is a proper subset of Hφ m+1, which implies that we can obtain a hypothesis function h satisfying that:

1)(φ h (x1), ..., φ h (xm), φ h (xm+1)) / Hφ C;

2) There exist xj, xp C such that φ h (xj) = 2 and φ h (xp) = 1.

Let CI = C {x X : φ h (x) = 1} and CO = C {x X : φ h (x) = 2}.

Then, we construct a special domain DXY :

DXY = 0.5 DXI DYI|XI + 0.5 DXO DYO|XO, where

DXI = 1 |CI|

x CI δx and DYI|XI(y|x) = 1, if h(x) = y and x CI;

and DXO = 1 |CO|

x CO δx and DYO|XO(K + 1|x) = 1, if x CO.

Since DXY is a ﬁnite discrete distribution and (φ h (x1), ..., φ h (xm), φ h (xm+1)) / Hφ C, it is clear that arg minh H RD(h) = and infh H RD(h) > 0. Additionally, Rin D( h) = 0. Therefore, infh H Rin D(h) = 0.

Proof by Contradiction: suppose that OOD detection is learnable in Ds XY for H, then Lemma 1 implies that

inf h H RD(h) = 0.5 inf h H Rin D(h) + 0.5 inf h H Rout D (h).

Fang, Li, Liu, Han, Lu

Therefore, if OOD detection is learnable in Ds XY for H, then infh H Rout D (h) > 0. Until now, we have constructed a domain DXY (deﬁned over X Yall) with ﬁnite support and satisfying that infh H Rout D (h) > 0. Note that H is the separate space for OOD data (Assumption 1 holds). Using Lemma 4, we know that OOD detection is not learnable in Ds XY for H, which is inconsistent with our assumption that OOD detection is learnable in Ds XY for H. Therefore, OOD detection is not learnable in Ds XY for H. We have completed the proof.

Theorem 6 (Impossibility Theorem for Separate Space under Risk) If Assumption 1 holds, VCdim(φ H) < + and suph H |{x X : h(x) Y}| = + , OOD detection is not learnable under risk in the separate space Ds XY for H, where φ maps ID labels to 1 and maps OOD labels to 2.

Proof [Proof of Theorem 6] Let VCdim(φ H) = v. Since suph H |{x X : h(x) Y}| = + , it is clear that suph H |{x X : h(x) Y}| > v. Using Lemma 5, we complete this proof.

Appendix G. Proofs of Lemma 7 and Theorem 8

G.1 Proof of Lemma 7

Condition 8 Given a ranking function space R, the corresponding hypothesis space G consists of all gr satisfying that there exists a r R such that

gr(x, x ) = sign(r(x) r(x )).

Condition 9 (Equivalence) Given measures µ1, µ2, we say two ranking functions r R and r R are AUC equivalent over µ1, µ2, i.e., f f w.r.t. µ1, µ2, if and only if the corresponding hypothesis functions gf = gf a.e. µ1 µ2.

Lemma 6 Assume that DXI = R g XIdµ and DXO = R g XOdµ, then

sup r Rall AUC(f; DXI, DXO) = 1

2Ex µEx µ max{g XI(x)g XO(x ), g XI(x )g XO(x)}.

Proof Let D(X) = R gdµ, DXI = R g XIdµ and DXO = R g XOdµ satisfying that g XI + g XO = 2g. Let P = {x X : g(x) > 0}, PXI = {x X : g XI(x) > 0} and PXO = {x X : g XO(x) > 0}. To any two points x and x , we consider that

R(r, x, x ) = 1r(x)>r(x ) + 1

21r(x)=r(x ) g XI(x)g XO(x )

+ 1r(x )>r(x) + 1

21r(x)=r(x ) g XI(x )g XO(x).

We set r (x) = sigmoid g XI(x)/g XO(x) , if g XO(x) > 0; otherwise, r (x) = 1. It is clear that we have that

R(r , x, x ) = max r Rall R(r, x, x ) = max{g XI(x)g XO(x ), g XI(x )g XO(x)}.

On the Learnability of Out-of-distribution Detection

Due to 2AUC(r; DXI, DXO) = Z

P R(r, x, x )dµ(x)dµ(x ),

2AUC(r ; DXI, DXO) = Z

P max r Rall R(r, x, x )dµ(x)dµ(x )

P R(r, x, x )dµ(x)dµ(x ) = max r Rall 2AUC(r; DXI, DXO).

Therefore, r is the optimal solution, and

AUC(r ; DXI, DXO) = 1

2Ex µEx µ max{g XI(x)g XO(x ), g XI(x )g XO(x)}.

We have completed this proof.

Lemma 7 Given a ranking function space R Rall, DXI = R g XIdµ, DXO = R g XOdµ and D XO = R g XOdµ, if

sup r R AUC(r; DXI, DXO) = sup r Rall AUC(r; DXI, DXO),

sup r R AUC(r; DXI, D XO) = sup r Rall AUC(r; DXI, D XO),

and there exists α (0, 1) such that

α sup r R AUC(r; DXI, DXO) + (1 α) sup r R AUC(r; DXI, D XO) = sup r R AUC(r; DXI, Dα XO),

where Dα XO = αDXO + (1 α)D XO. Then

g XI g XI + g XO g XI g XI + g XO w.r.t. DXI|P XO PXO, DXI|PXO P XO,

where PXO = {x : g XO(x) > 0} and P XO = {x : g XO(x) > 0}

Proof Let ηXI = g XI g XI+g XO and η XI = g XI g XI+g XO . It is easy to check that the proof process of

Lemma 6 implies that to each i = 1, 2,

ηi XI arg max r Rall AUC(f; DXI, Di XO).

Additionally,

sup r R AUC(r; DXI, Dα XO) =α sup r R AUC(r; DXI, DXO) + (1 α) sup r R AUC(r; DXI, D XO)

=α sup r Rall AUC(r; DXI, DXO) + (1 α) sup r Rall AUC(r; DXI, D XO)

sup r Rall AUC(r; DXI, Dα XO) sup r R AUC(r; DXI, Dα XO).

Fang, Li, Liu, Han, Lu

Therefore, under the condition that

sup r R AUC(r; DXI, DXO) = sup r Rall AUC(r; DXI, DXO),

sup r R AUC(r; DXI, D XO) = sup r Rall AUC(r; DXI, D XO),

we obtain that

sup r Rall AUC(r; DXI, Dα XO) = α sup r Rall AUC(r; DXI, DXO) + (1 α) sup r Rall AUC(r; DXI, D XO).

Because supr Rall AUC(r; DXI, DXO) and supr Rall AUC(r; DXI, DXO) are attainable (the proof process of Lemma 6 implies this), it is easy to check (similar the proof of Lemma 3) that arg max r Rall AUC(r; DXI, DXO) arg max r Rall AUC(r; DXI, D XO) = .

Combining with Lemma 6, above equality implies there exists r such that

r ηXI, w.r.t. DXI, DXO, r η XI, w.r.t. DXI, D XO

Therefore, ηXI η XI w.r.t. DXI|P XO PXO, DXI|PXO P XO We have completed the proof.

Lemma 23 Given a ranking function space R, a domain space DXY and DXY = βDXIYI + (1 β)DXOYO, D XY = β DXIYI + (1 β )D XOYO DXY , let P be the overlap set between DXI and DXO and P be the overlap set between DXI and D XO based on the Deﬁnition 5. If

sup r R AUC(r; DXI, DXO) = sup r Rall AUC(r; DXI, DXO)

sup r R AUC(r; DXI, D XO) = sup r Rall AUC(r; DXI, D XO),

and DXI(P P ) < min{DXI(P), DXI(P )}, then Condition 2 does not hold, where Rall is a ranking function space consisting of all ranking functions from X to R. Therefore, OOD detection is not learnable under AUC in DXY for R.

Proof [Proof of Lemma 7] By the condition that DXI(P1 P2) < min{DXI(P1), DXI(P2)}, we can ensure that DXI(P1 P2) > 0, DXI(P2 P1) > 0.

By this, one can easily check that

g XI g XI + g XO g XI g XI + g XO w.r.t. DXI|P2 P1, DXI|P1 P2 does not hold.

Therefore, Lemma 7 implies that Condition 2 does not hold.

On the Learnability of Out-of-distribution Detection

G.2 Proof of Theorem 8

Theorem 8 (Impossibility Theorem for Total Space under AUC) Given ranking function space R, if there exist x, x X and r, r R such that

r(x) > r(x ) and r (x ) > r (x),

then the learnability of OOD detection under AUC is not distribution-free for R.

Proof [Proof of Theorem 8] Let DXI = 1

2δx , DXO = δx and DX O = δx , where δ is the Dirac measure. Then the condition in Lemma 7 holds, which implies the result of Theorem 8.

Appendix H. Proof of Theorem 9

Lemma 8 Given a separate ranking function space R, if there exists ﬁnite discrete DXY Ds XY such that sup r R AUC(r; DXY ) < 1,

then OOD detection is not learnable under AUC in Ds XY for R.

Proof [Proof of Lemma 8] Assume that if OOD detection is learnable under AUC in DXY for R, then Condition 2 implies that: let DXO = Pm i=1 λiδxi (Pm i=1 = 1),

sup r R AUC(r; DXY ) =

i=1 λi sup r R AUC(r; DXI, δxi).

Because R is the separate ranking function space, we know that

sup r R AUC(r; DXY ) =

i=1 λi sup r R AUC(r; DXI, δxi) = 1,

which is conﬂict with the condition that supr R AUC(r; DXY ) < 1. We have completed this proof.

Theorem 9 (Impossibility Theorem for Separate Space under AUC) Given a separate ranking function space R, if VC[φ R] = d < + and |X| (28d+14) log(14d+7), then OOD detection is not learnable under AUC in Ds XY for R, where φ R = {1r(x)>r(x ) : r R}.

Proof [Proof of Theorem 9] Given disjoint samples X = {x1, ..., xm}. Consider the following matrix and set Br[X] = 1r(xi)>r(xj) , BR[X] = {Br[X] : r R}.

Let DXI,XO[X] be the set consisting of all (DXI, DXO) which satisﬁes the following conditions:

DXI, DXO are from the separate space;

Fang, Li, Liu, Han, Lu

DXI, DXO are uniform distributions;

supp DXI supp DXO = X.

Additionally, let

DR[X] = {(DXI, DXO) DXI,XO[X] : r R, AUC(r; DXI, DXO) = 1}.

It is clear that

dd m2d; |DR[X]| (m 1)|BR[X]| ed

dd m2d+1; |DXI,XO[X]| = 2m 2.

When m is large enough (m (28d + 14) log(14d + 7)), we have that

|DR[X]| < |DXI,XO[X]|.

Therefore, we can ﬁnd

(DXI, DXO) DXI,XO[X] such that sup r R AUC(r; DXI, DXO) < 1.

By Lemma 8, we have completed this proof.

Appendix I. Proofs of Theorem 10 and Theorem 11

I.1 Proof of Theorem 10

Firstly, we need two lemmas, which are motivated by Lemma 19.2 and Lemma 19.3 in (Shalev-Shwartz and Ben-David, 2014).

Lemma 9 Let C1,...,Cr be a cover of space X, i.e., Pr i=1 Ci = X. Let SX = {x1, ..., xn} be a sequence of n data drawn from DXI, i.i.d. Then

i:Ci SX= DXI(Ci) r

Proof [Proof of Lemma 9]

i:Ci SX= DXI(Ci) =

DXI(Ci) ESX Dn XI 1Ci SX= ,

where 1 is the characteristic function. For each i,

ESX Dn XI 1Ci SX= = Z

X n 1Ci SX= d Dn XI(SX)

X 1Ci {x}= d DXI(x) n

= 1 DXI(Ci) n e n DXI(Ci).

On the Learnability of Out-of-distribution Detection

i:Ci S= DXI(Ci)

i=1 DXI(Ci)e n DXI(Ci)

r max i {1,...,r} DXI(Ci)e n DXI(Ci) r

here we have used inequality: maxi {1,...,r} aie nai 1/(ne). The proof has been completed.

Lemma 10 Let K = 1. When X Rd is a bounded set, there exists a monotonically decreasing sequence ϵcons(m) satisfying that ϵcons(m) 0, as m 0, such that

Ex DXI,S Dn XIYIdist(x, π1(x, S)) < ϵcons(n),

where dist is the Euclidean distance, π1(x, S) = arg min x SX dist(x, x), here SX is the feature part of S, i.e., SX = {x1, ..., xn}, if S = {(x1, y1), ..., (xn, yn)}.

Proof [Proof of Lemma 10] Since X is bounded, without loss of generality, we set X [0, 1)d. Fix ϵ = 1/T, for some integer T. Let r = T d and C1, C2, ..., Cr be a cover of X: for every (a1, ..., a T ) [T]d := [1, ..., T]d, there exists a Ci = {x = (x1, ..., xd) : j {1, ..., d}, xj [(aj 1)/T, aj/T)}.

If x, x belong to some Ci, then dist(x, x )

dϵ; otherwise, dist(x, x )

d. Therefore,

Ex DXI,S Dn XIYIdist(x, π1(x, S))

i:Ci SX = DXI(Ci) +

i:Ci SX= DXI(Ci)

i:Ci SX = DXI(Ci) +

i:Ci SX= DXI(Ci) .

Note that C1, ..., Cr are disjoint. Therefore, P i:Ci SX = DXI(Ci) DXI(P i:Ci SX = Ci) 1. Using Lemma 9, we obtain

Ex DXI,S Dn XIYIdist(x, π1(x, S))

If we set ϵ = 2n 1/(d+1), then

Ex DXI,S Dn XIYIdist(x, π1(x, S)) 2

d n1/(d+1) +

d 2den1/(d+1) .

If we set ϵcons(n) = 2

d n1/(d+1) +

d 2den1/(d+1) , we complete this proof.

Fang, Li, Liu, Han, Lu

Theorem 10 Let K = 1 and |X| < + . Suppose that Assumption 1 holds and the constant function hin := 1 H. Then OOD detection is learnable under risk in Ds XY for H if and only if Hall {hout} H, where Hall is the hypothesis space consisting of all hypothesis functions, and hout is a constant function that hout := 2, here 1 represents ID data and 2 represents OOD data.

Proof [Proof of Theorem 10] First, we prove that if the hypothesis space H is a separate space for OOD (i.e., Assumption 1 holds), the constant function hin := 1 H, then that OOD detection is learnable in Ds XY for H implies Hall {hout} H. Proof by Contradiction: suppose that there exists h Hall such that h = hout and h / H. Let X = {x1, ..., xm}, CI = {x X : h (x) Y} and CO = {x X : h (x) = K + 1}. Because h = hout, we know that CI = . We construct a special domain DXY Ds XY : if CO = , then DXY = DXI DYI|XI; otherwise,

DXY = 0.5 DXI DYI|XI + 0.5 DXO DYO|XO, where

DXI = 1 |CI|

x CI δx and DYI|XI(y|x) = 1, if h (x) = y and x CI,

and DXO = 1 |CO|

x CO δx and DYO|XO(K + 1|x) = 1, if x CO.

Since h / H and |X| < + , then arg minh H RD(h) = , and infh H RD(h) > 0. Additionally, Rin D(hin) = 0 (here hin = 1), hence, infh H Rin D(h) = 0. Since OOD detection is learnable in Ds XY for H, Lemma 1 implies that

inf h H RD(h) = (1 πout) inf h H Rin D(h) + πout inf h H Rout D (h),

where πout = DY (Y = K + 1) = 1 or 0.5. Since infh H Rin D(h) = 0 and infh H RD(h) > 0, we obtain that infh H Rout D (h) > 0. Until now, we have constructed a special domain DXY Ds XY satisfying that infh H Rout D (h) > 0. Using Lemma 4, we know that OOD detection in Ds XY is not learnable for H, which is inconsistent with the condition that OOD detection is learnable in Ds XY for H. Therefore, the assumption (there exists h Hall such that h = hout and h / H) doesn t hold, which implies that Hall {hout} H.

Second, we prove that if Hall {hout} H, then OOD detection is learnable in Ds XY for H.

To prove this result, we need to design a special algorithm. Let d0 = minx,x X and x =x dist(x, x ), where dist is the Euclidean distance. It is clear that d0 > 0. Let

( 1, if dist(x, π1(x, S)) < 0.5 d0;

2, if dist(x, π1(x, S)) 0.5 d0, where π1(x, S) = arg min x SX dist(x, x), here SX is the feature part of S, i.e., SX = {x1, ..., xn}, if S = {(x1, y1), ..., (xn, yn)}.

On the Learnability of Out-of-distribution Detection

For any x supp DXI, it is easy to check that for almost all S Dn XIYI,

dist(x, π1(x, S)) > 0.5 d0,

which implies that A(S)(x) = 2,

hence, ES Dn XIYIRout D (A(S)) = 0. (35)

Using Lemma 10, for any x supp DXI, we have

Ex DXI,S Dn XIYIdist(x, π1(x, S)) < ϵcons(n),

where ϵcons(n) 0, as n 0 and ϵcons(n) is a monotonically decreasing sequence.

Hence, we have that

DXI Dn XIYI({(x, S) : dist(x, π1(x, S)) 0.5 d0}) 2ϵcons(n)/d0,

where DXI Dn XIYI is the product measure of DXI and Dn XIYI (Cohn, 2013). Therefore,

DXI Dn XIYI({(x, S) : A(S)(x) = 1}) > 1 2ϵcons(n)/d0,

which implies that ES Dn XIYIRin D(A(S)) 2Bϵcons(n)/d0, (36)

where B = max{ℓ(1, 2), ℓ(2, 1)}. Using Eq. (35) and Eq. (36), we have proved that

ES Dn XIYIRD(A(S)) 0 + 2Bϵcons(m)/d0 inf h H RD(h) + 2Bϵcons(m)/d0. (37)

It is easy to check that A(S) Hall {hout}. Therefore, we have constructed a consistent algorithm A for H. We have completed this proof.

I.2 Proof of Theorem 11

Theorem 11 Let |X| < + and H = Hin Hb. If Hall {hout} Hb and Condition 3 holds, then OOD detection is learnable under risk in Ds XY for H, where Hall and hout are deﬁned in Theorem 10.

Proof [Proof of Theorem 11] Since |X| < + , we know that |H| < + , which implies that Hin is agnostic PAC learnable for supervised learning in classiﬁcation. Therefore, there exist an algorithm Ain : + n=1(X Y)n Hin and a monotonically decreasing sequence ϵ(n), such that ϵ(n) 0, as n + , and for any DXY Ds XY ,

ES Dn XIYIRin D(Ain(S)) inf h Hin Rin D(h) + ϵ(n).

Since |X| < + and Hb almost contains all binary classiﬁers, then using Theorem 10 and Theorem 1, we obtain that there exist an algorithm Ab : + n=1(X {1, 2})n Hb and a

Fang, Li, Liu, Han, Lu

monotonically decreasing sequence ϵ (n), such that ϵ (n) 0, as n + , and for any DXY Ds XY ,

ES Dn XIYIRin φ(D)(Ab(φ(S))) inf h Hb Rin φ(D)(h) + ϵ (n),

ES Dn XIYIRout φ(D)(Ab(φ(S))) inf h Hb Rout φ(D)(h) + ϵ (n),

where φ maps ID s labels to 1 and OOD s label to 2,

Rin φ(D)(Ab(φ(S))) = Z

X Y ℓ(Ab(φ(S))(x), φ(y))d DXIYI(x, y), (38)

Rin φ(D)(h) = Z

X Y ℓ(h(x), φ(y))d DXIYI(x, y), (39)

Rout φ(D)(Ab(φ(S))) = Z

X {K+1} ℓ(Ab(φ(S))(x), φ(y))d DXOYO(x, y), (40)

Rout φ(D)(h) = Z

X {K+1} ℓ(h(x), φ(y))d DXOYO(x, y), (41)

here φ(S) = {(x1, φ(y1)), ..., (xn, φ(yn))}, if S = {(x1, y1), ..., (xn, yn)}.

Note that Hb almost contains all classiﬁers, and Ds XY is the separate space. Hence,

ES Dn XIYIRin φ(D)(Ab(φ(S))) ϵ (n), ES Dn XIYIRout φ(D)(Ab(φ(S))) ϵ (n).

Next, we construct an algorithm A using Ain and Aout.

( K + 1, if Ab(φ(S))(x) = 2;

Ain(S)(x), if Ab(φ(S))(x) = 1.

Since infh H Rin φ(D)(φ h) = 0, infh H Rout D (h) = 0, then by Condition 3, it is easy to check that inf h Hin Rin D(h) = inf h H Rin D(h).

Additionally, the risk Rin D(A(S)) is from two parts: 1) ID data are detected as OOD data; 2) ID data are detected as ID data, but are classiﬁed as incorrect ID classes. Therefore, we have the inequality:

ES Dn XIYIRin D(A(S)) ES Dn XIYIRin D(Ain(S)) + c ES Dn XIYIRin φ(D)(Ab(φ(S)))

inf h Hin Rin D(h) + ϵ(n) + cϵ (n) = inf h H Rin D(h) + ϵ(n) + cϵ (n), (42)

where c = maxy1,y2 Y ℓ(y1, y2)/ min{ℓ(1, 2), ℓ(2, 1)}.

On the Learnability of Out-of-distribution Detection

Note that the risk Rout D (A(S)) is from the case that OOD data are detected as ID data. Therefore,

ES Dn XIYIRout D (A(S)) c ES Dn X|rm I YIRout φ(D)(Ab(φ(S)))

cϵ (n) inf h H Rout D (h) + cϵ (n). (43)

Note that (1 α) infh H Rin D(h) + α infh H Rout D (h) infh H Rα D(h). Then, using Eq. (42) and Eq. (43), we obtain that for any α [0, 1],

ES Dn XIYIRα D(A(S)) inf h H Rα D(h) + ϵ(n) + cϵ (n).

According to Theorem 1 (the second result), we complete the proof.

Appendix J. Proof of Theorem 12

Lemma 11 Suppose that |X| < + . If AUC-based Realizability Assumption holds for AUC metric, then OOD detection is learnable under AUC in separate space for R.

Proof [Proof of Lemma 11] Without loss of generality, we assume that K = 1, and any r R satisﬁes that 0 < r < 1 (one can achieve this by using sigmoid function). Given m data points Sm = {x 1, ..., x m} X m. We consider the following learning rule

max r R,τ (0,1)

i=1 1r(x i) τ1x i / S, subject to 1

j=1 1r(xj) τ = 0.

We denote the algorithm, which solves the above rule, as ASm (ASm outputs r and a corresponding τ). For diﬀerent data points Sm, we have diﬀerent algorithm ASm. Let S be the set that consists of all data points, i.e.,

S := {Sm : Sm are any m data points, m = 1, ..., + }. (44)

Using S, we construct an algorithm space as follows:

A := {AS : S S}.

We will ﬁnd an algorithm A from A , which is learnable. Let S = X. We will show that AX can guarantee the learnability. Suppose that r S and τS is the output of AX (S), then the realizability assumption implies that there exists learning rate ϵ(n) such that

ES Dn XIEx DXI1r S(x) τS ϵ(n). (45)

Additionally, due to supp(DXO) X S, the AUC-based Realizability Assumption implies that ES Dn XIEx DXO1r S(x) τS = 1. (46)

Fang, Li, Liu, Han, Lu

ES Dn XIAUC(f S; DXI, DXO)

ES Dn XIEx DXOEx DXI1r S(x)<r S(x )

ES Dn XIEx DXOEx DXI1r S(x) τS1r S(x )>τS

Then the ranking function part r S of AX A is the universally consistent algorithm, i.e.,

ES Dn XIAUC(r S; DXI, DXO) 1 ϵ(n).

Theorem 12 Given a separate ranking function space R, if |X| < + , then OOD detection is learnable under AUC in the separate space Ds XY for R if and only if AUC-based Realizability Assumption holds.

Proof [Proof of Theorem 12] Lemmas 8 and 11 imply this result.

Appendix K. Proofs of Theorems 13 and 14

K.1 Proof of Theorem 13

Lemma 12 Given a prior-unknown space DXY and a hypothesis space H, if Condition 4 holds, then for any equivalence class [D XY ] with respect to DXY , OOD detection is learnable in the equivalence class [D XY ] for H. Furthermore, the learning rate can attain O(1/n).

Proof Let F be a set consisting of all inﬁnite sequences, whose coordinates are hypothesis functions, i.e., F = {h = (h1, ..., hn, ...) : hn H, n = 1, ...., + }.

For each h F, there is a corresponding algorithm Ah: Ah(S) = hn, if |S| = n. F generates an algorithm class A = {Ah : h F}. We select a consistent algorithm from the algorithm class A .

We construct a special inﬁnite sequence h = ( h1, ..., hn, ...) F. For each positive integer n, we select hn from \

DXY [D XY ] {h H : Rout D (h ) inf h H Rout D (h)+2/n} \ {h H : Rin D(h ) inf h H Rin D(h)+2/n}.

The existence of hn is based on Condition 4. It is easy to check that for any DXY [D XY ],

ES Dn XIYIRin D(A h(S)) inf h H Rin D(h) + 2/n.

ES Dn XIYIRout D (A h(S)) inf h H Rout D (h) + 2/n.

On the Learnability of Out-of-distribution Detection

Since (1 α) infh H Rin D(h) + α infh H Rout D (h) infh H Rα D(h), we obtain that for any α [0, 1],

ES Dn XIYIRα D(A h(S)) inf h H Rα D(h) + 2/n.

Using Theorem 1 (the second result), we have completed this proof.

Theorem 13 Suppose that X is bounded. OOD detection is learnable under risk in DF XY for H if and only if the compatibility condition (i.e., Condition 4) holds. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

Proof [Proof of Theorem 13] First, we prove that if OOD detection is learnable in DF XY for H, then Condition 4 holds.

Since DF XY is the prior-unknown space, by Theorem 1, there exist an algorithm A : + n=1(X Y)n H and a monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any DXY DF XY ,

ES Dn XIYI Rin D(A(S)) inf h H Rin D(h) ϵcons(n),

ES Dn XIYI Rout D (A(S)) inf h H Rout D (h) ϵcons(n).

Then, for any ϵ > 0, we can ﬁnd nϵ such that ϵ ϵcons(nϵ), therefore, if n = nϵ, we have

ES Dnϵ XIYI Rin D(A(S)) inf h H Rin D(h) ϵ,

ES Dnϵ XIYI Rout D (A(S)) inf h H Rout D (h) ϵ,

which implies that there exists Sϵ Dnϵ XIYI such that

Rin D(A(Sϵ)) inf h H Rin D(h) ϵ,

Rout D (A(Sϵ)) inf h H Rout D (h) ϵ.

Therefore, for any equivalence class [D XY ] with respect to DF XY and any ϵ > 0, there exists a hypothesis function A(Sϵ) H such that for any domain DXY [D XY ],

A(Sϵ) {h H : Rout D (h ) inf h H Rout D (h) + ϵ} {h H : Rin D(h ) inf h H Rin D(h) + ϵ},

which implies that Condition 4 holds.

Second, we prove Condition 4 implies the learnability of OOD detection in DF XY for H. For convenience, we assume that all equivalence classes are [D1 XY ], ..., [Dm XY ]. By Lemma 12, for every equivalence class [Di XY ], we can ﬁnd a corresponding algorithm ADi such that OOD detection is learnable in [Di XY ] for H. Additionally, we also set the learning rate for ADi is ϵi(n). By Lemma 12, we know that ϵi(n) can attain O(1/n).

Fang, Li, Liu, Han, Lu

Let Z be X Y. Then, we consider a bounded universal kernel K( , ) deﬁned over Z Z. Consider the maximum mean discrepancy (MMD) (Gretton et al., 2012), which is a metric between distributions: for any distributions P and Q deﬁned over Z, we use MMDK(Q, P) to represent the distance. Let F be a set consisting of all ﬁnite sequences, whose coordinates are labeled data, i.e.,

F = {S = (S1, ..., Si, ..., Sm) : i = 1, ..., m and labeled data Si}.

Then, we deﬁne an algorithm space as follows:

A = {AS7 : S F},

where AS(S) = ADi(S), if i = arg min i {1,...m} MMDK(PSi, PS),

here PS = 1

(x,y) S δ(x,y), PSi = 1

(x,y) Si , δ(x,y)

and δ(x,y) is the Dirac measure. Next, we prove that we can ﬁnd an algorithm A from the algorithm space A such that A is the consistent algorithm. Since the number of diﬀerent equivalence classes is ﬁnite, we know that there exists a constant c > 0 such that for any diﬀerent equivalence classes [Di XY ] and [Dj XY ] (i = j),

MMDK(Di XIYI, Dj XIYI) > c.

Additionally, according to (Gretton et al., 2012) and the property of DF XY (the number of diﬀerent equivalence classes is ﬁnite), there exists a monotonically decreasing ϵ(n) 0, as n + such that for any DXY D,

ES Dn XIYIMMDK(DXIYI, PS) ϵ(n), where ϵ(n) = O( 1

n1 θ ). (47)

Therefore, for every equivalence class [Di XY ], we can ﬁnd data points SDi such that

MMDK(Di XIYI, PSDi) < c 100.

Let S = {SD1, ..., SDi, ..., SDm}. Then, we prove that AS is a consistent algorithm. By Eq. (47), it is easy to check that for any i {1, ..., m} and any 0 < δ < 1,

PS Di,n XIYI

MMDK(Di XIYI, PS) ϵ(n)

which implies that

PS Di,n XIYI

MMDK(PSDi, PS) ϵ(n)

δ + c 100 > 1 δ.

7. In this paper, we regard an algorithm as a mapping from + n=1(X Y)n to H or R. So we can design an algorithm like this.

On the Learnability of Out-of-distribution Detection

Therefore, (here we set δ = 200ϵ(n)/c)

PS Di,n XIYI

AS (S) = ADi(S) 200ϵ(n)

Because ADi is a consistent algorithm for [Di XY ], we conclude that for all α [0, 1],

ES Di,n XIYI

Rα D(AS (S)) inf h H Rα D(h) ϵi(n) + 200Bϵ(n)

where ϵi(n) = O(1/n) is the learning rate of ADi and B is the upper bound of the loss ℓ. Let ϵmax(n) = max{ϵ1(n), ..., ϵm(n)} + 200Bϵ(n)

c . Then, we obtain that for any DXY DF XY and all α [0, 1],

ES Dn XIYI Rα D(AS (S)) inf h H Rα D(h) ϵmax(n) = O( 1

According to Theorem 1 (the second result), AS is the consistent algorithm. This proof is completed.

K.2 Proof of Theorem 14

Theorem 14 Given a density-based space Dµ,b XY , if µ(X) < + , the Risk-based Realizability Assumption holds, then when H has ﬁnite Natarajan dimension (Shalev-Shwartz and Ben David, 2014), OOD detection is learnable in Dµ,b XY for H. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

Proof [Proof of Theorem 14] First, we consider the case that the loss ℓis the zero-one loss. Since µ(X) < + , without loss of generality, we assume that µ(X) = 1. We also assume that f I is DXI s density function and f O is DXO s density function. Let f be the density function for 0.5 DXI + 0.5 DXO. It is easy to check that f = 0.5 f I + 0.5 f O. Additionally, due to Risk-based Realizability Assumption, it is obvious that for any samples S = {(x1, y1), ..., (xn, yn)} Dn XIYI, i.i.d., we have that there exists h H such that

i=1 ℓ(h (xi), yi) = 0.

Given m data points Sm = {x 1, ..., x m} X m. We consider the following learning rule:

min h H 1 m

j=1 ℓ(h(x j), K + 1), subject to 1

i=1 ℓ(h(xi), yi) = 0.

We denote the algorithm, which solves the above rule, as ASm 8. For diﬀerent data points Sm, we have diﬀerent algorithm ASm. Let S be the inﬁnite sequence set that consists of all inﬁnite sequences, whose coordinates are data points, i.e.,

S := {S := (S1, S2, ..., Sm, ...) : Sm are any m data points, m = 1, ..., + }. (48)

8. In this paper, we regard an algorithm as a mapping from + n=1(X Y)n to H or R. So we can design an algorithm like this.

Fang, Li, Liu, Han, Lu

Using S, we construct an algorithm space as follows:

A := {AS : S S}, where AS(S) = ASn(S), if |S| = n.

Next, we prove that there exists an algorithm AS A , which is a consistent algorithm. Given data points Sn µn, i.i.d., using the Natarajan dimension theory and Empirical risk minimization principle (Shalev-Shwartz and Ben-David, 2014), it is easy to obtain that there exists a uniform constant Cθ such that (we mainly use the uniform bounds to obtain the following bounds)

ES Dn XIYI sup h HS Rin D(h) inf h H Rin D(h) + Cθ

and because of HS H,

ESn µn sup S (X Y)n[Rµ(ASn(S), K + 1) inf h HS Rµ(h, K + 1)] Cθ

n1 θ , (49)

HS = {h H :

i=1 ℓ(h(xi), yi) = 0}, here S = {(x1, y1), ..., (xn, yn)},

Rµ(h, K + 1) = Ex µℓ(h(x), K + 1) = Z

X ℓ(h(x), K + 1)dµ(x).

We set DI = {DXIYI : there exists DXOYOsuch that (1 α)DXIYI + αDXOYO Dµ,b XY }. Then by Eq. (49), we have

ESn µn sup DXIYI DI ES Dn XIYI[Rµ(ASn(S), K + 1) inf h HS Rµ(h, K + 1)] Cθ

n1 θ . (50)

Due to Risk-based Realizability Assumption, we obtain that infh H Rin D(h) = 0. Therefore,

ES Dn XIYI sup h HS Rin D(h) Cθ

n1 θ , (51)

which implies that (in following inequalities, g is the groundtruth labeling function, i.e., RD(g) = 0)

Cθ n ES Dn XIYI sup h HS Rin D(h) =ES Dn XIYI sup h HS

g<K+1 ℓ(h(x), g(x))f I(x)dµ(x)

b ES Dn XIYI sup h HS

g<K+1 ℓ(h(x), g(x))dµ(x).

This implies that (here we have used the property of zero-one loss)

ES Dn XIYI inf h HS

g<K+1 ℓ(h(x), K + 1)dµ(x) µ(x X : g(x) < K + 1) Cθb

On the Learnability of Out-of-distribution Detection

ES Dn XIYI inf h HS Rµ(h, K + 1) µ(x X : g(x) < K + 1) Cθb

n1 θ . (52)

Additionally, Rµ(g, K + 1) = µ(x X : g(x) < K + 1) and g HS, which implies that

inf h HS Rµ(h, K + 1) µ(x X : g(x) < K + 1). (53)

Combining inequalities (52) and (53), we obtain that

ES Dn XIYI inf h HS Rµ(h, K + 1) µ(x X : g(x) < K + 1) Cθb

n1 θ . (54)

Using inequalities (50) and (54), we obtain that

ESn µn sup DXIYI DI

ES Dn XIYIRµ(ASn(S), K + 1) µ(x X : g(x) < K + 1) Cθ(b + 1)

(55) Using inequality (51), we have

ESn µn sup DXIYI DI ES Dn XIYIRin D(ASn(S)) Cθ

n1 θ , (56)

which implies that (here we use the property of zero-one loss)

ESn µn sup DXIYI DI ES Dn XIYI Z

g<K+1 ℓ(ASn(S)(x), K + 1)dµ(x)

+ µ(x X : g(x) < K + 1) 2b Cθ

n1 θ . (57)

Combining inequalities (55) and (57), we have

ESn µn sup DXIYI DI ES Dn XIYI

g=K+1 ℓ(ASn(S)(x), K + 1)dµ(x) 2b Cθ

n1 θ + Cθ(b + 1)

Therefore, there exist data points S n such that

sup DXIYI DI ES Dn XIYIRout D (AS n)

= sup DXIYI DI ES Dn XIYI

g=K+1 ℓ(AS n(S)(x), K + 1)f O(x)dµ(x)

2b sup DXIYI DI ES Dn XIYI

g=K+1 ℓ(AS n(S)(x), K + 1)dµ(x) 4b2Cθ

n1 θ + 2Cθ(b2 + b)

Combining inequalities (51) and (58), we obtain that for any n, there exists data points S n such that

ES Dn XIYIRα D(AS n) max 4b2Cθ

n1 θ + 2Cθ(b2 + b)

Fang, Li, Liu, Han, Lu

We set data point sequences S = (S 1, S 2, ..., S n, ...). Then, AS A is the universally consistent algorithm, i.e., for any α [0, 1]

ES Dn XIYIRα D(AS ) max 4b2Cθ

n1 θ + 2Cθ(b2 + b)

We have completed this proof when ℓis the zero-one loss.

Second, we prove the case that ℓis not the zero-one loss. We use the notation ℓ0 1 as the zero-one loss. According the deﬁnition of loss introduced in Section 2, we know that there exists a constant M > 0 such that for any y1, y2 Yall,

1 M ℓ0 1(y1, y2) ℓ(y1, y2) Mℓ0 1(y1, y2).

Hence, 1 M Rα,ℓ0 1 D (h) Rα,ℓ D (h) MRα,ℓ0 1 D (h),

where Rα,ℓ0 1 D is the α-risk with zero-one loss, and Rα,ℓ D is the α-risk for loss ℓ. Above inequality tells us that Risk-based Realizability Assumption holds with zero-one loss if and only if Risk-based Realizability Assumption holds with the loss ℓ. Therefore, we use the result proven in ﬁrst step. We can ﬁnd a consistent algorithm A such that for any α [0, 1],

ES Dn XIYIRα,ℓ0 1 D (A) O( 1

which implies that for any α [0, 1],

1 M ES Dn XIYIRα,ℓ D (A) O( 1

We have completed this proof.

Appendix L. Proof of Theorem 15

Theorem 15 Suppose that R is constant closure, separate, and µ(X) < + . Given a density-based space Dµ,b XY , if the AUC-based Realizability Assumption holds, then when VC[φ R] < + , OOD detection is learnable under AUC in Dµ,b XY for R, where φ R = {1r1(x)>r2(x ) : r1, r2 R}. Furthermore, the learning rate ϵcons(n) can attain O(1/

n1 θ), for any θ (0, 1).

Proof [Proof of Theorem 15] Without loss of generality, we assume that K = 1, and any r R satisﬁes that 0 < r < 1 (one can achieve this by using sigmoid function). Then it is clear that R(0,1) = {1r(x) τ : r R, τ (0, 1)} has ﬁnite VC-dimension by the condition constant closure and VC[φ F] < + . Given m data points Sm = {x 1, ..., x m} X m. We consider the following learning rule:

max r R,τ (0,1) 1 m

i=1 1r(x i) τ, subject to 1

j=1 1r(xj) τ = 0.

On the Learnability of Out-of-distribution Detection

We denote the algorithm, which solves the above rule, as ASm. For diﬀerent data points Sm, we have diﬀerent algorithm ASm. Let S be the inﬁnite sequence set that consists of all inﬁnite sequences, whose coordinates are data points, i.e.,

S := {S := (S1, S2, ..., Sm, ...) : Sm are any m data points, m = 1, ..., + }. (59)

Using S, we construct an algorithm space as follows:

A := {AS : S S}, where AS(S) = ASn(S), if |S| = n.

Then we can check that

ES Dn XI sup 1r τ GS Ex DXI1r(x) τ inf 1r τ R(0,1) Ex DXI1r(x) τ + Cθ

and because of GS R(0,1),

ESn µn sup S X n[ sup 1r τ GS Rµ(1r τ) Rµ(ASn(S))] Cθ

n1 θ , (60)

GS = {1r(x) τ R(0,1) :

i=1 1r(xj) τ = 0 < 0}, here S = {x1, ..., xn},

and Rµ(1r τ) = Ex µ1r(x) τ.

Let DI be the set consisting of all ID distribution in the density-based space. Then we have

ESn µn sup DXI DI ES Dn XI[ sup 1r τ GS Rµ(1r τ) Rµ(ASn(S))] Cθ

n1 θ , (61)

Due to AUC-based Realizability Assumption, we obtain that inf1r τ R(0,1) Ex DXI1r(x) τ = 0, therefore,

ES Dn XI sup 1r τ GS Ex DXI1r(x) τ Cθ

n1 θ . (62)

Let r R be the optimal ranking function satisfying that

AUC(r ; DXI, DXO) = 1,

which implies that there exists τ satisfying that for any ϵ > 0

DXI(x : r (x) < τ ϵ) = 0, DXI(x : r (x) < τ + ϵ) > 0.

Then we consider set ΩXI := {x X : r (x) > τ }, if DXI(x : r (x) = τ ) = 0; otherwise, ΩXI := {x X : r (x) τ }. ΩXO := X ΩXI. Then we have that

n1 θ ES Dn XI sup 1r τ GS Ex DXI1r(x) τ 2

b ES Dn XI sup 1r τ GS

ΩXI 1r(x) τdµ(x),

Fang, Li, Liu, Han, Lu

which implies that

ES Dn XI inf 1r τ GS

ΩXI 1r(x)>τdµ(x) + b Cθ 2

n1 θ µ(ΩXI).

ES Dn XI inf 1r τ GS

X 1 1r(x) τdµ(x) + b Cθ 2

n1 θ µ(ΩXI),

which implies that

µ(ΩXO) + b Cθ 2

n1 θ ES Dn XI sup 1r τ GS Rµ(1r(x) τ). (63)

Additionally, due to 1r (x) τ ϵ GS, it is clear that

sup 1r τ GS Rµ(1r τ) µ(ΩXO). (64)

Combining Eq. (63) with Eq. (64), we have

|ES Dn XI sup 1r τ GS Rµ(1r τ) µ(ΩXO)| b Cθ 2

By using Eq. (61) and Eq. (64), we have

ESn µn sup DXI DI ES Dn XI[µ(ΩXO) Rµ(ASn(S))] Cθ

n1 θ + b Cθ 2

n1 θ . (65)

Using inequality (62), we have

ESn µn sup DXI DI ES Dn XIEx DXIASn(S)(x) Cθ

n1 θ , (66)

which implies that

ESn µn sup DXI DI ES Dn XI

ΩXI ASn(S)(x)dµ(x) b Cθ 2

n1 θ . (67)

Then inequalities (65) and (67) imply that

ESn µn sup DXI DI ES Dn XI

ΩXO 1 ASn(S)(x)dµ(x) (1 + b)Cθ

ESn µn sup DXI DI ES Dn XI

X 1 ASn(S)(x)d DXO(x) 2b(1 + b)Cθ

We assume that ASn(S) = 1r Sn,S τSn,S. Then above inequality implies that

ESn µn inf DXI DI ES Dn XIEx DXO1r Sn,S(x) τSn,S 1 2b(1 + b)Cθ

On the Learnability of Out-of-distribution Detection

Inequality (66) implies that

ESn µn inf DXI DI ES Dn XIEx DXI1r Sn,S(x)>τSn,S 1 Cθ

which shows that there exists S n µn and C such that

inf DXI DI ES Dn XIAUC(f S n,S; DXI, DXO)

inf DXI DI ES Dn XIEx DXOEx DXI1r S n,S(x)<r S n,S(x )

inf DXI DI ES Dn XIEx DXOEx DXI1r S n,S(x) τS n,S1r S n,S(x )>τS n,S

1 max{Cθ, C }

n1 θ 2b(1 + b)Cθ

We set data point sequences S = (S 1, S 2, ..., S n, ...). Then, the function part r S n,S of AS A is the universally consistent algorithm, i.e.,

inf DXI DI ES Dn XIAUC(r S n,S; DXI, DXO) 1 max{Cθ, C }

n1 θ 2b(1 + b)Cθ

We have completed this proof.

Appendix M. Proofs of Proposition 1 and Proposition 2

Proposition 1 Let X be a bounded feature space. Given q = (l1, ..., lg 1, 1), then

if some s with 1 < s < g, d = l1 l2 ... ls, and ls 2d, Fσ q is the separate ranking function space;

Fσ q is constant closure;

{1f1(x)<f2(x ) : f1, f2 Fσ q} has ﬁnite VC dimension.

Proof [Proof of Proposition 1 ] Firstly, we prove that if some s with 1 < s < g, d = l1 l2 ... ls, and ls 2d, Fσ q is the separate ranking function space.

First, we show that if X R1, and q = (1, 2, 1), then Fσ q is the separate ranking space. For any x X,

fx(x ) = [1, 1] σ( 1 1

It is easy to check that for any x = x,

fx(x) < fx(x ).

Fang, Li, Liu, Han, Lu

Next, we prove that if X Rd, and q = (d, 2d, 1), then Fσ q is the separate ranking space. Let vi R2d 1 is a vector whose 2i-th and 2i + 1-th coordinates are 1; otherwise, other coordinates are 0. For any x = [x1, ..., xd] X,

fx(x ) = [1, 1..., 1]1 (2d) σ([v1, v2, ..., vd] x + Md+1) + 0,

where Md+1 R2d 1 is a vector whose 2i-th coordinate is xi and (2i + 1)-th coordinate is xi. It is easy to check that for any x = x,

fx(x) < fx(x ).

Thirdly, we prove that if d = l1 = l2 = ... = lr 1 and lr = 2d, then Fσ q is the separate ranking space. Due to X is bounded, we can ﬁnd b such that any x = [x1, ..., xd] X satisﬁes that xi + b 0. Then, we set w2, w3, ..., wr 1 are identity matrices, b2 is the matrix whose all coordinates are b, and b3, ..., br 1 are 0. Then the result in second step implies the result. Finally, using the result that q q Fσ q Fσ q (Lemma 14) implies the ﬁnal result.

Secondly, it is clear that Fσ q is constant closure. We omit the proof. Thirdly, by Theorems 5 and 8 in (Bartlett and Maass, 2003), we can obtain the third result.

Proposition 2 Given q = (l1, ..., lg 1, l) and q = (l1, ..., lg 1, 1), let R = E Fσ q, then

if Fσ q is a separate ranking function space, R is the separate ranking function space;

R is constant closure;

{1r1(x)<r2(x ), r1, r2 R} has ﬁnite VC dimension,

where E is Eq. (7), (8) or (9).

Proof [Proof of Proposition 2 ] Firstly, we prove that if Fσ q is a separate ranking function space, R is the separate ranking function space.

For the softmax-based function:

E(f) = max k {1,...,l} exp (fk) Pl c=1 exp (fc) .

Note that Fσ q is separate ranking space. Then for any x X, there exists fx Fσ q such that 0 = fx(x) < fx(x ), for any x X and x = x. Then fx = [fx, fx, .., fx] Fσ q can ensure that for any x X and x = x,

E(fx(x)) < E(fx(x )).

Using the same strategy, we can prove that the temperature-scaled function is the separate ranking space. For the energy-based function:

E(f) = T log

c=1 exp (fc/T).

On the Learnability of Out-of-distribution Detection

Note that Fσ q is separate ranking space. Then for any x X, there exists fx Fσ q such that 0 = fx(x) < fx(x ), for any x X and x = x. Then fx = [fx, fx, .., fx] Fσ q can ensure that for any x X and x = x,

E(fx(x)) < E(fx(x )).

Secondly, it is easy to show that R is constant closure. We omit it. Thirdly, by Theorems 5 and 8 in (Bartlett and Maass, 2003), we can obtain the third result.

Appendix N. Proofs of Proposition 3 and Proposition 4

To better understand the contents in Appendices N-Q, we introduce the important notations for FCNN-based hypothesis space and score-based hypothesis space detaily.

FCNN-based Hypothesis Space. Given a sequence q = (l1, l2, ..., lg), where li and g are positive integers and g > 2, we use g to represent the depth of neural network and use li to represent the width of the i-th layer. After the activation function σ is selected, we can obtain the architecture of FCNN according to the sequence q. Given any weights wi Rli li 1 and bias bi Rli 1, the output of the i-layer can be written as follows: for any x Rl1, fi(x) = σ(wifi 1(x) + bi), i = 2, ..., g 1,

where fi 1(x) is the i-th layer output and f1(x) = x. Then, the output of FCNN is fw,b(x) = wgfg 1(x) + bg, where w = {w2, ..., wg} and b = {b2, ..., bg}. An FCNN-based scoring function space is deﬁned as:

Fσ q := {fw,b : wi Rli li 1, bi Rli 1, i = 2, ..., g}.

Additionally, given two sequences q = (l1, ..., lg) and q = (l 1, ..., l g ), we use the notation q q to represent the following equations and inequalities:

g g , l1 = l 1, lg = l g ,

li l i, i = 1, ..., g 1,

lg 1 l i, i = g, ..., g 1.

Given a sequence q = (l1, ...lg) satisfying that l1 = d and lg = K + 1, the FCNN-based scoring function space Fσ q can induce an FCNN-based hypothesis space. Before deﬁning the FCNN-based hypothesis space, we deﬁne the induced hypothesis function. For any fw,b Fσ q, the induced hypothesis function is:

hw,b(x) := arg max k {1,...,K+1} fk w,b(x), x X,

where fk w,b(x) is the k-th coordinate of fw,b(x). Then, we deﬁne the FCNN-based hypothesis space as follows:

Hσ q := {hw,b : wi Rli li 1, bi Rli 1, i = 2, ..., g}.

Fang, Li, Liu, Han, Lu

Score-based Hypothesis Space. Many OOD algorithms detect OOD data using a scorebased strategy. That is, given a threshold λ, a scoring function space Fl {f : X Rl} and a scoring function E : Fl R, then x is regarded as ID, if E(f(x)) λ; otherwise, x is regarded as OOD.

Using E, λ and f Fσ q, we can generate a binary classiﬁer hλ f,E:

hλ f,E(x) :=

( 1, if E(f(x)) λ;

2, if E(f(x)) < λ,

where 1 represents ID data, and 2 represents OOD data. Hence, a binary classiﬁcation hypothesis space Hb, which consists of all hλ f,E, is generated. We deﬁne the score-based

hypothesis space Hσ,λ q,E := {hλ f,E : f Fσ q}. Next, we introduce two important propositions.

Proposition 3 Given a sequence q = (l1, ...lg) satisfying that l1 = d and lg = K + 1 (note that d is the dimension of input data and K + 1 is the dimension of output), then the constant functions h1, h2,...,h K+1 belong to Hσ q, where hi(x) = i, for any x X. Therefore, Assumption 1 holds for Hσ q.

Proof [Proof of Proposition 3] Note that the output of FCNN can be written as

fw,b(x) = wgfg 1(x) + bg,

where wg R(K+1) lg 1, bg R(K+1) 1 and fg 1(x) is the output of the lg 1-th layer. If we set wg = 0, and set bg = yi, where yi is the one-hot vector corresponding to label i. Then fw,b(x) = yi, for any x X. Therefore, hi(x) Hσ q, for any i = 1, ..., K, K +1.

Note that in some works (Safran and Shamir, 2017), bg is ﬁxed to 0. In fact, it is easy to check that when g > 2 and activation function σ is not a constant, Proposition 1 still holds, even if bg = 0.

Proposition 4 For any sequence q = (l1, ..., lg) satisfying that l1 = d and lg = l (note that d is the dimension of input data and l is the dimension of output), if {v Rl : E(v) λ} = and {v Rl : E(v) < λ} = , then the functions h1 and h2 belong to Hσ,λ q,E, where h1(x) = 1 and h2(x) = 2, for any x X, where 1 represents the ID labels, and 2 represents the OOD labels. Therefore, Assumption 1 holds.

Proof [Proof of Proposition 4] Since {v Rl : E(v) λ} = and {v Rl : E(v) < λ} = , we can ﬁnd v1 {v Rl : E(v) λ} and v2 {v Rl : E(v) < λ}. For any fw,b Fσ q, we have

fw,b(x) = wgfg 1(x) + bg,

where wg Rl lg 1, bg Rl 1 and fg 1(x) is the output of the lg 1-th layer.

On the Learnability of Out-of-distribution Detection

If we set wg = 0l lg 1 and bg = v1, then fw,b(x) = v1 for any x X, where 0l lg 1 is l lg 1 zero matrix. Hence, h1 can be induced by fw,b. Therefore, h1 Hσ,λ q,E.

Similarly, if we set wg = 0l lg 1 and bg = v2, then fw,b(x) = v2 for any x X, where 0l lg 1 is l lg 1 zero matrix. Hence, h2 can be induced by fw,b. Therefore, h2 Hσ,λ q,E.

It is easy to check that when g > 2 and activation function σ is not a constant, Proposition 4 still holds, even if bg = 0.

Appendix O. Proof of Theorem 16

Before proving Theorem 16, we need several lemmas.

Lemma 13 Let σ be Re LU function: max{x, 0}. Given q = (l1, ..., lg) and q = (l 1, ..., l g) such that lg = l g and l1 = l 1, and li l i (i = 1, ..., g 1), then Fσ q Fσ q and Hσ q Hσ q .

Proof [Proof of Lemma 13] Given any weights wi Rli li 1 and bias bi Rli 1, the i-layer output of FCNN with architecture q can be written as

fi(x) = σ(wifi 1(x) + bi), x Rl1, i = 2, ..., g 1,

where fi 1(x) is the i-th layer output and f1(x) = x. Then, the output of last layer is

fw,b(x) = wgfg 1(x) + bg.

We will show that fw,b Fσ q . We construct fw ,b as follows: for every w i Rl i l i 1, if l i li > 0 and l i 1 li 1 > 0, we set

" wi 0li (l i 1 li 1) 0(l i li) l i 1 0(l i li) (l i 1 li 1)

, b i = bi 0(l i li) 1

where 0pq means the p q zero matrix. If l i li = 0 and l i 1 li 1 > 0, we set

w i = h wi 0li (l i 1 li 1) i , b i = bi.

If l i 1 li 1 = 0 and l i li > 0, we set

w i = wi 0(l i li) l i 1

, b i = bi 0(l i li) 1

If l i 1 li 1 = 0 and l i li = 0, we set

w i = wi, b i = bi.

It is easy to check that if l i li > 0

f i = fi 0(l i li) 1

Fang, Li, Liu, Han, Lu

If l i li = 0, f i = fi.

Since l g lg = 0, f g = fg, i.e., fw ,b = fw,b.

Therefore, fw,b Fσ q , which implies that Fσ q Fσ q . Therefore, Hσ q Hσ q .

Lemma 14 Let σ be the Re LU function: σ(x) = max{x, 0}. Then, q q implies that Fσ q Fσ q , Hσ q Hσ q , where q = (l1, ..., lg) and q = (l 1, ..., l g ).

Proof [Proof of Lemma 14] Given l = (l 1, ..., l g ) satisfying that g g , l i = li for i = 1, ..., g 1, l i = lg 1 for i = g, ..., g 1, and l g = lg, we ﬁrst prove that Fσ q Fσ q and Hσ q Hσ q .

Given any weights wi Rli li 1 and bias bi Rli 1, the i-th layer output of FCNN with architecture q can be written as

fi(x) = σ(wifi 1(x) + bi), x Rl1, i = 2, ..., g 1,

where fi 1(x) is the i-th layer output and f1(x) = x. Then, the output of the last layer is

fw,b(x) = wgfg 1(x) + bg.

We will show that fw,b Fσ q . We construct fw ,b as follows: if i = 2, ..., g 1, then w i = w and b i = bi; if i = g, ..., g 1, then w i = Ilg 1 lg 1 and b i = 0lg 1 1, where Ilg 1 lg 1 is the lg 1 lg 1 identity matrix, and 0lg 1 1 is the lg 1 1 zero matrix; and if i = g , then w g = wg, b g = bg. Then it is easy to check that the output of the i-th layer is f i = fg 1, i = g 1, g, ..., g 1.

Therefore, fw ,b = fw,b, which implies that Fσ q Fσ q . Hence, Hσ q Hσ q . When g = g , we use Lemma 13 (q and q satisfy the condition in Lemma 13), which implies that Fσ q Fσ q , Hσ q Hσ q . Therefore, Fσ q Fσ q , Hσ q Hσ q .

Lemma 15 (Pinkus, 1999) If the activation function σ is not a polynomial, then for any continuous function f deﬁned in Rd, and any compact set C Rd, there exists a fullyconnected neural network with architecture q (l1 = d, lg = 1) such that

inf fw,b Fσ q max x C |fw,b(x) f(x)| < ϵ.

Proof [Proof of Lemma 15] The proof of Lemma 15 can be found in Theorem 3.1 in (Pinkus, 1999).

On the Learnability of Out-of-distribution Detection

Lemma 16 If the activation function σ is the Re LU function, then for any continuous vector-valued function f C(Rd; Rl), and any compact set C Rd, there exists a fullyconnected neural network with architecture q (l1 = d, lg = l) such that

inf fw,b Fσ q max x C fw,b(x) f(x) 2 < ϵ,

where 2 is the ℓ2 norm. (Note that we can also prove the same result, if σ is not a polynomial.)

Proof [Proof of Lemma 16] Let f = [f1, ..., fl] , where fi is the i-th coordinate of f. Based on Lemma 15, we obtain l sequences q1, q2,...,ql such that

inf g1 Fσ q1 max x C |g1(x) f1(x)| < ϵ/

inf g2 Fσ q2 max x C |g2(x) f2(x)| < ϵ/

inf gl Fσ ql max x C |gl(x) fl(x)| < ϵ/

It is easy to ﬁnd a sequence q = (l1, ..., lg) (lg = 1) such that qi q, for all i = 1, ..., l. Using Lemma 14, we obtain that Fσ qi Fσ q. Therefore,

inf g Fσ q max x C |g(x) f1(x)| < ϵ/

inf g Fσ q max x C |g(x) f2(x)| < ϵ/

inf g Fσ q max x C |g(x) fl(x)| < ϵ/

Therefore, for each i, we can ﬁnd gwi,bi from Fσ q such that

max x C |gwi,bi(x) fi(x)| < ϵ/

where wi represents weights and bi represents bias. We construct a larger FCNN with q = (l 1, l 2, ..., l g) satisfying that l 1 = d, l i = l li, for i = 2, ..., g. We can regard this larger FCNN as a combinations of l FCNNs with architecture q, that is: there are m disjoint sub-FCNNs with architecture q in the larger FCNN with architecture q . For i-th sub-FCNN, we use weights wi and bias bi. For weights and bias which connect diﬀerent sub-FCNNs, we set these weights and bias to 0. Finally, we can obtain that gw,b = [gw1,b1, gw2,b2, ..., gwl,bl] Fσ q , which implies that

max x C gw,b(x) f(x) 2 < ϵ.

Fang, Li, Liu, Han, Lu

We have completed this proof.

Given a sequence q = (l1, ..., lg), we are interested in following function space Fσ q,M:

Fσ q,M := {M (σ f) : f Fσ q},

where means the composition of two functions, means the product of two matrices, and

M = 11 (lg 1) 0 01 (lg 1) 1

here 11 (lg 1) is the 1 (lg 1) matrix whose all elements are 1, and 01 (lg 1) is the 1 (lg 1) zero matrix. Using Fσ q,M, we can construct a binary classiﬁcation space Hσ q,M, which consists of all classiﬁers satisfying the following condition:

h(x) = arg min k={1,2} fk M(x),

where fk M(x) is the k-th coordinate of M (σ f).

Lemma 17 Suppose that σ is the Re LU function: max{x, 0}. Given a sequence q = (l1, ..., lg) satisfying that l1 = d and lg = K + 1, then the space Hσ q,M contains φ Hσ q, and Hσ q,M has ﬁnite VC dimension (Vapnik Chervonenkis dimension), where φ maps ID data to 1 and OOD data to 2. Furthermore, if given q = (l 1, ..., l g) satisfying that l g = K and l i = li, for i = 1, ..., g 1, then Hσ q Hσ q Hσ q,M.

Proof [Proof of Lemma 17] For any hw,b Hσ q, then there exists fw,b Fσ q such that hw,b is induced by fw,b. We can write fw,b as follows:

fw,b(x) = wgfg 1(x) + bg,

where wg R(K+1) lg 1, bg R(K+1) 1 and fg 1(x) is the output of the lg 1-th layer. Suppose that

v1 v2 ... v K v K+1

b1 b2 ... b K b K+1

where vi R1 lg 1 and bi R.

We set fw ,b (x) = w gfg 1(x) + b g,

v1 v2 ... v K

b1 b2 ... b K

On the Learnability of Out-of-distribution Detection

It is obvious that fw ,b Fσ q . Using fw ,b Fσ q , we construct a classiﬁer hw ,b Hσ q :

hw ,b = arg max k {1,...,K} fk w ,b ,

where fk w ,b is the k-th coordinate of fw ,b .

Additionally, we consider

fw,b,B = M σ(B fw,b) Fσ q,M,

B = I(lg 1) (lg 1) 1(lg 1) 1 01 (lg 1) 0

here I(lg 1) (lg 1) is the (lg 1) (lg 1) identity matrix, 01 (lg 1) is the 1 (lg 1) zero matrix, and 1(lg 1) 1 is the (lg 1) 1 matrix, whose all elements are 1. Then, we deﬁne that for any x X,

hw,b,B(x) := arg max k {1,2} fk w,b,B(x),

where fk w,b,B(x) is the k-th coordinate of fw,b,B(x). Furthermore, we can check that hw,b,B can be written as follows: for any x X,

hw,b,B(x) =

( 1, if f1 w,b,B(x) > 0;

2, if f1 w,b,B(x) 0.

It is easy to check that hw,b,B = φ hw,b,

where φ maps ID labels to 1 and OOD labels to 2.

Therefore, hw,b(x) = K + 1 if and only if hw,b,B = 2; and hw,b(x) = k (k = K + 1) if and only if hw,b,B = 1 and hw ,b (x) = k. This implies that Hσ q Hσ q Hσ q,M and φ Hσ q Hσ q,M. Let q be (l1, ..., lg, 2). Then Fσ q,M Fσ q. Hence, Hσ q,M Hσ q. According to the VC dimension theory (Bartlett et al., 2019) for feed-forward neural networks, Hσ q has ﬁnite VC dimension. Hence, Hσ q,M has ﬁnite VC-dimension. We have completed the proof.

Lemma 18 Let |X| < + and σ be the Re LU function: max{x, 0}. Given r hypothesis functions h1, h2, ..., hr {h : X {1, ..., l}}, then there exists a sequence q = (l1, ..., lg) with l1 = d and lg = l, such that h1, ..., hr Hσ q.

Proof [Proof of Lemma 18] For each hi (i = 1, ..., r), we introduce a corresponding fi (deﬁned over X) satisfying that for any x X, fi(x) = yk if and only if hi(x) = k, where yk Rl is

Fang, Li, Liu, Han, Lu

the one-hot vector corresponding to the label k. Clearly, fi is a continuous function in X, because X is a discrete set. Tietze Extension Theorem implies that fi can be extended to a continuous function in Rd.

Since X is a compact set, then Lemma 16 implies that there exist a sequence qi = (li 1, ..., li gi) (li 1 = d and li gi = l) and fw,b Fσ qi such that

max x X fw,b(x) fi(x) ℓ2 < 1 10 l,

where ℓ2 is the ℓ2 norm in Rl. Therefore, for any x X, it easy to check that

arg max k {1,...,l} fk w,b(x) = hi(x),

where fk w,b(x) is the k-th coordinate of fw,b(x). Therefore, hi(x) Hσ qi.

Let q be (l1, ..., lg) (l1 = d and lg = l) satisfying that qi q. Using Lemma 14, we obtain that Hσ qi Hσ q, for each i = 1, ..., r. Therefore, h1, ..., hr Hσ q.

Lemma 19 Let the activation function σ be the Re LU function. Suppose that |X| < + . If {v Rl : E(v) λ} and {v Rl : E(v) < λ} both contain nonempty open sets of Rl

(here, open set is a topological terminology). There exists a sequence q = (l1, ..., lg) (l1 = d and lg = l) such that Hσ,λ q,E consists of all binary classiﬁers.

Proof [Proof of Lemma 19] Since {v Rl : E(v) λ}, {v Rl : E(v) < λ} both contain nonempty open sets, we can ﬁnd v1 {v Rl : E(v) λ}, v2 {v Rl : E(v) < λ} and a constant r > 0 such that Br(v1) {v Rl : E(v) λ} and Br(v2) {v Rl : E(v) < λ}, where Br(v1) = {v : v v1 ℓ2 < r} and Br(v2) = {v : v v2 ℓ2 < r}, here ℓ2 is the ℓ2 norm.

For any binary classiﬁer h over X, we can induce a vector-valued function as follows: for any x X,

( v1, if h(x) = 1;

v2, if h(x) = 2.

Since X is a ﬁnite set, then Tietze Extension Theorem implies that f can be extended to a continuous function in Rd. Since X is a compact set, Lemma 16 implies that there exists a sequence qh = (lh 1, ..., lh gh) (lh 1 = d and lh gh = l) and fw,b Fσ qh such that

max x X fw,b(x) f(x) ℓ2 < r

where ℓ2 is the ℓ2 norm in Rl. Therefore, for any x X, it is easy to check that E(fw,b(x)) λ if and only if h(x) = 1, and E(fw,b(x)) < λ if and only if h(x) = 2. For each h, we have found a sequence qh such that h is induced by fw,b Fσ qh, E and λ. Since |X| < + , only ﬁnite binary classiﬁers are deﬁned over X. Using Lemma 18,

On the Learnability of Out-of-distribution Detection

we can ﬁnd a sequence q such that Hb all = Hσ,λ q,E, where Hb all consists of all binary classiﬁers.

Lemma 20 Suppose the hypothesis space is score-based. Let |X| < + . If {v Rl : E(v) λ} and {v Rl : E(v) < λ} both contain nonempty open sets, and Condition 3 holds, then there exists a sequence q = (l1, ..., lg) (l1 = d and lg = l) such that for any sequence q satisfying q q and any ID hypothesis space Hin, OOD detection is learnable in the separate space Ds XY for Hin Hb, where Hb = Hσ,λ q ,E and Hin Hb is deﬁned below Eq. (6).

Proof [Proof of Lemma 20] Note that we use the Re LU function as the activation function in this lemma. Using Lemma 14, Lemma 19 and Theorem 11, we can prove this result.

Theorem 16 Suppose that Condition 3 holds and the hypothesis space H is FCNN-based or score-based, i.e., H = Hσ q or H = Hin Hb, where Hin is an ID hypothesis space, Hb = Hσ,λ q,E and H = Hin Hb is introduced below Eq. (6), here E is Eq. (7), (8) or (9). Then

There is a sequence q = (l1, ..., lg) such that OOD detection is learnable under risk in the separate space Ds XY for H if and only if |X| < + .

Furthermore, if |X| < + , then there exists a sequence q = (l1, ..., lg) such that for any sequence q satisfying that q q , OOD detection is learnable under risk in Ds XY for H.

Proof [Proof of Theorem 16] Note that we use the Re LU function as the activation function in this theorem. The Case that H is FCNN-based. First, we prove that if |X| = + , then OOD detection is not learnable in Ds XY for Hσ q, for any sequence q = (l1, ..., lg) (l1 = d and lg = K + 1). By Lemma 17, Theorems 5 and 8 in (Bartlett and Maass, 2003), we know that VCdim(φ Hσ q) < + , where φ maps ID data to 1 and maps OOD data to 2. Additionally, Proposition 3 implies that Assumption 1 holds and suph Hσq |{x X : h(x) Y}| = + , when |X| = + . Therefore, Theorem 6 implies that OOD detection is not learnable in Ds XY for Hσ q, when |X| = + .

Second, we prove that if |X| < + , there exists a sequence q = (l1, ..., lg) (l1 = d and lg = K + 1) such that OOD detection is learnable in Ds XY for Hσ q. Since |X| < + , it is clear that |Hall| < + , where Hall consists of all hypothesis functions from X to Yall. According to Lemma 18, there exists a sequence q such that Hall Hσ q. Additionally, Lemma 17 implies that there exist Hin and Hb such that Hσ q Hin Hb. Since Hall consists all hypothesis space, Hall = Hσ q = Hin Hb. Therefore, Hb contains all binary classiﬁers from X to {1, 2}. Theorem 11 implies that OOD detection is learnable in Ds XY for Hσ q.

Third, we prove that if |X| < + , then there exists a sequence q = (l1, ..., lg) (l1 = d and lg = K + 1) such that for any sequence q = (l 1, ..., l g ) satisfying that q q , OOD detection is learnable in Ds XY for Hσ q .

Fang, Li, Liu, Han, Lu

We can use the sequence q constructed in the second step of the proof. Therefore, Hσ q = Hall. Lemma 14 implies that Hσ q Hσ q . Therefore, Hσ q = Hall = Hσ q. The proving process (second step of the proof) has shown that if |X| < + , Condition 3 holds and hypothesis space H consists of all hypothesis functions, then OOD detection is learnable in Ds XY for H. Therefore, OOD detection is learnable in Ds XY for Hσ q . We complete the proof when the hypothesis space H is FCNN-based.

The Case that H is score-based Fourth, we prove that if |X| = + , then OOD detection is not learnable in Ds XY for Hin Hb, where Hb = Hσ,λ q,E for any sequence q = (l1, ..., lg) (l1 = d, lg = l), where E is in Eq. (7), (8), or (9). By Theorems 5 and 8 in (Bartlett and Maass, 2003), we know that VCdim(Hσ,λ q,E) < + . Additionally, Proposition 4 implies that Assumption 1 holds and suph Hσq |{x X : h(x) Y}| = + , when |X| = + . Hence, Theorem 6 implies that OOD detection is not learnable in Ds XY for Hσ q, when |X| = + .

Fifth, we prove that if |X| < + , there exists a sequence q = (l1, ..., lg) (l1 = d and lg = l) such that OOD detection is learnable in Ds XY for for Hin Hb, where Hb = Hσ,λ q,E for any sequence q = (l1, ..., lg) (l1 = d, lg = l), where E is in Eq. (7), (8), or Eq. (9). Based on Lemma 20, we only need to show that {v Rl : E(v) λ} and {v Rl : E(v) < λ} both contain nonempty open sets for diﬀerent score functions E. Since maxk {1,...,l} exp (vk) Pl c=1 exp (vc), maxk {1,...,l} exp (vk/T) PK c=1 exp (vc/T) and T log Pl c=1 exp (vc/T) are

continuous functions, whose ranges contain (1

l , 1), (0, + ) and (0, + ), respectively. Based on the property of continuous function (E 1(A) is an open set, if A is an open set), we obtain that {v Rl : E(v) λ} and {v Rl : E(v) < λ} both contain nonempty open sets. Using Lemma 20, we complete the ﬁfth step.

Sixth, we prove that if |X| < + , then there exists a sequence q = (l1, ..., lg) (l1 = d and lg = l) such that for any sequence q = (l 1, ..., l g ) satisfying that q q , OOD detection is

learnable in Ds XY for Hin Hb, where Hb = Hσ,λ q ,E, where E is in Eq. (7), (8), or Eq. (9). In the ﬁfth step, we have proven that Eqs. (7), (8), and (9) meet the condition in Lemma 20. Therefore, Lemma 20 implies this result. We complete the proof when the hypothesis space H is score-based.

Appendix P. Proof of Theorem 17

Theorem 17 Suppose the ranking function space R is separate, and FCNN-based or scorebased, i.e., R = Fσ q or R = E Fσ q, where E is Eq. (7), (8) or (9). Then

There is a sequence q = (l1, ..., lg) such that OOD detection is AUC learnable in the separate space Ds XY for R if and only if |X| < + .

Furthermore, if |X| < + , then there is a sequence q = (l1, ..., lg) such that for any sequence q satisfying that q q , OOD detection is learnable under AUC in Ds XY for R.

On the Learnability of Out-of-distribution Detection

Proof [Proof of Theorem 17] Using a similar strategy of Theorem 16, we can prove this theorem by Theorem 9 and Lemma 11.

Appendix Q. Proof of Theorem 18

Theorem 18 Suppose that each domain DXY in Dµ,b XY is attainable, i.e., arg minh H RD(h) = (the ﬁnite discrete domains satisfy this). Let K = 1 and the hypothesis space H be scorebased (H = Hσ,λ q,E, where E is in Eq. (7), (8), or (9)) or FCNN-based (H = Hσ q). If µ(X) < + , then the following four conditions are equivalent:

Learnability in Dµ,b XY for H Condition 1 Risk-based Realizability Assumption Condition 4

Proof [Proof of Theorem 18] 1) By Lemma 1, we conclude that Learnability in Dµ,b XY for H Condition 1. 2) By Proposition 3 and Proposition 4, we know that when K = 1, there exist h1, h2 H, where h1 = 1 and h2 = 2, here 1 represents ID, and 2 represent OOD. Therefore, we know that when K = 1, infh H Rin D(h) = 0 and infh H Rout D (h) = 0, for any DXY Dµ,b XY . By Condition 1, we obtain that infh H RD(h) = 0, for any DXY Dµ,b XY . Because each domain DXY in Dµ,b XY is attainable, we conclude that Risk-based Realizability Assumption holds. We have proven that Condition 1 Risk-based Realizability Assumption. 3) By Theorems 5 and 8 in (Bartlett and Maass, 2003), we know that VCdim(φ Hσ q) < + and VCdim(Hσ,λ q,E) < + . Then, using Theorem 14, we conclude that Risk-based

Realizability Assumption Learnability in Dµ,b XY for H. 4) According to the results in 1), 2) and 3), we have proven that Learnability in Dµ,b XY for H Condition 1 Risk-based Realizability Assumption. 5) By Lemma 2, we conclude that Condition 4 Condition 1. 6) Here we prove that Learnability in Dµ,b XY for H Condition 4. Since Dµ,b XY is the prior-unknown space, by Theorem 1, there exist an algorithm A : + n=1(X Y)n H and a monotonically decreasing sequence ϵcons(n), such that ϵcons(n) 0, as n + , and for any DXY Dµ,b XY ,

ES Dn XIYI Rin D(A(S)) inf h H Rin D(h) ϵcons(n),

ES Dn XIYI Rout D (A(S)) inf h H Rout D (h) ϵcons(n).

Then, for any ϵ > 0, we can ﬁnd nϵ such that ϵ ϵcons(nϵ), therefore, if n = nϵ, we have

ES Dnϵ XIYI Rin D(A(S)) inf h H Rin D(h) ϵ,

ES Dnϵ XIYI Rout D (A(S)) inf h H Rout D (h) ϵ,

Fang, Li, Liu, Han, Lu

which implies that there exists Sϵ Dnϵ XIYI such that

Rin D(A(Sϵ)) inf h H Rin D(h) ϵ,

Rout D (A(Sϵ)) inf h H Rout D (h) ϵ.

Therefore, for any equivalence class [D XY ] with respect to Dµ,b XY and any ϵ > 0, there exists a hypothesis function A(Sϵ) H such that for any domain DXY [D XY ],

A(Sϵ) {h H : Rout D (h ) inf h H Rout D (h) + ϵ} {h H : Rin D(h ) inf h H Rin D(h) + ϵ},

which implies that Condition 4 holds. Therefore, Learnability in Dµ,b XY for H Condition 4. 7) Note that in 4), 5) and 6), we have proven that Learnability in Dµ,b XY for H Condition 4 Condition 1, and Learnability in Dµ,b XY for H Condition 1, thus, we conclude that Learnability in Dµ,b XY for H Condition 4 Condition 1. 8) Combining 4) and 7), we have completed the proof.

Appendix R. Proof of Theorem 19

Theorem 19 Suppose that the ranking function space R is separate and score-based (R = E Fσ q) or FCNN-based (R = Fσ q), where E is Eq. (7), (8) or (9). If µ(X) < + , then the following three conditions satisfy:

AUC-based Realizability Assumption Learnability in Dµ,b XY for R Condition 2

Proof [Proof of Theorem 19] The result can be obtained by Theorems 15 and 3.

Appendix S. Proof of Theorem 20

Theorem 20 Let K = 1 and the hypothesis space H be score-based (H = Hσ,λ q,E, where E is in Eq. (7), (8), or (9)) or FCNN-based (H = Hσ q). Given a prior-unknown space DXY , if there exists a domain DXY DXY , which has an overlap between ID and OOD distributions (see Deﬁnition 5), then OOD detection is not learnable under risk in DXY for H.

Proof [Proof of Theorem 20] Using Proposition 3 and Proposition 4, we obtain that infh H Rin D(h) = 0 and infh H Rout D (h) = 0. Then, Theorem 4 implies this result.

Note that if we replace the activation function σ (Re LU function) in Theorem 20 with any other activation functions, Theorem 20 still hold.

On the Learnability of Out-of-distribution Detection

Appendix T. Proof of Theorem 21

Theorem 21 Let the separate ranking function space R be FCNN-based or score-based (where the score function E is Eq. (7), (8), or (9)). Suppose that DXY , D XY DXY are discrete distributions with DXIYI = DXIYI and DXO = δx, D XO = δx . If DXO = δx, D XO = δx have overlaps with DXIYI and DXO = D XO, then OOD detection is not learnable under AUC in DXY for R.

Proof [Proof of Theorem 21] This is a conclusion of Lemma 7.