# outofdistribution_detection_with_implicit_outlier_transformation__8b09e8ff.pdf

Published as a conference paper at ICLR 2023

OUT-OF-DISTRIBUTION DETECTION WITH IMPLICIT OUTLIER TRANSFORMATION

Qizhou Wang1 Junjie Ye2 Feng Liu3 Quanyu Dai2 Marcus Kalander2

Tongliang Liu4 Jianye Hao2 Bo Han1

1Hong Kong Baptist University 2Huawei Noah s Ark Lab 3The University of Melbourne 4 Sydney AI Centre, The University of Sydney {csqzwang, bhanml}@comp.hkbu.edu.hk fengliu.ml@gmail.com tongliang.liu@sydney.edu.au {yejunjie4, daiquanyu, marcus.kalander, haojianye}@huawei.com

Outlier exposure (OE) is powerful in out-of-distribution (OOD) detection, enhancing detection capability via model fine-tuning with surrogate OOD data. However, surrogate data typically deviate from test OOD data. Thus, the performance of OE, when facing unseen OOD data, can be weakened. To address this issue, we propose a novel OE-based approach that makes the model perform well for unseen OOD situations, even for unseen OOD cases. It leads to a min-max learning scheme searching to synthesize OOD data that leads to worst judgments and learning from such OOD data for uniform performance in OOD detection. In our realization, these worst OOD data are synthesized by transforming original surrogate ones. Specifically, the associated transform functions are learned implicitly based on our novel insight that model perturbation leads to data transformation. Our methodology offers an efficient way of synthesizing OOD data, which can further benefit the detection model, besides the surrogate OOD data. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts. The code is publicly available at: github.com/qizhouwang/doe.

1 INTRODUCTION

Deep learning systems in the open world often encounter out-of-distribution (OOD) data whose label space is disjoint with that of the in-distribution (ID) samples. For many safety-critical applications, deep models should make reliable predictions for ID data, while OOD cases (Bulusu et al., 2020) should be reported as anomalies. It leads to the well-known OOD detection problem (Lee et al., 2018c; Fang et al., 2022), which has attracted intensive attention in reliable machine learning.

OOD detection remains non-trivial since deep models can be over-confident when facing OOD data (Nguyen et al., 2015; Bendale & Boult, 2016), and many efforts have been made in pursuing reliable detection models (Yang et al., 2021; Salehi et al., 2021). Building upon discriminative models, existing OOD detection methods can generally be attributed to two categories, namely, posthoc approaches and fine-tuning approaches. The post-hoc approaches assume a well-trained model on ID data with its fixed parameters, using model responses to devise various scoring functions to indicate ID and OOD cases (Hendrycks & Gimpel, 2017; Liang et al., 2018; Lee et al., 2018c; Liu et al., 2020; Sun et al., 2021; 2022; Wang et al., 2022). By contrast, the fine-tuning methods allow the target model to be further adjusted, boosting its detection capability by regularization (Lee et al., 2018a; Hendrycks et al., 2019; Tack et al., 2020; Mohseni et al., 2020; Sehwag et al., 2021; Chen et al., 2021; Du et al., 2022; Ming et al., 2022; Bitterwolf et al., 2022). Typically, fine-tuning approaches benefit from explicit knowledge of unknowns during training and thus generally reveal reliable performance across various real-world situations (Yang et al., 2021).

For the fine-tuning approaches, outlier exposure (OE) (Hendrycks et al., 2019) is among the most potent ones, engaging surrogate OOD data during training to discern ID and OOD patterns. By mak-

Correspondence to Bo Han (bhanml@comp.hkbu.edu.hk) and Junjie Ye (yejunjie4@huawei.com).

Published as a conference paper at ICLR 2023

Figure 1: Comparison between OE, DRO, and DOE. Black boxes indicate support sets for surrogate/test OOD data. Intensities of color indicate the coverage of learning schemes a deeper colored region indicates the associated model can make more reliable detection therein. As we can see, OE directly makes the model learn from surrogate OOD data, largely deviating from test OOD situations. DRO further makes the model perform uniformly well regarding sub-populations, and the model can excel in the support set of the surrogate case. Moreover, DOE makes the model learn from additional OOD data besides surrogate cases, covering wider OOD situations (exceeding the support set) than that of OE and DRO. Thus, OOD detection capability increases from left to right.

ing these surrogate OOD data with low-confident predictions, OE explicitly enables the detection model to learn knowledge for effective OOD detection. A caveat is that one can hardly know what kind of OOD data will be encountered when the model is deployed. Thus, the distribution gap exists between surrogate (training-time) and unseen (test-time) OOD cases. Basically, this distribution gap is harmful for OOD detection since one can hardly ensure the model performance when facing OOD data that largely deviate from the surrogate OOD data (Yang et al., 2021; Dong et al., 2020).

Addressing the OOD distribution gap issue is essential but challenging for OE. Several works are related to this problem, typically shrinking the gap by making the model learn from additional OOD data. For example, Lee et al. (2018a) synthesize OOD data that the model will make mistakes by generative models, and the synthetic data are learned by the detection model for low confidence predictions. However, synthesizing unseen is intractable in general (Du et al., 2022), meaning that corresponding data may not fully benefit OE training. Instead, Zhang et al. (2023) mixup ID and surrogate OOD data to expand the coverage of OOD cases; and Du et al. (2022) sample OOD data from the low-likelihood region of the class-conditional distribution in the low-dimensional feature space. However, linear interpolation in the former can hardly cover diverse OOD situations, and feature space data generation in the latter may fail to fully benefit the underlying feature extractors. Hence, there is still a long way to go to address the OOD distribution gap issue in OE.

To overcome the above drawbacks, we suggest a simple yet powerful way to access extra OOD data, where we transform available surrogate data into new OOD data that further benefit our detection models. The key insight is that model perturbation implicitly leads to data transformation, and the detection models can learn from such implicit data by model updating after its perturbation. The associated transform functions are free from tedious manual designs (Zhang et al., 2023; Huang et al., 2023) and complex generative models (Lee et al., 2018b) while remaining flexible for synthetic OOD data that deviate from original data. Here, two factors support the effectiveness of our data synthesis: 1) implicit data follow different distribution from that of the original one (cf., Theorem 1) and 2) the discrepancy between original and transformed data distributions can be very large, given that our detection model is deep enough (cf., Lemma 1). It indicates that one can effectively synthesize extra OOD data that are largely different from the original ones. Then, we can learn from such data to further benefit the detection model.

Accordingly, we propose Distributional-agnostic Outlier Exposure (DOE), a novel OE-based approach built upon our implicit data transformation. The distributional-agnostic reflects our ultimate goal of making the detection models perform uniformly well with respect to various unseen OOD distributions, accessing only ID and surrogate OOD data during training. In DOE, we measure the model performance in OOD detection by the worst OOD regret (WOR) regarding a candidate set of OOD distributions (cf., Definition 2), leading to a min-max learning scheme as in equation 6. Then, based on our systematic way of implicit data synthesis, we iterate between 1) searching implicit OOD data that lead to large WOR via model perturbation and 2) learning from such data for uniform detection power for the detection model.

Published as a conference paper at ICLR 2023

DOE is related to distributionally robust optimization (DRO) (Rahimian & Mehrotra, 2019), which similarly learns from the worst-case distributions. Their conceptual comparison is summarized in Figure 1. Therein, DRO considers a close-world setting, striving for uniform performance regarding various data distributions in the support (Sagawa et al., 2020). However, it fails in the open-world OOD settings that require detecting unseen data (cf., Section 5.3), which is the part of the test support that is disjoint with the surrogate one in Figure 1(b). By contrast, our data transformation offers an effective approach in learning from unseen data, considering the region s uniform performance beyond the support. Thus, DOE can mitigate the distribution gap issue to some extent, reflected by the smaller disjoint region than the DRO case in Figure 1(c).

We conduct extensive experiments in Section 5 on widely used benchmark datasets, verifying the effectiveness of our method with respect to a wide range of different OOD detection setups. For common OOD detection, our DOE reduces the average FPR95 by 7.26%, 20.30%, and 13.97% compared with the original OE on CIFAR-10, CIFAR-100, and Image Net datasets. For hard OOD detection, our DOE reduces the FPR95 by 7.45%, 7.75%, and 4.09% compared with advanced methods regarding various hard OOD datasets.

2 PRELIMINARY

Let X Rd denote the input space and Y = {1, . . . , C} the label space. We consider the ID distribution DID defined over X Y and the OOD distribution DOOD defined over X. In general, the OOD distribution DOOD is defined as an irrelevant distribution whose label set has no intersection with Y (Yang et al., 2021), which is unseen during training and should not be predicted by the model.

2.1 SOFTMAX SCORING

Building upon the model h H : X RC with logit outputs, our goal is to utilize the scoring function s : X R in discerning test-time inputs given by DID from that of DOOD. Typically, if the score value s(x) is greater than a threshold τ R, the associated input x X is taken as an ID case, otherwise an OOD case. A representative scoring function in the literature is the maximum softmax prediction (MSP) (Hendrycks & Gimpel, 2017), following

s MSP(x; h) = max k softmaxk h(x), (1)

where softmaxk( ) denotes the k-th element of a softmax output. Since the true labels of OOD are not in the label space, the model will return lower scores for them than ID cases in expectation.

2.2 OUTLIER EXPOSURE

Unfortunately, for a normally trained model h( ), MSP may make over-confident predictions for some OOD data (Liu et al., 2020), which is detrimental in effective OOD detection. To this end, OE (Hendrycks et al., 2019) boosts the detection capability by making the model h( ) learn from the surrogate OOD distribution Ds OOD, with the associated learning objective of the form:

L(h) = EDID [ℓCE(h(x), y)] | {z } LCE(h;DID)

+λ EDs OOD [ℓOE(h(x))] | {z } LOE(h;Ds OOD)

where λ is the trade-off parameter, ℓCE( ) is the cross-entropy loss, and ℓOE( ) is defined by Kullback-Leibler divergence to the uniform distribution, which can be written as ℓOE(h(x)) = P

k softmaxk h(x)/C. Basically, the OE loss ℓOE( ) plays the role of regularization, making the model learn from surrogate OOD data with low confident predictions. Since the model can see some OOD data during training, OE typically reveals reliable performance in OOD detection.

Note that since we know nothing about unseen during training, the surrogate distribution Ds OOD is largely different from the real one DOOD in general. Then, the difference between surrogate and unseen OOD data leads to the OOD distribution gap between training- (i.e., Ds OOD) and test-time (i.e., DOOD) situations. When deployed, the model inherits this data bias, potentially making overconfident predictions for unseen OOD data that differ from the surrogate ones.

Published as a conference paper at ICLR 2023

3 OOD SYNTHESIS

The OOD distribution gap issue stems from our insufficient knowledge about (test-time) unseen OOD data. Therefore, a direct approach is to give the model access to extra OOD data via data synthesis, doing our best to fill the distribution gap between trainingand test-time situations.

When it comes to data synthesis, a direct approach is to utilize generative models (Lee et al., 2018a), while generating unseen is intractable in general (Du et al., 2022). Therefore, Mix OE (Zhang et al., 2023) mixup ID and surrogate OOD to expand the coverage of various OOD situations, and VOS (Du et al., 2022) generates additional OOD in the embedding space with respect to low-likelihood ID regions. However, the former relies on manually designed synthesizing procedures, which can hardly cover diverse OOD situations. The latter generates OOD in low-dimensional space, which relies on specific assumptions for ID distribution (e.g., a mixture of Gaussian) and hardly benefits the underlying feature extractors to learn meaningful OOD patterns.

3.1 MODEL PERTURBATION FOR DATA SYNTHESIS

Considering previous drawbacks in OOD synthesis, we suggest a new way to access additional OOD data, which is simple yet powerful. Overall, we transform the available surrogate OOD data to synthesize new data that can further benefit our model. The associated transform function is parasitic on our detection model, which is learnable without auxiliary deep models or manual designs.

The key insight is that perturbing model parameters have the same impact as transforming data, where specific model perturbations indicate specific transform functions. For the beneficial data of our interest (e.g., the worst OOD data), we can implicitly get them access by finding the corresponding model perturbation. Updating the detection model thereafter, it can learn from the transformed data (i.e., the beneficial ones) instead of the original inputs. Now, we formalize our intuition.

We study the piecewise affine Re LU network model (Arora et al., 2018), covering a large group of deep models with Re LU activations, fully connected layers, convolutional layers, residual layers, etc. Here, we consider the recursive definition of a L-layer Re LU network, following

z(l) = h(l)(W (l 1)z(l 1)) for l = 1, . . . , L, (3)

where W (l) Rnl nl 1 is the l-th layer weights and h(l)(z) = max{0, t} the Re LU activation. We have z(1) = x the model input and z(L) = h(x) the model output. If necessary, we write h W in place of h with the joint form of weights W = {W (l)}L l=1 that contains all trainable parameters.

Our discussion is on a specific form of model perturbation named multiplicative perturbation. Definition 1 (Multiplicative Perturbation (Petzka et al., 2021)). For a L-layer Re LU network h( ), its l-th layer is multiplicatively perturbed if W (l) is changed into

W (l)(I + αA(l)), (4)

where α > 0 is the perturbation strength and A(l) Rnl 1 nl 1 is the perturbation matrix. Furthermore, the model h( ) is multiplicatively perturbed if all its layers are multiplicatively perturbed.

Now, we link the multiplicative perturbation of the l-th layer to data transformation in the associated embedding space, summarized by the following proposition. Proposition 1. Considering the data distribution D and the multiplicative perturbation regarding the l-th layer of a Re LU network. Then, measuring in the feature space, multiplicative perturbation is equivalent to data transformation. Further, the transformed data follows a new distribution D that is different from D if the eigenvalues of A(l) are greater than 0.

Therefore, model perturbation offers an alternative way to modify data and their distribution implicitly. Now, we generalize Proposition 1 for the multiplicative perturbation of the model, showing that it can modify the data distribution in the original input space. Theorem 1. Considering the data distribution D and an L-layer Re LU network. Measuring in the input space X Rd, multiplicative perturbation of the model is equivalent to data transformation in the input space following distribution D . Then, D and D are different if the eigenvalues of A(l) are greater than 0 and W (l), = W (l), 1 for l = 1, . . . , L.

Published as a conference paper at ICLR 2023

The proof of the above theorem directly leads to the following lemma, indicating that our datasynthesizing approach can benefit from the layer-wise architectures of deep models.

Lemma 1. Considering a L-layer Re LU network with the multiplicative perturbation {A(l) L }L l=1 and the associated transformed distribution D L. Then, there exists a L+1-layer Re LU network with the multiplicative perturbation {A(l) L+1}L+1 l=1 and the associated transformed distribution D L+1, such that the difference between D L+1 and D is no smaller than the difference between D L and D.

All the above proofs can be found in Appendix A, revealing that model perturbation leads to data transformation. There are two points worth emphasizing. First, the distribution of transformed data can be very different from that of the original data under the mild condition of non-negative eigenvalues. Further, the corresponding transform function is complex enough with layer-wise nonlinearity, where deep models induce strong forms of transformations (regarding distributions).

4 DISTRIBUTIONAL-AGNOSTIC OUTLIER EXPOSURE

Our data synthesis scheme allows the model h( ) to learn from additional OOD data besides the surrogate ones. Recalling that, we aim for the model to perform uniformly well for various unseen OOD data. Then, a critical issue is what kinds of synthesized OOD can benefit our model the most.

To begin with, we measure the detection capability by the worst-case OOD performance of the detection model, leading to the following definition of the worst OOD regret (WOR).

Definition 2 (Worst OOD Regret). For the detection model h( ), its worst OOD regret is

WOR(h) = sup D DOOD

LOE(h; D) inf h H LOE(h ; D) , (5)

where DOOD denotes the set of all OOD distributions and H is the hypothesis space.

Minimizing the WOR upper bounds the uniform performance of the detection model for the OOD cases. Therefore, synthetic OOD data that lead to WOR are of our interest, and learning from such data can benefit our model the most. Note that we can also measure the detection capability by the risk, i.e., sup D DOOD LOE(h; D), while we find that our regret-based measurement is better since it further considers the fitting power of the model when facing extremely large space of unseen data.

4.1 LEARNING OBJECTIVE

The WOR measures the worst OOD regret with respect to the worst OOD distribution, suitable for our perturbation-based data transformation that can lead to new data distributions (cf., Theorem 1). Therefore, to empirically upper-bound the WOR, one can first find the model perturbation that leads to large OOD regret and then update model parameters after its perturbation. Here, an implicit assumption is that the associated data given by model perturbation (with surrogate OOD inputs) are valid OOD cases. It is reasonable since the WOR in equation 5 does not involve any term to make the associated data close to ID data in either semantics or stylish.

Then, we propose an OE-based method for uniformly well OOD detection, namely, Distributionalagnostic Outlier Exposure (DOE). It is formalized by a min-max learning problem, namely,

LDOE(h W ; DID, Ds OOD) = LCE(h W ; DID)+

λ max P :||P || 1

h LOE(h W +αP ; Ds OOD) min W LOE(h W +αP ; Ds OOD) i

| {z } WORP(h W ;Ds OOD)

where WORP(h W ; Ds OOD) is a perturbation-based realization for the WOR calculation. Several points therein require our attention. First, ID data remain the same during training and testing, and the distribution gap occurs only for OOD cases. Therefore, WOR is applied only to the surrogate OOD data, and the original risk LCE(h W ; DID) is applied for the ID data. Furthermore, we adopt the implicit data transformation to search for the worst OOD distribution, substituting the search space of distribution DOOD by the search space of the perturbation, i.e., {P : ||P|| 1}. Here, we adopt

Published as a conference paper at ICLR 2023

a fixed threshold of 1 since one can change the perturbation strength via the parameter α. Finally, we adopt the additive perturbation W + αP which is easier to implement than the multiplicative counterpart, and they are equivalent when assuming P = WA.

4.2 REALIZATION

We consider a stochastic realization of DOE, where ID and OOD mini-batches are randomly sampled in each iteration, denoted by BID and Bs OOD, respectively. The overall DOE algorithm is summarized in Appendix B. Here, we emphasize several vital points.

Regret Estimation. The exact regret computation is hard since we need to find the optimal risk for each candidate perturbation. As its effective estimation, following (Arjovsky et al., 2019; Agarwal & Zhang, 2022), we calculate the norm of the gradients with respect to the risk LOE, namely,

WORG(h W ; Bs OOD) = || σ|σ=1.0LOE(σ h W +αP ; Bs OOD)||2. (7)

Intuitively, a large value of the gradient norm indicates that the current model is far from optimal, and thus the corresponding regret should be large. It leads to an efficient indicator of regret.

Perturbation Estimation. The gradient ascent is employed to find the proper perturbation P for the max operation in equation 6. In each step, the perturbation is updated by

P P WORG(h W +αP ; Bs OOD), (8)

with P initialized to 0. We further normalize P using PNORM = NORM(P) to satisfy the norm constraint. By default, we employ one step of gradient update as an efficient estimation for its value, which can be taken as the solution for the first-order Taylor approximated model.

Stable Estimation. Equation 8 is calculated for the mini-batch of OOD samples, biased from the exact solution of P that leads to the worst regret regarding the whole training sample. To mitigate the gap, for the resultant PNORM, we adopt its moving average across training steps, namely,

PMA (1 β)PMA + βPNORM, (9)

where β (0, 1] is the smoothing strength. Overall, a smaller β indicates that we take the average for a wider range of steps, leading to a more stable estimation of the perturbation.

Scoring Function. After training, we adopt the Max Logit scoring (Hendrycks et al., 2022) in OOD detection, which is better than the MSP scoring when facing large semantic spaces. It is of the form:

s ML(x; h) = max k hk(x), (10)

where hk( ) denotes the k-th element of the logit output. In general, a large value of s ML(x; h) indicates the high confidence of the associated x to be an ID case.

5 EXPERIMENTS

This section conducts extensive experiments in OOD detection. In Section 5.1, we verify the superiority of our DOE against state-of-the-art methods on both the CIFAR (Krizhevsky & Hinton, 2009) and the Image Net (Deng et al., 2009) benchmarks. In Section 5.2, we demonstrate the effectiveness of our method for hard OOD detection. In Section 5.3, we further conduct an ablation study to understand our learning mechanism in depth. The code is publicly available at: github.com/qizhouwang/doe.

Baseline Methods. We compare our DOE with advanced methods in OOD detection. For post-hoc approaches, we consider MSP (Hendrycks & Gimpel, 2017), ODIN (Liang et al., 2018), Mahalanobis (Lee et al., 2018c), Free Energy (Liu et al., 2020), Re Act (Sun et al., 2021), and KNN (Sun et al., 2022); for fine-tuning approaches, we consider OE (Hendrycks et al., 2019), CSI (Tack et al., 2020), SSD+ (Sehwag et al., 2021), Mix OE (Zhang et al., 2023), and VOS (Du et al., 2022).

Evaluation Metrics. The OOD detection performance of a detection model is evaluated via two representative metrics, which are both threshold-independent (Davis & Goadrich, 2006): the false positive rate of OOD data when the true positive rate of ID data is at 95% (FPR95); and the area

Published as a conference paper at ICLR 2023

Table 1: Comparison in OOD detection on the CIFAR and Image Net benchmarks. (or ) indicates smaller (or larger) values are preferred; a bold font indicates the best results in a column.

Methods CIFAR-10 CIFAR-100 Image Net FPR95 AUROC FPR95 AUROC FPR95 AUROC

Post-hoc Approaches

MSP 53.77 88.40 76.73 76.24 75.32 76.96 ODIN 42.80 88.69 63.25 75.72 77.43 71.04 Mahalanobis 34.98 93.21 65.57 78.03 86.50 58.78 Free Energy 37.77 88.27 71.56 78.51 71.14 79.50 Re Act 58.22 82.21 69.94 78.21 70.31 81.42 KNN 34.56 93.43 50.24 86.73 64.75 80.91

Fine-tuning Approaches

OE 12.41 97.85 45.68 87.61 73.80 78.90 CSI 17.39 96.87 83.72 65.94 86.80 65.54 SSD+ 14.84 97.36 56.65 87.38 64.55 77.46 Mix OE 13.55 97.59 52.04 86.46 74.36 77.28 VOS 31.55 91.56 73.43 79.98 87.87 61.36 DOE 5.15 98.78 25.38 93.97 59.83 83.54

under the receiver operating characteristic curve (AUROC), which can be viewed as the probability of the ID case having greater score than that of the OOD case.

Pre-training Setups. For the CIFAR benchmarks, we employ the WRN-40-2 (Zagoruyko & Komodakis, 2016) as the backbone model following (Liu et al., 2020). The models have been trained for 200 epochs via empirical risk minimization, with a batch size 64, momentum 0.9, and initial learning rate 0.1. The learning rate is divided by 10 after 100 and 150 epochs. For the Image Net, we employ Res Net-50 (He et al., 2016) with well-trained parameters downloaded from the Py Torch repository following (Sun et al., 2021).

DOE Setups. Hyper-parameters are chosen based on the OOD detection performance on validation datasets, which are separated from ID and surrogate OOD data. For the CIFAR benchmarks, DOE is run for 10 epochs with an initial learning rate of 0.01 and the cosine decay (Loshchilov & Hutter, 2017). The batch size is 128 for ID cases and 256 for OOD cases. The number of warm-up epochs is set to 5. λ is 1 and β is 0.6. For the Image Net dataset, DOE is run for 4 epochs with an initial learning rate of 0.0001 and cosine decay. The batch sizes are 64 for both ID and surrogate OOD cases. The number of warm-up epochs is 2. λ is 1 and β is 0.1. For both the CIFAR and the Image Net benchmarks, σ is uniformly sampled from {1e 1, 1e 2, 1e 3, 1e 4} in each training step, which allows covering a wider range of OOD situations than assigning fixed values. Furthermore, the perturbation step is fixed to be 1.

Surrogate OOD datasets. For the CIFAR benchmarks, we adopt the tiny Image Net dataset (Le & Yang, 2015) as the surrogate OOD dataset for training. For the Image Net dataset, we employ the Image Net-21K-P dataset (Ridnik et al., 2021), which makes invalid classes cleansing and image resizing compared with the original Image Net-21K (Deng et al., 2009).

5.1 COMMON OOD DETECTION

We begin with our main experiments on the CIFAR and Image Net benchmarks. Model performance is tested on several common OOD datasets widely adopted in the literature (Sun et al., 2022). For the CIFAR cases, we employed Texture (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), Places365 (Zhou et al., 2018), LSUN-Crop (Yu et al., 2015), and i SUN (Xu et al., 2015); for the Image Net case, we employed i Naturalist (Horn et al., 2018), SUN (Xu et al., 2015), Places365 (Zhou et al., 2018), and Texture (Cimpoi et al., 2014). In Table 1, we report the average performance (i.e., FPR95 and AUROC) regarding the OOD datasets mentioned above. Please refer to Tables 4-5 and 8 in Appendix C for the detailed results.

CIFAR Benchmarks. Overall, the fine-tuning methods can lead to effective OOD detection in that they (e.g., OE and DOE) generally demonstrate better results than most of the post-hoc approaches

Published as a conference paper at ICLR 2023

Table 2: Comparison of DOE and advanced methods in hard OOD detection. (or ) indicates smaller (or larger) values are preferred; a bold font indicates the best results in a column.

Methods LSUN-Fix Image Net-Resize CIFAR-100 FPR95 AUROC FPR95 AUROC FPR95 AUROC

KNN 25.76 95.00 40.65 86.30 64.50 86.32 OE 10.45 98.33 14.95 97.78 53.55 90.40 CSI 34.85 91.72 33.30 90.50 45.64 87.64 SSD+ 23.95 95.74 53.52 84.00 46.87 90.50 DOE 3.00 99.15 7.20 98.55 41.55 91.85

(e.g., Mahalanobis and KNN). Furthermore, compared with the OE-based methods (i.e., OE and Mix OE), other fine-tuning methods only show comparable, even inferior, performance in OOD detection. Therefore, the OE-based methods that utilize surrogate OOD remain hard to beat among state-of-the-art methods, even with its inherent OOD distribution gap issue.

Further, the DOE s improvement in OOD detection is notable compared to OE and Mix OE, with 7.26 and 8.40 better results on the CIFAR-10 dataset, and with 20.30 and 26.66 better results on the CIFAR-100 dataset. Note that the tiny-Image Net dataset is adopted as the surrogate OOD data, which is largely different from the considered test OOD datasets. Thus, we emphasize that the improvement of our method is due to our novel distributional-robust learning scheme, mitigating the OOD distribution gap between the surrogate and the unseen OOD cases.

Figure 2: The scoring densities of OE and DOE on CIFAR-100 dataset, where the Max Logit is employed.

We emphasize that the improvement of our DOE compared with that of OE is not dominated by our specific choice of scoring strategy. To verify this, we conduct experiments with OE and DOE and then employ the Max Logit scoring after model training. Figure 2 illustrates the scoring densities with (a) OE and (b) DOE on the CIFAR-100 dataset, where we consider two test-time OOD datasets, namely, Texture and SVHN. Compared with that of OE, the overlap regions of DOE between the ID (i.e., CIFAR-10) and the OOD (i.e., Texture and SVHN) distributions are reduced. It reveals that even with the same scoring function (i.e., Max Logit), DOE can still improve the model s detection capability compared with the original OE. Therefore, we state that the key reason for our improved performance is our novel learning strategy, learning from extra OOD data that can benefit the model. Please refer to Appendix C for their detailed comparison.

Image Net Benchmark. Huang & Li (2021) show that many advanced methods developed on the CIFAR benchmarks can hardly work for the Image Net dataset due to its large semantic space with about 1k classes. Therefore, Table 1 also compares the results of DOE with advanced methods on Image Net. As we can see, similar to the cases with CIFAR benchmarks, the fine-tuning approaches generally reveal superior results compared with the post-hoc approaches, and DOE remains effective in showing the best detection performance in expectation. Overall, Table 1 demonstrates the effectiveness of DOE across widely adopted experimental settings, revealing the power of our implicit data search scheme and distributional robust learning scheme.

5.2 HARD OOD DETECTION

Besides the above test OOD datasets, we also consider hard OOD scenarios (Tack et al., 2020), of which the test OOD data are very similar to that of the ID cases in style. Following the common setup (Sun et al., 2022) with the CIFAR-10 dataset being the ID case, we evaluate our DOE on three hard OOD datasets, namely, LSUN-Fix (Yu et al., 2015), Image Net-Resize (Deng et al., 2009), and CIFAR-100. Note that data in Image Net-Resize (1000 classes) with the same semantic space as

Published as a conference paper at ICLR 2023

Table 3: Effectiveness of implicit data transformation and distributional robust learning. (or ) indicates smaller (or larger) values are preferred; a bold font indicates the best results in a row.

Implicit Data Transformation Distributional Robust Learning DOE OE All-ones Gaussian Uniform χ2 WD AT

FPR95 38.30 32.78 32.50 46.93 42.85 45.24 25.38 45.68 AUROC 92.67 91.25 91.55 89.17 90.51 90.45 93.97 87.61

tiny-Image Net (200 classes) are removed. We compare our DOE with several works reported to perform well in hard OOD detection, including KNN, OE, CSI, and SSD+, where the results are summarized in Table 2. As we can see, our DOE can beat these advanced methods across all the considered datasets, even for the challenging CIFAR-10 versus CIFAR-100 setting. To some extent, it may indicate that our implicit data synthesis can even cover some hard OOD cases, and thus our DOE can lead to improved performance in hard OOD detection.

5.3 ABLATION STUDY

Our proposal claims two key contributions. The first one is the implicit data transformation via model perturbation, and the second one is the distributional robust learning scheme regarding WOR. Here, we design a series of experiments to demonstrate their respective power.

Implicit Data Transformation. In Section 3.1, we demonstrate that model perturbation can lead to data transformation. Here, we verify that other realizations (besides searching for WOR) can also benefit the model with additional OOD data. We employ perturbation with fixed values of ones (allones) and two types of random noise, namely, Gaussian noise with 0 mean and I covariance matrix (Gaussian) and uniform noise over the interval [ 1, 1] (Uniform) (cf., Appendix B). We summarize their results on CIFAR-100 in Table 3 (Implicit Data Transformation). Compared to MSP and OE without model perturbation, all the forms of perturbation can lead to improved detection, indicating that our implicit data transformation is general to benefit the model with additional OOD data.

Distributional Robust Learning. In Section 4, we employ the implicit data transformation for uniform performance in OOD detection. As mentioned in Section 1, DRO (Rahimian & Mehrotra, 2019) also focuses on distributional robustness. Here, we conduct experiments with two realizations of DRO, with χ2 divergence (χ2) (Hashimoto et al., 2018) and Wasserstein distance (WD) (Kwon et al., 2020) (cf., Appendix B). We also consider the adversarial training (AT) (Madry et al., 2018b) as a baseline method, which can also be interpreted from the lens of DRO.

We summarize the related experiments on CIFAR-100 in Table 3 (Distributional Robust Learning). For two traditional DRO-based methods (i.e., χ2 and WD), they mainly consider the cases where the support of the test OOD data is a subset of the surrogate case. This close-world setup fails in OOD detection, and thus they reveal unsatisfactory results. Though AT also makes data transformation, its transformation is limited to additive noise, which can hardly cover the diversity of unseen data. In contrast, our DOE can search for complex transform functions that exploit unseen, having large improvements compared to all other robust learning methods.

6 CONCLUSION

Our proposal makes two key contributions. The first is the implicit data transformation for OOD synthesis, based on our novel insight that model perturbation leads to data transformation. Synthetic data follow a diverse distribution compared to original ones, rendering the target model to learn from unseen data. The second contribution is a distributional-robust learning method, building upon a min-max optimization scheme in searching for the worst regret. We demonstrate that learning from the worst regret in OOD detection can demonstrate better results than the risk-based counterpart. Accordingly, we propose DOE to mitigate the OOD distribution gap issue inherent in OE-based methods, where the extensive experiments verify our effectiveness. Our two contributions may not be limited to the OOD detection field. We will explore their usage scenarios in other areas, such as OOD generalization, adversarial training, and distributionally robust optimization.

Published as a conference paper at ICLR 2023

7 ACKNOWLEDGMENTS

QZW and BH were supported by NSFC Young Scientists Fund No. 62006202, Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652, RGC Early Career Scheme No. 22200720, RGC Research Matching Grant Scheme No. RMGS20221102, No. RMGS20221306 and No. RMGS20221309. BH was also supported by CAAI-Huawei Mind Spore Open Fund and HKBU CSD Departmental Incentive Grant. TLL was partially supported by Australian Research Council Projects IC-190100031, LP-220100527, DP-220102121, and FT-220100318.

8 ETHIC STATEMENT

This paper does not raise any ethical concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

9 REPRODUCIBILITY STATEMENT

The experimental setups for training and evaluation as well as the hyper-parameters are described in detail in Section 5, and the experiments are all conducted using public datasets. The code is publicly available at: github.com/qizhouwang/doe.

Alekh Agarwal and Tong Zhang. Minimax regret optimization for robust machine learning under distribution shift. In COLT, 2022.

Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In ICLR, 2018.

Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. In CVPR, 2016.

Julian Bitterwolf, Alexander Meinke, Maximilian Augustin, and Matthias Hein. Breaking down out-of-distribution detection: Many methods based on OOD training data estimate a combination of the same core quantities. In ICML, 2022.

Saikiran Bulusu, Bhavya Kailkhura, Bo Li, P Varshney, and Dawn Song. Anomalous instance detection in deep learning: A survey. Technical report, Lawrence Livermore National Lab., 2020.

Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. ATOM: robustifying out-ofdistribution detection using outlier mining. In ECML, 2021.

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.

Jesse Davis and Mark Goadrich. The relationship between precision-recall and ROC curves. In ICML, 2006.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In CVPR, 2020.

Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. VOS: learning what you don t know by virtual outlier synthesis. In ICLR, 2022.

Published as a conference paper at ICLR 2023

Zhen Fang, Yixuan Li, Jie Lu, Jiahua Dong, Bo Han, and Feng Liu. Is out-of-distribution detection learnable? In Neur IPS, 2022.

Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In ICML, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017.

Dan Hendrycks, Mantas Mazeika, and Thomas G. Dietterich. Deep anomaly detection with outlier exposure. In ICLR, 2019.

Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. In ICML, 2022.

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and Serge J. Belongie. The inaturalist species classification and detection dataset. In CVPR, 2018.

Rui Huang and Yixuan Li. MOS: towards scaling out-of-distribution detection for large semantic space. In CVPR, 2021.

Zhuo Huang, Xiaobo Xia, Li Shen, Bo Han, Mingming Gong, Chen Gong, and Tongliang Liu. Harnessing out-of-distribution examples via augmenting content and style. In ICLR, 2023.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, 2009.

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation. In ICML, 2021.

Yongchan Kwon, Wonyoung Kim, Joong-Ho Won, and Myunghee Cho Paik. Principled learning method for wasserstein distributionally robust optimization with local perturbations. In ICML, 2020.

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, 2018a.

Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, 2018b.

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Neur IPS, 2018c.

Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.

Weitang Liu, Xiaoyun Wang, John D Owens, and Yixuan Li. Energy-based out-of-distribution detection. In Neur IPS, 2020.

Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In ICLR, 2017.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018a.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018b.

Published as a conference paper at ICLR 2023

Yifei Ming, Ying Fan, and Yixuan Li. POEM: out-of-distribution detection with posterior sampling. In ICML, 2022.

Sina Mohseni, Mandar Pitale, J. B. S. Yadawa, and Zhangyang Wang. Self-supervised learning for generalizable out-of-distribution detection. In AAAI, 2020.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015.

Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. In Neur IPS, 2021.

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv:1908.05659, 2019.

Tal Ridnik, Emanuel Ben Baruch, Asaf Noy, and Lihi Zelnik. Imagenet-21k pretraining for the masses. In Neur IPS Datasets and Benchmarks, 2021.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR, 2020.

Mohammadreza Salehi, Hossein Mirzaei, Dan Hendrycks, Yixuan Li, Mohammad Hossein Rohban, and Mohammad Sabokrou. A unified survey on anomaly, novelty, open-set, and out-ofdistribution detection: Solutions and future challenges. ar Xiv preprint ar Xiv:2110.14051, 2021.

Vikash Sehwag, Mung Chiang, and Prateek Mittal. SSD: A unified framework for self-supervised outlier detection. In ICLR, 2021.

Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018.

Yiyou Sun, Chuan Guo, and Yixuan Li. React: out-of-distribution detection with rectified activations. In Neur IPS, 2021.

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In ICML, 2022.

Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo Shin. CSI: novelty detection via contrastive learning on distributionally shifted instances. In Neur IPS, 2020.

Qizhou Wang, Feng Liu, Yonggang Zhang, Jing Zhang, Chen Gong, Tongliang Liu, and Bo Han. Watermarking for out-of-distribution detection. In Neur IPS, 2022.

Pingmei Xu, Krista A Ehinger, Yinda Zhang, Adam Finkelstein, Sanjeev R Kulkarni, and Jianxiong Xiao. Turkergaze: crowdsourcing saliency with webcam based eye tracking. ar Xiv preprint ar Xiv:1504.06755, 2015.

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: a survey. ar Xiv preprint ar Xiv:2110.11334, 2021.

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.

Jingyang Zhang, Nathan Inkawhich, Randolph Linderman, Yiran Chen, and Hai Li. Mixture outlier exposure: Towards out-of-distribution detection in fine-grained environments. In WACV, 2023.

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452 1464, 2018.

Published as a conference paper at ICLR 2023

This section provides the detailed proofs for our theoretical claims in the main text.

A.1 PROOF OF PROPOSITION 1

Proof. To make the derivation clear, we adopt the equivalent form for our recursive definition of the model in equation 3, following:

h(l+1)(W (l)z(l)) = h(l+1)(z(l); W (l)). (11)

Then, by multiplicatively perturbing the l-th layer of the model, we have

h(l+1) z(l); W (l)(I + αA(l)) =max n W (l)(I + αA(l)) z(l), 0 o

=max n W (l) (I + αA(l))z(l) , 0 o

=h(l+1) (I + αA(l))z(l); W (l) .

Therefore, measuring in feature space Z(l), multiplicative perturbation modifies the original features z(l) by an affine transformation I + αA(l). Assuming that the original data are i.i.d. drawn from the distribution with the probability density function (pdf) f Z(l)(z(l)), then the transformed data are i.i.d. drawn from the distribution with the pdf f Z (l)(z (l)) = f Z(l)(z(l)) I + αA(l) 1.

Using the KL-divergence to measure the discrepancy between the original feature distribution and the transformed feature distribution, we have

DKL(f Z(l)||f Z (l)) = Ef Z(l)(z(l)) log f Z(l)(z(l))

f Z (l)(z (l)) = log I + αA(l) . (13)

Without loss of generality, we assume K different eigenvalues for the matrix A(l). Then, by the Jordan matrix decomposition, we can write A(l) = T (l), 1J(l)T (l). Therein, J(l) is of the form:

J(λ1) J(λ2) J(λk) J(λK)

and J(λk) is the k-th Jordan block (of size nk nk) corresponding to the k-th eigenvalue of the matrix A(l). Then, we have I + αA(l) = T (l), 1(I + αJ(l))T (l) = I + αJ(l) . Since J(l) is an upper triangular matrix, we can write I + αJ(l) = QK k=1(αλk + 1)nk. Accordingly, if the eigenvalues of the matrix A(l) are all greater than 0 and α > 0, we have I + αA(l) > 1 and DKL(f Z(l)||f Z (l)) > 0. Therefore, the distributions f Z(l) and f Z (l) are different regarding the KL divergence. Thus we complete our proof.

A.2 PROOF OF THEOREM 1

Proof. We consider an induction proof, justifying that: the multiplicative perturbation with A(l) Rnl nl of any layer in l = 1, . . . , L can be transformed into an equivalent multiplicative perturbation with A(l 1) Rnl 1 nl 1 in the (l 1)-th layer. Moreover, | A(l 1)| > 0 if |A(l)| > 0. Then, one can transform the multiplicative perturbation of the model to an equivalent form in the input space. Since the determinant of the equivalent perturbation is greater than 0, by applying Proposition 1, we conclude that multiplicative perturbation can lead to data transformation in the original input space.

To find the equivalent perturbation matrix A(l 1) in the (l 1)-th layer regarding the original one A(l) in the l-th layer, we solve the following equation:

W (l)(I + αA(l))h(l)(W (l 1)z(l 1)) = W (l)h(l)(W (l 1)(I + α A(l 1))z(l 1)). (15)

Published as a conference paper at ICLR 2023

If [z(l 1)]i = 0 in each dimension, equation 15 can be rewritten as

A(l)h(l)(W (l 1)z(l 1)) = h(l) (W (l 1)z(l 1))W (l 1) A(l 1)z(l 1), (16) by applying the Taylor Theorem for the right-hand side1. Then, since the Re LU activation is applied, we solve the equivalent formulation for equation 16, following, A(l)W (l 1) = W (l 1) A(l 1). (17) Then, the solution of A(l 1) is W (l 1), A(l)W (l 1) with being the Moore-Penrose inverse.

We justify that the multiplicative perturbation in the l-th layer can be transformed to that of the (l 1)-th layer. Therefore, the equivalent perturbation A(l 1) and the original perturbation in the (l 1)-th layer can formulate a joint perturbation A(l 1), namely, I + α A(l 1), with A(l 1) = A(l 1) + A(l 1) + αA(l 1) A(l 1). (18)

Now, we justify that A(l 1) can also lead to distributional transformation. If W (l 1), = W (l 1), 1

(Here, we implicitly assume that nl 1 = nl 2) and the eigenvalues of the matrix A(l) are all greater than 0, then we know that the eigenvalues of the matrix A(l 1) are all greater than 0. Again, we have I + α A(l 1) > 1. Then, the joint perturbation A(l 1) satisfies: I + α A(l 1) = (I + αA(l 1))(I + α A(l 1)) (19)

= I + αA(l 1) I + α A(l 1) (20)

> I + αA(l 1) (21)

By induction, the multiplicative perturbation of the model can be approximated by the input transformation. By applying Proposition 1, we know that x and the perturbation-based transformed counterpart follow the different data distributions. Thus we complete our proof.

A.3 PROOF OF LEMMA 1

Proof. For the L + 1-layer Re LU network, we assume its model parameters and the model perturbation are the same as that of the corresponding layers for the L-layer Re LU network (except for the L + 1-th layer). Then, by inspecting equation 20, the perturbation from the L + 1-th layer can make the perturbation matrices for the L + 1-layer network no smaller than that of the L-layer network regarding each layer (including the input space) of the joint multiplicative perturbation. Thus, we complete our proof.

A.4 EXCESS RISK BOUND

We further derive the learning bound of DOE. Here, we make the standard assumptions for our learning problem. First, we assume that the Rademacher Complexity Rn(H) of H is bounded, i.e., there is a CH such that Rn(H) CH/ n, holding for Re LU models. Further, the CE loss is bounded by ACE and is LCE Lipschitz continuous; the OE loss is bounded by AOE and is LOE Lipschitz continuous. To ease notation, we also define

ϵ(C, L, A) = 2CL + A

We are now ready to state the upper bound for the worst-case population performance of our DOE. Theorem 2. Given ID and surrogate OOD training sample SID and SOOD, we write the optimal solution as h W = arg minh W H LDOE(h W; DID, Ds OOD) and the empirical counterpart as ˆh W = arg minh W H LDOE(h W; SID, Ss OOD). Then, under above assumptions, w.p. at least 1 δ, we have

LDOE(ˆh W; DID, Ds OOD) LDOE(h W;DID, Ds OOD)

+ (2 + 4λ)ϵ(CH, L, A)/ q

min{|SID|, |Ss OOD|}, (24)

where L = max{LCE, LOE} and A = max{ACE, AOE}.

1With the usual adjustments that the equations only hold almost everywhere in parameter space.

Published as a conference paper at ICLR 2023

Proof. We apply the Rademacher Bound for LCE and LOE, given that w.p. at least 1 σ, we have

|LCE(h W; DID) LCE(h W; SID)| ϵ(CH, LCE, ACE)/ p

|SID|, (25)

|LOE(h W; DOOD) LOE(h W; SOOD)| ϵ(CH, LOE, AOE)/ p

|SOOD|, (26)

for all h W H. When the hypothesis space H is large enough, we have

h W = arg min h W H LCE(h W ; DID) = arg min h W H max P :||P || ρ Regret OE(h W +αP ; Ds OOD). (27)

Accordingly, by the definition of LCE(h W; DID), for any ϵ > 0, there exists hϵ W such that LCE(hϵ W; DID) LCE(h W; DID) + ϵ. Thus, using LCE(ˆh W; SID) LCE(hϵ W; SID), we can write

LCE(ˆh W; DID) LCE(h W; DID)

=LCE(ˆh W; DID) LCE(hϵ W; SID) + LCE(hϵ W; SID) LCE(h W; DID) (28)

LCE(ˆh W; DID) LCE(hϵ W; SID) + ϵ (29)

=LCE(ˆh W; DID) LCE(ˆh W; SID) + LCE(ˆh W; SID) LCE(hϵ W; DID) + ϵ (30)

LCE(ˆh W; DID) LCE(ˆh W; SID) + LCE(hϵ W; SID) LCE(hϵ W; DID) + ϵ (31) 2 sup h H |LCE(h; DID) LCE(h; SID)| + ϵ. (32)

Since equation 25 holds for h W H and ϵ > 0, we have

LCE(ˆh W; DID) LCE(h W; DID) + 2ϵ(CH, LCE, ACE)/ p

|SID|. (33)

For any h W H, we also have

sup P :||P || ρ Regretα,P OE (ˆh W ; Ds OOD)

sup P :||P || ρ

h LOE(ˆh W +αP ; SOOD) min W LOE(h W +αP ; SOOD) i + 2ϵ(CH, LOE, AOE) p

|SOOD| (34)

sup P :||P || ρ

h LOE(ˆh W +αP ; SOOD) min W LOE(h W +αP ; SOOD) i + 2ϵ(CH, LOE, AOE) p

|SOOD| (35)

sup P :||P || ρ

h LOE(h W +αP ; SOOD) min W LOE(h W +αP ; SOOD) i + 2ϵ(CH, LOE, AOE) p

|SOOD| (36)

sup P :||P || ρ

h LOE(h W +αP ; DOOD) min W LOE(h W +αP ; DOOD) i + 4ϵ(CH, LOE, AOE) p

|SOOD| (37)

sup P :||P || ρ Regretα,P OE (h W ; Ds OOD) + 4ϵ(CH, LOE, AOE) p

|SOOD| , (38)

indicating that

sup P :||P || ρ Regretα,P OE (ˆh W ; Ds OOD) sup P :||P || ρ Regretα,P OE (h W ; Ds OOD) + 4ϵ(CH, LOE, AOE) p

(39) Combining equation 33 and equation 39, we complete our proof.

The theorem states that the empirical solution leads to a promising detection capability in expectation, which considers the uniform OOD performance via the WOR. The critical point is that the original surrogate OOD is still very important (i.e., the small sample size of SOOD leads to loose excess bound), even if our method can synthesize additional OOD data.

Published as a conference paper at ICLR 2023

Algorithm 1 Distribution-agnostic Outlier Exposure (DOE).

Input: ID and OOD samples from DID and Ds OOD, resp; PMA = 0; for ns = 1 to num step do

Sample BID and Bs OOD from ID and surrogate OOD, resp; P = 0; if ns > num warm then

for np = 1 to num pert do

WORG(h W ; Bs OOD) = || σ|σ=1.0LOE(σ h W +αP ; Bs OOD)||2; P P WORG(h W +αP ; Bs OOD); end for PMA (1 β) PMA + β NORM(P); W W lr W [LCE(h W ; BID) + λLOE(h W +αPMA; Bs OOD)]; else

W W lr W [LCE(h W ; BID) + λLOE(h W ; Bs OOD)]; end if end for Output: detection model h W ( ).

B ALGORITHM DESIGNS

We summarize details of algorithm designs for a set of related learning schemes.

B.1 DISTRIBUTIONAL ROBUSTNESS AND DISTRIBUTION GAP

Overall, to demonstrate why our distributional-robust learning scheme can mitigate the OOD distribution gap, we consider the following two situations: (1) the true OOD distribution contains all the different OOD situations; and (2) the capacity of implicit data transformation is large enough.

For the first situation, we assume that the true OOD distribution contains all the different OOD situations, i.e., all samples with labels out of the considered label space. It is a reasonable consideration since we do not know what kinds of OOD data will be encountered during the test, and thus all the different OOD situations can be encountered. In this case, the surrogate and the (associated) implicit OOD data are subsets of the true OOD distribution since they do not have overlapped semantics with the ID distribution. Then, compared with OE that learns only from surrogate OOD data, our DOE can further benefit from implicit OOD data. It can enlarge the coverage of OOD situations since implicit data follows new data distributions over the surrogate OOD distribution (cf., Theorem 1).

For the second situation, we assume that the capacity of implicit data transformation is large enough to cover sufficiently many OOD cases. This is also a reasonable assumption since the transformation s capacity can benefit from layer-wise architectures (cf., Lemma 1), and deep models (which contain many layers) are typically adopted in OOD detection. Accordingly, although we do not know precisely what is the true OOD distribution, we can upper-bound the worst OOD performance to guarantee uniform performance of the model under various test situations (cf., Theorem 2). When the capacity is large enough (covering many test OOD situations), DOE performs well under these unseen test OOD data, thus mitigating the OOD distribution gap.

Algorithm 1 summarizes a stochastic realization of our DOE. The overall algorithm is run for num step steps, with num warm epochs of warm-up in employing the original OE. Then, in each training step, we first calculate the perturbation P regarding the OOD mini-batch for num pert steps, and the normalized results are used to update the moving average PMA. With the resultant perturbation PMA for the OE loss, we update the model via one step of mini-batch gradient descent. After training, we apply the Max Logit scoring in discerning ID and OOD cases.

Published as a conference paper at ICLR 2023

B.3 WORST RISK-BASED DOE

Our proposed DOE searches for the model perturbation that leads to WOR, which is the worst regretbased realization. In our main text, we state its superior to the risk-based counterpart in Section 4, with the experimental verification in Section 5.3. For integrity, we further describe the realization of the worst risk-based DOE, named DOE-risk.

Similar to our proposed DOE, DOE-risk can also be formalized by a min-max learning problem:

LDOE(h W ; DID, Ds OOD) = LCE(h W ; DID) + λ max P :||P || 1 LOE(h W +αP ; Ds OOD). (40)

Then, for its stochastic realization, one step of gradient ascent is employed for the perturbation with respect to the mini-batch, namely,

P P |P =0LOE(h W +αP ; Bs OOD). (41)

All other parts follow the realization of the original DOE. After training, we also employ the Max Logit scoring in discerning ID and OOD data.

B.4 IMPROVED OE WITH PREDEFINED PERTURBATION

We consider several implicit data transformations with the predefined perturbations in Section 5.3. Here, we briefly summarize their realizations.

All-ones Matrices. For the perturbation matrices with fixed values, we employ the simple all-ones matrices, namely, Pone = {I(l)}L l=1, with I(l) Rnl 1 nl 1 for l = 1, . . . , L being the all-ones matrix. Then, the associated learning objective can be written as:

LOE-one(h W ; DID, Ds OOD) = LCE(h W ; DID) + λLOE(h W +αPone; Ds OOD). (42)

Gaussian Noise. When adopting Gaussian noise for random perturbation, we have Pgau = {N (l)}L l=1, with the elements drawn from Gaussian distribution with 0 mean and 1 standard deviation. Then, the associated learning objective is of the form

LOE-gau(h W ; DID, Ds OOD) = LCE(h W ; DID) + λLOE(h W +αPgau; Ds OOD). (43)

Uniform Noise. Similarly, one can adopt uniform noise for random perturbation, which we denote by Puni = {U (l)}L l=1. The elements of U (l) are drawn from the uniform noise over the interval [ 1, 1]. Then, the associated learning objective is

LOE-uni(h W ; DID, Ds OOD) = LCE(h W ; DID) + λLOE(h W +αPuni; Ds OOD). (44)

The distirbutionally robust optimization (DRO) (Rahimian & Mehrotra, 2019) is a traditional technique to make the model perform uniformly well. In OE, one can utilize DRO by replacing the original OE risk in equation 2 with its distributional robust counterpart, namely,

LDRO(h; DID, Ds OOD) = LCE(h; DID) + λ sup Dw OOD U(Ds OOD) LOE(h; Dw OOD)

LDRO OE (h;Ds OOD)

where U(Ds OOD) is the ambiguity set. Basically, U(Ds OOD) constrains the difference between the surrogate OOD distribution Ds OOD and its worst counterpart Dw OOD. In expectation, equation 45 makes the training procedure cover a wide range of potential test OOD distributions in U(Ds OOD), guaranteeing its uniform performance by bounding the worst OOD risk derived by Dw OOD.

The ambiguity set U(Ds OOD) is defined by {Dw : Divf(Dw||D) ρ} with Divf( ) the fdivergence and ρ the constraint. For the worst OOD distribution that leads to the worst OOD risk, a weighting-based searching scheme can be derived for the empirical counterpart of equation 45, following the form of re-weighted empirical risk, namely,

X pℓOE(h(x)) s.t. p {p | p and Df(p||1) ρ}. (46)

Published as a conference paper at ICLR 2023

However, due to this equivalent re-weighting scheme, DRO actually assumes that the support of testtime OOD data is among that of the training situation. This assumption is violated in OOD detection since the surrogate OOD data can be largely different from the unseen situations, i.e., their support sets can be greatly different. Therefore, traditional DRO cannot lead to much improved results compared with original OE, which we demonstrate by the experimental results in Section 5.3. Note that some DRO methods (Krueger et al., 2021) try to search for worst distributions that go beyond the support set of training data. However, they rely on more than one training domain, which is not directly applicable in OOD detection.

χ2-divergence DRO. Hashimoto et al. (2018) define the ambiguity set by the χ2 divergence, given

by Dχ2(P||Q) = R d P d Q 1 2 d Q. They assume that data distribution can be written as the joint

form of the sub-populations, i.e., Ds OOD = P

k [K] αk Ds,k OOD. Then, one can derive the dual form of the LDRO OE (h; Ds OOD) in equation 46 with respect to the χ2 divergence, namely,

(2(1/αmin 1)2 + 1)1/2 EDs OOD h max {ℓOE(h(x)) η, 0}2i 1/2 + η , (47)

where αmin = mink αk. Then, Hashimoto et al. suggest that for deep models that rely on stochastic gradient descent, one can utilize the dual objective in equation 47, leading to the learning objective of the form:

Lχ2(h; DID, Ds OOD) = LCE(h; DID) + λEDs OOD h max {ℓOE(h(x)) η, 0}2i , (48)

where η is treated as a hyperparameter. Overall, equation 48 ignores all data points that suffer less than η-levels of loss values, while large loss above η are upweighted due to the square operation.

Wasserstein DRO. The ambiguity set with the Wasserstein distance has also attracted much attention in the literature. Specifically, Wasserstein distance is given by

Wr(P, Q) = inf O J(P,Q)

Z Z ||ζ ζ||rd O(ζ, ζ) 1/r . (49)

However, the direct calculation for the Wasserstein DRO is intractable, and Kwon et al. (2020) propose a simple learning method that leads to its effective approximation. Specifically, if the loss function is differentiable and its gradient is Holder continuous, one can optimize the following surrogate objective as an effective approximation for the optimal solution of Wasserstein DRO:

LWDRO(h; DID, Ds OOD) = LCE(h; DID) + λ (LOE(h; DOOD) + EDOOD|| xℓOE(h(x))||) . (50)

Please refer to (Kwon et al., 2020) for an in-depth discussion.

Adversarial training (AT) (Madry et al., 2018a) directly modifies input features that lead to increased risk, which can also be interpreted from the lens of distributional robust learning (Sinha et al., 2018). For OE, one can modify features for surrogate OOD data by adding adversarial noise, namely,

δp Proj δp + κsign δpℓOE(h(x + δp)) , (51)

where κ controls the magnitude of the perturbation, Proj is the clipping operation for the valid δp, and sign is the signum function. Equation 51 iterates for several steps and δp is typically initialized by random noise.

Applying the adversarial noise for surrogate OOD data, the resultant learning objective is

LAT(h; DID, Ds OOD) = LCE(h; DID) + λEDs OOD [ℓOE(h(x + δp))] . (52)

AT can be viewed as a direct way of data transformation. However, as demonstrated in Section 5.3, the associated transformation function is simpler than our DOE. Therefore, the performance of AT is inferior to our DOE.

Published as a conference paper at ICLR 2023

Table 4: Comparison of DOE and advanced methods on CIFAR-10 dataset. (or ) indicates smaller (or larger) values are preferred; a shaded row of results indicate the best method in previous post-hoc (or fine-tuning) methods; and a bold font indicates the best result in a column.

Method SVHN LSUN i SUN Texture Places365 Average ID ACC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC

Post-hoc Approach

MSP 65.60 81.23 23.05 96.74 56.55 89.59 61.45 87.47 62.20 86.95 53.77 88.40 94.28 ODIN 55.30 83.65 8.55 98.48 36.95 92.66 54.00 84.34 59.20 84.33 42.80 88.69 94.28 Mahalanobis 9.35 98.00 45.15 92.90 37.15 93.54 11.80 97.92 71.45 83.68 34.98 93.21 94.28 Free Energy 55.40 76.92 3.65 99.21 31.65 93.26 52.80 84.20 45.35 87.77 37.77 88.27 94.28 Re Act 76.95 74.91 0.95 99.21 64.15 83.19 81.00 72.59 68.05 81.16 58.22 82.21 94.28 KNN 31.29 95.01 26.84 95.33 29.48 94.28 41.21 92.08 44.02 90.47 34.56 93.43 94.28

Fine-tuning Approach

OE 4.30 99.12 0.85 99.76 11.45 98.17 17.35 97.03 28.10 95.17 12.41 97.85 94.58 CSI 20.48 96.63 6.18 98.78 5.49 98.99 21.07 96.27 33.73 93.68 17.39 96.87 94.33 SSD+ 0.28 99.10 4.07 98.71 35.96 95.28 8.90 98.34 25.00 95.40 14.84 97.36 95.46 Mix OE 22.35 96.21 1.25 99.68 7.75 98.65 17.15 96.75 19.25 96.68 13.55 97.59 94.79 VOS 35.20 91.57 6.15 98.85 26.95 93.65 49.35 85.06 40.10 88.69 31.55 91.56 95.45 DOE 2.65 99.36 0.00 99.89 0.75 99.67 7.25 98.47 15.10 96.53 5.15 98.78 94.18

Table 5: Comparison of DOE and advanced methods on the CIFAR-100 dataset. (or ) indicates smaller (or larger) values are preferred; a shaded row of results indicate the best method in post-hoc (or fine-tuning) methods; and a bold font indicates the best results in the a column.

Method SVHN LSUN i SUN Texture Places365 Average ID ACC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC

Post-hoc Approach

MSP 80.90 75.19 51.25 87.93 85.35 73.48 83.40 71.94 82.75 72.66 76.73 76.24 73.98 ODIN 70.75 72.57 63.38 76.55 60.23 74.83 60.31 76.96 61.61 77.72 63.25 75.72 73.98 Mahalanobis 58.45 86.54 99.80 52.40 34.70 92.96 44.40 90.13 80.50 68.14 65.57 78.03 73.98 Free Energy 89.70 73.09 16.90 96.96 86.45 75.20 82.75 73.76 82.00 73.56 71.56 78.51 73.98 Re Act 79.10 81.73 8.50 98.46 92.55 64.78 85.30 73.58 84.25 72.47 69.94 78.21 73.98 KNN 49.73 88.06 31.94 93.81 37.11 91.86 48.30 87.96 84.16 71.96 50.24 86.73 73.98

Fine-tuning Approach

OE 55.15 85.62 11.65 96.92 48.40 87.90 47.35 87.02 65.85 80.57 45.68 87.61 75.33 CSI 62.96 84.75 96.47 49.28 95.91 52.98 78.30 71.25 85.00 71.45 83.72 65.94 74.30 SSD+ 13.30 97.45 82.55 86.03 38.74 91.69 71.24 82.52 77.41 79.20 56.65 87.38 75.91 Mix OE 83.80 74.26 20.10 96.26 54.20 87.24 55.80 86.13 46.30 88.39 52.04 86.46 75.81 VOS 60.22 88.57 85.45 83.62 50.57 88.80 80.65 74.22 90.30 64.73 73.43 79.98 73.55 DOE 19.20 96.43 4.15 99.02 12.80 97.65 32.75 91.88 58.00 84.87 25.38 93.97 74.51

C FURTHER EXPERIMENTS

This section provides further experiments to demonstrate the effectiveness of our proposal.

C.1 CIFAR BENCHMARKS

We first summarize the main experiments in Table 4-5 on CIFAR benchmarks for the common OOD detection. A brief version can also be found in Table 1 in the main text. Overall, our DOE reveals superior performance on average regarding both the evaluation metrics of FPR95 and AUROC. However, when it comes to individual test-time OOD datasets, our DOE may not work best in all situations (e.g., KNN on OOD dataset Places365 and ID dataset CIFAR-100). We emphasize that it does not challenge the generality of our proposal since DOE has demonstrated stable improvements for the original OE. Here, the interesting point is that if we can further benefit the OE from the latest progress in OOD scoring, one can further improve the performance of our method in effective OOD detection, which requires our future study.

Now, we compare OE and DOE on CIFAR benchmarks with five individual trails in Table 6, where we report the mean results and the standard deviation. As we can see, our DOE not only leads to improved average performance in OOD detection, and the results are more stable than that of the original OE. The superiority of DOE in stability may lie in the fact that the target model can learn from more data than the OE case, further demonstrating the effectiveness of our proposal.

Also, we compare the performance of OE and DOE when using the MSP scoring and the Max Logit scoring, of which the experiments are summarized in Table 7. Regarding both the cases with different scoring functions, our DOE always achieve superior performance than that of the OE, demonstrating that our proposal can genuinely mitigate the OOD distribution gap issue in OOD detection. Further, comparing the results across different scoring strategies, we observe that using the

Published as a conference paper at ICLR 2023

Table 6: Comparison of DOE and OE on CIFAR benchmarks with 5 individual trails. (or ) indicates smaller (or larger) values are preferred; and a bold font indicates the best results in the corresponding column.

Method SVHN LSUN i SUN Texture Places365 FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC

OE 2.91 0.60 99.34 0.08 0.46 0.13 99.80 0.03 9.05 1.56 98.45 0.20 18.08 0.88 96.91 0.11 28.02 0.68 95.01 0.08

DOE 2.66 0.10 99.41 0.01 0.15 0.01 99.91 0.01 1.26 0.11 99.48 0.04 7.40 0.27 98.32 0.02 15.42 0.42 96.33 0.03

OE 55.48 1.46 86.99 0.99 12.28 0.95 86.76 0.19 44.38 2.75 88.54 0.90 47.57 1.42 86.93 0.21 65.05 1.27 80.83 0.23

DOE 28.47 0.35 95.45 0.02 5.27 0.18 98.51 0.01 22.28 0.92 96.36 0.10 40.00 0.54 91.34 0.06 50.70 0.27 88.42 0.03

Table 7: Comparison of DOE and OE on CIFAR and Image Net benchmarks with the MSP scoring and the Max Logit scoring. (or ) indicates smaller (or larger) values are preferred; and a bold font indicates the best results in the corresponding column.

CIFAR-10 CIFAR-100 Image Net Max Logit MSP Max Logit MSP Max Logit MSP FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC

OE 11.07 97.98 12.41 97.85 35.95 92.42 45.68 87.61 71.25 79.18 73.80 78.90 DOE 5.15 98.78 7.83 98.46 25.38 93.97 30.50 92.75 59.83 83.54 65.20 80.83

Max Logit scoring leads to better results than using the MSP scoring. Therefore, we choose the Max Logit scoring in our DOE.

C.2 IMAGENET BENCHMARKS

Table 8 lists the detailed experiments on the Image Net benchmark. Overall, our DOE achieves superior performance on average against all the considered baselines. Further, for the cases with i Naturalist and Places365, which are believed to be the challenging OOD datasets on the Image Net situation, our DOE also achieve considerable improvements against all other advanced methods. It demonstrates that our DOE can also work well for challenging detection scenarios with extremely large semantic space and complex data patterns.

C.3 TRAINING FROM SCRATCH WITH DOE

Table 9: Comparison of OE and DOE when training from scratch on CIFAR benchmarks.

from scratch CIFAR-10 CIFAR-100 FPR95 AUROC FPR95 AUROC

OE 15.46 95.77 46.02 88.14 DOE 5.85 98.52 26.47 93.16

This section further considers the training setup of training from scratch, where we mainly focus on the detection performance of OE and DOE on CIFAR benchmarks. Specifically, the models are trained for 150 epochs for OE and DOE via stochastic gradient descent. We fix the learning rate to be 0.1, divided by 10 per 30 epochs. For DOE, the warmup epochs are set to be 90. All other hyperparameters follow the same setup as in Section 5. We summarize the experimental results in Table 9. As we can see, our DOE can still improve OE by a large margin, revealing that our method is general in applying for the setup of training from scratch.

C.4 WORST OOD REGRET AND WORST OOD RISK

Another issue related to distributional robustness is our definition of the worst OOD distribution. It is defined by the OOD regret LOE(h; D) infh H LOE(h ; D), where we claim its superiority than its risk counterpart, i.e., LOE(h; D) (cf., Appendix B).

Published as a conference paper at ICLR 2023

Table 8: Comparison of DOE and advanced methods on Image Net dataset. (or ) indicates smaller (or larger) values are preferred; a shaded row of results indicate the best method in post-hoc (or finetuning) methods; and a bold font indicates the best results in the corresponding column.

Method i Naturalist SUN Places365 Texture Average ID ACC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC

Post-hoc Approach

MSP 72.98 77.22 80.89 74.24 76.69 77.81 70.73 78.58 75.32 76.96 74.55 ODIN 63.85 77.78 89.98 61.80 88.00 67.17 67.87 77.40 77.43 71.04 74.55 Mahalanobis 95.90 60.56 95.42 45.33 98.90 44.65 55.80 84.60 86.50 58.78 74.55 Free Energy 69.10 77.39 82.36 76.08 76.15 80.23 56.97 84.32 71.14 79.50 74.55 Re Act 56.11 84.94 82.79 75.87 75.00 80.72 70.37 82.16 70.31 81.42 74.55 KNN 65.40 83.73 75.62 77.33 79.20 74.34 40.80 86.45 64.75 80.91 74.55

Fine-tuning Approach

OE 78.31 75.23 80.10 76.55 70.41 81.78 66.38 82.04 73.80 78.90 75.51 CSI 75.85 82.63 90.62 47.83 94.90 44.62 85.85 87.11 86.80 65.54 74.27 SSD+ 59.60 85.54 75.62 73.80 83.60 68.11 39.40 82.40 64.55 77.46 78.80 Mix OE 80.51 74.30 74.62 79.81 84.33 69.20 58.00 85.83 74.36 77.28 74.62 VOS 94.83 57.69 98.72 38.50 87.75 65.65 70.20 83.62 87.87 61.36 74.43 DOE 55.87 85.98 80.94 76.26 67.84 83.05 34.67 88.90 59.83 83.54 75.50

Table 10: Roubst learning with worst OOD regret and worst OOD risk.

Methods DOE-regret DOE-risk

FPR95 25.74 30.33 AUROC 94.25 94.01

Table 10 summarizes the results on the CIFAR-100 dataset in comparison between searching for the worst OOD regret (DOE-regret) and the worst OOD risk (DOErisk). Therein, both realizations can improve results compared with the original OE. However, the regret-based DOE can reveal better results than the risk-based one, with 4.59 further improvement in FPR95. Here, the worst regret can better indicate the worst OOD distribution than the risk counterpart, and thus the DOE-regret, as employed in Algorithm 1, demonstrates superior results in Table 10.

C.5 EFFECT OF HYPER-PARAMETERS

We study the effect of hyper-parameters on the final performance of our DOE, where we consider the trade-off parameter λ, the perturbation strength α, the smoothing strength β, the perturbation steps num pert, and the warm-up epochs num warm. We also study the case of sub-model perturbation, where the model perturbation is only applied to a part of the whole model. All the above experiments are conducted on the CIFAR-100 dataset.

As one can see from the Tables 1114, our DOE is pretty robust to different choices of the hyperparameters (i.e., λ, β, num pert, and num warm), and the results are superior to the OE across most of the hyper-parameter settings. However, a proper choice of the hyper-parameters can truly induce improved results in effective OOD detection, reflecting that all the introduced hyper-parameters are useful in our proposed DOE. In Table 15, we further demonstrate that randomly selected α (from the candidates) reveals superior performance than assigning fixed values. Note that random selection can cover a wider range of OOD situations than that of fixed values. Then, since the model can learn from more implicit OOD data, the capability of the model in OOD detection is better than the case with fixed values.

Finally, we show the experimental results with sub-model perturbation in Table 16, where only a part of the model is perturbed in our DOE. Here, we separate the WRN-40-2 into 3 blocks, following the block structure in (Zagoruyko & Komodakis, 2016). As we can see, perturbing the whole model can reveal superior performance than the cases with sub-model perturbation. It can be explained by our Lemma 1 in that perturbing the whole model can benefit the data transformation from the layer-wise structure of deep models most. Then, with the more flexible form of transform function, perturbing the whole model can reveal better results than the cases with sub-model perturbation since the model can learn from more diverse (implicit) OOD data.

Published as a conference paper at ICLR 2023

Table 11: DOE on CIFAR-100 with various λ.

FPR95 AUROC

0.1 41.43 90.82 0.5 32.35 93.66 1.0 25.80 93.77 1.5 27.59 94.20 2.0 25.52 94.35 2.5 25.84 94.34 3.0 26.04 94.31 3.5 25.60 94.32 4.0 25.96 94.28 4.5 26.35 94.23

Table 12: DOE on CIFAR-100 with various β.

FPR95 AUROC

0.1 23.52 94.48 0.2 24.31 64.36 0.3 23.12 94.47 0.4 23.90 94.41 0.5 24.79 94.25 0.6 25.11 94.22 0.7 26.37 93.95 0.8 28.12 93.71 0.9 30.23 93.19 1.0 27.25 93.77

Table 13: DOE on CIFAR-100 with various num pert.

FPR95 AUROC

1 25.59 94.50 2 24.90 94.83 3 26.75 94.22 4 25.37 94.41 5 24.62 94.83 6 25.37 94.40 7 24.60 94.38 8 25.80 94.15 9 24.83 94.71 10 24.54 94.88

Table 14: DOE on CIFAR-100 with various num warm.

FPR95 AUROC

1 25.50 94.09 2 25.21 93.94 3 24.81 94.03 4 23.96 94.39 5 25.50 94.11 6 26.18 93.79 7 26.02 94.49 8 29.01 94.07 9 36.05 92.19 10 35.33 92.89

Table 15: DOE on CIFAR-100 with various α.

candidate α FPR95 AUROC

{1e 1} 53.20 86.49 {1e 2} 27.29 94.04 {1e 3} 31.75 93.71 {1e 4} 32.72 93.63 {1e 1, 1e 2} 36.14 91.21 {1e 2, 1e 3} 28.56 94.17 {1e 3, 1e 4} 34.35 93.49 {1e 1, 1e 2, 1e 3} 27.59 94.20 {1e 2, 1e 3, 1e 4} 30.83 93.69 {1e 1, 1e 2, 1e 3, 1e 4} 25.20 94.33

Table 16: DOE on CIFAR-100 with sub-model perturbation.

Perturbed Block FPR95 AUROC

Block 1 35.76 93.82 Block 2 32.88 93.73 Block 3 32.50 93.54 Block 1-2 27.29 94.06 Block 2-3 31.89 93.63 Block 1 and Block 3 28.07 94.04 Whole Model 25.25 94.47