# certifiable_outofdistribution_generalization__71060687.pdf

Certifiable Out-of-Distribution Generalization

Nanyang Ye1*, Lin Zhu1, Jia Wang2, Zhaoyu Zeng1, Jiayao Shao3, Chensheng Peng1, Bikang Pan4, Kaican Li5, Jun Zhu6

1 Shanghai Jiao Tong University, Shanghai, China 2 University of Cambridge, Cambridge, United Kingdom 3 University of Warwick, Warwick, United Kingdom 4 Shanghai Tech University, Shanghai, China 5 Huawei Noah s Ark Lab, Hong Kong, China 6 Tsinghua University, Beijing, China ynylincoln@sjtu.edu.cn, zhulin sjtu@sjtu.edu.cn, jw2054@cam.ac.uk, zlatanwilliams@sjtu.edu.cn, jiayao shao@163.com, pesiterswift@sjtu.edu.cn, panbk@shanghaitech.edu.cn, mjust.lkc@gmail.com dcszj@mail.tsinghua.edu.cn

Machine learning methods suffer from test-time performance degeneration when faced with out-of-distribution (Oo D) data whose distribution is not necessarily the same as training data distribution. Although a plethora of algorithms have been proposed to mitigate this issue, it has been demonstrated that achieving better performance than ERM simultaneously on different types of distributional shift datasets is challenging for existing approaches. Besides, it is unknown how and to what extent these methods work on any Oo D datum without theoretical guarantees. In this paper, we propose a certifiable out-of-distribution generalization method that provides provable Oo D generalization performance guarantees via a functional optimization framework leveraging random distributions and max-margin learning for each input datum. With this approach, the proposed algorithmic scheme can provide certified accuracy for each input datum s prediction on the semantic space and achieves better performance simultaneously on Oo D datasets dominated by correlation shifts or diversity shifts. Our code is available at https://github.com/ Zlatan Williams/Stochastic Disturbance Learning.

Introduction Deep learning has achieved success in various domains, including computer vision and natural language processing. However, traditional algorithms only exhibit human-superior behaviors towards datasets that are independent and identically distributed (i.i.d.) (Silver et al. 2016), while the performance degenerates when exposed to out-of-distribution (Oo D) data. This precludes many applications, especially in high-risk sectors, such as healthcare, autonomous driving and securities, where the distribution shift between training and testing data is ubiquitous, and the impact of machines mistakes is severe. Many previous works have been done for Oo D generalization with empirical performance improvements (Ye et al. 2021; Ahuja et al. 2020a). However, due to the complexity of the Oo D generalization problem where the model has to generalize across various unseen domains, it still remains largely

*Corresponding author. The full Appendix is available on Arxiv. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

unsolved. For example, it was recently found few methods can outperform the empirical risk minimization (ERM) with extensive data augmentation (Gulrajani and Lopez-Paz 2021). It has also been demonstrated that the proposed Oo D generalization algorithms exhibit preferences on one type of distribution shift and fail on the other. Besides, many empirically well-performed algorithms are provided without performance guarantees (Ye et al. 2021; Wiles et al. 2022). All this begs the following question: Can we have a certifiable Oo D generalization algorithm that may work well under multiple types of distribution shifts? Previous pioneering work (Zhao et al. 2019; Ben-David et al. 2010) proposed upper bounds for generalization performances, while the bounds are not computable. In this paper, we propose the Stochastic Disturbance Learning (SDL) algorithm aiming to achieve SOTA performance for different kinds of distribution shifts that certifiably provide correct predictions within some sets around the input data in the semantic space. To achieve this, we derive the performance guarantee by measuring the performance of deep neural networks (DNNs) under random disturbances via a functional optimization framework, that is not reliant on the local convexity assumption of DNNs. Our main contributions are as follows:

1. We propose an Oo D generalization algorithm that achieves better performances than ERM on Oo D datasets dominated by correlation shifts and diversity shifts simultaneously. See Section for definitions of the mentioned shifts.

2. We provide a certification methodology with max-margin learning, that can provide theoretical performance guarantees. Ablation studies confirm the effectiveness of algorithmic components. The theoretical analysis also helps explain why the commonly-used dropout method can help improve DNN s generalization abilities.

Preliminaries

This section briefly summarizes the literature for the Oo D generalization methods and introduces the topics that motivate the proposed algorithmic framework.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Figure 1: The causal graph for the mentioned random variables. The directed arrow (resp. dotted line) indicates causal (resp. confounding) relations between two random variables. More specifically, the directed arrow from A to B indicates a causal path from A to B.

Oo D Generalization Oo D generalization is the task of generalizing a model s performance under distribution shifts between the training and the unseen testing distributions. This is in stark contrast with adversarial defense, where the objective is to have a robust classifier against steered minor perturbations added to the images, which resemble noises in the images. The Oo D generalization focuses on generalizing to data subsuming the same semantic information for classification but with different environments or styles information, arguably more commonly observed in practical scenarios than deliberate adversarial attacks. For example, the system has to generalize to unseen environments for safety in autonomous driving. Existing Oo D generalization algorithms can be generally divided into four genres: domain generalization-based methods that focus on learning coherent patterns in data collected from diverse environments (Huang et al. 2020; Li et al. 2018b; Ganin et al. 2016); invariant learning-based methods that exclude spurious correlations that exist in the data (Arjovsky et al. 2019; Ahuja et al. 2020b, 2021); distributionally robust optimization methods that fabricate challenging data distributions from the original (Sagawa* et al. 2020); causal learning-based methods that leverage causal inference techniques (Shen et al. 2020b,a; Tran et al. 2016). These methods have demonstrated empirical improvements on Oo D generalization tasks, while their theoretical performance on Oo D generalization is largely untapped.

Diversity and Correlation Shifts It was recently found that multiple dimensions exist in Oo D generalization datasets where algorithms typically perform better than ERM on one dimension but not as well on the other (Wiles et al. 2022; Ye et al. 2021). In (Ye et al. 2021), these dimensions are delineated as diversity shift or correlation shift. The diversity shift is formally defined as the difference in the environmental semantic feature s training and testing probability density functions (p.d.f.s) on the overall differences between two distributions supports1. In comparison, the correlation shift is defined as the difference in training and testing marginal

1A function f s support is informally interpreted as the subset of the domain where f takes non-zero values.

Figure 2: Summary of certifiable out-of-distribution generalization. The black center point is the living input datum in the semantic space, which should be classified as cats. The black circle denotes the certifiable range, inside which data are classified as cats with theoretical guarantees with our method. The blue region outside the black circle is the semantic space where input samples are also classified as cats but may have no guarantees. The green region is the semantic space where input samples are classified as other types, such as bears, elephants, and giraffes. The red lines denote the model s decision boundary shaped by the proposed max-margin training method to separate different categories.

p.d.f.s over the environmental semantic feature on the intersection of training and testing distributions supports. For example, consider data pairs (X, Y ), where X is the covariate variable and Y is a dependent variable (e.g. label), we wish to model from X (i.e., we wish to find the function f : X Y). Z is defined to be the semantic/feature/latent2 space with Zc Z and Zn Z be vectors of latent (unobserved) confounding causal and non-causal random variables. The causal graph in Figure 1 is an illustration of the settings described above. In other words, we can interpret X being caused by Zc and Zn and Y is determined by Zc only. The reason for assuming this model is that in most real-world distribution shifts, the interpretation is valid and produces productive outcomes. Intuitively, The diversity distribution shifts summarise the traits of novel features between the training and testing distributions. In contrast, the correlation shifts summarise the spurious correlation between label Y and non-causal features Zn, which can also be interpreted as the environmental semantic feature. It was found that very few methods can achieve better performance than ERM concurrently on two kinds of Oo D shifts (Ye et al. 2021; Huang et al. 2022).

Theoretical Results in Oo D Generalization As mentioned in the previous paragraphs, there is no theoretical approach to solve the Oo D generalization problem com-

2Note that semantic feature Z will be formally defined in the following section and could be viewed equivalently as latent space. This differs from some other papers where semantic feature means causal feature.

pletely, and existing theories mostly require assumptions and optimization under constraints. One of the most popular directions is the distributionally robust optimization (DRO) (Rahimian and Mehrotra 2019), which is under the robustness optimization framework. The ethics behind the DRO is minimizing the worst-case risk over an uncertainty distribution set centered on the training distribution. Under this scenario, there exhibits freedom of choosing sensible distance measure (e.g. f-Divergence (Weber et al. 2022; Duchi and Namkoong 2019), Wasserstein Distance (Gao and Kleywegt 2016; Sinha, Namkoong, and Duchi 2018), MMD (Staib and Jegelka 2019)) to define the uncertainty set which leads to various angles to tackle the Oo D generalization problem. In (Gao and Kleywegt 2016), bounds for the worst-case risk are obtained under the minimum assumption of loss function l being bounded for any black-box machine learning functions. Another direction is Invariance-Based Optimization (Rojas Carulla et al. 2018; Koyama and Yamaguchi 2020b) which defines an optimization problem based on information theory. Specifically, the Shannon mutual information (Cover and Thomas 2006) between two random variables is optimized under an invariance set. Despite the success of theoretical certificates, the limitations of using these optimization models are that these results are only non-trivial under a small ball of distribution shifts and the bounds derived are not numerically computable. (Weber et al. 2022). Sometimes strong assumptions are needed to be imposed on the loss function l or the machine learning models to guarantee validity (Gao and Kleywegt 2016; Sinha, Namkoong, and Duchi 2018; Staib and Jegelka 2019). These limitations have made these models hard to apply in real-world data where the distribution shifts are generally large and provide theoretical guarantees to existing algorithms.

Proposed Methodology In this section, we will first introduce the certification methodology for providing certifiable guarantees for each input datum s prediction. Then, based on the theoretical results, the max-margin training is proposed and analyzed via the neural tangent kernel theory to improve the certification bounds. A summary of certifiable Oo D generalization is illustrated in Figure 2. Later, we propose an instantiation of practical certifiable Oo D generalization algorithms.

Certification Methodology Notations and Settings We use (X, Y) to denote the dataset with n data pairs, where X Rn p and Y Rn. We denote fθ( ) = f( ; θ) = f L 1 f L 2 f0 as a L-layer DNN with the last layer as the classification layer, where θ RΘ is the parameters of the DNN. We also use f( ) for short in the next paragraphs. For simplicity and without loss of generality, we consider a 0-1 classification problem 3 and the range of f s output lies in [0, 1]. As previous research revealed that the intermediate representations learned by DNN exhibit semantic features of objects to be recognized (Zeiler and Fergus 2013), given

3Indeed, a multi-class classification problem can be seen as multiple 0-1 classification problems (one versus others).

an input datum x, we define the semantic representation as the concatenation intermediate layers representations: z = [f0(x), f1(x), , f L 2(x)]. Without loss of generality, we want to certify that if f(z) > 1/2, then, it still holds for some set B around z, i.e., for any change in the semantic information δ B, f(z δ) 1/2, where is an operator that is addition or multiplication. Formally, we have the following B-Generalizable definition for Oo D generalization: Definition 1. (B-Generalizable) For a 0-1 classification problem with notations inherited from above where f(z) [0, 1], given B a closed set and f(z) > 1/2, we say the function f is B-Generalizable at z, if for any perturbation δ in the B set i.e., δ B, f(z δ) > 1/2. Remark: For convenience we will simply write f(z) is BGeneralizable . For clarity, we posit the proposed definition for [0,1] classification problem. This can be also extended for multi-class cases. Next, we will introduce certifiable methods to elicit B-Generalizable models and corresponding B. In the following, we introduce the proposed random disturbed version of the model. Assuming π0 is the distribution of the stochastic disturbance, the randomized model is defined as the expectation of prediction averaged over the distribution of the semantic representation:

fπ0(z) := Eη π0[f(z η)] (1)

We want to prove that if the original classifier still gives the correct prediction under stochastic disturbance (fπ0(z) > 1/2), then for any perturbation within some range δ B, the following inequality still holds:

min δ B fπ0(z δ) = min δ B fπδ(z)

= min δ B Eη π0[f(z η δ)] > 1/2 (2)

where is element-wise addition or element-wise multiplication and πδ is the distribution of η δ. To derive a tractable lowerbound of minδ B fπ0(z δ), we further relax f to the functional space F = { ˆf : ˆf(z) [0, 1], z RZ} that is the set of all functions bounded in [0, 1], along with a equality constraint at the original function f:

min δ B fπ0(z δ) min ˆ f F

min δ B ˆfπ0(z δ) (3)

s.t. ˆfπ0(z) = fπ0(z)

Remark 1: To make the lowerbound computable, as f is a high-dimensional non-linear function (deep neural networks) that is generally intractable, we further relax it to any function bounded in [0, 1] but with an extra constraint that it should give the same prediction as the original function. Note that the above inequality can be solved with the Lagrangian method. Theorem 1. (Lagrangian) Denote by πδ the distribution of η δ, solving Inequality 3 is equivalent to solving the following problem:

L = min ˆ f F min δ B max λ R

n ˆfπ0(z δ) λ( ˆfπ0(z) fπ0(z)) o

λfπ0(z) max δ B DF(λπ0, πδ) (4)

where DF(λπ0, πδ) is:

DF(λπ0, πδ)

= max ˆ f F

n λEη π0[ ˆf(z η)] Eη πδ[ ˆf(z η)] o (5)

= R [λπ0(η) πδ(η)]+dη, if π is continuous, P[λπ0(η) πδ(η)]+, if π is discrete.

Remark 2: The proof of Theorem 1 and following Propositions are shown in Appendix C. This theorem does not rely on additional assumptions of (local) convexity of deep neural networks, which is network architecture agnostic, which means the elicited bound can be applied to any black-box model. From this result, we can derive Oo D generalizable set B via solving L > 1/2. Based on Theorem 1, we have the following Propositions for various distributions. Proposition 1. (Gaussian distribution) In this case, we instantiate π0 as a Gaussian distribution that is centered at 0: N(0, σ2I), as the addition operator, and fπ0(z) > 1

2. Then fπ0(z) is B-Generalizable for B = {δ : δ 2 r}, with r satisfying the prediction confidence lowerbound:

L Φ Φ 1(fπ0(z)) r

Φ( ) is the cumulative density function of standard Gaussian distribution, which gives:

r < σΦ 1(fπ0(z)) (7)

Thus, the randomized classifier always gives correct predictions when the perturbation range r lies in the generalizable set B. For Laplace distribution, the result is similar, which we leave to Appendix. We also consider the Bernoulli distribution, which can offer discrete stochastic disturbance to the semantic representations. Bernoulli distribution is the widelyused dropout trick for reducing overfitting in deep neural networks. Our results can shed some light on explaining why dropout works. Proposition 2. (Bernoulli distribution) In this case, we instantiate π0 as a Bernoulli distribution with a probability of setting to zero as p [0, 1). is instantiate as multiplication, and fπ0(z) > 1

2. 0 represents the l0-norm which counts the number of non-zero elements in a vector. Then fπ0(z) is B-Generalizable for B = {δ : δ 1 0 r}, with r satisfying the prediction confidence lowerbound:

L max{fπ0 (z) 1 + pr, 0} > 1

Which gives

r < ln(1.5 fπ0 (z))

The proof of Proposition 2 is challenging as it involves discrete distributions, which we leave for Appendix. From Proposition 2, it can be concluded that the radius of generalizable set B is propositional to the inverse of ln(p). This provides a plausible theoretical explanation of why the widelyused dropout methods can help avoid over-fitting and improve

generalization performance. It also further reveals an inherent trade-off between selecting a higher dropout rate p and keeping fπ0(z) > 1

2 4. Additionally, this analysis also indicates a new research direction by automatically searching for the optimal parameters of random distributions e.g. the dropout rate to improve generalization abilities, which we leave for future works. Remark 3: The above analysis equips us with methods for certifying Oo D generalization algorithms. Though our analysis is based on a 0-1 classification problem, it can be directly extended to a multi-classes classification problem by constructing multiple one-vs-others classification problems. For the above propositions to hold, a necessary condition is fπ0(z) > 1

2. Besides, in the binary classification setting, the closer fπ0(z) to 1 (i.e., further away from the decision boundary), the larger the allowed perturbation range r (i.e., the certifiable region B) from Equation 7 5.

Max-Margin Training

To find as large as possible certifiable region, we introduce max-margin training, which aligns with our moral of being away from the decision boundary. The maximal margin training is the task of finding the optimal hyperplane that linearly separates two separable classes. To ensure robustness, naturally, the optimal hyperplane is defined as the hyperplane that maximizes its distance (= 1

2 margin) from the closest points of the two clouds of separable data. From the neural tangent kernel (NTK) theory in (Jacot, Gabriel, and Hongler 2018), we can approximate a high-dimensional non-linear deep neural network model fπ0(x; θ) with kernelized linear regression when the dimension of the network goes to infinity:

fπ0(x; θ) fπ0(x; w) fπ0(x; w0) + Ψπ0(x; w0)(w w0) (10)

where Ψπ0(x; w0) is the NTK, w is the last layer s parameter, w0 is the initialization for the last layer. Then, we derive the max-margin linear classifier for separating samples.

Theorem 2. (Max-margin classifier) If we wish to allow outliers (points within the margin, or even on the other side of decision boundary). The NTK max-margin classifier s parameters satisfy the following optimal conditions:

min ξ,w 1 2 w 2 + C

i=1 ξi (11)

s.t. yi(fπ0(xi; w0) + Ψπ0(xi; w0)(w w0)) 1 ξi, ξi 0, i = 1, , n

where ξi/ w is the distance of the furthest outlier point to the decision boundary (margin) along the i-th axis, C is the hyper-parameter controlling the costs of outliers. This is

4A too large dropout rate p may lead to wrong prediction results i.e., fπ0(z) < 1

2. 5For Bernoulli case, the result is similar as maintaining higher fπ0(z) values can leave room for much smaller magnitude of ln(p).

equivalent to solving the following optimization problem:

min w 1 2 w 2+

i=1 max(0, 1 yi(fπ0(xi; w0) + Ψπ0(xi; w0)(w w0) | {z } Training loss

(12) We will show the equivalence of Equation 11 and Equation 12 in the Appendix. From the above equation, we can see that the essence of the max-margin classifier is the introduction of the Hinge loss which only keeps the training loss larger than a constant. Indeed, this can also be achieved by selecting samples that yield larger losses in a batch for training to avoid introducing an extra task-dependent hyper-parameter for practical reasons.

Practical Algorithm Instantiation Based on the theoretical analysis, we propose an instantiation of the algorithm Stochastic Disturbance Learning with Gaussian distribution as an example. For Gaussian distribution, the expectation of random disturbed loss for a data pair (Xi, Yi) is:

Eπ0[ℓ(f(Xi; θ), Yi)] =

Eη N (0,σ2)[ℓ(f L 1((z + η); θ), Yi)] (13)

where z = f L 2 f0(Xi; θ). where σ is the added Gaussian distribution s variance. The proposed SDL algorithm is shown in Algorithm 1. Based on the derived propositions, an instantiation of the certification algorithm that obtains B is demonstrated in Appendix B. After training DNNs with Algorithm 1, we can use the Certification Algorithm in Appendix B to provide certified accuracies within the closed set B calculated by Propositions in the previous sections. This can provide a theoretical performance guarantee for high-stake applications.

Algorithm 1: Training procedure of stochastic disturbance learning

Require: Training set (X, Y), maximum number of epochs T, percentage of max-margin training epochs κ, percentage of top loss samples used in max-margin training η, batch-size B, variance of Gaussian distribution σ. Ensure: The model s parameters θ. 1: while t (1 κ)T do 2: Calculate the forward pass loss for every datum with added stochastic disturbances: Eπ0[ℓ(fθ(Xi), Yi)] (See Eq 13). 3: Select the top ηB loss samples for max-margin training (Xi max, Y i max) 4: Train the model with (Xi max, Y i max): Eπ0[ℓ(fθ(Xi max), Yi max)] 5: end while 6: while (1 κ)T < t T do 7: Calculate the forward pass loss for every datum Eπ0[ℓ(fθ(Xi), Yi)] 8: Train the model with (X, Y ) with Eπ0[ℓ(fθ(Xi), Yi)] 9: end while

Experiments Results This section will demonstrate the effectiveness of the proposed algorithmic framework with empirical experiments as it was found that benchmark results on Oo D datasets are susceptible to hyper-parameters choices. For a fair comparison, we evaluate the effectiveness of our method with the Oo DBench suit (Ye et al. 2021) based on the Domain Bed implementation (Gulrajani and Lopez-Paz 2021). With Oo D-Bench suit, we can evaluate the Oo D generalization performances on datasets dominated by diversity shifts or correlation shifts. Next, ablation studies are conducted for further analysis.

Oo D-Bench Results on Distribution Shifts Datasets We follow the setting of Oo D-Bench (Ye et al. 2021) for comparison experiments. Specifically, we have selected PACS (Li et al. 2017), Office Home (Venkateswara et al. 2017), Terra Incognita (Beery, Horn, and Perona 2018), and Camelyon17-WILDS (Koh et al. 2020) for benchmarking on the diversity shift datasets, and Colored MNIST (Arjovsky et al. 2019), NICO (He, Shen, and Cui 2020) , and a modified version of Celeb A (Liu et al. 2015) for benchmarking on the correlation shift datasets. We use Res Net-18 for all experiments except the Colored MNIST dataset. For the Colored MNIST dataset, a multi-layer perceptron is used. For hyper-parameter search, we run twenty iterations for each algorithm and the search procedure is repeated three times. The means and standard deviations of accuracies are reported. For every dataset-algorithm pair, depending on whether the attained accuracy is lower than, within, or higher than the standard error bar of ERM accuracy on the same dataset, the ranking score -1, 0, +1 is assigned to the pair. Eighteen strong Oo D generalization algorithms are compared, including the invariant risk minimization methods (e.g. IRM (Arjovsky et al. 2019), VREx (Krueger et al. 2020)), distributionally robust optimization method (e.g. Group DRO (Sagawa* et al. 2020)), and domain generalization methods (e.g. MLDG (Li et al. 2018a), ERDG (Zhao et al. 2020)), etc. The results for diversity shifts dominated datasets are shown in Table 1. For correlation shifts dominated datasets, the results are shown in Table 2. From Table 1 and Table 2, it can be observed that all methods except for the proposed SDL can only achieve better results than ERM on either type of distribution shift. From Table 1, we can observe that the proposed SDL method achieves better performances. Among the top performers, we observe that RSC, MMD, and Sag Net outperform the standard empirical risk minimization method. This demonstrates that only a few methods can achieve better performances with systematic evaluation than ERM, revealing the inherent challenge of Oo D generalization. From Table 2, the proposed method is still the best-performing method among all candidates. This answers an unsolved question in the previous paper (Ye et al. 2021) whether there exists an Oo D generalization algorithm that can achieve better performances than ERM simultaneously on diversity and correlation shifts dominated datasets. For overall performance, the proposed SDL obtains a ranking score of +5 followed by RSC and MMD with the ranking score of +1, which means SDL achieves better performances stably for most Oo D datasets in Oo D-Bench.

Algorithm PACS Office Home Terra Incognita Camelyon17 Average Ranking score

SDL(Proposed) 84.8 0.6 63.9 0.1 44.1 1.1 95.4 0.3 72.1 +4 RSC 82.8 0.4 62.9 0.4 43.6 0.5 94.9 0.2 71.1 +2 MMD 81.7 0.2 63.8 0.1 38.3 0.4 94.9 0.4 69.7 +2 Sag Net 81.6 0.4 62.7 0.4 42.3 0.7 95.0 0.2 70.4 +1 ERM 81.5 0.0 63.3 0.2 42.6 0.9 94.7 0.1 70.5 0 IGA 80.9 0.4 63.6 0.2 41.3 0.8 95.1 0.1 70.2 0 CORAL 81.6 0.6 63.8 0.3 38.3 0.7 94.2 0.3 69.5 0 IRM 81.1 0.3 63.0 0.2 42.0 1.8 95.0 0.4 70.3 -1 VREx 81.8 0.1 63.5 0.1 40.7 0.7 94.1 0.3 70.0 -1 Group DRO 80.4 0.3 63.2 0.2 36.8 1.1 95.2 0.2 68.9 -1 ERDG 80.5 0.5 63.0 0.4 41.3 1.2 95.5 0.2 70.1 -2 DANN 81.1 0.4 62.9 0.6 39.5 0.2 94.9 0.0 69.6 -2 MTL 81.2 0.4 62.9 0.2 38.9 0.6 95.0 0.1 69.5 -2 Mixup 79.8 0.6 63.3 0.5 39.8 0.3 94.6 0.3 69.4 -2 ANDMask 79.5 0.0 62.0 0.3 39.8 1.4 95.3 0.1 69.2 -2 ARM 81.0 0.4 63.2 0.2 39.4 0.7 93.5 0.6 69.2 -3 MLDG 73.0 0.4 52.4 0.2 27.4 2.0 91.2 0.4 61.0 -4

Average 80.8 62.6 39.8 94.6 69.4

Table 1: Performance of ERM and Oo D generalization algorithms on datasets dominated by diversity shift. The baseline methods include RSC (Huang et al. 2020), MMD (Li et al. 2018b), Sag Net (Nam et al. 2019), IGA (Koyama and Yamaguchi 2020a), CORAL (Sun and Saenko 2016), IRM (Arjovsky et al. 2019), VREx (Krueger et al. 2020), Group DRO (Sagawa* et al. 2020), ERDG (Zhao et al. 2020), DANN (Ganin et al. 2016), MTL (Blanchard et al. 2017), Mixup (Yan et al. 2020), ANDMask (Parascandolo et al. 2021), ARM (Zhang et al. 2020), and MLDG (Li et al. 2018a). Every symbol denotes a score of -1, and every symbol denotes a score of +1; otherwise, the score is 0. The scores indicate how many datasets the candidate algorithm are performed better than ERM. Adding up the scores across all datasets produces the ranking score for each algorithm. The proposed SDL can achieve the best average performance and the highest ranking score.

Certified Accuracy on Oo D Datasets and Ablation Studies

The certified accuracy is defined as the fraction of the test set samples which is certifiably correct within the maximum allowable generalizable set B. We show the certified accuracies of the proposed algorithmic scheme taking PACS and Office Home as examples. We plot the certified accuracies of the proposed SDL against the radius of B varying the variance σ, which is shown in Figure 3 (a). We can observe that varying variances σ can yield different trade-offs between certified accuracies and the radius of the generalizable set B. Larger variance generally leads to higher certified accuracies where semantic information deviates further. This aligns well with our theoretical analysis that larger variance can lead to a higher allowable radius (Equation 7). The certifying result with the Bernoulli distribution is shown in Appendix C. The effectiveness of max-margin training and random noises in SDL is investigated for ablation studies. We compare SDL and its variants without max-margin training (Ma1) or random noises (Ma2), or without both of them (ERM). The results are visualized in Figure 3 (b). From Figure 3 (b), we can observe that SDL can achieve statistically significant better results than its variants on all radius selections. Furthermore, when the deviation degree increases, the certified accuracy reduces more significantly after removing the random noises part, confirming the algorithmic components necessity.

We have analyzed the relationship between the radius of B and the variance σ both theoretically and empirically (see Equation 7 and Figure 4(a)). The degree of domain shift between training data and test data affects B through fπ0(z). As for a dataset with a very large degree of distribution shift, it may be quite hard for the baseline to learn fπ0(z) well, which may be close to the classification bound of 0.5. We have addressed this issue by max-margin training which is quite efficient according to the ablation study. In fact, a very large domain shift does not essentially cause B-set very small or disappear according to the experimental results. For example, for dataset PACS with a very large degree of diversity shift (about 0.8 in Oo D-Bench), the certified test accuracies by SDL (σ = 3.0) start to drop significantly until the radius is up to 8.

In this paper, we propose a certifiable out-of-distribution generalization algorithm that can provide certified accuracy for each input datum s prediction on the semantic space. The proposed method simultaneously achieves the state-of-the-art empirical performance on datasets dominated by two types of distribution shifts with theoretically guaranteed performance. For future work, we will explore how to further apply this method to other tasks, such as autonomous driving or medical image processing, to improve Oo D generalization performances.

Algorithm Colored MNIST Celeb A NICO Average Ranking score

SDL(Proposed) 58.8 2.2 88.6 0.5 71.7 0.6 73.0 +2 VREx (Krueger et al. 2020) 56.3 1.9 87.3 0.2 71.5 2.3 71.7 +1 Group DRO (Sagawa* et al. 2020) 32.5 0.2 87.5 1.1 71.0 0.4 63.7 +1 ERM (Vapnik 1998) 29.9 0.9 87.2 0.6 72.1 1.6 63.1 0 IRM (Arjovsky et al. 2019) 60.2 2.4 85.4 1.2 73.3 2.1 73.0 0 MTL (Blanchard et al. 2017) 29.3 0.1 87.0 0.7 70.6 0.8 62.3 0 ERDG (Zhao et al. 2020) 31.6 1.3 84.5 0.2 72.7 1.9 62.9 0 ARM (Zhang et al. 2020) 34.6 1.8 86.6 0.7 67.3 0.2 62.8 0 MMD (Li et al. 2018b) 50.7 0.1 86.0 0.5 68.9 1.2 68.5 -1 RSC (Huang et al. 2020) 28.6 1.5 85.9 0.2 74.3 1.9 62.9 -1 IGA (Koyama and Yamaguchi 2020a) 29.7 0.5 86.2 0.7 71.0 0.1 62.3 -1 CORAL (Sun and Saenko 2016) 30.0 0.5 86.3 0.5 70.8 1.0 62.4 -1 Mixup (Yan et al. 2020) 27.6 1.8 87.5 0.5 72.5 1.5 62.5 -1 MLDG (Li et al. 2018a) 32.7 1.1 85.4 1.3 66.6 2.4 61.6 -1 Sag Net (Nam et al. 2019) 30.5 0.7 85.8 1.4 69.8 0.7 62.0 -2 ANDMask (Parascandolo et al. 2021) 27.2 1.4 86.2 0.2 71.2 0.8 61.5 -2 DANN (Ganin et al. 2016) 24.5 0.8 86.0 0.4 69.4 1.7 60.0 -3

Average 36.2 86.4 70.9 64.5

Table 2: Performance of ERM and Oo D generalization algorithms on datasets dominated by correlation shift. Every symbol denotes a score of -1, and every symbol denotes a score of +1; otherwise, the score is 0. The scores indicate how many datasets the candidate algorithm performs better than ERM. Adding up the scores across all datasets produces the ranking score for each algorithm. The proposed SDL can achieve the best average performance and the highest ranking score.

Figure 3: (a) Certified accuracy on PACS and Office Home. The left line chart shows the certified accuracy under SDL with different σ of π(η). The x-axis denotes the radius of generalizable set B. (b) The right bar chart shows the ablation study results of removing max-margin training (Ma1) or random noises (Ma2).

Acknowledgements

Nanyang Ye was supported in part by National Natural Science Foundation of China under Grant No.62106139.

Ahuja, K.; Caballero, E.; Zhang, D.; Gagnon-Audet, J.-C.; Bengio, Y.; Mitliagkas, I.; and Rish, I. 2021. Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization. In Neur IPS.

Ahuja, K.; Shanmugam, K.; Varshney, K.; and Dhurandhar, A. 2020a. Invariant risk minimization games. In ICML.

Ahuja, K.; Shanmugam, K.; Varshney, K. R.; and Dhurandhar, A. 2020b. Invariant Risk Minimization Games. ar Xiv:1907.02893.

Arjovsky, M.; Bottou, L.; Gulrajani, I.; and Lopez-Paz, D. 2019. Invariant Risk Minimization. ar Xiv:1907.02893.

Beery, S.; Horn, G. V.; and Perona, P. 2018. Recognition in Terra Incognita. In ECCV.

Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning, 79(1): 151 175. Blanchard, G.; Deshmukh, A. A.; Dogan, U.; Lee, G.; and Scott, C. 2017. Domain generalization by marginal transfer learning. ar Xiv:1711.07910. Cover, T. M.; and Thomas, J. A. 2006. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience. ISBN 0471241954. Duchi, J.; and Namkoong, H. 2019. Learning Models with Uniform Performance via Distributionally Robust Optimization. ar Xiv:1810.08750. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-Adversarial Training of Neural Networks. JMLR. Gao, R.; and Kleywegt, A. J. 2016. Distributionally Robust Stochastic Optimization with Wasserstein Distance. ar Xiv:1604.02199. Gulrajani, I.; and Lopez-Paz, D. 2021. In Search of Lost Domain Generalization. In ICLR. He, Y.; Shen, Z.; and Cui, P. 2020. Towards Non-IID Image Classification: A Dataset and Baselines. Pattern Recognition. Huang, Z.; Wang, H.; Huang, D.; Lee, Y. J.; and Xing, E. P. 2022. The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization. ar Xiv:2204.04384. Huang, Z.; Wang, H.; Xing, E. P.; and Huang, D. 2020. Self-challenging improves cross-domain generalization. ar Xiv:2007.02454. Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv:1806.07572. Koh, P. W.; Sagawa, S.; Marklund, H.; Xie, S. M.; Zhang, M.; Balsubramani, A.; Hu, W.; Yasunaga, M.; Phillips, R. L.; Beery, S.; et al. 2020. WILDS: A Benchmark of in-the-Wild Distribution Shifts. ar Xiv:2012.07421. Koyama, M.; and Yamaguchi, S. 2020a. Out-ofdistribution generalization with maximal invariant predictor. ar Xiv:2008.01883. Koyama, M.; and Yamaguchi, S. 2020b. When is invariance useful in an Out-of-Distribution Generalization problem ? ar Xiv:2008.01883. Krueger, D.; Caballero, E.; Jacobsen, J.-H.; Zhang, A.; Binas, J.; Priol, R. L.; and Courville, A. 2020. Out-of Distribution Generalization via Risk Extrapolation (REx). ar Xiv:2003.00688. Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2018a. Learning to generalize: Meta-learning for domain generalization. In AAAI. Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. M. 2017. Deeper, Broader and Artier Domain Generalization. In ICCV. Li, H.; Jialin Pan, S.; Wang, S.; and Kot, A. C. 2018b. Domain generalization with adversarial feature learning. In CVPR. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep Learning Face Attributes in the Wild. In ICCV.

Nam, H.; Lee, H.; Park, J.; Yoon, W.; and Yoo, D. 2019. Reducing domain gap via style-agnostic networks. ar Xiv:1910.11645. Parascandolo, G.; Neitz, A.; Orvieto, A.; Gresele, L.; and Sch olkopf, B. 2021. Learning explanations that are hard to vary. In ICLR. Rahimian, H.; and Mehrotra, S. 2019. Distributionally Robust Optimization: A Review. ar Xiv:1908.05659. Rojas-Carulla, M.; Sch olkopf, B.; Turner, R.; and Peters, J. 2018. Invariant Models for Causal Transfer Learning. JMLR, 19(36): 1 34. Sagawa*, S.; Koh*, P. W.; Hashimoto, T. B.; and Liang, P. 2020. Distributionally Robust Neural Networks. In ICLR. Shen, Z.; Cui, P.; Liu, J.; Zhang, T.; Li, B.; and Chen, Z. 2020a. Stable Learning via Differentiated Variable Decorrelation. In KDD. Shen, Z.; Cui, P.; Zhang, T.; and Kuang, K. 2020b. Stable Learning via Sample Reweighting. In AAAI. Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529: 484 489. Sinha, A.; Namkoong, H.; and Duchi, J. 2018. Certifiable Distributional Robustness with Principled Adversarial Training. In ICLR. Staib, M.; and Jegelka, S. 2019. Distributionally Robust Optimization and Generalization in Kernel Methods. In Neur IPS. Sun, B.; and Saenko, K. 2016. Deep coral: Correlation alignment for deep domain adaptation. In ECCV. Tran, D.; Ruiz, F. J.; Athey, S.; and M.Blei, D. 2016. Model Criticism for Bayesian Causal Inference. ar Xiv:1610.09037. Vapnik, V. 1998. Statistical Learning Theory. Wiley. Venkateswara, H.; Eusebio, J.; Chakraborty, S.; and Panchanathan, S. 2017. Deep Hashing Network for Unsupervised Domain Adaptation. In CVPR. Weber, M.; Li, L.; Wang, B.; Zhao, Z.; Li, B.; and Zhang, C. 2022. Certifying Out-of-Domain Generalization for Blackbox Functions. ar Xiv: 2202.01679. Wiles, O.; Gowal, S.; Stimberg, F.; Rebuffi, S.-A.; Ktena, I.; Dvijotham, K. D.; and Cemgil, A. T. 2022. A Fine-Grained Analysis on Distribution Shift. In ICLR. Yan, S.; Song, H.; Li, N.; Zou, L.; and Ren, L. 2020. Improve unsupervised domain adaptation with mixup training. ar Xiv:2001.00677. Ye, N.; Li, K.; Bai, H.; Yu, R.; Hong, L.; Zhou, F.; Li, Z.; and Zhu, J. 2021. Oo D-Bench: Benchmarking and Understanding Out-of-Distribution Generalization Datasets and Algorithms. ar Xiv:2106.03721. Zeiler, M. D.; and Fergus, R. 2013. Visualizing and Understanding Convolutional Networks. ar Xiv:1311.2901.

Zhang, M.; Marklund, H.; Dhawan, N.; Gupta, A.; Levine, S.; and Finn, C. 2020. Adaptive Risk Minimization: A Meta Learning Approach for Tackling Group Distribution Shift. ar Xiv:2007.02931. Zhao, H.; des Combes, R. T.; Zhang, K.; and Gordon, G. J. 2019. On Learning Invariant Representation for Domain Adaptation. Co RR, abs/1901.09453. Zhao, S.; Gong, M.; Liu, T.; Fu, H.; and Tao, D. 2020. Domain Generalization via Entropy Regularization. In Neur IPS.