# certifying_outofdomain_generalization_for_blackbox_functions__49dbdfef.pdf Certifying Out-of-Domain Generalization for Blackbox Functions Maurice Weber 1 Linyi Li 2 Boxin Wang 2 Zhikuan Zhao 1 Bo Li 2 Ce Zhang 1 Certifying the robustness of model performance under bounded data distribution drifts has recently attracted intensive interest under the umbrella of distributional robustness. However, existing techniques either make strong assumptions on the model class and loss functions that can be certified, such as smoothness expressed via Lipschitz continuity of gradients, or require to solve complex optimization problems. As a result, the wider application of these techniques is currently limited by its scalability and flexibility these techniques often do not scale to large-scale datasets with modern deep neural networks or cannot handle loss functions which may be non-smooth such as the 0-1 loss. In this paper, we focus on the problem of certifying distributional robustness for blackbox models and bounded loss functions, and propose a novel certification framework based on the Hellinger distance. Our certification technique scales to Image Net-scale datasets, complex models, and a diverse set of loss functions. We then focus on one specific application enabled by such scalability and flexibility, i.e., certifying out-ofdomain generalization for large neural networks and loss functions such as accuracy and AUC. We experimentally validate our certification method on a number of datasets, ranging from Image Net, where we provide the first non-vacuous certified out-of-domain generalization, to smaller classification tasks where we are able to compare with the state-of-the-art and show that our method performs considerably better. 1. Introduction The wide application of machine learning models in the real world brings an emerging challenge of understanding the 1Department of Computer Science, ETH Zurich 2UIUC, USA. Correspondence to: Maurice Weber , Ce Zhang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). performance of a machine learning model under different data distributions ML systems operating autonomous vehicles which are trained based on data collected in the northern hemisphere might fail when deployed in desert-like environments or under different weather conditions (Volk et al., 2019; Dai & Van Gool, 2018), while recognition systems have been shown to fail when deployed in new environments (Beery et al., 2018). Similar concerns also apply to many mission-critical applications such as medicine and cyber-security (Koh et al., 2021; Al Badawy et al., 2018; Gulrajani & Lopez-Paz, 2021). In all these applications, it is imperative to have a sound understanding of the model s robustness and possible failure cases in the presence of a shift in the data distribution, and to have corresponding guarantees on the performance. Recently, this problem has attracted intensive interest under the umbrella of distributional robustness (Scarf, 1958; Ben Tal et al., 2013; Gao & Kleywegt, 2016; Kuhn et al., 2019; Blanchet & Murthy, 2019; Duchi et al., 2021). Specifically, let P be a joint data distribution over features X X and labels Y Y, and let hθ : X Y be a machine learning model parameterized by θ. For a loss function ℓ: Y Y R, we hope to compute Rθ(UP ) := sup Q UP E(X,Y ) Q[ℓ(hθ(X), Y )] (1) where UP P(Z) is a set of probability distributions on Z, called the uncertainty set. Intuitively, this measures the worst-case risk of hθ when the data distribution drifts from P to another distribution in UP . Providing a technical solution to this problem has gained increased attention over the years, as summarized in Table 1. However, most if not all existing approaches, impose strong constraints such as bounded Lipschitz gradients on both h and ℓand rely on expensive certification methods such as direct minimax optimization. As a result, these methods have been applied only to small-scale datasets and ML models. In this paper, we consider the case that both h and ℓcan be non-convex and non-smooth h can be a full-fledged neural network, e.g., Image Net-scale Efficient Net-B7 (Tan & Le, 2019), and ℓcan be a general non-smooth loss function such as the 0-1 loss. We provide, to our best knowledge, the first practical method for blackbox functions that scales to Certifying Out-of-Domain Generalization for Blackbox Functions Ref. Assumptions on ℓ Assumption on h Distance Largest Dataset (Gao & Kleywegt, 2016) Generalised Lipschitz Continuity Wasserstein (Sinha et al., 2018) Bounded, Smoothness Smoothness Wasserstein MNIST (Staib & Jegelka, 2019) Bounded, Continuous Kernel Methods MMD (Shafieezadeh-Abadeh et al., 2019) Lipschitz Continuity Wasserstein (Blanchet & Murthy, 2019) Bounded, Smoothness Smoothness Wasserstein (Cranko et al., 2021) Generalised Lipschitz Continuity Wasserstein Our Method Bounded any Blackbox Hellinger Image Net Table 1. Current landscape of certified distributional robustness. real-world, Image Net-scale neural networks and datasets. Our key innovation is a novel algorithmic framework that arises from bounding inner products between elements of a suitable Hilbert space. Specifically, we can characterize the upper bound of the performance of h on any Q within the uncertainty set as a function of the Hellinger distance, a specific type of f-divergence, and the expectation and variance of the loss of h on P. We then apply our framework to the problem of certifying the out-of-domain generalization performance of a given classifier, taking advantage of its scalability and flexibility. Specifically, let P be the in-domain distribution, and hθ a classifier. Then, to reason about the performance of hθ on shifted distributions Q, we provide a certificate in the following form: Q: dist(Q, P) ρ = E(X,Y ) Q[ℓ(h(X), Y )] Cℓ(ρ, P) (2) where Cℓis a bound which depends on the distance ρ and the distribution P. This requires several nontrivial instantiations of our framework with careful practical considerations. To this end, we first develop a certification algorithm that relies only on a finite set of samples from the in-domain distribution P. Moreover, we also instantiate it with different domain drifting models such as label drifting and covariate drifting, connecting the general Hellinger distance to the degree of domain drifting specific to these scenarios. We then consider a diverse range of loss functions, including JSD loss, 0-1 loss, and AUC. To the best of our knowledge, we provide the first certificate for such diverse realistic scenarios, which is able to scale to large problems. Last but not least, we conduct intensive experiments verifying the efficiency and effectiveness of our result. Our method is able to scale to datasets and neural networks as large as Image Net and full-fledged models like Efficient Net B7 and BERT. We further apply our method on smaller-scale datasets, in order to compare with strong, state-of-the-art methods. We show that our method provides much tighter certificates. Our contributions can be summarized as follows: We present a novel framework which provides a non- vacuous, computationally tractable bound to the distributionally robust worst-case risk Rθ(UP ) for general bounded loss functions ℓand models h. We apply this framework to the problem of certifying out-of-domain generalization for blackbox functions and provide a means to certify distributional robustness in specific scenarios such as label and covariate drifts. We provide an extensive experimental study of our approach on a wide range of datasets including the large scale Image Net (Russakovsky et al., 2015) dataset, as well as NLP datasets with complex models. 2. Distributional Robustness for Blackbox Functions In this section, we present our main results, namely, a computationally tractable upper bound to the worst-case risk (1) for uncertainty sets expressed in terms of Hellinger balls around the data-generating distribution P. The technique is based on the non-negativity of Gram matrices which, by expressing expectation values as inner products between elements of a suitable Hilbert space, can be leveraged to relate expectation values of a blackbox function under different probability distributions P and Q.1 We describe the underlying technique leading to our main result in Theorem 2.2, which upper bounds the worst-case population loss using both the expectation and variance. For the remainder of this section, to simplify notation and maintain generality, we consider generic loss functions ℓ: Z R+ which contain the model h and take inputs from a generic input space Z. For example in the context of supervised learning, Z = X Y can be the product space of features and labels and the loss ℓ(z) = ℓ(hθ(x), y) can be seen as a composition of the loss function ℓand the model hθ. We denote the set of probability measures on the space Z by P(Z). For two measures µ, ν on Z, we say that ν is absolutely continuous with respect to µ, denoted ν µ if µ(A) = 0 implies that ν(A) = 0 for any 1The idea behind our methods is inspired by how Gram matrices are used in quantum chemistry (Weinhold, 1968; Weber et al., 2021) to bound expectation values of quantum observables. However, the adaptation to machine learning is nontrivial and requires careful analysis. Certifying Out-of-Domain Generalization for Blackbox Functions measurable set A Z. Among the plethora of distances between probability measures, such as total variation and Wasserstein distance, a particularly popular choice is the family of f-divergences which has been extensively studied in the context of distributionally robust optimization (Ben Tal et al., 2013; Lam, 2016; Duchi & Namkoong, 2019; Duchi et al., 2021). In this paper, we focus on the Hellinger distance, which is a particular type of f divergence. Definition 2.1 (Hellinger-distance). Let P, Q P(Z) be probability measures on Z that are absolutely continuous with respect to a reference measure µ with P, Q µ. The Hellinger distance between P and Q is defined as q(z) 2 dµ(z) (3) where p = d P dµ and q = d Q dµ are the Radon-Nikodym derivatives of P and Q with respect to µ. The Hellinger distance is independent of the choice of the reference measure µ. The Hellinger distance is bounded with values in [0, 1], with H(P, Q) = 0 if and only if P = Q and the maximum value of 1 attained when P and Q have disjoint support. Furthermore, H defines a metric on the space of probability measures and hence satisfies the triangle inequality. We will now show how the Hellinger distance can be expressed in terms of an inner product between elements of a suitable Hilbert space, which ultimately enables us to use the theory of Gram matrices to derive an upper bound on the worst-case population risk (1) for uncertainty sets given by Hellinger balls. Consider the Hilbert space L2(Z, Σ, µ)2 of square-integrable functions f : Z R, endowed with the inner product f, g = R Z fg dµ. Within this space, we can identify any probability distribution P µ with a unit vector ψP L2(Z, Σ, µ) via the square root of its Radon-Nikodym derivative ψP := p d P/dµ. This mapping enables us to write the Hellinger distance and, more generally, expectation values in terms of inner products. To see this, note that for any two probability measures P, Q on Z, it holds that ψP , ψQ = Z d Q = 1 H2(P, Q) (4) and similarly, for any essentially bounded function f L , we have EP [f(Z)] = Z Z f(z) d P(z) = ψP , f ψP (5) where the product (f ψP )(z) = f(z) ψP (z) is to be understood as pointwise multiplication3. For f L , 2We take Σ to be the Borel σ-algebra on Z, being the smallest σ-algebra containing all open sets on Z. 3More precisely, every f L (Z, Σ, µ) defines a bounded consider the Gram matrix of the Hilbert space elements ψQ, ψP and f ψP , defined as 1 ψQ, ψP ψQ, fψP ψQ, ψP 1 ψP , fψP fψP , ψQ fψP , ψP fψP , fψP The crucial observation is that G is positive semidefinite and thus has a non-negative determinant which can be viewed as a second degree polynomial π(x) evaluated at x = ψQ, fψP and is given by det(G) =: π(x)|x= ψQ, fψP (7) where π(x) = ax2+bx+c is a polynomial with coefficients a = 1, b = 2 ψP , ψQ ψP , fψP c = (1 | ψP , ψQ |2) fψP , fψP fψP , ψP 2. (8) The non-negativity of det(G) implies that π(x = ψQ, fψP ) 0 and thus effectively restricts the values which ψQ, fψP can take to be bounded within the square roots of π so that 4 + c ψQ, fψP b For positive functions f 0, we can upper bound ψQ, fψP via the Cauchy-Schwarz inequality and obtain a lower bound on the expectation of f under Q. Taking as our function f to be the loss function of interest f := ℓ, under the assumption that supz Z|ℓ(z)| M for some M > 0, we can finally recast this lower bound as an upper bound on the expectation of ℓunder Q. Taking the supremum with respect to Q leads to a bound on the worst-case risk (1). We remark that in this way, we obtain both lower and upper bounds on the expected loss. As we will show, these bounds can be used to bound useful statistics, such as the accuracy or the AUC score used in binary classification. In the following Theorem, we state our main result as an upper bound to the worst-case risk (1) and refer the reader to Appendix A.2 for the analogous lower bound. Theorem 2.2. Let ℓ: Z R+ be a loss function and suppose that supz Z|ℓ(z)| M for some M > 0. Then, for any probability measure P on Z and ρ > 0 we have sup Q Bρ(P ) EQ[ℓ(Z)] EP [ℓ(Z)] + 2Cρ p + ρ2(2 ρ2) M EP [ℓ(Z)] VP [ℓ(Z)] M EP [ℓ(Z)] where Cρ = p ρ2(1 ρ2)2(2 ρ2) and Bρ(P) = {Q P(Z): H(P, Q) ρ} is the Hellinger ball of radius ρ linear operator Mf : L2 L2 acting on elements of L2 via pointwise multiplication, g 7 Mf(g) := f g with (f g)(z) = f(z) g(z) for any z Z. Certifying Out-of-Domain Generalization for Blackbox Functions centered at P. The radius ρ is required to be small enough such that ρ2 1 1 + (M EP [ℓ(Z)])2 We refer the reader to Appendix A.1 for a full proof and now make some general observations about this result. The bound (10) presents a pointwise guarantee in the sense that it upper bounds the distributional worst-case risk for a particular model ℓ( ). This is in contrast to bounds which hold uniformly for an entire model class and introduce complexity measures such as covering numbers and VC-dimension which are hard to compute for many practical problems. Other techniques which yield a pointwise robustness certificate of the form (10), typically express the uncertainty set as a Wasserstein ball around the distribution P (Sinha et al., 2018; Shafieezadeh-Abadeh et al., 2019; Blanchet & Murthy, 2019; Cranko et al., 2021), and require the model ℓto be sufficiently smooth. For example, the certificate presented in (Sinha et al., 2018) can only be tractably computed for small neural networks for which one can upper bound their smoothness by bounding the Lipschitz constant of their gradients. For more general and large-scale neural networks, these bounds quickly become intractable and/or lead to vacuous certificates. For example, it is known that computing the Lipschitz constant of neural networks with Re LU activations is NP-hard (Virmaux & Scaman, 2018). Secondly, we emphasize that our bound (10) is faithful , in the sense that, as the radius approaches zero, ρ 0, the bound converges towards the true expectation EP [ℓ(Z)]. This is of course desirable for any such bound as it indicates that any intrinsic gap vanishes as the covered distributions become increasingly closer to the reference distribution P. A third observation is that the bound (10) is monotonically increasing in the variance, indicating that low-variance models exhibit better generalization properties, which can be seen in light of the bias-variance tradeoff. More specifically, from the form our bound (10) takes, we see that minimizing the variance-regularized objective L(θ) = EZ P [ℓθ(Z)]+λVZ P ℓθ(Z), effectively amounts to minimizing an upper bound on the worst-case risk. Indeed, various recent works have highlighted the connection between variance regularization and generalization (Lam, 2016; Maurer & Pontil, 2009; Gotoh et al., 2018; Duchi & Namkoong, 2019) and our result provides further evidence for this observation. 3. Certifying Out-of-domain Generalization Taking advantage of our weak assumptions on the loss functions and models, we now apply our framework to the problem of certifying the out-of-domain generalization performance of a given classifier, when measured in terms of different loss functions. In practice, one is typically only given a finite sample Z1, . . . , Zn from the in-domain distribution P and the bound (10) needs to be estimated empirically. To address this problem, our next step is to present a finite sampling version of the bound (10) which holds with arbitrarily high probability over the distribution P. Second, we instantiate our results with specific distribution shifts, namely, shifts in the label distribution, and shifts which only affect the covariates. Finally, we highlight specific loss and score functions and show how our result can be applied to certify the out-of-domain generalization of these functions. 3.1. Finite Sample Results Let Z1, . . . , Zn iid P be an independent and identically distributed sample from the in-domain distribution P. One immediate way to use our bound would be to construct the empirical distribution ˆPn and consider the worst-case risk over distributions Q Bρ( ˆPn), while computing the bound on the right hand side of (10) with the empirical mean and unbiased sample variance. However, for ρ < 1, the Hellinger ball Bρ( ˆPn) will in general only contain distributions with discrete support since any continuous distribution Q has distance 1 from ˆPn. We therefore seek another path and make use of concentration inequalities for the population variance and mean, in order to get statistically sound guarantees which hold with arbitrarily high probability. To achieve this, we bound the expectation value via Hoeffding s inequality (Hoeffding, 1963), and the population variance via a bound presented in (Maurer & Pontil, 2009). In the second step, we use the union bound as a means to bound both variance and expectation simultaneously with high probability. We leave the derivation and proof to Appendix 3.1. These ingredients lead to the finite sampling-based version of Theorem 2.2, which we state in the following Corollary. Corollary 3.1 (Finite-sampling bound). Let Z1, . . . , Zn be independent random variables drawn from P and taking values in Z. For a loss function ℓ: Z [0, M], let ˆLn := 1 n Pn i=1 ℓ(Zi) be the empirical mean and S2 n := 1 n(n 1) Pn 1 i