# certifying_outofdomain_generalization_for_blackbox_functions__49dbdfef.pdf

Certifying Out-of-Domain Generalization for Blackbox Functions

Maurice Weber 1 Linyi Li 2 Boxin Wang 2 Zhikuan Zhao 1 Bo Li 2 Ce Zhang 1

Certifying the robustness of model performance under bounded data distribution drifts has recently attracted intensive interest under the umbrella of distributional robustness. However, existing techniques either make strong assumptions on the model class and loss functions that can be certiﬁed, such as smoothness expressed via Lipschitz continuity of gradients, or require to solve complex optimization problems. As a result, the wider application of these techniques is currently limited by its scalability and ﬂexibility these techniques often do not scale to large-scale datasets with modern deep neural networks or cannot handle loss functions which may be non-smooth such as the 0-1 loss. In this paper, we focus on the problem of certifying distributional robustness for blackbox models and bounded loss functions, and propose a novel certiﬁcation framework based on the Hellinger distance. Our certiﬁcation technique scales to Image Net-scale datasets, complex models, and a diverse set of loss functions. We then focus on one speciﬁc application enabled by such scalability and ﬂexibility, i.e., certifying out-ofdomain generalization for large neural networks and loss functions such as accuracy and AUC. We experimentally validate our certiﬁcation method on a number of datasets, ranging from Image Net, where we provide the ﬁrst non-vacuous certiﬁed out-of-domain generalization, to smaller classiﬁcation tasks where we are able to compare with the state-of-the-art and show that our method performs considerably better.

1. Introduction

The wide application of machine learning models in the real world brings an emerging challenge of understanding the

1Department of Computer Science, ETH Zurich 2UIUC, USA. Correspondence to: Maurice Weber <maurice.weber@inf.ethz.ch>, Ce Zhang <ce.zhang@inf.ethz.ch>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

performance of a machine learning model under different data distributions ML systems operating autonomous vehicles which are trained based on data collected in the northern hemisphere might fail when deployed in desert-like environments or under different weather conditions (Volk et al., 2019; Dai & Van Gool, 2018), while recognition systems have been shown to fail when deployed in new environments (Beery et al., 2018). Similar concerns also apply to many mission-critical applications such as medicine and cyber-security (Koh et al., 2021; Al Badawy et al., 2018; Gulrajani & Lopez-Paz, 2021). In all these applications, it is imperative to have a sound understanding of the model s robustness and possible failure cases in the presence of a shift in the data distribution, and to have corresponding guarantees on the performance.

Recently, this problem has attracted intensive interest under the umbrella of distributional robustness (Scarf, 1958; Ben Tal et al., 2013; Gao & Kleywegt, 2016; Kuhn et al., 2019; Blanchet & Murthy, 2019; Duchi et al., 2021). Speciﬁcally, let P be a joint data distribution over features X X and labels Y Y, and let hθ : X Y be a machine learning model parameterized by θ. For a loss function ℓ: Y Y R, we hope to compute

Rθ(UP ) := sup Q UP E(X,Y ) Q[ℓ(hθ(X), Y )] (1)

where UP P(Z) is a set of probability distributions on Z, called the uncertainty set. Intuitively, this measures the worst-case risk of hθ when the data distribution drifts from P to another distribution in UP .

Providing a technical solution to this problem has gained increased attention over the years, as summarized in Table 1. However, most if not all existing approaches, impose strong constraints such as bounded Lipschitz gradients on both h and ℓand rely on expensive certiﬁcation methods such as direct minimax optimization. As a result, these methods have been applied only to small-scale datasets and ML models.

In this paper, we consider the case that both h and ℓcan be non-convex and non-smooth h can be a full-ﬂedged neural network, e.g., Image Net-scale Efﬁcient Net-B7 (Tan & Le, 2019), and ℓcan be a general non-smooth loss function such as the 0-1 loss. We provide, to our best knowledge, the ﬁrst practical method for blackbox functions that scales to

Certifying Out-of-Domain Generalization for Blackbox Functions

Ref. Assumptions on ℓ Assumption on h Distance Largest Dataset

(Gao & Kleywegt, 2016) Generalised Lipschitz Continuity Wasserstein (Sinha et al., 2018) Bounded, Smoothness Smoothness Wasserstein MNIST (Staib & Jegelka, 2019) Bounded, Continuous Kernel Methods MMD (Shaﬁeezadeh-Abadeh et al., 2019) Lipschitz Continuity Wasserstein (Blanchet & Murthy, 2019) Bounded, Smoothness Smoothness Wasserstein (Cranko et al., 2021) Generalised Lipschitz Continuity Wasserstein

Our Method Bounded any Blackbox Hellinger Image Net

Table 1. Current landscape of certiﬁed distributional robustness.

real-world, Image Net-scale neural networks and datasets. Our key innovation is a novel algorithmic framework that arises from bounding inner products between elements of a suitable Hilbert space. Speciﬁcally, we can characterize the upper bound of the performance of h on any Q within the uncertainty set as a function of the Hellinger distance, a speciﬁc type of f-divergence, and the expectation and variance of the loss of h on P.

We then apply our framework to the problem of certifying the out-of-domain generalization performance of a given classiﬁer, taking advantage of its scalability and ﬂexibility. Speciﬁcally, let P be the in-domain distribution, and hθ a classiﬁer. Then, to reason about the performance of hθ on shifted distributions Q, we provide a certiﬁcate in the following form:

Q: dist(Q, P) ρ

= E(X,Y ) Q[ℓ(h(X), Y )] Cℓ(ρ, P) (2)

where Cℓis a bound which depends on the distance ρ and the distribution P. This requires several nontrivial instantiations of our framework with careful practical considerations. To this end, we ﬁrst develop a certiﬁcation algorithm that relies only on a ﬁnite set of samples from the in-domain distribution P. Moreover, we also instantiate it with different domain drifting models such as label drifting and covariate drifting, connecting the general Hellinger distance to the degree of domain drifting speciﬁc to these scenarios. We then consider a diverse range of loss functions, including JSD loss, 0-1 loss, and AUC. To the best of our knowledge, we provide the ﬁrst certiﬁcate for such diverse realistic scenarios, which is able to scale to large problems.

Last but not least, we conduct intensive experiments verifying the efﬁciency and effectiveness of our result. Our method is able to scale to datasets and neural networks as large as Image Net and full-ﬂedged models like Efﬁcient Net B7 and BERT. We further apply our method on smaller-scale datasets, in order to compare with strong, state-of-the-art methods. We show that our method provides much tighter certiﬁcates.

Our contributions can be summarized as follows:

We present a novel framework which provides a non-

vacuous, computationally tractable bound to the distributionally robust worst-case risk Rθ(UP ) for general bounded loss functions ℓand models h. We apply this framework to the problem of certifying out-of-domain generalization for blackbox functions and provide a means to certify distributional robustness in speciﬁc scenarios such as label and covariate drifts. We provide an extensive experimental study of our approach on a wide range of datasets including the large scale Image Net (Russakovsky et al., 2015) dataset, as well as NLP datasets with complex models.

2. Distributional Robustness for Blackbox Functions

In this section, we present our main results, namely, a computationally tractable upper bound to the worst-case risk (1) for uncertainty sets expressed in terms of Hellinger balls around the data-generating distribution P. The technique is based on the non-negativity of Gram matrices which, by expressing expectation values as inner products between elements of a suitable Hilbert space, can be leveraged to relate expectation values of a blackbox function under different probability distributions P and Q.1 We describe the underlying technique leading to our main result in Theorem 2.2, which upper bounds the worst-case population loss using both the expectation and variance.

For the remainder of this section, to simplify notation and maintain generality, we consider generic loss functions ℓ: Z R+ which contain the model h and take inputs from a generic input space Z. For example in the context of supervised learning, Z = X Y can be the product space of features and labels and the loss ℓ(z) = ℓ(hθ(x), y) can be seen as a composition of the loss function ℓand the model hθ. We denote the set of probability measures on the space Z by P(Z). For two measures µ, ν on Z, we say that ν is absolutely continuous with respect to µ, denoted ν µ if µ(A) = 0 implies that ν(A) = 0 for any

1The idea behind our methods is inspired by how Gram matrices are used in quantum chemistry (Weinhold, 1968; Weber et al., 2021) to bound expectation values of quantum observables. However, the adaptation to machine learning is nontrivial and requires careful analysis.

Certifying Out-of-Domain Generalization for Blackbox Functions

measurable set A Z. Among the plethora of distances between probability measures, such as total variation and Wasserstein distance, a particularly popular choice is the family of f-divergences which has been extensively studied in the context of distributionally robust optimization (Ben Tal et al., 2013; Lam, 2016; Duchi & Namkoong, 2019; Duchi et al., 2021). In this paper, we focus on the Hellinger distance, which is a particular type of f divergence.

Deﬁnition 2.1 (Hellinger-distance). Let P, Q P(Z) be probability measures on Z that are absolutely continuous with respect to a reference measure µ with P, Q µ. The Hellinger distance between P and Q is deﬁned as

q(z) 2 dµ(z) (3)

where p = d P

dµ and q = d Q

dµ are the Radon-Nikodym derivatives of P and Q with respect to µ. The Hellinger distance is independent of the choice of the reference measure µ.

The Hellinger distance is bounded with values in [0, 1], with H(P, Q) = 0 if and only if P = Q and the maximum value of 1 attained when P and Q have disjoint support. Furthermore, H deﬁnes a metric on the space of probability measures and hence satisﬁes the triangle inequality. We will now show how the Hellinger distance can be expressed in terms of an inner product between elements of a suitable Hilbert space, which ultimately enables us to use the theory of Gram matrices to derive an upper bound on the worst-case population risk (1) for uncertainty sets given by Hellinger balls. Consider the Hilbert space L2(Z, Σ, µ)2

of square-integrable functions f : Z R, endowed with the inner product f, g = R

Z fg dµ. Within this space, we can identify any probability distribution P µ with a unit vector ψP L2(Z, Σ, µ) via the square root of its Radon-Nikodym derivative ψP := p

d P/dµ. This mapping enables us to write the Hellinger distance and, more generally, expectation values in terms of inner products. To see this, note that for any two probability measures P, Q on Z, it holds that

ψP , ψQ = Z

d Q = 1 H2(P, Q) (4)

and similarly, for any essentially bounded function f L , we have

EP [f(Z)] = Z

Z f(z) d P(z) = ψP , f ψP (5)

where the product (f ψP )(z) = f(z) ψP (z) is to be understood as pointwise multiplication3. For f L ,

2We take Σ to be the Borel σ-algebra on Z, being the smallest σ-algebra containing all open sets on Z. 3More precisely, every f L (Z, Σ, µ) deﬁnes a bounded

consider the Gram matrix of the Hilbert space elements ψQ, ψP and f ψP , deﬁned as

1 ψQ, ψP ψQ, fψP ψQ, ψP 1 ψP , fψP fψP , ψQ fψP , ψP fψP , fψP

The crucial observation is that G is positive semideﬁnite and thus has a non-negative determinant which can be viewed as a second degree polynomial π(x) evaluated at x = ψQ, fψP and is given by

det(G) =: π(x)|x= ψQ, fψP (7)

where π(x) = ax2+bx+c is a polynomial with coefﬁcients

a = 1, b = 2 ψP , ψQ ψP , fψP

c = (1 | ψP , ψQ |2) fψP , fψP fψP , ψP 2. (8)

The non-negativity of det(G) implies that π(x = ψQ, fψP ) 0 and thus effectively restricts the values which ψQ, fψP can take to be bounded within the square roots of π so that

4 + c ψQ, fψP b

For positive functions f 0, we can upper bound ψQ, fψP via the Cauchy-Schwarz inequality and obtain a lower bound on the expectation of f under Q. Taking as our function f to be the loss function of interest f := ℓ, under the assumption that supz Z|ℓ(z)| M for some M > 0, we can ﬁnally recast this lower bound as an upper bound on the expectation of ℓunder Q. Taking the supremum with respect to Q leads to a bound on the worst-case risk (1). We remark that in this way, we obtain both lower and upper bounds on the expected loss. As we will show, these bounds can be used to bound useful statistics, such as the accuracy or the AUC score used in binary classiﬁcation. In the following Theorem, we state our main result as an upper bound to the worst-case risk (1) and refer the reader to Appendix A.2 for the analogous lower bound.

Theorem 2.2. Let ℓ: Z R+ be a loss function and suppose that supz Z|ℓ(z)| M for some M > 0. Then, for any probability measure P on Z and ρ > 0 we have

sup Q Bρ(P ) EQ[ℓ(Z)] EP [ℓ(Z)] + 2Cρ p

+ ρ2(2 ρ2) M EP [ℓ(Z)] VP [ℓ(Z)] M EP [ℓ(Z)]

where Cρ = p

ρ2(1 ρ2)2(2 ρ2) and Bρ(P) = {Q P(Z): H(P, Q) ρ} is the Hellinger ball of radius ρ

linear operator Mf : L2 L2 acting on elements of L2 via pointwise multiplication, g 7 Mf(g) := f g with (f g)(z) = f(z) g(z) for any z Z.

Certifying Out-of-Domain Generalization for Blackbox Functions

centered at P. The radius ρ is required to be small enough such that

ρ2 1 1 + (M EP [ℓ(Z)])2

We refer the reader to Appendix A.1 for a full proof and now make some general observations about this result. The bound (10) presents a pointwise guarantee in the sense that it upper bounds the distributional worst-case risk for a particular model ℓ( ). This is in contrast to bounds which hold uniformly for an entire model class and introduce complexity measures such as covering numbers and VC-dimension which are hard to compute for many practical problems. Other techniques which yield a pointwise robustness certiﬁcate of the form (10), typically express the uncertainty set as a Wasserstein ball around the distribution P (Sinha et al., 2018; Shaﬁeezadeh-Abadeh et al., 2019; Blanchet & Murthy, 2019; Cranko et al., 2021), and require the model ℓto be sufﬁciently smooth. For example, the certiﬁcate presented in (Sinha et al., 2018) can only be tractably computed for small neural networks for which one can upper bound their smoothness by bounding the Lipschitz constant of their gradients. For more general and large-scale neural networks, these bounds quickly become intractable and/or lead to vacuous certiﬁcates. For example, it is known that computing the Lipschitz constant of neural networks with Re LU activations is NP-hard (Virmaux & Scaman, 2018). Secondly, we emphasize that our bound (10) is faithful , in the sense that, as the radius approaches zero, ρ 0, the bound converges towards the true expectation EP [ℓ(Z)]. This is of course desirable for any such bound as it indicates that any intrinsic gap vanishes as the covered distributions become increasingly closer to the reference distribution P. A third observation is that the bound (10) is monotonically increasing in the variance, indicating that low-variance models exhibit better generalization properties, which can be seen in light of the bias-variance tradeoff. More speciﬁcally, from the form our bound (10) takes, we see that minimizing the variance-regularized objective L(θ) = EZ P [ℓθ(Z)]+λVZ P ℓθ(Z), effectively amounts to minimizing an upper bound on the worst-case risk. Indeed, various recent works have highlighted the connection between variance regularization and generalization (Lam, 2016; Maurer & Pontil, 2009; Gotoh et al., 2018; Duchi & Namkoong, 2019) and our result provides further evidence for this observation.

3. Certifying Out-of-domain Generalization

Taking advantage of our weak assumptions on the loss functions and models, we now apply our framework to the problem of certifying the out-of-domain generalization performance of a given classiﬁer, when measured in terms of different loss functions. In practice, one is typically only given

a ﬁnite sample Z1, . . . , Zn from the in-domain distribution P and the bound (10) needs to be estimated empirically. To address this problem, our next step is to present a ﬁnite sampling version of the bound (10) which holds with arbitrarily high probability over the distribution P. Second, we instantiate our results with speciﬁc distribution shifts, namely, shifts in the label distribution, and shifts which only affect the covariates. Finally, we highlight speciﬁc loss and score functions and show how our result can be applied to certify the out-of-domain generalization of these functions.

3.1. Finite Sample Results

Let Z1, . . . , Zn iid P be an independent and identically distributed sample from the in-domain distribution P. One immediate way to use our bound would be to construct the empirical distribution ˆPn and consider the worst-case risk over distributions Q Bρ( ˆPn), while computing the bound on the right hand side of (10) with the empirical mean and unbiased sample variance. However, for ρ < 1, the Hellinger ball Bρ( ˆPn) will in general only contain distributions with discrete support since any continuous distribution Q has distance 1 from ˆPn. We therefore seek another path and make use of concentration inequalities for the population variance and mean, in order to get statistically sound guarantees which hold with arbitrarily high probability. To achieve this, we bound the expectation value via Hoeffding s inequality (Hoeffding, 1963), and the population variance via a bound presented in (Maurer & Pontil, 2009). In the second step, we use the union bound as a means to bound both variance and expectation simultaneously with high probability. We leave the derivation and proof to Appendix 3.1. These ingredients lead to the ﬁnite sampling-based version of Theorem 2.2, which we state in the following Corollary. Corollary 3.1 (Finite-sampling bound). Let Z1, . . . , Zn be independent random variables drawn from P and taking values in Z. For a loss function ℓ: Z [0, M], let ˆLn := 1 n Pn i=1 ℓ(Zi) be the empirical mean and S2 n := 1 n(n 1) Pn 1 i<j n(ℓ(Zi) ℓ(Zj))2 be the unbiased estimator of the variance of the random variable ℓ(Z), Z P. Then, for any δ (0, 1), with probability at least 1 δ

sup Q Bρ(P ) EQ[ℓ(Z)] ˆLn + 2Cρ p

+ S2 n + 2M q

2S2n ln 2/δ

n 1 + 2M 2 ln 2/δ

n,ρ = 2Cρ n 1 ρ2(2 ρ2)

2 ln 2/δ (13)

Certifying Out-of-Domain Generalization for Blackbox Functions

ρ2(1 ρ2)2(2 ρ2). The radius ρ is required to be small enough such that

Thus, we have derived a certiﬁcate for out-of-domain generalization for general bounded loss functions and models h which can be efﬁciently estimated from ﬁnite data sampled from the distribution P.

3.2. Speciﬁc Distribution Shifts

We now consider speciﬁc distribution shifts and discuss our main results in light of shifts in the distributions of labels and covariates.

3.2.1. LABEL DISTRIBUTION SHIFTS

Shifts in the label distribution occur when, during deployment, an ML-system operates in an environment where the relative frequency of certain classes increases or decreases, compared to the training environment, or, as is common in practical applications, instances of previously unseen classes appear. This can potentially harm the model performance dramatically and can have severe implications, in particular in the context of fairness and ethics in machine learning. To investigate this type of distribution shift, we follow the common practice to assume that the distribution over covariates, conditioned on the labels, stays constant. Formally, here, we consider the distribution shift P Q expressed via

p(x, y) = π(x| y)p(y) 7 q(x, y) = π(x| y)q(y) (15)

where π(x| y) is given by a ﬁxed distribution over covariates, conditioned on labels. In this case, it can be shown that the Hellinger distance is equal to the L2 norm between the square roots of the (label) probability vectors p = (p(1), . . . , p(K))T RK and q = (q(1), . . . , q(K))T RK where K is the number of classes, so that

H(P, Q) = 1

2 p q 2 (16)

where the square root is applied to each element in the respective probability vector.

3.2.2. COVARIATE DISTRIBUTION SHIFTS

In contrast to label distribution shifts, here we consider shifts to the distribution of covariates. This models scenarios where the relative frequency of labels stays constant,

but environments change, for example the shift from day to night in autonomous driving or wildlife surveillance. Formally, we consider the shift P Q with

p(x, y) = π(y| x)p(x) 7 q(x, y) = π(y| x)q(x) (17)

where π(y| x) is given by a ﬁxed distribution on labels, conditioned on the covariates. In this scenario, the Hellinger distance between P and Q reduces to the distance between the marginals

q(x) 2 dx. (18)

In principle, this quantity could be estimated from unlabeled samples of a target distribution Q, enabling one to reason about distributional robustness of a given model, by evaluating our bounds from Theorem 2.2 and Corollary 3.1. However, in practice, it is generally difﬁcult to estimate fdivergences, and in particular the Hellinger distance, from data for practically relevant problem instances. Although ﬁrst steps in this direction have been made (Nguyen et al., 2007; 2010; Sreekumar et al., 2021), it remains largely an open problem and a potential solution would give our approach additional ounces of practical signiﬁcance. We view this problem as orthogonal to certifying out-of-domain generalization and believe that research efforts towards such an end-to-end solution pose an exciting future research direction.

Discussion. We notice that when considering labelor covariate distribution shifts, we are effectively interested in a subset of all probability distributions with a given predeﬁned Hellinger distance. In other words, if the shift P Q models the label distribution shift with distance H(P, Q) ρ, then applying the certiﬁcate (10) with radius ρ also covers every other type distribution shift bounded by ρ and hence gives a more conservative view than desired. This is because, in general,

sup Q:H(P, Q) ρ q( | y) p( | y)

EQ[ℓ(Z)] sup Q:H(P, Q) ρ EQ[ℓ(Z)] (19)

arising from the additional constraint that q(x| y) = p(x| y) for all x X. A similar argument can be made for covariate shifts. Naturally, this leads to an intrinsic gap between the actual and certiﬁed robustness, which we also observe in our experiments. Finally, it is worth pointing out the connection with generalization from ﬁnite amounts of data which can be seen as a speciﬁc instantiation of the worstcase risk (1) where the in-domain distribution corresponds to the empirical distribution ˆPn and the radius ρ decays as O(1/n). In this sense, the distribution shift originates from the transition from the empirical to the true data distribution. This type of distribution shift has been analyzed in (Duchi & Namkoong, 2019) where further links to variance-based regularization have been established.

Certifying Out-of-Domain Generalization for Blackbox Functions

3.3. Speciﬁc Loss and Score Functions

We now turn our attention to speciﬁc loss functions and discuss, in particular, the Jensen-Shannon divergence loss, the classiﬁcation error, and the AUC score.

3.3.1. JENSEN-SHANNON DIVERGENCE

The Jensen-Shannon Divergence is a particular type of loss function for classiﬁcation models, and serves as a symmetric alternative to other common losses such as cross entropy. It has been observed that the JSD loss and its generalizations have favorable properties compared to the standard cross entropy loss, as it is bounded, symmetric, and its square root is a distance and hence satisﬁes the triangle inequality. In (Englesson & Azizpour, 2021) it has been observed that JSD loss can be seen as an interpolation between cross-entropy and mean absolute error and is particularly well suited for classiﬁcation problems with noisy labels. Formally, the Jensen-Shannon divergence is deﬁned as

DJS(P, Q) := 1

DKL(P µ) + DKL(Q µ) (20)

where DKL is the Kullback-Leibler divergence and µ = 1 2(P + Q). Since it is a bounded loss function, it is straightforward to apply our results to certify the out-of-domain generalization for the JSD loss and, due to its smoothness, allows for a principled comparison between our bound and the Wasserstein distance certiﬁcates proposed in (Sinha et al., 2018; Cranko et al., 2021).

3.3.2. CLASSIFICATION ERROR

The classiﬁcation error is among the most popular choices for measuring the performance of classiﬁcation models and serves as a means to assess how accurate a classiﬁer is on a given data distribution. As it is a non-smooth function, existing approaches cannot in general certify distributional robustness for this function. In contrast, one can immediately instantiate our Theorem 2.2 (or the ﬁnite sampling version from Corollary 3.1) with this loss. Indeed, for a ﬁxed model h: X Y, let ϵP := P(X,Y ) P [h(X) = Y ] and analogously ϵQ. Then, in the inﬁnite sampling regime, we immediately get an upper bound on the worst-case classiﬁcation error from Theorem 2.2. Namely, for a sufﬁciently small radius ρ2 1 ϵP , we have

sup Q Bρ(P ) ϵQ ϵP + 2Cρ p

+ ρ2(2 ρ2)(1 2ϵP ) (21)

where Cρ = p

ρ2(1 ρ2)2(2 ρ2).

3.3.3. AUC SCORE

Among other uses, the Area under the ROC (AUC) score (Hanley & Mc Neil, 1982; Cl emenc on et al., 2008)

is a metric to measure the performance of binary classiﬁcation models. Unlike the classiﬁcation error, which captures the ability to classify a single randomly chosen instance, the AUC score provides a means to quantify the ability to correctly assigning to any positive instance a higher score than to a randomly chosen negative instance. For a binary classiﬁcation model h: X R that outputs the score of the positive class, the AUC score is deﬁned as

AUC(h) = P [h(X) h(X )|Y = 1, Y = 1] (22)

where (X, Y ) and (X , Y ) are independent and identically distributed according to P. By introducing the notation X := X| Y = 1, we can equivalently write the AUC score as an expectation value over the joint (conditional) distribution of Z := (X+, X )

AUC(h) = E(X+, X ) PZ[1{h(X+) h(X )}]. (23)

We notice that only distribution shifts on the covariates have an impact on the AUC score. For this reason, we consider a setting similar to the covariate shift setting of Sec. 3.2.2, although we consider shifts in the conditional distribution p(x| y) 7 q(x|y) for each y { 1} in contrast to shifts in the marginals. Due to independence, the probability density function of Z PZ can be written as

p Z(x+, x ) = p(x| y = +1)p(x| y = 1) (24)

and similarly for the shifted distribution Q. Thus, assuming that for both negative and positive samples a distribution drift with H(PX|Y =y, QX|Y =y) ρ occurs, the squared Hellinger distance between PZ and QZ is bounded by

H2(PZ, QZ) ρ2(2 ρ2). (25)

Thus, for certifying out-of-domain generalization for the AUC score, we can apply our bound by instantiating it with Hellinger distance p

ρ2(2 ρ2). We remark that for the AUC score, one is typically interested in lower bounding it under distribution shifts. To that end, we present a lower bound version of our Theorem 2.2 in Appendix A.2.

4. Experiments

We now experimentally validate our theoretical ﬁndings on a diverse collection of datasets and scenarios. We ﬁrst provide certiﬁcates considering generic distribution shifts P Q and then provide detailed analysis on the two speciﬁc scenarios described in Sections 3.2.1 and 3.2.2, namely, shifts in the label and in the covariate distributions. Finally, we construct a synthetic example that allows for a fair comparison of our bounds with the Wasserstein certiﬁcate of (Sinha et al., 2018), which indicates that in addition to favorable scalability properties, our bounds are also considerably tighter. We remark that all our bounds are computed

Certifying Out-of-Domain Generalization for Blackbox Functions

Classification Error

Image Net-1k

0.0 0.1 0.2 Hellinger Distance

0.0 0.1 0.2 Hellinger Distance

0.0 0.1 0.2 Hellinger Distance

0.0 0.1 0.2 Hellinger Distance

P[ (h(X), Y)] Gramian Upper Bd. Gramian Lower Bd.

Figure 1. Distributional robustness certiﬁcates for generic distribution shifts on vision and NLP datasets for JSD and 0-1 loss.

0.00 0.05 0.10 0.15 0.20 Hellinger Distance

Image Net-1k

0.00 0.05 0.10 0.15 0.20 Hellinger Distance

P[ (h(X), Y)] Gramian Upper Bd. Gramian Lower Bd.

Figure 2. Distributional robustness certiﬁcates for AUC against generic distribution shifts on binary Image Net and CIFAR datasets.

using the ﬁnite sampling bounds presented in Corollary 3.1 and hold with 99% probability (δ = 0.01).4

Datasets We certify out-of-domain generalization on two standard vision datasets: Image Net-1k (Russakovsky et al., 2015) containing objects of 1,000 different classes; and CIFAR-10 (Krizhevsky, 2009), which contains natural images of 10 different classes. We also conduct experiments on the standard natural language processing (NLP) datasets Yelp (Challenge) and SNLI (Bowman et al., 2015). We follow Lin et al. (2017) to sample 2, 000 examples for the Yelp test set and 10, 000 examples for the SNLI test set.

Models For classiﬁcation on Image Net-1k, we use the Efﬁcient Net-B7 (Tan & Le, 2019) architecture which we initialize with pre-trained weights; we use Dense Net121 (Huang et al., 2017) for CIFAR-10. On Yelp, we use BERT (Devlin et al., 2018) and on SNLI we use a De BERTa architecture (He et al., 2020).

Settings for AUC Scores When we consider AUC scores, we further constrain all multiclass datasets into a binary

4Our code is publicly available at https://github.com/ DS3Lab/certified-generalization.

Figure 3. Certiﬁed Generalization for label distribution shifts. Each gray point corresponds to a randomly sampled label distribution with corresponding Hellinger distance and empirical loss.

version. To this end, on Image Net, we randomly choose two classes and train a Res Net-152 architecture to discriminate between the two Synsets n01601694 and n04330267 (corresponding to the classes water ouzel and stove ). Similarly, on CIFAR-10 we also pick two classes at random and train a Res Net-110 classiﬁer for the two classes bird and horse .

4.1. Certifying Distribution Shifts

Figures 1 and 2 illustrate the certiﬁcates that we provide on a diverse range of datasets, considering three different scores: classiﬁcation error, JSD loss, and AUC score. In all these ﬁgures, the x-axis corresponds to the degree of distribution drift, and the Gramian Certiﬁcate curves correspond to the lower and upper bound of these scores under distribution drifts. To our best knowledge, this is the ﬁrst time that nonvacuous certiﬁcates are obtained on this diverse range of datasets, scores, and large-scale models.

Label Distribution Shifts To get a better indication of how well our certiﬁcates capture the true risk under label distribution shifts, we randomly generate 100,000 shifted class distributions on the CIFAR-10 and Yelp datasets by 1) subsampling existing classes, 2) removing the counts

Certifying Out-of-Domain Generalization for Blackbox Functions

0.0 0.2 0.4 0.6 0.8 1.0

2 || px|y qx|y ||2

Classification Error

P[ (h(X), Y)] Gramian Upper Bd.

(a) Classiﬁcation Error

0.0 0.2 0.4 0.6 0.8 1.0

2 || px|y qx|y ||2

P[ (h(X), Y)] Gramian Lower Bd.

(b) AUC Score

Figure 4. Certiﬁcate against covariate shift on colored MNIST.

of existing classes, and 3) including new unseen classes. This allows us to empirically compute both the classiﬁcation error and the Hellinger distance and enables us to compare the certiﬁcates to the actual loss on the shifted distribution. We can see from Figure 3 that our certiﬁcates indeed provide a valid upper and lower bound. Note that, given that all shifted class distributions are randomly sampled, we might not hit the true worst-case scenario, explaining the clear gap between the generalization certiﬁcates and the scores obtained from the randomly generated label distributions. Another reason for the gap can be attributed to the intrinsic gap for label and covariate shifts, discussed in Section 3.2. We refer the reader to Appendix F.1 for analogous ﬁgures with a larger set of model architectures on the CIFAR-10 dataset. Finally, we point out the difﬁculty in sampling these class distributions for datasets with a large number of classes and include analogous ﬁgures for Image Net and the SNLI dataset in Appendix F.2.

Covariate Distribution Shifts. We now investigate our certiﬁcates in light of changes in the distribution of the covariates and consider the scenario described in Sec. 3.2.2. In this experiment, we use the binary Colored MNIST dataset (Kim et al., 2019; Arjovsky et al., 2019), which is constructed from the MNIST dataset by coloring the digits 0-4 in green and 5-9 in red for the training set, while ﬂipping the coloring in the test set. The classiﬁer is then trained to classify the digits into the two groups {0, 1, 2, 3, 4} and {5, 6, 7, 8, 9}. In this setting, the classiﬁer learns to perfectly distinguish the two classes in the training set, but fails on the testing set since the color is a stronger predictor than the shape of the digits. To investigate the space between these two extreme cases, we generate mixture distributions between training and test distribution in the following way. We set P to be the training distribution and Q the testing distribution (containing digits with ﬂipped colors). Guided by a mixing parameter γ, we mix P and Q to obtain the mixture distribution Πγ := γ P +(1 γ) Q. Since P and Q have disjoint support, we compute the Hellinger distance between P and Πγ as H(P, Πγ) = p1 γ as shown in Appendix E. Figure 4 illustrates our robustness certiﬁcates for the 0-1 loss and the AUC Score, as well as the empirical losses EΠγ[ℓ(X, Y )] for different values of the mixture

0 1 2 3 4 Distribution shift (L2-norm || ||2)

Gramian Certificate WRM (Sinha et al., 2018) Lip. (Cranko et al., 2021)

w=2 w=4 w=8 w=16

(a) Robustness certiﬁcates for varying network widths w and nh = 2 hidden layers.

0 1 2 3 4 Distribution shift (L2-norm || ||2)

Gramian Certificate WRM (Sinha et al., 2018) Lip. (Cranko et al., 2021)

nh =2 nh =5 nh =10 nh =20

(b) Robustness certiﬁcates for varying numbers of hidden layers nh and ﬁxed width w = 4.

Figure 5. Comparison of our approach with the Wasserstein-based certiﬁcates from (Sinha et al., 2018; Cranko et al., 2021) for varying levels of model complexity.

parameter γ. We see from the ﬁgure that our technique provides quite tight certiﬁcates for both classiﬁcation error and AUC score.

4.2. Comparison with Wasserstein Certiﬁcates

We now construct a synthetic example that enables a fair comparison with two baseline certiﬁcates based on the Wasserstein distance. Namely, we compare our approach with 1) the certiﬁcate which uses the Lipschitz constant of the ML model, presented in (Cranko et al., 2021); and 2) with the pointwise robustness certiﬁcate derived in (Sinha et al., 2018) from the dual formulation of the worst-case risk. We remark that these certiﬁcates cannot be applied to our previous examples because of their prohibitive assumptions. To make the three techniques comparable, we consider a Gaussian mixture model and certify the Jensen-Shannon divergence loss, while modeling distribution shifts as dislocations, X 7 X + δ for a ﬁxed perturbation vector δ. This allows us to parameterize the distribution shift via the

Certifying Out-of-Domain Generalization for Blackbox Functions

L2-norm of δ and obtain a one-to-one correspondence between our Hellinger distance and the Wasserstein distance, and enables a principled comparison. We describe the details of this synthetic dataset in Appendix C. To investigate how the techniques scale with increased model complexity, we use fully connected feedforward neural networks with varying depths and widths. In addition, to accommodate (Sinha et al., 2018) s assumptions on smoothness, we use ELU activation functions on all layers. We remark that the bound in (Sinha et al., 2018) requires one to solve a complex maximization problem, which requires the composition of the loss function and the network to be sufﬁciently smooth. Furthermore, the concavity of the maximization problem hinges on knowledge of the Lipschitz constant of the gradient. For small examples, this Lipschitz constant can be obtained, as we show in Appendix D for the JSD loss function. As can be seen in Figure 5, all bounds converge to the expected loss EP [ℓ(X, Y )] as the perturbation goes to zero, δ 2 0. However, the certiﬁcate from (Sinha et al., 2018) quickly becomes vacuous as the perturbation magnitude increases. In addition, both baseline bounds become loose with increasing model complexity, while our bound is virtually agnostic to the model architecture as it only depends on the variance and expected loss on the distribution P.

5. Related Work

Distributionally robust optimization ﬁrst appeared in the context of inventory management (Scarf, 1958) and has since been discovered by the machine learning community as a useful tool to train machine learning models which generalize better to new distributions (Ben-Tal et al., 2013; Gao & Kleywegt, 2016; Shaﬁeezadeh-Abadeh et al., 2019). The uncertainty set occurring in the distributionally robust loss has been studied in terms of Wasserstein balls in (Gao & Kleywegt, 2016; Sinha et al., 2018; Shaﬁeezadeh-Abadeh et al., 2019; Cranko et al., 2021; Lee & Raginsky, 2017; Cisse et al., 2017; Kuhn et al., 2019; Blanchet & Murthy, 2019), and f-divergence balls in (Ben-Tal et al., 2013; Duchi et al., 2021; Lam, 2016; Duchi & Namkoong, 2019; 2021). From a more general viewpoint, (Husain, 2020) connects integral probability metrics with distributional robustness in general and provides links with generative adversarial networks. In another vein, maximum mean discrepancy measures have been investigated in (Staib & Jegelka, 2019) for generalization in Kernel methods. (Sinha et al., 2018) propose a method to certify generalization by using the dual formulation of the Wasserstein worst-case risk. However, their approach requires the loss model and loss function to be smooth and relies on an estimate of the Lipschitz constant of gradients, which quickly becomes vacuous for large problem sizes. Related techniques based on Wasserstein distances (Gao & Kleywegt, 2016; Shaﬁeezadeh-Abadeh et al., 2019; Blanchet & Murthy, 2019; Kuhn et al., 2019; Cranko

et al., 2021) make similarly prohibitive assumptions and generally fail to provide scalable alternatives. In contrast, we study uncertainty sets expressed as Hellinger balls and provide a model-speciﬁc distributional robustness guarantee which only makes minimal assumptions on the loss (namely, boundedness) and thus scales to large problems. The authors in (Subbaswamy et al., 2021) consider distributionally robust optimization under ﬁne-grained shifts in the marginal distributions, and reason about the worst-case risk on subpopulations in the data distribution. Orthogonal to our work is the topic of certiﬁed adversarial robustness (Wong et al., 2018; Lecuyer et al., 2019; Cohen et al., 2019; Szegedy et al., 2014; Carlini & Wagner, 2017). This line of research seeks to reason about robustness at the instance level, while we aim to bound the worst-case risk over a set of distributions.

6. Conclusion

In this paper, we have studied the problem of certifying the out-of-domain generalization for blackbox functions. To that end, we have presented a framework to bound the worstcase population risk over an uncertainty set of probability distributions given by a Hellinger ball. In contrast to existing approaches, our framework is scalable since it treats the loss function together with the model as a blackbox and thus requires virtually no knowledge about the internals of, e.g., neural networks. We have provided experimental evidence that our technique can handle large models and datasets and provides, to the best of our knowledge, the ﬁrst nonvacuous out-of-domain generalization bounds for problems as large as Image Net with a full-ﬂedged Efﬁcient Net-B7. While our techniques provide a means to certify robustness against general distribution shifts, future research directions can potentially extensively study more speciﬁc distribution shifts. In addition, it will be interesting to link our results to related topics such as fairness in machine learning.

Acknowledgments

CZ and the DS3Lab gratefully acknowledge the support from the Swiss State Secretariat for Education, Research and Innovation (SERI) s Backup Funding Scheme for European Research Council (ERC) Starting Grant TRIDENT (101042665), the Swiss National Science Foundation (Project Number 200021 184628, and 197485), Innosuisse/SNF BRIDGE Discovery (Project Number 40B20 187132), European Union Horizon 2020 Research and Innovation Programme (DAPHNE, 957407), Botnar Research Centre for Child Health, Swiss Data Science Center, Alibaba, Cisco, e Bay, Google Focused Research Awards, Kuaishou Inc., Oracle Labs, Zurich Insurance, and the Department of Computer Science at ETH Zurich. BL, LL and BW are supported by NSF grant No.1910100, NSF CNS 2046726, C3 AI, and the Alfred P. Sloan Foundation

Certifying Out-of-Domain Generalization for Blackbox Functions

Al Badawy, E. A., Saha, A., and Mazurowski, M. A. Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing. Medical physics, 45(3):1150 1158, 2018.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp. 456 473, 2018.

Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013.

Blanchet, J. and Murthy, K. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, 2019.

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference. In EMNLP, 2015.

Carlini, N. and Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artiﬁcial intelligence and security, pp. 3 14, 2017.

Challenge, Y. D. data retrieved from Yelp Dataset Challenge, https://www.yelp.com/dataset/ challenge.

Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pp. 854 863. PMLR, 2017.

Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certiﬁed adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019.

Cranko, Z., Shi, Z., Zhang, X., Nock, R., and Kornblith, S. Generalised lipschitz regularisation equals distributional robustness. In International Conference on Machine Learning, pp. 2178 2188. PMLR, 2021.

Dai, D. and Van Gool, L. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3819 3824. IEEE, 2018.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Duchi, J. and Namkoong, H. Variance-based regularization with convex objectives. The Journal of Machine Learning Research, 20(1):2450 2504, 2019.

Duchi, J. C. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021.

Duchi, J. C., Glynn, P. W., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021.

Englesson, E. and Azizpour, H. Generalized jensen-shannon divergence loss for learning with noisy labels. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=Tiw PYwg3IRf.

Gao, R. and Kleywegt, A. J. Distributionally robust stochastic optimization with wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016.

Gotoh, J.-y., Kim, M. J., and Lim, A. E. Robust empirical optimization is almost the same as mean variance optimization. Operations research letters, 46(4):448 452, 2018.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=l Qd Xe XDo Wt I.

Hanley, J. A. and Mc Neil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29 36, 1982.

He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decodingenhanced bert with disentangled attention. ar Xiv preprint ar Xiv:2006.03654, 2020.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Husain, H. Distributional robustness with ipms and links to regularization and gans. ar Xiv preprint ar Xiv:2006.04349, 2020.

Certifying Out-of-Domain Generalization for Blackbox Functions

Kim, B., Kim, H., Kim, K., Kim, S., and Kim, J. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9012 9020, 2019.

Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., et al. Wilds: A benchmark of in-thewild distribution shifts. In International Conference on Machine Learning, pp. 5637 5664. PMLR, 2021.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shaﬁeezadeh Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pp. 130 166. INFORMS, 2019.

Lam, H. Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248 1275, 2016.

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certiﬁed robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656 672. IEEE, 2019.

Lee, J. and Raginsky, M. Minimax statistical learning with wasserstein distances. ar Xiv preprint ar Xiv:1705.07815, 2017.

Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., and Bengio, Y. A structured self-attentive sentence embedding. ar Xiv preprint ar Xiv:1703.03130, 2017.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Nonparametric estimation of the likelihood ratio and divergence functionals. In 2007 IEEE International Symposium on Information Theory, pp. 2016 2020. IEEE, 2007.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211 252, 2015.

Scarf, H. A min-max solution of an inventory problem. Studies in the mathematical theory of inventory and production, 1958.

Shaﬁeezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

Sinha, A., Namkoong, H., and Duchi, J. Certiﬁable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=Hk6k Pg ZA-.

Sreekumar, S., Zhang, Z., and Goldfeld, Z. Non-asymptotic performance guarantees for neural estimation of fdivergences. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3322 3330. PMLR, 2021.

Staib, M. and Jegelka, S. Distributionally robust optimization and generalization in kernel methods. Advances in Neural Information Processing Systems, 32:9134 9144, 2019.

Subbaswamy, A., Adams, R., and Saria, S. Evaluating model robustness and stability to dataset shift. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2611 2619. PMLR, 2021.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http: //arxiv.org/abs/1312.6199.

Tan, M. and Le, Q. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105 6114. PMLR, 2019.

Virmaux, A. and Scaman, K. Lipschitz regularity of deep neural networks: analysis and efﬁcient estimation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Volk, G., M uller, S., von Bernuth, A., Hospach, D., and Bringmann, O. Towards robust cnn-based object detection through augmentation with synthetic rain variations. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 285 292. IEEE, 2019.

Weber, M., Anand, A., Cervera-Lierta, A., Kottmann, J. S., Kyaw, T. H., Li, B., Aspuru-Guzik, A., Zhang, C., and Zhao, Z. Toward reliability in the nisq era: Robust interval guarantee for quantum measurements on approximate states. ar Xiv preprint ar Xiv:2110.09793, 2021.

Certifying Out-of-Domain Generalization for Blackbox Functions

Weinhold, F. Lower bounds to expectation values. Journal of Physics A: General Physics, 1(3):305, 1968. URL https://doi.org/10.1088/0305-4470/ 1/3/301.

Wong, E., Schmidt, F., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Certifying Out-of-Domain Generalization for Blackbox Functions

A.1. Proof of Theorem 2.2

We begin the proof by stating a lemma which allows one to bound inner products between elements of a Hilbert space H. Lemma A.1. Let H be a Hilbert space with inner product , , let A B(H) be a positive semideﬁnite bounded linear operator on H and let u, v H \ {0} be such that

( u, u v, v | u, v |2) v, Av 2 , (26)

where := v, v Av, Av Av, v 2 . (27)

u, Au | u, v |2 v, Av

v, v 2 2| u, v | p

( u, u v, v | u, v |2)

u, u v, v | u, v |2 v, v 2 v, Av (28)

Proof of Lemma A.1. In the following we denote by ℜ(z) := 1

2(z + z) and ℑ(z) := 1

2i(z z) the real and imaginary parts of a complex number z C. Let G be the Gram matrix of the vectors u, v, Av and recall that Gram matrices are positive semideﬁnite, G 0. Since the determinant of a matrix is given by the product of its eigenvalues, it follows that

u, u u, v u, Av v, u v, v v, Av Av, u Av, v Av, Av

= u, u v, v | u, v |2 Av, Av u, u | Av, v |2

+ 2ℜ( u, v v, Av Av, u ) v, v | Av, u |2.

Let φ R be such that eiφ u, v = | u, v | and let u = e iφ.5 Thus, we have u, u v, v | u, v |2 Av, Av u, u | Av, v |2 v, v ℑ (Av, u )2

+ 2| u, v | v, Av ℜ( Av, u ) v, v ℜ( Av, u )2 0 (30)

The LHS of this inequality can be seen as a quadratic polynomial in ℜ( Av, u ) and the non-negativity effectively constrains the values that ℜ( Av, u ) can take to be within the roots of the polynomial. Thus, we have, in particular,

ℜ( Av, u ) | u, v | v, Av

( u, u v, v | u, v |2) v, v ℑ( Av, u )2

| u, v | v, Av

( u, u v, v | u, v |2)

with := v, v Av, Av Av, v 2 . Since A is positive semideﬁnite, it has a square root, i.e. there exists a linear operator A1/2 with A1/2A1/2 = A. It follows that

ℜ( Av, u ) = (i) | Av, u | = | Av, u | (ii) = | A1/2v, A1/2u |

A1/2u, A1/2u q

A1/2v, A1/2v

where in (i) we have used ℜ(z) |z| for any z C, in (ii) we have used that A1/2 is self-adjoint and in (iii) we have used the Cauchy-Schwarz inequality. Combining this with (31) and dividing each side by p

v, Av yields

u, Au | u, v | p

( u, u v, v | u, v |2)

5We use the convention that inner products are linear in their second argument and conjugate linear in the ﬁrst inner product.

Certifying Out-of-Domain Generalization for Blackbox Functions

The RHS in this inequality is non-negative as long as

( u, u v, v | u, v |2)

v, Av . (34)

Thus, in this case, squaring both sides of (33) yields

u, Au | u, v |2 v, Av

v, v 2 2| u, v | p

( u, u v, v | u, v |2)

u, u v, v | u, v |2 v, v 2 v, Av (35)

which is the desired result.

We will now show how Lemma A.1 can be used to upper bound the worst-case risk (1). Let Q P(Z) be an arbitrary probability measure on Z with H(P, Q) ρ. Denote by ψP , ψQ the positive square roots of the Radon-Nikodyim derivatives of P and Q, respectively, with respect to an arbitrary measure µ with P, Q µ6

dµ and ψQ :=

Note that ψP and ψQ are square-integrable with respect to µ and real-valued, ψP , ψQ L2(Z, Σ, µ) where we set Σ to be the Borel σ-algebra on Z and assume that L2 contains only real-valued functions. It is well known that L2 together with the inner product f, g L2 := R

Z fg dµ is a Hilbert space. Furthermore, the space of essentially bounded functions L (Z, Σ, µ) is isometrically isomorphic to the set of linear bounded operators on Z. That is, each f L deﬁnes a linear operator Mf via pointwise mulitplication, L2 ψ 7 Mfψ with Mfψ : z 7 (Mfψ)(z) = f(z) ψ(z). It follows that for any f L , we can write its expectation with respect to P (and equivalently Q) in terms of the inner product on L2

EZ P [f(Z)] = Z

Z f(z) d P(z) = Z

dµ (z) dµ(z) = Z

Z ψP (z)f(z)ψP (z) dµ(z) = ψP , MfψP L2 (37)

Similarly, we can write the variance of f(Z) with respect to P (and equivalently Q) in terms of inner products as

VZ P [f(Z)] = ψP , Mf 2ψP L2 ψP , MfψP 2 L2. (38)

To simplify notation, we write f ψ for the image of ψ under Mf for f L and we drop the subscript in the inner product whenever it is clear from context. Recall that M is an upper bound on the loss function ℓ, so that supz Z|ℓ(z)| M. It follows that the function fℓ( ) := M ℓ( ) is essentially bounded with respect to µ and hence deﬁnes a bounded linear operator (which is also self-adjoint since we only consider real-valued functions in this work). Applying Lemma A.1 to the Hilbert space L2 and identifying u ψQ, v ψP and A with the operator deﬁned by fℓimmediately yields the lower bound

EQ[M ℓ(Z)] | ψP , ψQ |2EP [M ℓ(Z)]

2| ψP , ψQ | q

(1 | ψP , ψQ |2)VP [M ℓ(Z)] + (1 | ψP , ψQ |2)VP [M ℓ(Z)]

EP [M ℓ(Z)]

Rearranging terms and noting that VP [M ℓ(Z)] = VP [ℓ(Z)] leads to

EQ[ℓ(Z)] | ψP , ψQ |2EP [ℓ(Z) M] + M

+ 2| ψP , ψQ | q

(1 | ψP , ψQ |2)VP [ℓ(Z)] (1 | ψP , ψQ |2)VP [ℓ(Z)]

EP [M ℓ(Z)] . (40)

Note that the inner product ψP , ψQ is known as the Hellinger afﬁnity and related to the squared Hellinger distance between P and Q via

H2(P, Q) = 1

Z (ψP ψQ)2 dµ = 1 Z

Z ψP ψQ dµ = 1 ψP , ψQ . (41)

6Such a measure µ always exists as one can choose µ = P + Q.

Certifying Out-of-Domain Generalization for Blackbox Functions

Thus, the requirement that the inner product ψP , ψQ satisﬁes is lower bounded by the quantity in (26) can be expressed as an upper bound on ρ2 as

M EP [ℓ(Z)] p

Finally, substituting ψP , ψQ = 1 H2(P, Q) = 1 ρ2 in (40), setting Cρ = p

ρ2(2 ρ2)(1 ρ2)2 and rearranging terms yields EQ[ℓ(Z)] EP [ℓ(Z)] + 2Cρ p

+ ρ2(2 ρ2) M EP [ℓ(Z)] VP [ℓ(Z)] EP [M ℓ(Z)]

Since the choice of Q was arbitrary and the RHS in this inequality does not depend on Q, taking the supremum of the LHS over all Q with H(P, Q) ρ gives the desired result.

A.2. A lower bound version of Theorem 2.2

Given the proof of Theorem 2.2, it is straightforward to adapt it so as to yield a lower bound on expectation values using Lemma A.1. Indeed, by instantiating this Lemma with the function ℓ(instead of fℓ( ) := M ℓ( )) we obtain a lower bound by following the analogous, subsequent reasoning as in the proof of Theorem 2.2.

Theorem A.2 (Lower bound). Let ℓ: Z R+ be a nonnegative function taking values in Z. Then, for any probability measure P on Z and ρ > 0 we have

inf Q Bρ(P ) EQ[ℓ(Z)] EP [ℓ(Z)] 2Cρ p

VP [ℓ(Z)] ρ2(2 ρ2) EP [ℓ(Z)] VP [ℓ(Z)]

where Cρ = p

ρ2(1 ρ2)2(2 ρ2) and Bρ(P) = {Q P(Z): H(P, Q) ρ} is the Hellinger ball of radius ρ centered at P. The radius ρ is required to be small enough such that

ρ2 1 1 + EP [ℓ(Z)]2

B. Finite Sampling Errors

Here we explain the reasoning behind the ﬁnite-sampling version of our main Theorem stated in Corollary 3.1. Let us ﬁrst recall a version of Hoeffding s inequality, formulated in terms of our setting.

Theorem B.1 ((Hoeffding, 1963)). Let Z1, . . . , Zn be independent random variables drawn from P and taking values in Z. Let ℓ: Z [0, M] be a loss function and let ˆLn := 1

n Pn i=1 ℓ(Zi) be the mean under the empirical distribution ˆPn. Then, for δ > 0, with probability at least 1 δ,

EP [ℓ(Z)] ˆLn + M

We remark that one could in principle different concentration inequalities at this stage which can potentially improve upon Hoeffding s inequality. For example, (Maurer & Pontil, 2009) present a ﬁnite sampling version of Bennett s inequality which is known to be an improvement over Hoeffding s inequality in the low variance regime. We leave such considerations for interesting future work. Recall that the certiﬁcate (10) is monotonically increasing in the variance. For this reason, we are interested in an upper bound on the population variance which can be computed from ﬁnite samples. To achieve this, we use the variance bound presented in Theorem 10 in (Maurer & Pontil, 2009) which we state here for completeness and adapt it to our use case.

Theorem B.2 ((Maurer & Pontil, 2009), Theorem 10). Let Z1, . . . , Zn be independent random variables drawn from P and taking values in Z. For a loss function ℓ: Z [0, M], let S2 n := 1 n(n 1) Pn 1 i<j n(ℓ(Zi) ℓ(Zj))2 be the unbiased estimator of the variance of the random variable ℓ(Z), Z P. Then, for δ > 0, with probability at least 1 δ,

VP [ℓ(Z)] p

Certifying Out-of-Domain Generalization for Blackbox Functions

Finally, we employ the union bound to upper bound both expectation and variance simultaneously with high probability. Thus, for any δ > 0, we have with probability at least 1 δ

EP [ℓ(Z)] ˆLn + M

VP [ℓ(Z)] p

Finally, plugging in these upper bounds for the population quantities in Theorem 2.2 leads to the desired ﬁnite sampling bound. Getting the ﬁnite sampling version of the lower bound in Theorem A.2 is analogous by using the corresponding lower bound variant of Hoeffding, but still the same upper bound for the variance.

C. Synthetic Dataset

We consider a binary classiﬁcation task with covariates X R2 and labels Y 1, where the data is distributed according to the Gaussian mixture X| Y = y N(y µ, 12). (49)

with p(y) = 1/2 and µ = (2, 0)T R2. When considering the distribution shift P Q arising from perturbations X 7 X + δ for a ﬁxed δ R2, both the Wasserstein distance and Hellinger distance can be evaluated as functions of the L2-norm of the perturbation:

W2(P, Q) = δ 2 , H(P, Q) = q

1 e δ 2 2/8. (50)

For our classiﬁcation model, we use a small neural network with ELU activations and 2 hidden layers of size 4 and 2. The ELU activations, in combination with spectral normalization of the weights, enforce the model to be smooth and hence satisfy the assumptions required for the certiﬁcate from (Sinha et al., 2018).

D. Lipschitz Constant for Gradients of Neural Networks with Jensen-Shannon Divergence Loss

Let us ﬁrst recall the dual reformulation of the Wasserstein worst-case risk, which is the central result that underpins the distributional robustness certiﬁcate presented in (Sinha et al., 2018).

Proposition D.1 ((Sinha et al., 2018), Proposition 1). Let ℓ: Θ Z R and c: Θ Z R+ be continuous, and let φγ(θ; z0) := supz Z{ℓ(θ; z) γc(z, z0)}. Then, for any distribution P and any ρ > 0,

sup Q: Wc(P, Q) ρ EQ[ℓ(θ; Z)] = inf γ 0{γρ + EP [φγ(θ; Z)]}. (51)

where Wc(P, Q) := infπ Π(P, Q) R

Z c(z, z ) dπ(z, z ) is the 1-Wasserstein distance between P and Q.

From this result, (Sinha et al., 2018) derive a robustness certiﬁcate which can be instantiated to hold uniformly over a function of families parametrized by θ Θ, but also a certiﬁcate that holds pointwise, that is, for a single model ℓ(θ0; ). One requirement for this certiﬁcate to be tractable is that the surrogate function φγ be concave in z. As shown in (Sinha et al., 2018), this is the case when γ is larger than the Lipschitz constant L of the gradient of ℓwith respect to z. Thus one needs to compute L and choose γ L so that the inner maximization in (51) is guaranteed to converge and hence a robustness certiﬁcate can be calculated.

Here, we present the calculation of the Lipschitz constant for the gradient of the Jensen-Shannon divergence loss with respect to input features. For the remainder of this section, we set Z = X Y with a binary label space |Y| = C = 2. We will always write vectors in bold roman letters, for example p = (p1, . . . , p C) RC and ey RC denotes a standard basis vector with zeros everywhere except 1 at position y. We consider a feedforward neural network with L layers and ELU activation functions, denoted by σ:

FL(θ; x) := σ(θl σL 1(θL 1 σ(θ1 x) )) (52)

and we are interested in calculating L > 0 such that

ℓ(FL(θ; x), y) ℓ(FL(θ; x ), y) L x x 2 (53)

Certifying Out-of-Domain Generalization for Blackbox Functions

where the gradient is taken with respect to x and where ℓis the Jensen-Shannon divergence. To achieve this, we apply Proposition 5 in (Sinha et al., 2018) which states that the Jacobian of FL is βL(θ)-Lipschitz with respect to the operator norm induced by 2 with

βl(θ) = αl(θ)

( L1 j (L0 j)2 αj(θ)

j=1 L0 j θj op . (54)

where L0 j is the Lipschitz constant of each activation function σj and L1 j is the Lipschitz constant of its Jacobian. It is useful to write this recursively as αl+1(θ) = L0 l+1 θl+1 op αl(θ),

βl+1(θ) = L0 l+1 θl+1 op βl(θ) + L1 l+1 θl+1 2 op αl(θ)2 (55)

In our case, since we have ELU activations, we have L0 j = L1 j = 1 for all j ((Sinha et al., 2018), example 3). Finally, viewing ℓ(p, ey) as an L + 1 layer neural network with a single output dimension, we have that zℓ(p(x), y) is L -Lipschitz continuous with constant L = L0 L+1βL(θ) + L1 L+1αL(θ)2 (56)

where we have used that θL+1 op = 1 op = 1 and where L0 L+1 is the Lipschitz constant of the function z 7 ℓ(p(z), y) and L1 L+1 is the Lipschitz constant of z 7 zℓ(p(z), y) and p(z) is the softmax probability vector

ez1 P j ezj , . . . , ez C P j ezj

We now show the calculation of L0 L+1 and L1 L+1. Fix z RC and y Y, and let ey RK be the one hot encoded label vector with zero everywhere except at position y. The Jensen-Shannon divergence loss between a vector of predicted class probabilities p and the class label ey is given by

ℓ(p, ey) = 1

2 (DKL(p m) + DKL(ey m)) (58)

2(p + ey). The Kullback Leibler divergences are

DKL(p m) = 1 + py log py 1 + py

DKL(ey m) = 1 + log 1 1 + py

where log = log2 is the logarithm with base 2. The Jensen-Shannon divergence loss is thus given by

ℓ(p, ey) = 1 + 1

2 (py log(py) (1 + py) log(1 + py)) . (60)

The gradient zℓ(ˆp, ey) of the loss with respect to the input x is given by

zℓ(p, ey) = 1

2 z (py log(py) (1 + py) log(1 + py))

2 z (py log(py)) 1

2 z ((1 + py) log(1 + py))

2(1 + log(py)) zpy 1

2(1 + log(1 + py)) zpy

Noting that zpy(x) = py(ey p) (62)

yields the expression

zℓ(p, ey) = 1

2 log py 1 + py

py(ey p). (63)

Certifying Out-of-Domain Generalization for Blackbox Functions

L0 L+1 = sup z zℓ(p, ey) 2 = 1

= sup z zℓ(p, ey) 2 = 1

py(1 py) 0.314568. (64)

We will now calculate L1 L+1 = supx J 2 where J Jℓ(p,ey), is the Jacobian of ℓ(p, ey) and J 2 is given by the largest singular value of J. For ease of notation, let fi(x) ( zℓ)i and recall that J is deﬁned by

T z f1 ... T z f C

zfi = z 1 2 log py 1 + py

1 + py py (1 + py)2 zpy

py(δiy pi)+

2 log py 1 + py

(δiy pi) zpy 1

2 log py 1 + py

1 1 + py + log py 1 + py

(δiy pi) zpy 1

2 log py 1 + py

and hence, using zpy = py(ey p),

py 1 + py + py log py 1 + py

(δiy pi)(ey p) 1

2py log py 1 + py

pi(ei p) (67)

It follows that the Jacobian is given by

py 1 + py + py log py 1 + py

(ey p) (ey p)T + 1

2py log 1 + py

(diag(p) p p T ). (68)

Since we are only interested in the binary case C = 2, we see that

A := (ey p) (ey p)T = (1 py)2 1 1 1 1

with spectrum σ(A) = {0, 2(1 py)2}. The eigenvalues of diag(p) are pi and hence λ(diag(p)) [0, 1], and σ(p p T ) = {0, p 2 2}. It follows that σ(diag(p) p p T ) [ p 2 2 , 1]. Thus, by Weyl s inequality and noting that the term in front of (ey p) (ey p)T is always negative, we have for any eigenvalue λ of J that

(1 py)2 py 1 + py + py log py 1 + py

2py log 1 + py

2py log 1 + py

Note that J is symmetric, and hence its largest singular value is given by the largest absolute value of its eigenvalues. Taking the inﬁmum (supremum) of the LHS (RHS) with respect to z yields the bounds

and hence L1 L+1 = sup z J 2 1

It follows that xℓ(p(FL(θ; x)), y) is L -Lipschitz with

L = L0 L+1βL(θ) + 1

2αL(θ)2 (73)

Certifying Out-of-Domain Generalization for Blackbox Functions

and L0 L+1 = 0.314568. Finally, choosing γ L in (51) makes the objective in the surrogate loss φγ concave and hence enables the certiﬁcate

sup Q: Wc(P, Q) ρ EQ[ℓ(θ; Z)] γρ + EP [φγ(θ; Z)]

= γρ + E(X, Y ) P [sup x X ℓ(FL(θ; x), Y ) γ x X 2 2]. (74)

E. Hellinger distance for mixtures of distributions with disjoint support

Consider two joint (feature, label)-distributions P, Q P(X Y) with densities f P and f Q with respect to a suitable measure. P and Q have disjoint support if

x X, y Y : f Q(x, y) > 0 f P (x, y) = 0. (75)

In this case, for γ (0, 1), we deﬁne the mixture measure as Πγ := γP + (1 γ)Q with density

πγ(x, y) = γf P (x, y) + (1 γ)f Q(x, y). (76)

We can calculate the squared Hellinger distance between P and Πγ as

H2(P, Πγ) = 1 Z Z

f P (x, y) q

γf P (x, y) + (1 γ)f Q(x, y) dx dy

fp>0 f P (x, y)

γ f Q(x, y) f P (x, y) dx dy

fp>0 f P (x, y) dx dy

Certifying Out-of-Domain Generalization for Blackbox Functions

F. Additional Experimental Results

F.1. Additional Model Architectures on CIFAR-10

Here, we present results for a diverse set of model architectures, evaluated on the CIFAR-10 dataset.

(a) Dense Net-169

(b) Google Net

(c) Inception-V3

(d) Mobile Net-V2

(e) Res Net-18

(f) Res Net-50

Figure 6. Certiﬁed classiﬁcation error with label distribution shifts on CIFAR-10.

Certifying Out-of-Domain Generalization for Blackbox Functions

(a) VGG11-BN

(b) VGG19-BN

Figure 7. Certiﬁed classiﬁcation error with label distribution shifts on CIFAR-10.

F.2. Results for Additional Loss and Score Functions

Here we present additional results for JSD loss, classiﬁcation error and AUC score.

(a) Dense Net-121 on CIFAR-10

(b) BERT on Yelp

Figure 8. Certiﬁed JSD Loss with label distribution shifts.

Certifying Out-of-Domain Generalization for Blackbox Functions

(a) Efﬁcient Net-B7 on Image Net-1k

(b) De BERTa on SNLI

Figure 9. Certiﬁed JSD Loss with label distribution shifts.

(a) Efﬁcient Net-B7 on Image Net-1k

(b) De BERTa on SNLI

Figure 10. Certiﬁed classiﬁcation error with label distribution shifts.

0.0 0.2 0.4 0.6 0.8 1.0

2 || px|y qx|y ||2

P[ (h(X), Y)] Gramian Upper Bd.

Figure 11. Certiﬁed Jensen-Shannon divergence loss for the colored MNIST dataset.