# optimal_representations_for_covariate_shift__f750c72f.pdf Published as a conference paper at ICLR 2022 OPTIMAL REPRESENTATIONS FOR COVARIATE SHIFT Yangjun Ruan 12, Yann Dubois 2, Chris J. Maddison12 1University of Toronto & 2Vector Institute {yjruan,yanndubois,cmaddis}@cs.toronto.edu Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation s marginal support needs to be the same across source and target. We make this practical by designing self-supervised objectives that only use unlabelled data and augmentations to train robust representations. Our objectives give insights into the robustness of CLIP, and further improve CLIP s representations to achieve SOTA results on Domain Bed. 1 INTRODUCTION It is hard to build machine learning (ML) systems that are robust to distribution shifts between a source (train) and target (test) domain. One promising approach to domain generalization (DG) is learning robust representations from which predictors trained on source must perform well on target. In practice, however, no current DG methods for learning representation uniformly outperform empirical source-risk minimizers (ERM) (Gulrajani & Lopez-Paz, 2021). Furthermore, our theoretical understanding of DG is still lacking. Specifically, while previous work have studied properties that would or would not imply robust representations (Ben-David et al., 2007; 2010a; Zhao et al., 2019; Johansson et al., 2019), the minimal set of achievable requirements for perfect DG is not yet known. We introduce the first, simple, variational objective whose optima are exactly the set of all representations on which source risk minimizers are guaranteed to generalize across distribution shifts that preserve the Bayes predictor. We work in an idealized DG (IDG) setting; we assume that a learner has access to the source population risk. Our variational characterization implies that it is both sufficient and necessary for optimal IDG that a representation: (a) remains discriminative for the learning task, i.e., there must exist predictors from the representation to the labels that can simultaneously minimize both source and target risk; and (b) keeps the support of its marginal distribution invariant to shifts. This means that any optimal representation learning method must seek discriminative information about the target. Even worse, we prove that without access to some knowledge about the target, any representation learning algorithm cannot uniformly (over all target domains) outperform a constant representation, which may explain why DG methods struggle to outperform ERM. We show, in theory and practice, how to overcome these challenges using only a large set of unlabeled examples and particular data augmentations that retain all discriminative information but minimal domain-specific information. Text descriptions of images are examples of such augmentations, as they are informative for many downstream classification tasks, but they remove a lot of domain-specific information. With such augmentations, we design practical self-supervised learning (SSL) objectives for learning robust representations. Our objectives give insights into the robustness of CLIP (Radford et al., 2021) over other SSL methods, and lead to improved CLIP-based representations that achieve state-of-the-art (SOTA) results on Domain Bed (Gulrajani & Lopez-Paz, 2021). To summarize, we: provide minimal sufficient objectives whose optima achieve optimal DG under covariate shift; prove that it is impossible to learn useful representations without accessing target information; provide practical objectives to learn optimally robust representations using specific augmentations; obtain SOTA results on typical domain generalization benchmarks.1 Authors contributed equally. 1Our implementation is released at https://github.com/ryoungj/optdom. Published as a conference paper at ICLR 2022 2 BACKGROUND: DOMAIN GENERALIZATION AND REPRESENTATIONS We are interested in predictions that are robust across distribution shifts. We formalize this using domain generalization (DG) language. Given a distribution p X,Y | ds over inputs x X and labels y Y from the source domain ds D, we select a predictor f : X Γ. The predictions γ Γ could for example be labels or distributions over labels. Despite being selected on the source domain, we would like f to achieve a small expected risk with respect to a loss function ℓ: Y Γ R 0, Rd f [Y | X] := Ep X,Y | d[ℓ(Y, f(X))] , (1) on a distribution p X,Y | d from a target domain d = dt D, which is somehow related to ds. A common strategy for DG is to learn robust representations, which splits the problem into two. First, learn an encoder p Z | X, which maps inputs X to representations Z. Then, learn a predictor h : Z Γ from representations Z to labels Y using standard risk minimization. The goal is to design a robust representation Z, so that predictors h trained to minimize the source risk Rds h [Y | Z] also achieve low target risk Rdt h [Y | Z]. Many methods have been proposed to try to learn such Z, e.g., by enforcing domain invariance of the marginal p Z | d (e.g., Ganin et al., 2016). Still, many of these proposals are not sound (Zhao et al., 2019; Johansson et al., 2019). Furthermore, they rarely outperform source empirical risk minimization (ERM) in practice (Gulrajani & Lopez-Paz, 2021). 3 OPTIMAL REPRESENTATIONS FOR DOMAIN GENERALIZATION To separate domain generalization from finite sample generalization, we consider an idealized DG (IDG), where the predictor h is selected on the source population risk rather than empirical risk. We assume sample spaces X, Z, Y, D are discrete; formal statements and proofs are in Appxs. A and B. 3.1 DEFINING OPTIMAL REPRESENTATIONS FOR IDEALIZED DOMAIN GENERALIZATION We want to evaluate the quality of a representation Z of X. In our IDG, the learner is given a random source Ds; she selects any source risk minimizer; and is scored according to her risk on a random target domain Dt. To give uniform guarantees while reflecting the uncertainty over the source-target pair (Ds, Dt), we measure the quality of Z as the expected risk of the learner s worst-case choice. Definition. The idealized domain generalization risk (IDG risk) of an encoder p Z | X is the expected (over domains) worst-case (over source risk minimizers) target risk, i.e., RIDG [Y | Z] := Ep Ds,Dt sup h H Ds RDt h [Y | Z] where H Ds := arg minh RDs h [Y | Z] are the source risk minimizers, and p Ds,Dt is any joint distribution that has full support over D D. We call a representation Z (or its encoder) optimal for IDG if it minimizes the IDG risk. 3.2 CHARACTERIZING OPTIMAL REPRESENTATIONS FOR IDG UNDER COVARIATE SHIFT The IDG risk is useful to evaluate representations, but gives few insights into IDG and is impractical to optimize due to the supremum in Eq. (2). Under mild assumptions, we provide a simplified, equivalent objective, which is easier to optimize. For convenience, we assume that there is a unique Bayes predictor f , which minimizes the expected risk over domains, i.e., f = arg minf Ep Dt[RDt f [Y | X]]. This is satisfied by standard ML tasks p Y,X and losses ℓ. More importantly, we assume the following domain structure, which ensures the existence of optimal encoders and allows our simplification. Assumptions. All domains d D we consider are related by the following assumptions: 1. Generalized covariate shift. All domain-specific risk minimizers f arg minf[Rd f [Y | X]] are equal to the Bayes predictor f on their support, i.e., f(x) = f (x) for all x supp(p X | d). 2. Invariance of Bayes predictions. The set of Bayes predictions is the same for all domains, i.e., f (x) | x supp(p X | d) ={f (x) | x X}. Published as a conference paper at ICLR 2022 Y = 0 Y = 1 target p Z|dt (a) discriminative & support match (b) only support match (c) only discriminative Figure 1: (a) Optimal representations for IDG must have invariant supports while being simultaneously discriminative on all domains: (b) without the discriminative requirement, a source-risk minimizer can mispredict the target, and (c) without support match, some risk minimizer will perform poorly. Generalized covariate shift (GCS) ensures that f is simultaneously optimal on all domains. For log-loss ℓit recovers standard covariate shift, i.e., p Y | x,d = p Y | x. For other losses, GCS is weaker, e.g., it only requires invariance of most likely labels for 0-1 loss, and of conditional expectations for MSE. Invariance of Bayes predictors is necessary to learn useful predictors using a single domain. For example, for 0-1 loss it ensures that each label is seen at least once in each domain. The intuition behind our objective is that under GCS any source risk minimizer will make optimal predictions on target samples x that are also in the source. Thus, IDG optimal representations are exactly those that (a) have the same support in Z for all domain, and (b) retain GCS from Z without sacrificing the ability to predict Y , which can be ensured by minimizing the risk from Z. See Fig. 1. Theorem 1. Under our assumptions, an encoder p Z | X is optimal for IDG if and only if it minimizes the risk R [Y | Z] := infh Ep Dt RDt h [Y | Z] while matching the support of Z across domains, i.e., p Z | X arg min p Z | X R [Y | Z] s.t. d D, supp(p Z | d) = supp(p Z) (3) Moreover, such encoders exist and their IDG risk is the Bayes risk RIDG [Y | Z ] = R [Y | X]. Theorem 1 provides an objective to learn representations on which performing risk minimization using a single domain and Z is as good as performing risk minimization on the target domain from inputs X. Other sufficient conditions have previously been hinted towards, e.g., matching the marginal p Z | d instead of its support (e.g., Ben-David et al., 2010a) which is the focus of most DG methods (e.g., Ganin et al., 2016). Note that previous conditions are nevertheless generally not necessary and could be too stringent to be achievable. To our knowledge, Thm. 1 is the first characterization of necessary and sufficient conditions, which gives better insights into the essential goal for optimal IDG and provides a guide for deriving the least stringent objectives in practice. The risk minimization (Eq. (3)) shows that one must have some knowledge about the target domains to learn optimal representations for IDG. Access to targets might seem unrealistic, but without such knowledge or additional assumptions it is provably impossible to beat even constant representations. Proposition 1 (No free lunch for IDG). Let ds be any source domain, Zds be any representation chosen on source ds, and C Z be a constant representation. Under minor assumptions, for every good target domain outside the source s support on which Zds outperforms C for IDG, there are many bad target domains on which Zds is strictly worse than C. Formal statement in Appx. B.3. Proposition 1 shows that target knowledge is necessary for learning useful representations in IDG. This may explain why previous DG methods have been unable to outperform ERM in standard benchmarks (Gulrajani & Lopez-Paz, 2021): the knowledge they have access to is insufficient to generalize. Taken together, Prop. 1 and Thm. 1 say that either you have access to target domains dt, in which case you can achieve an IDG risk that matches supervised learning, or you do not access dt, in which case any representation learning algorithm can achieve worse IDG risk than a constant. 4 LEARNING REPRESENTATIONS UNDER COVARIATE SHIFT 4.1 SELF-SUPERVISED LEARNING USING DOMAIN-AGNOSTIC AUGMENTATIONS Our characterization of optimal representations for IDG (Thm. 1) requires labeled data from all domains, which is impractical. We show how this can be overcome with self-supervised learning Published as a conference paper at ICLR 2022 (a) standard augmentations (b) supervised augmentations A dog with floppy ears. A pointy-eared dog. (c) image-text augmentations Figure 2: Image-text augmentations are practical domain-agnostic augmentations. Arrows denote augmenters. Bubbles denote inputs that have the same representations, as induced by predicting the augmentations. (a) Standard augmentations are not domain-agnostic. (b) Supervised augmentations uniformly augment inputs inside their label class, irrespective of domains. (c) Image-text augmentations are (nearly) domain-agnostic as they map images across domains to similar descriptions. (SSL), which is a technique for training representations without direct access to labels, and a particular class of data augmentations. E.g, in CLIP, images are augmented with alt-text collected on the internet and invariance is enforced between the representations of the image and its text pair (Radford et al., 2021). Representations learned like this preserve discriminative information about all downstream tasks Y whose label information is preserved by the augmentation (e.g., Dubois et al., 2021). More precisely, an augmentation A is a random variable sampled conditionally from the input X. The key requirement is that augmentations retain task information. Specifically, if any samples x, x X have the same augmentation conditional p A | x = p A | x , then their Bayes predictions must be the same f (x) = f (x ). With such A, one can learn an encoder that minimizes the risk R [Y | Z] by instead maximizing mutual information I[A; Z]. Intuitively, if Z has all augmentation information, then it must have information about the conditional p A | X, and thus the Bayes prediction f (X). This suggests learning optimal representations for IDG by replacing Eq. (3) with a maximization of I[A; Z]. Unfortunately, fully optimizing I[A; Z] w.r.t. p Z | X is not generally possible under the support constraint Eq. (3). This can be overcome under a domain-agnostic assumption, which requires that the set of possible augmentation distributions is the same across domains, i.e., p A | x | x supp(p X | d) = p A | x | x X . Proposition 2. Let p A | X be a domain-agnostic augmenter. Then any optimal solution p Z | X of the following objective is optimal for IDG: p Z | X arg max p Z | X I[A; Z] s.t. d D, supp(p Z | d) = supp(p Z) (4) Proposition 2 shows that we can still learn IDG optimal representations without labels if we have access to the right augmentations. How realistic are those augmentations? For 0-1 loss ℓ, the most likely label should be preserved, which is satisfied by standard image augmentations like rotations and color jittering. Those augmentations are nevertheless not domain-agnostic for typical domains (e.g. sketches and photos), since outputs A are correlated with the input s domain D. See Fig. 2a. A practical choice of augmentation that is nearly domain-agnostic, is a mapping from images to text descriptions, as with CLIP (Radford et al., 2021) which uses text-image pairs. Image-text augmentations have many advantages. First, text augmentations preserve label information for many downstream tasks. Second, they are close to being domain-agnostic, since images from different domains (e.g., sketches and photos) but similar semantics are often mapped to similar descriptions.2 (Fig. 2c). This gives insights into the open question (Radford et al., 2021) about why CLIP s representations are so robust compared to other SSL methods. Finally, image-text pairs are easy to access in practice given their abundance on the internet. Many other multi-modal augmentations, e.g., audio-video (Wang et al., 2021), are also likely domain-agnostic and can be explored in practice. In practice, even the domain information D is usually unknown. One can nevertheless still optimize (Eq. (4)) by replace the support constraint with a stronger one that does not rely on D e.g., minimizing I[Z; X] (see Sec. 4.2.2), . This highlights the potential of Prop. 2: if one can find a large source of inputs X and domain-agnostic augmentations A (e.g., the 400M image-text pairs of CLIP) then one can, in principle, learn optimal representations for IDG on any downstream task Y that A preserves. 2Although text descriptions might contain domain information (e.g., referring to sketch ), they are still much better than standard augmentations that rarely map together images from different domains. Published as a conference paper at ICLR 2022 4.2 PRACTICAL OBJECTIVES We now design practical objectives for learning optimal representations without labels. Proposition 2 does provide an objective but it is impractical as it involves constrained optimization. We can nevertheless convert it to the following unconstrained objective by using a Lagrangian relaxation and introducing a domain bottleneck B[Z, D] that enforces support match, arg min p Z | X I[A; Z] + λ B[Z, D] , (5) Eq. (5) is a valid reformulation of Prop. 2 as long as minimizing B[Z, D] while maximizing I[A; Z] enforces the support constraint in Eq. (4). Below, we provide different choices of such B[Z, D] each of which results in a different SSL objective. In practice, however, terms in Eq. (5) are hard to estimate from finite samples. We now discuss two variational bounds that can be efficiently estimated and optimized with stochastic gradient descent (Bottou, 2010). For simplicity, we use a deterministic encoder eϕ : X Z for the rest of the paper. Detailed derivations are in Appx. C. For both practical objectives we use a contrastive variational lower bound on I[A; Z] based on Info NCE (Oord et al., 2018), which is standard in SSL. Specifically, for a sample X, we first obtain the augmented positive A by sampling from p A | X. We then obtain n augmented negatives A i n i=1 i.i.d. from the marginal p A by first independently sampling X := X i n i=1 from p X and then sampling A i from p A | X i . We denote A := A, A 1 , . . . , A n . Info NCE then uses a critic sψ to score how likely each A A is to be positive, resulting in the following variational bound, I[A; Z] log(n + 1) + Ep A,X,Z log exp sψ(A, Z) P A A exp sψ(A , Z) When A = X, one can tie the parameters of the critic and the encoder by passing augmentations through the encoder and taking an inner product, i.e., sψ(A, Z) := eϕ(A)T Z. Many previous DG regularizers (e.g., Ganin et al., 2016; Li et al., 2018b;a) could be valid domain bottlenecks. In the following, we discuss two possible B[Z, D] , the first of which is novel. 4.2.1 CONTRASTIVE ADVERSARIAL DOMAIN BOTTLENECK (CAD) Algorithm 1 CAD objective Require: eϕ, sψ, D, X, n 1: Z eϕ(X) 2: A sample(p A | X) (D i , X i , A i ) n i=1 i.i.d. sample(p D,X,A) 4: X, A X i n i=1,{A} A i n i=1 5: X D X i | D i = D, i [n] 6: Laug log exp sψ(A,Z) P A A exp sψ(A ,Z) I[A; Z] 7: Lsupp log P X X D exp eϕ(X )T Z P X X exp eϕ(X )T Z I[Z; D] 8: return LCAD = Laug + λLsupp Our first domain bottleneck minimizes B[Z, D] = I[Z; D], which enforces support match using a KL divergence. Dropping constants w.r.t. Z we thus aim to maximize H[D | Z]. Domain-adversarial neural network (DANN, Ganin et al., 2016) does so by ensuring that a domain classifier qφ cannot predict domains from representations, i.e., it maximizes Ep D,Z[ log qφ(D | Z)] H[D | Z] w.r.t. encoder parameter ϕ but minimizes it w.r.t. φ. However, DANN suffers from two issues: (i) it maximizes an upper bound on the desired term; (ii) it requires adversarial training, which is challenging in practice. To overcome these issues, we construct q(D | Z) without introducing additional parameters and with a bound that is tight with enough samples. In short, using the equality p D | Z = Ep X | Z p D | X , we set our variational distribution to q(D | Z) = Eqϕ,X[ˆp(D | X)], where qϕ,X(X | Z) is a contrastive variational distribution of p X | Z constructed with samples X and a critic eϕ(X)T Z tied with the encoder, ˆp is a count estimate of p D | X. Detailed derivations and explanations are in Appx. C.3. The resulting contrastive adversarial domain (CAD) objective is in Algorithm 1. First, sample domains D :={D i }n i=1 for each X X. Then collect inputs associated with a different domain from the current domain D, i.e., X D :={X i | D i = D, i [n]}. Ignoring constants, the final loss is LCAD(ϕ, ψ) := Ep D,X,A,Z log exp sψ(A, Z) P A A exp sψ(A , Z) λ log X X D qϕ,X(X | Z) In Appx. C.4, we also derive a conditional variation of CAD that minimizes I[Z; D | Y ], which can be used when labels are available and supervised augmentations are used. Published as a conference paper at ICLR 2022 4.2.2 ENTROPY BOTTLENECK (ENT) Our second domain bottleneck is the entropy bottleneck (Ent) that minimizes H[Z] = I[Z; X] I[Z; D], where the first equality uses the encoder s determinism. Ent enforces support match by removing all information that is not needed to maximize I[Z; A]. In particular, minimizing I[Z; X] is more stringent than I[Z; D], as it also matches the representations inside a domain. The advantage of Ent is that it does not require domain samples D, which are rarely accessible in SSL. We consider the standard variational bound used in neural compression (Ballé et al., 2016; Theis et al., 2017), H[Z] Ep Z[ log qθ(Z)], where an entropy model qθ(Z) is used. This leads to LEnt(ψ, ϕ, θ) := Ep X,A,Z log exp sψ(A, Z) P A A exp sψ(A , Z) λ log qθ(Z) . (8) 5 RELATED WORK Provably robust representations under covariate shift. Previous work mostly focuses on domain generalization bounds for robust representations. Ben-David et al. (2007; 2010a) bound the target risk using the source risk, a divergence between source and target distributions, and the joint optimal risk over source and target domains. Mansour et al. (2009) generalizes these results from 0-1 loss to more general losses. Johansson et al. (2019) takes this further by deriving a support-based bound. In our setting, these bounds only hint towards a sufficient condition for optimality, i.e., matching the marginal p Z | d or its support while minimizing R [Y | Z]. However, these bounds can often be loose and the implied sufficient conditions are neither necessary nor generally achievable. Ben-David et al. (2010b) suggests that separately minimizing R [Y | Z] or matching the marginal is not sufficient, while Zhao et al. (2019) also proves minimizing only the source risk Rds [Y | Z] is not sufficient; but none of them proves the desired necessary condition. Our work distinguishes from previous work on three key aspects: (i) we are the first to study and formalize optimally robust representations, and provide the achievable sufficient and necessary conditions; (ii) we prove that one can practically learn optimal Z with SSL using domain-agnostic augmentations; (iii) we consider a more general framework with any standard losses and a less stringent generalized covariate shift assumption, Still, our work is more specific than others, as we consider idealized DG and unrestricted predictors H. Practical objectives for DG. The most popular DG methods aim to learn domain-invariant representation by minimizing various divergernces between the marginal distributions p Z | d and p Z (Long et al., 2015; Ganin et al., 2016; Sun & Saenko, 2016; Long et al., 2017; Li et al., 2018a; Shen et al., 2018; Nguyen et al., 2021). Others propose matching the conditional p Z | y,d across domains instead (Gong et al., 2016; Li et al., 2018b; Tachet des Combes et al., 2020). These regularizers would all be valid domain bottlenecks B[Z, D] . Another line of work aims at learning Z with invariant predictors p Y | z,d across domains (e.g., Arjovsky et al., 2019; Krueger et al., 2021; Li et al., 2021). However, none of these methods outperform ERM with fair model selections (Gulrajani & Lopez-Paz, 2021). 6 EXPERIMENTS In our experiments, we aimed to: (i) verify our theoretical results in practice; (ii) investigate our proposed representation learning objectives in practical DG; (iii) take advantage of pretrained SSL models (in particular, CLIP) to achieve powerful models for DG. Unless stated otherwise, we consider a two-stage training setup. First, the representation learner ( the representor ) trains an encoder p Z | X using a specified objective and freezes it. Then, the person performing predictions ( the learner ) trains her predictor h from Z by minimizing the risk on source data. Finally, the representation Z and predictor h are evaluated on target data. In all experiments, the learner uses a linear classifier for h. For the Ent bottleneck, we used Ballé et al. s (2018) entropy model. For the CAD bottleneck we used its conditional version whenever labels were available. When a model contains no domain bottleneck, we label it as Base . For experimental details and additional results see Appxs. E and F. 6.1 SCIENTIFIC SETTING: EXPLORING OPTIMAL REPRESENTATIONS FOR WORST-CASE DG To validate our theory, we studied optimal representations in a scientific setup that is as close to our IDG framework as possible with log-loss ℓ. In particular, we used the PACS dataset (Li et al., 2017) Published as a conference paper at ICLR 2022 Base Ent CAD -5.1 0.3 -0.4 0.1 -0.7 0.1 -3.5 0.1 -0.0 0.0 -0.8 0.2 (a) Effect of different objectives 10 2 100 102 104 6 Log likelihood source target (b) Effect of λ DA Approx. Non-DA Augmentation type Log likelihood Supervised Single Dom Approx DA Intra Dom Standard (c) Effect of augmentations Figure 3: (a) Adding bottlenecks significantly improves the worst-case DG performance and using domain-agnostic (DA) augmentations (H[A | Z]) performs as well as with labels (R [Y | Z]). (b) Increasing the domain bottleneck weight λ will improve target performance until it decreases source performance. (c) DA augmentations are crucial but approx. DA aug. might be also be sufficient. and approximated the idealized DG by treating the dataset as the population distribution, i.e., we did not split datasets into train and test sets. To approximate the worst-case source predictor, we followed Dubois et al. (2020) by incorporating the wrongly labeled target data to the source domain. The experimental setup goes as follows: (i) the representor trains a Res Net-18 (He et al., 2016) to minimize the objective on labeled data from all domains; (ii) the learner trains a worst-case source classifier h on every possible pair of (source, target); (iii) the negative target risk (log likelihood) for each h is evaluated. We reported the log likelihood averaged over 5 seeds. For more realistic scenarios (i.e. non-idealized average-case DG) see Appx. F.2 which replicates the following results. Do our domain bottlenecks improve worst-case DG? In Fig. 3a, we compare IDG performance of representations trained with (Ent, CAD) and without (Base) domain bottlenecks. We see that both bottlenecks significantly improve the worst-case DG, and nearly achieve the source-domain performance (0 log likelihood). This shows the importance of support match (Thm. 2) and the effectiveness of our bottlenecks to enforce it. In Appx. F.2, we show that bottlenecks also helps in practical scenarios, i.e., non-idealized average-case DG evaluated with accuracy (95.9% 96.7%). What is the effect of λ? Fig. 3b shows the effect of the bottleneck weight λ on the worst-case target and source performance. We see that increasing λ will decrease the DG gap. As a result the target performance improves until λ 102, where source performance starts to decrease. What if the representor has access to domain-agnostic augmentations instead of labels? In Sec. 4.2, we provide a contrastive objective for using augmentations. To show the effectiveness of the objective, we compared minimizing H[A | Z] using Eq. (6) to standard supervised risk minimization R [Y | Z] and used the domain-agnostic supervised augmentations (Fig. 2b). The 1st and 2nd row of Fig. 3a show that our objective performs similarly to direct label prediction. How important is the choice of augmentations? Prop. 2 shows that domain-agnostic (DA) augmentations are sufficient for achieving IDG, but it does not give necessary conditions. Here we investigate the effect of using our loss with different choices of augmentations. Specifically, we used LCAD with five augmentations. The first two are DA. Supervised : augment inputs inside the label class across all domains as in Fig. 2b; Single Dom : augment inputs to same label samples from a fixed domain. The second two are not DA. Standard : standard SSL augmentations (Chen et al., 2020) as in Fig. 2a; Intra Dom : augment inputs to same label and same domain samples. Finally, we consider Approx DA , which is approximately DA by augmenting 10% of the time with Supervised and 90% of the time with Intra Dom . Fig. 3c shows that the non-DA augmentations give terrible results compared to DA. Interestingly, Approx DA also performs very well, which suggests that approximately DA augmentations might be sufficient to learn optimal representations in practice. What if the representor does not have access to target domains? Prop. 1 shows that DG without access to target domains is generally impossible. We empirically verified this by excluding a predefined target dt domain from the representor s training set, i.e., LCAD is optimized on 3 of the 4 domains. The learner then trains a predictor h on each source. We finally evaluate each h on the target domain dt, and average over choices of dt. The resulting worst-case log likelihood was 4.2 0.2, which is significantly worse than when the representor had access to all domains ( 0.8 0.2). Published as a conference paper at ICLR 2022 6.2 APPROXIMATING OPTIMAL REPRESENTATIONS BY EXPLOITING PRETRAINED SSL As discussed in Sec. 4.1, one can learn optimal representations for IDG by performing SSL with a domain bottleneck on a large sample of inputs X and domain-agnostic augmentations A. This is nearly how CLIP was pretrained (SSL with 400M image-text pairs) except it did not include a domain bottleneck. In this section, we investigate how to take advantage of CLIP to approximate optimal representations for IDG. We did so in two simple steps. First, we froze the pretrained CLIP and added a multi-layer perceptron (MLP) that could effectively finetune CLIP s representations. Then, we trained the MLP by minimizing our CAD bottleneck and R [Y | Z] on the available data. In all experiments, we used the standard Domain Bed benchmark (with non-MNIST datasets) and protocol (Gulrajani & Lopez-Paz, 2021). In particular, we left out a target domain for evaluation and used the union of other domains for training both the encoder and the classifier. Contrary to our scientific setting, the representor does not get access to the target domain. All our representations were evaluated by fitting a linear classifier on source domains with source validation selection. As in Domain Bed we selected the encoder based on oracle selection over 10 hyperparameters, and reported the target accuracy averaged over all choices of targets and 5 random seeds with standard errors. Note that using oracle selection is more consistent with our theory since it gets access to the necessary target information (for model selection), as discussed in Appx. F.3. Due to space limit, we only included as baselines ERM and Domain Bed SOTA which for each dataset is the best result over all baselines. The extended results and baselines are in Table 4. Details in Appx. E.3. We investigated two pretrained CLIP models with different number of parameters. The larger Vi T-B/32 denoted CLIP L and the smaller Res Net-50 denoted CLIP S . Table 1: CLIP significantly outperforms the previous SOTA result on Domain Bed, as supported by our theoretical analysis. Finetuning CLIP with our CAD bottleneck consistently improves the robustness of its representations and achieves SOTA performance. Algorithm VLCS PACS Office Home Terra Incognita Domain Net ERM 77.6 0.3 86.7 0.3 66.4 0.5 53.0 0.3 41.3 0.1 Domain Bed SOTA 79.9 0.2 87.2 0.1 68.4 0.2 54.4 0.3 41.8 0.1 DINO + CAD 69.6 0.6 76.1 0.1 56.9 0.5 25.9 1.2 33.6 0.1 CLIP S 81.1 0.5 90.3 0.2 70.6 0.1 29.6 0.8 47.7 0.0 CLIP S + Base 81.3 0.5 91.2 0.3 70.6 0.1 36.4 0.7 46.8 0.2 CLIP S + CAD 82.3 0.3 92.0 0.2 71.9 0.2 36.2 0.8 48.8 0.1 CLIP L 80.7 0.4 93.7 0.8 79.6 0.1 36.9 0.6 52.8 0.1 CLIP L + CAD 81.6 0.1 94.9 0.3 80.0 0.2 40.6 1.1 53.7 0.1 Approx. Optimal Z 86.8 0.6 97.2 0.6 86.3 1.6 76.5 4.1 66.7 0.2 Can we approximate optimal representations by exploiting pretrained CLIP? The row CLIP L + CAD in Table 1 shows that finetuning a large pretrained CLIP model with our CAD achieves SOTA on nearly all Domain Bed benchmarks by a very large margin (see 2nd row). Note that the poor performance on Terra Incognita is likely because CLIP s dataset does not cover such images (camera traps monitoring animals). The last row essentially shows an optimal representation, which we approximate by finetuning CLIP L with our CAD on all domains including the target. The gap between CLIP L + CAD and the upper-bound suggests that one can still learn better representations. We hypothesize that end-to-end training of our objective would greatly shrink this gap. Are gains due to the architectural differences? Domain Bed s baselines finetuned an Image Net pretrained Res Net-50. In contrast, CLIP L pretrained a larger Vi T. To decouple gains due to our objective from architectural gains, we evaluated Res Net-50 pretrained CLIP S. Table 1 shows that CLIP S + CAD still significantly outperforms Domain Bed baselines. Note that our theory does not constrain the encoder and so we expect larger encoders to be better as seen in Table 1. What is the effect of domain bottlenecks? In the CLIP rows of Table 1, we investigated the effect of finetuning CLIP with our CAD bottleneck. We see that for both CLIP L and CLIP S, it consistently improves results by around 1 2%. These gains are due to the bottleneck, rather than finetuning on source data as seen by CLIP S + Base . We believe the gains could potentially be much larger if CLIP was trained end-to-end with our bottleneck. Note that raw CLIP S already significantly Published as a conference paper at ICLR 2022 outperforms baselines. We hypothesize that this is because SGD acts as an information bottleneck that naturally favors support match (Shwartz-Ziv & Tishby, 2017). Which pretrained SSL model to use? Our theory suggests that we can exploit pretrained SSL models as long as their augmentations are domain-agnostic and their training set covers desired domains. We investigated adaption of SSL models that do not satisfy those properties by finetuning DINO (Caron et al., 2021), the current SOTA on SSL Image Net. DINO is pretraiend using standard augmentations. As a result, Table 1 shows that the finetuned DINO + CAD significantly underperforms compared to CLIP S and Domain Bed baselines. This supports our hypothesis that CLIP is much more robust than other SSL methods due to its domain-agnostic augmentations. 6.3 TOWARDS GENERIC ROBUST REPRESENTATIONS WITH SSL In the previous section, we finetuned CLIP in a task specific fashion by optimizing R [Y | Z] and our CAD bottleneck. To get generic (task agnostic) robust representations, one should instead directly use our objectives on a sufficiently large dataset with image-text augmentations. Unfortunately, we cannot fully train CLIP with our bottlenecks as we do not have access to CLIP s original dataset and sufficient compute. In this section, we aim to emulate such training of generic robust representations. To do so we used LAION-400M (Schuhmann et al., 2021) that is a public dataset that contains 400M web-crawled image-text pairs. Due to our computational budget, we again froze the pretrained CLIP L and only finetuned an additional MLP with our LEnt. We used LEnt as it only requires access to paired image X and text A but no prior information about domain D. As in CLIP s paper, we evaluated the learned representation Z in Taori et al. s (2020) realistic setting, where a linear classifier h from Z is trained on Image Net and tested on 7 natural distribution shift datasets. Details in Appx. E.4. Would training CLIP with a bottleneck have improved its robustness? As shown in the last 2 rows of Table 2, finetuning CLIP L on LAION with LEnt (Tuned w/ Ent) outperforms finetuning without bottleneck (Tuned w/o Ent) on all 7 distribution shift datasets. This suggests that directly training CLIP with our Ent bottleneck would improve the robustness of learned representations. We hypothesize that the gains could be larger if SSL models trained LEnt end-to-end. In Appx. F.4, we show similar results on Domain Bed. Note that both models underperform the original CLIP L, likely due to non-end-to-end training and LAION data with (possibly) lower quality than CLIP s data. Table 2: Finetuning CLIP L on LAION with an entropy bottleneck improves its robustness compared to finetuning without on 7 distribution shift datasets. The pretrained CLIP L is still better likely due to end-to-end training with higher quality data. IN denotes Image Net. IN IN-V2 IN-S YT-BB IN-Vid Object Net IN-A IN-R Avg. CLIP L 75.2 64.2 41.0 58.4 71.6 42.8 27.5 62.9 52.6 Tuned w/o Ent 73.8 62.1 37.0 56.9 68.8 41.3 26.0 58.1 50.0 Tuned w/ Ent 74.2 62.7 38.9 58.1 70.1 42.1 26.2 60.8 51.3 7 CONCLUSION We gave a simple variational characterization of all representations on which source-risk minimizers are guaranteed to generalize to target domains that preserve the Bayes predictor. Similar to previous work, our theory strongly implies the need for target information when learning representations for domain generalization. Nevertheless, we identified a domain-agnostic property of data augmentations that make it possible to learn optimal representations from unlabelled data. Thus, we showed that it is possible to learn robust representations using only large sources of inputs X and augmentations A. There are caveats that need to be addressed in future work. First, we studied an idealized DG, which assumes access to the population distributions. This gives insights into the challenges that are specific to DG, rather than finite sample challenges faced throughout ML. Second, we considered risk minimizers from an unconstrained hypothesis class. The support constraint can likely be weakened, if the hypothesis class is constrained. Finally, we focus only on optimal representations, but it would be interesting to characterize approximately optimal representations. Nevertheless, in this idealized setting, our characterization is a springboard from which all future objectives can be derived, and, in general, it brings us closer to the goal of robust machine learning systems. Published as a conference paper at ICLR 2022 Acknowledgement We would like to thank Elliot Creager, Roger Grosse, Elan Rosenfeld, Guodong Zhang, Han Zhao, and anonymous reviewers for their helpful feedbacks and encouragements. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), RGPIN-2021-03445. Reproducibility For our theoretical results, we include formal assumptions, statements, and proofs in Appxs. A and B. We include the detailed derivations of our algorithms in Appx. C. For our experiments, we include experimental details for reproducing our results in Appx. E and have released our code at https://github.com/ryoungj/optdom. Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. ar Xiv preprint ar Xiv:1612.00410, 2016. Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. ar Xiv preprint ar Xiv:1611.01704, 2016. Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. ar Xiv preprint ar Xiv:1802.01436, 2018. Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Danny Gutfreund, Joshua Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. 2019. Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp. 456 473, 2018. Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151 175, 2010a. Shai Ben-David, Tyler Lu, Teresa Luu, and David Pal. Impossibility theorems for domain adaptation. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 129 136, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010b. PMLR. URL https://proceedings.mlr.press/v9/david10a.html. Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pp. 177 186. Springer, 2010. Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.14294, 2021. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Yann Dubois, Douwe Kiela, David J Schwab, and Ramakrishna Vedantam. Learning optimal representations with the decodable information bottleneck. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 18674 18690. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/d8ea5f53c1b1eb087ac2e356253395d8-Paper.pdf. Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, and Chris J. Maddison. Lossy compression for lossless prediction. ar Xiv preprint ar Xiv:2106.10800, 2021. Published as a conference paper at ICLR 2022 Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1657 1664, 2013. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016. Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007. Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pp. 2839 2848. PMLR, 2016. Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. ar Xiv preprint ar Xiv:1701.00160, 2016. Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=l Qd Xe XDo Wt I. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. ar Xiv preprint ar Xiv:2006.16241, 2020. Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262 15271, 2021. Fredrik D Johansson, David Sontag, and Rajesh Ranganath. Support and invertibility in domaininvariant representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 527 536. PMLR, 2019. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. ar Xiv preprint ar Xiv:2004.11362, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. ar Xiv preprint ar Xiv:1705.07215, 2017. David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815 5826. PMLR, 2021. Bo Li, Yifei Shen, Yezhen Wang, Wenzhen Zhu, Colorado J Reed, Jun Zhang, Dongsheng Li, Kurt Keutzer, and Han Zhao. Invariant information bottleneck for domain generalization. ar Xiv preprint ar Xiv:2106.06333, 2021. Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017. Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400 5409, 2018a. Published as a conference paper at ICLR 2022 Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624 639, 2018b. Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97 105. PMLR, 2015. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pp. 2208 2217. PMLR, 2017. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. ar Xiv preprint ar Xiv:0902.3430, 2009. A Tuan Nguyen, Toan Tran, Yarin Gal, Philip HS Torr, and Atılım Güne s Baydin. Kl guided domain adaptation. ar Xiv preprint ar Xiv:2106.07780, 2021. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1406 1415, 2019. Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171 5180. PMLR, 2019. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8748 8763. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html. Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389 5400. PMLR, 2019. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A Theoretical Analysis of Contrastive Unsupervised Representation Learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 5628 5637. PMLR, 2019. URL http: //proceedings.mlr.press/v97/saunshi19a.html. Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ar Xiv preprint ar Xiv:2111.02114, 2021. Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theor. Comput. Sci., 411(29-30):2696 2711, 2010. doi: 10.1016/j.tcs.2010.04.006. URL https://doi.org/10.1016/j.tcs.2010.04.006. Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. Do image classifiers generalize across time? ar Xiv preprint ar Xiv:1906.02168, 2019. Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Published as a conference paper at ICLR 2022 Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. Co RR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810. Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443 450. Springer, 2016. Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33, 2020. Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. ar Xiv preprint ar Xiv:2007.00644, 2020. Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. ar Xiv preprint ar Xiv:1703.00395, 2017. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5018 5027, 2017. Haohan Wang, Songwei Ge, Eric P Xing, and Zachary C Lipton. Learning robust global representations by penalizing local predictive power. ar Xiv preprint ar Xiv:1905.13549, 2019. Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, and Aaron van den Oord. Multimodal self-supervised learning of general audio representations. ar Xiv preprint ar Xiv:2104.12807, 2021. Aolin Xu and Maxim Raginsky. Minimum excess risk in bayesian learning. ar Xiv preprint ar Xiv:2012.14868, 2020. Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020. Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523 7532. PMLR, 2019. Published as a conference paper at ICLR 2022 Table of Contents A Preliminaries 15 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Proofs 19 B.1 Lemmas for general losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.3 Impossibility results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.4 Augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C Practical objectives 27 C.1 Mutual information bottleneck B[Z, X, Y, D] = I[Z; X] . . . . . . . . . . . . 27 C.2 Entropy bottleneck B[Z, X, Y, D] = H[Z] . . . . . . . . . . . . . . . . . . . . 28 C.3 Contrastive adversarial domain bottleneck B[Z, X, Y, D] = I[Z; D] . . . . . . . 29 C.4 Conditional CAD B[Z, X, Y, D] = I[Z; D | Y ] . . . . . . . . . . . . . . . . . 30 D Extended Related Work 32 E Experimental Details 32 E.1 Scientific . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E.2 Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.3 Domain Bed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 E.4 LAION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 F Additional Experimental Results 36 F.1 Scientific . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 F.2 Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 F.3 Domain Bed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 F.4 LAION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Published as a conference paper at ICLR 2022 A PRELIMINARIES A.1 NOTATION For the most part, we will assume that all spaces are discrete probability spaces. A full list of assumptions is found at Appx. A.3. General The image of a set A X under a function f : X Y is denoted f (A) = {f(x) | x A}. The pre-image is denoted f (B) ={x X | f(x) B} for B Y. Probability Random variables (r.v.) are denoted by uppercase letters (e.g., X), and their sample space and realizations are denoted by the corresponding calligraphic (e.g., X) and lowercase letters (e.g., x) respectively. The probability mass function (pmf) of a random variable X is denoted as p X. We use capital P instead of p to denote the measure under p. The support supp(p X) of a discrete distribution is the set of all points x X with positive probability, i.e., supp(p X) = {x X | p X(x) > 0}. The space of all probability distributions on X is denoted P(X) = p X | p X(x) 0 and P x X p X(x) = 1 . When it is necessary to be explicit, we will denote X is distributed as p X using the notation X d p X. Expectations are written as: Ep X[f(X)], independence of two r.v. as , conditional independence as | . For jointly distributed random variables (X, Y ) taking value in (t.v.i.) X Y, the conditional distribution is denoted as p Y | X : Y X [0, 1]. For convenience, let p Y | x = p Y | X( | x) be the conditional distribution of Y given x. All random variables are independently distributed, unless an explicit joint distribution or coupling is given. A.2 DEFINITIONS We are interested in prediction problems with domain shift. There are three random variables: the target domain Dt, the input X, the label Y . They have the following joint distribution: (Dt, X, Y ) d p Dt p X,Y | Dt (9) where we drop the arguments of the probability densities for clarity. We make a variety of convenience assumptions on these random variables (Assumption 6). Crucially, we will be making the Bayes invariance assumption on p Dt,X,Y that can be thought of as a generalized covariate shift assumption (Assumption 4). We will be studying the effect of changing the representation of the data. This is done by encoding X into a representation Z using a conditional distribution p Z | X. Definition 1 (Encoder). An encoder is a conditional distribution p Z | X : Z X [0, 1] from the input space X to the representation space Z. The data together with the representation has the following joint: (Dt, X, Y, Z) d p Dt p X,Y | Dt p Z | X (10) The key thing to notice here is that Z is conditionally independent of Y, Dt given X. In particular, the same encoder is used across all domains. A.2.1 RISK MINIMIZATION Our ultimate goal is to predict Y from the representation Z of X in a manner that is robust to changes in the domain. We formalize this in the standard way by making predictions γ Γ in a space of predictions or actions. For example the prediction space may be the set of all possible labels Γ = Y, in which case we would be predicting deterministic labels. Or we may predict a distribution over labels, in which case the prediction space would be the set of all probability distributions on Y, i.e. Γ = P(Y). A predictor is a function mapping inputs to predictions, i.e., f : X Γ, or representations to predictions, i.e., h : Z Γ. For example, f may be a neural network that takes as input a sample x and outputs a vector of logits that parameterize a softmax distribution over finitely many labels. Published as a conference paper at ICLR 2022 We select predictors according to the risk defined via a loss function ℓ: Y Γ R 0 { }: Rf [Y | X] := Ep X,Y [ℓ(Y, f(X))] . (11) In particular, we are interested in the Bayes (minimum) risk over all predictors: R [Y | X] := inf f Rf [Y | X] , (12) We denote the set of all optimal predictors from X as F :={f | Rf [Y | X] = R [Y | X]} (13) Similarly, we define the risk Rh [Y | Z], the Bayes risk R [Y | Z], and the set of optimal predictors H Z :={h | Rh [Y | Z] = R [Y | Z]} (14) from Z, all of which vary as a function of the encoder p Z | X. Note, in the main body of the paper, we omitted the subscript Z from H Z for clarity, but we will keep it in the Appendices. We assume that together our loss and prediction space always admit optima (Item 2 of Assumption 2), and thus F , H Z are always non-empty. We will be assuming that the risk admits unique optimal prediction when predicting from X (Item 3 of Assumption 2). Thus it makes sense to define the following: Definition 2 (The Bayes predictor). The Bayes predictor f : X Γ is the unique predictor that is optimal for all x X: f (x) = arg min γ Γ Ep Y | x[ℓ(Y, γ)] (15) Definition 3 (The Bayes image). The image of all the inputs under the Bayes predictor will be denoted as Γ = f (X) and called the Bayes image. Note that F becomes a singleton {f }, but it is not necessarily the case for H Z since we will not be making any uniqueness assumption on optimal prediction from Z. A.2.2 DOMAIN GENERALIZATION We are interested in controlling the risk in a domain generalization setting, and so we define the domain-conditional risk, Rd f [Y | X] := Ep X,Y | d[ℓ(Y, f(X))] . (16) Rd [Y | X] , F d are defined as Eqs. (12) and (13), respectively, but with respect to Rd f. Similarly, define the Bayes image for domain d as Γ d := f supp(p X | d) . (17) We also define domain-conditional quantities for prediction from a representation Z. The most important term which we will be investigating is an idealization of the domain generalization worstcase risk. Definition 4 (IDG risk). Given an encoder p Z | X and a distribution p Dt,Ds over a target domain Dt and source domain Ds, the idealized domain generalization worst-case risk, IDG risk for short, is the expected worst-case target risk taken over source minimizers, i.e., RIDG [Y | Z] := Ep Dt,Ds sup h H Z,Ds RDt h [Y | Z] Note that the IDG risk is well-defined because H Z,Ds is non-empty by Assumption 2. The desired optimal representations, are then those that minimize the IDG risk. Definition 5 (Optimal representations for IDG). An encoder p Z | X is optimal for idealized domain generalization if and only if it minimizes the IDG risk, i.e., RIDG [Y | Z ] = inf p Z | X RIDG [Y | Z] (19) Published as a conference paper at ICLR 2022 A.3 ASSUMPTIONS We make a the following assumptions throughout the paper. All these assumptions should hold for practical settings. Assumption 1 (Convenience: discrete probability spaces). All data spaces (D, X, Y, Z, A) are discrete spaces. Because the distributions of X, Y, D are fixed, we assume for convenience that supp(p X) = X, supp(p Y ) = Y, and supp(p Dt) = D. Assumption 1 is a convenience assumption to avoid measure theory for the sake of clarity. It always holds in practice due to finiteness of computers, i.e., all spaces will be finite but arbitrarily large. We believe that our claims can nevertheless be generalized to typical continuous spaces with some minor technical assumptions. Assumption 2 (Losses admit optima). We assume that our risk always admits optimal predictions: 1. |Γ| > 1. 2. For all pΥ P(Y), there exists γ Γ, such that EpΥ[ℓ(Υ, γ )] EpΥ[ℓ(Υ, γ)] γ Γ. (20) 3. For all x X, there exist γ Γ, such that Ep Y | x[ℓ(Y, γ )] < Ep Y | x[ℓ(Y, γ)] γ = γ . (21) Note that for log-loss ℓ(y, γ) = log γ(y) and finite Y, these assumptions are satisfied if Γ = P(Y) where the optimal prediction for Item 3 is γ = p Y | x by strict properness (Gneiting & Raftery, 2007). If we consider the 0-1 loss (reverse accuracy) ℓ(y, γ) = 1 1[y = γ] with Γ = Y and a finite label space where the optimal prediction for Item 3 is γ = arg maxy Y p Y | x(y), this assumption is mostly satisfied, except we assume that p Y | x has a unique mode. Assumption 2 serves two purposes: Item 2 ensures that for any representation the optimal predictors from Z exists such that the IDG risk is well-defined as in Def. 5; Item 3 ensures a unique Bayes predictor from X, which simplifies the analysis and is satisfied by common losses as described above. Assumption 3 (Cardinalities). We assume that |Z| |Γ | 2 (22) Assumption 3 is very weak and ensures that optimal representations always exists (Prop. 3). Assumption 4 (Generalized covariate shift). The Bayes predictor is optimal for all domains. I.e., for all (x, d) supp(p X,Dt), γ Γ such that γ = f (x), we have Ep Y | x,d[ℓ(Y, f (x))] < Ep Y | x,d[ℓ(Y, γ)] . (23) For example, in the case of strictly proper scoring rules, e.g. log loss, covariate shift p Y | X,D = p Y | X is equivalent to the invariance of the Bayes predictor. For the 0-1 loss, this is guaranteed by invariance of the most likely label. For MSE it is guaranteed by the invariance of the expected label. In the latter two cases, Assumption 4 is less stringent than the typical covariate shift assumption. Assumption 4 is the core assumption for our theoretical results. It ensures that source and target domains are related in a useful way that can be utilized by the representation. Assumption 5 (Constant Bayes image). The Bayes image is invariant across domains, i.e., for all d D, Γ d = Γ . (24) For the case of 0-1 loss, this simply means that the label set for all domains is the same, which is trivial. For log-loss, this means that the set of possible conditional distributions Γ d = {p Y | x | x supp(p X | d)} is the same across domains. Published as a conference paper at ICLR 2022 Assumption 5 is crucial to be able to learn. Without it, in the extreme case, one could set each domain to be all examples associated with a single element from the label set (or the Bayes image set) in which case it is impossible to generalize across different domains. Assumption 5 is also necessary to guarantee the existence of optimal representations as in Prop. 3. Assumption 6 (Domain joint). p Dt,Ds is any distribution such that supp(p Dt,Ds) = D D. In a simplified scenario, one could define the source Ds and target Dt as i.i.d. r.v. from p Dt, where p Dt,Ds = p Dt p Ds = p Dt p Dt and Assumption 6 is trivially satisfied. Published as a conference paper at ICLR 2022 B.1 LEMMAS FOR GENERAL LOSSES An important result that we will be using is the generalized data processing inequality of Bayes risk (Xu & Raginsky, 2020; Dubois et al., 2021). We include it here for completeness. Lemma 1 (Generalized DPI (Xu & Raginsky, 2020; Dubois et al., 2021)). Let Z X Y be a Markov chain of random variables. For any loss function ℓ, R [Y | X] R [Y | Z] . (25) For the case of strictly proper losses (Assumption 2) we can go one step further. Lemma 2. Let Z X Y be a Markov chain of random variables. Then, under Assumptions 1 and 2 we have that R [Y | Z] = R [Y | X] h H Z, (x, z) supp(p X,Z), h (z) = f (x). (26) Proof. Suppose that for all h H Z we have h (z) = f (x) on the support of p X,Z. Then, R [Y | X] = Ep X,Y [ℓ(Y, f (X))] (27) = Ep X,Y p Z | X[ℓ(Y, f (X))] (28) = Ep X,Y p Z | X[ℓ(Y, h (Z))] (29) = Ep Z,Y [ℓ(Y, h (Z))] (30) = R [Y | Z] . (31) Now suppose there exists a h H Z and a pair (x , z ) supp(p X,Z) such that h (z ) = f (x ). Then R [Y | Z] (32) = Ep X,Zp Y | X[ℓ(Y, h (Z))] (33) = p X,Z(x , z ) Ep Y | x [ℓ(Y, h (z ))] + X (x,z) =(x ,z ) p X,Z(x, z) Ep Y | x[ℓ(Y, h (z))] (34) p X,Z(x , z ) Ep Y | x [ℓ(Y, h (z ))] + X (x,z) =(x ,z ) p X,Z(x, z) Ep Y | x[ℓ(Y, f (x))] (35) > p X,Z(x , z ) Ep Y | x [ℓ(Y, f (x ))] + X (x,z) =(x ,z ) p X,Z(x, z) Ep Y | x[ℓ(Y, f (x))] (36) = R [Y | X] (37) Eq. (35) follows by Item 3 of Assumption 2 along with the definition of f . Eq. (36) follows by Item 3 of Assumption 2 and the fact that h (z ) = f (x ). This completes the proof, because Lemma 1 prevents R [Y | Z] < R [Y | X]. B.2 PROOF OF THEOREM 1 First we will show that the desired representation exists by taking all inputs for which the Bayes predictor predicts similarly and bucketing them to the same representation. This is a direct extension of the example from Dubois et al. s (2020) Proposition 6, to the case of proper losses. Proposition 3 (Existence of optimal representations). Under Assumptions 1 to 5, there exists an encoder p Z | X that is optimal for Eq. (3), i.e., p Z | X arg min p Z | X R [Y | Z] s.t. d D, supp(p Z | d) = supp(p Z). (38) Moreover, we have that R [Y | X] = R [Y | Z ] . (39) Published as a conference paper at ICLR 2022 Proof. Because we assume arbitrary encoders p Z | X, the essence of this construction is simple: we embed the Bayes image into Z. Indeed, let φ : Γ Z be any one-to-one function, which exists due to Assumption 3 (here we use deterministic one-to-one function for simplicity, the construction can be easily extended to stochastic case). Then let Z = φ(f (X)). We now verify the properties of p Z | X. 1. Z satisfies R [Y | X] = R [Y | Z ]. Indeed, R [Y | X] = Ep X,Y [ℓ(Y, f (X))] (40) = Ep X,Y p Z | X[ℓ(Y, f (X))] (41) = Ep Z ,Y ℓ(Y, φ 1(Z )) (42) R [Y | Z ] . (43) Eq. (42) is by our construction of Z and Eq. (43) is by the definition of the Bayes risk. Due to the data processing inequality of Bayes risk (Lemma 1) we also have R [Y | X] R [Y | Z ], from which we conclude that R [Y | X] = R [Y | Z ] and that Eq. (39) holds. 2. Recall that Γ = f (X) and Γ d = f supp(p X | d) . Now let us compute the desired support for all d D: supp(p Z | d) = φ (Γ d) (44) = φ (Γ ) (45) = supp(p Z ). (46) Eq. (45) is by Assumption 5. Because R [Y | X] is the minimum achievable risk by any encoder regardless of constraint (this is by Lemma 1), this implies that p Z | X is an optimal encoder for Eq. (3). The following lemma essentially says that when R [Y | Z] is minimized, then the optimal predictors for each domain all agree on the intersection of their support. Lemma 3. Let p Z | X be an encoder such that R [Y | Z] = R [Y | X]. Under Assumptions 1 and 2, we have that for all z supp(p Z), there exists γ Γ such that Ep Y | z[ℓ(Y, γ )] < Ep Y | z[ℓ(Y, γ)] γ = γ . (47) In other words, the restriction of any h H Z to supp(p Z) is unique. If, in addition, Assumption 4 holds, then for all (z, d) supp(p Z,Dt), γ Γ such that γ = h (z), Ep Y | z,d[ℓ(Y, h (z)] < Ep Y | z,d[ℓ(Y, γ)] . (48) In other words, the restriction of any h H Z,d to supp(p Z | d) is unique and equal to h . Proof. For the first result, let z supp(p Z) and consider x supp(p X | z). By Lemma 2, it must be the case that f is constant on supp(p X | z). Thus, we can pick γ = f (x). Now, let γ = γ . We have that, Ep Y | z[ℓ(Y, γ )] = Ep X | zp Y | X[ℓ(Y, γ )] (49) = Ep X | zp Y | X[ℓ(Y, f (X))] (50) < Ep X | zp Y | X[ℓ(Y, γ)] (51) = Ep Y | z[ℓ(Y, γ)] . (52) Eq. (49) is due to the conditional independence of Y and Z given X. Eq. (51) is due to Assumption 2 and the definition of the Bayes predictor. Let h : supp(p Z) Γ be the unique Bayes predictor from Z. Now, for the second result, note that R [Y | X] = Rf [Y | X] (53) Published as a conference paper at ICLR 2022 d D p Dt(d) Rd f [Y | X] (54) d D p Dt(d) Rd [Y | X] , Assumption 4 (55) R [Y | Z] = Rh [Y | Z] (56) d D p Dt(d) Rd h [Y | Z] (57) d D p Dt(d) Rd [Y | Z] , (58) where Eq. (58) is due to the definition of (domain-conditional) Bayes risk. Then R [Y | Z] R [Y | X] X d D p Dt(d) Rd [Y | Z] Rd [Y | X] (59) 0. Lemma 1 conditioned on d (60) Thus, any encoder that achieves R [Y | Z] = R [Y | X] also satisfies Rd [Y | Z] = Rd [Y | X] for all d D since we assume that supp(p Dt) = D in Assumption 1. Now, let d D. An argument analogous to Lemma 2 gives us, h H Z,d, (x, z) supp(p X,Z | d), h(z) = f (x) = h (z). (61) Eq. (61) is derived from Rd [Y | Z] = Rd [Y | X] using Assumption 4 in place of Item 3 of Assumption 2 for a specific domain d. Let z supp(p Z | d) and γ Γ such that γ = h (z). Since supp(p X | z,d) supp(p X | z), f is a constant on supp(p X | z,d) and equal to h . Now, as above, we have that Ep Y | z,d[ℓ(Y, h (z))] = Ep X | z,dp Y | X,d[ℓ(Y, h (z))] (62) = Ep X | z,dp Y | X,d[ℓ(Y, f (X))] (63) < Ep X | z,dp Y | X,d[ℓ(Y, γ)] (64) = Ep Y | z,d[ℓ(Y, γ)] . (65) Eq. (64) is due to Assumption 4. Corollary 1. Let p Z | X be an encoder such that R [Y | Z] = R [Y | X]. Under Assumptions 1, 2 and 4 we have that H Z H Z,d for all d D and that for all ds, dt D inf h H Z,ds Rdt h [Y | Z] = Rdt [Y | Z] (66) Proof. H Z H Z,d is immediate from Lemma 3. Now, we have that Rdt h [Y | Z] Rdt [Y | Z]. So, the result follows by taking any h H Z H Z,ds in the inf of Eq. (66). Theorem 2 (Characterizing optimal representations for IDG, equiv. Theorem 1). Under Assumptions 1 to 6, an encoder p Z | X is optimal for idealized domain generalization if and only if it minimizes the Bayes risk while matching the support of p Z | d and p Z for all d D, i.e., p Z | X arg min p Z | X R [Y | Z] (67) s.t. d D, supp(p Z | d) = supp(p Z) (68) Proof. The IDG risk is lower bounded by R [Y | X]: RIDG [Y | Z] Ep Ds,Dt inf h H Z,Ds RDt h [Y | Z] Published as a conference paper at ICLR 2022 Ep Ds,Dt RDt [Y | Z] (70) Ep Ds,Dt RDt [Y | X] Lemma 1 (71) = R [Y | X] Assumption 4 (72) We will now show that this lower bound is achieved by an encoder if and only if it satisfies Eqs. (67) and (68), which exist by Prop. 3. Sufficiency ( = ): Let p Z | X be an encoder that satisfies Eqs. (67) and (68). Note that R [Y | Z] = R [Y | X] by Prop. 3. Let h H Z, then we have the following IDG risk RIDG [Y | Z] (73) sup h H Z,Ds Ep Z,Y | Dt[ℓ(Y, h(Z))] sup h H Z,Ds Ep Z,Y | Dt[ℓ(Y, h (Z))] Lemma 3 under matching support (75) = Ep Dt h Ep Z,Y | Dt[ℓ(Y, h (Z))] i constant w.r.t Ds (76) = R [Y | Z] = R [Y | X] (77) Necessity ( = ): If the IDG risk is R [Y | X], then it must be the case that R [Y | Z] = R [Y | X] (78) sup h H Z,ds Rdt h [Y | Z] = Rdt [Y | Z] (ds, dt) supp(p Ds,Dt) (79) We will prove by contrapositive that Eq. (79) implies support match (Eq. (68)). Suppose that the support match does not hold. Since supp(p Z) = d Dsupp(p Z | d) and supp(p Ds,Dt) = D D (Assumption 6), there must exist (ds, dt) supp(p Ds,Dt) such that supp(p Z | ds) = supp(p Z | ds). Define the set S = supp(p Z | ds) supp(p Z | dt) and S = supp(p Z | dt) \ supp(p Z | ds), let ρ = PZ | dt(S), and let h H Z. Then, sup h H Z,ds Rdt h [Y | Z] (80) = sup h H Z,ds ρ Ep Y,Z | S,dt[ℓ(Y, h(Z))] + (1 ρ) Ep Y,Z | S,dt[ℓ(Y, h(Z))] (81) = sup h H Z,ds ρ Ep Y,Z | S,dt[ℓ(Y, h (Z))] + (1 ρ) Ep Y,Z | S,dt[ℓ(Y, h(Z))] Lem. 3 (82) = ρ Ep Y,Z | S,dt[ℓ(Y, h (Z))] + (1 ρ) sup h H Z,ds Ep Y,Z | S,dt[ℓ(Y, h(Z))] (83) = Rdt [Y | Z] + (1 ρ) sup h H Z,ds Ep Y,Z | S,dt[ℓ(Y, h(Z)) ℓ(Y, h (Z))] (84) > Rdt [Y | Z] Lem. 3 (85) Eq. (85) uses the following reasoning. 1 ρ > 0 due to support mismatch. For any h H Z,ds such that h = h on S (such an h exists by Item 1 of Assumption 2), we have that Ep Y,Z | S,dt[ℓ(Y, h(Z)) ℓ(Y, h (Z))] > 0 (86) by Lemma 3. As a corollary from the proof strategy we directly have that the optimal DG risk is simply R [Y | X]. This means that using the optimal encoder one can actually perform just as well by training on the source as if you were to directly train on the target using the raw data. Corollary 2 (Optimal IDG Risk). Under Assumptions 1 to 6, infp Z | X RIDG [Y | Z] = R [Y | X]. Published as a conference paper at ICLR 2022 B.3 IMPOSSIBILITY RESULTS As a direct corollary of Thm. 2 we know that it is impossible to learn an optimal representation without knowledge or assumptions on the target domain. We can actually prove the following much stronger negative result, which essentially states that it is impossible to find a useful representation without having some information about the target domain. Specifically, we prove that if there exists a non-trivial target domain on which the representation is advantageous then there exists an infinite amount of target domains on which it is disadvantageous compared to predicting from a constant. For clarity, we will focus on the proof for the standard accuracy (0-1 loss) which is much shorter and simpler to understand, but note that we can generalize the proof to all losses with the right assumptions. The key is that outside of the source domain, the label distribution is unconstrained because generalized covariate shift has no effect. In other words, for any domain which gives some probability mass on an example that has not been seen during training, then all possible labels for that example gives a valid domain. Furthermore, if there exists one domain on which the representation is good, then one can construct a domain on which the representation is bad simply by labelling this point as the constant prediction. Proposition 4 (No free lunch for learning representations for IDG, equiv. Proposition 1). Let ℓbe the 0-1 loss with prediction space Γ = Y. Let Rep : P(X, Y) P(Z|X) be any algorithm for choosing an encoder p Z | X from the data distribution p X,Y , C be any constant r.v. that t.v.i. Z, and p X,Y | ds be any desired source distribution such that there is a unique constant prediction γC = arg miny Y Ep Y | ds [ℓ(Y, y)], and |X \ supp(p X | ds)| > 1. Let p Zds | X := Rep(p X,Y | ds) be the chosen source encoder. If there exists a target domain p X,Y | dg t such that (Non-trivial support) = supp(p X | dg t ) X \ supp(p X | ds); (Satisfies Bayes image invariance) Γ dg t = Y, i.e., there is at least one example for every possible label; (Source encoder is useful) p Zds | X performs better than a constant representation, sup h H Zds ,ds Rdg t h [Y | Zds] < sup h H C,ds Rdg t h [Y | C] , (87) Then there exist multiple target domains db t such that p Zds | X underperforms a constant encoder, sup h H Zds ,ds Rdb t h [Y | Zds] > sup h H C,ds Rdb t h [Y | C] . (88) Proof. Let h H Zds ,ds be any source Bayes predictor corresponding to our encoder. Partition Z according to whether h predicts like the constant or not: ZC :={z Z | h (z) = γC} Z =C := Z \ ZC. (89) We know by assumption that dg t is s.t. sup h H Zds ,ds Rdg t h [Y | Zds] < sup h H C,ds Rdg t h [Y | C] , (90) which is clearly only possible if PZds | dg t (Z =C) > 0. (91) In other words, there exists some input x =C X \ supp(p X | ds) that will get represented outside of the constant region, i.e., PZds | x =C(Z =C) > 0. (92) Published as a conference paper at ICLR 2022 We will now construct the desired bad domain db t by giving nearly all mass to this x =C, specifically, let p X | db t(x =C) = 1 δ for some 0 < δ < 1. We assign this example to the constant label, i.e., p Y | x =C,db t(γC) = 1. The rest of the target domain mass δ is distributed as with the source domain, i.e., p X,Y | db t(x, y) = δ p X,Y | ds(x, y) for all x, y supp(p X,Y | ds). Importantly, the constructed domain db t is valid. Indeed, the Bayes image is the same as the source s (Assumption 5), because we removed no prediction γ from the source s Bayes image (δ > 0). We added no new prediction γ, because f (x =C) = γC Y which must already have been in Γ due to the validity of dg t . Now let us compute the desired risk for that bad domain and show that the desired encoder performs worse than a constant encoder. sup h H Zds ,ds Rdb t h [Y | Zds] (93) = sup h H Zds ,ds (1 δ) Ep Zds | x =C [1 1[γC = h(Zds)]] + δ Rds h [Y | Zds] (94) (1 δ)(1 PZds | x =C(ZC)) (95) = (1 δ)PZds | x =C(Z =C) (96) In contrast, it is easy to show that suph H C,ds Rdb t h [Y | C] δ because the constant predictor would be perfect for x =C. So any choice of 0 < δ < PZds | x =C (Z =C) 1+PZds | x =C (Z =C), would satisfy Eq. (88). We conclude the proof by noting that there are infinitely many such choices of δ, and any choice of those would result in a different valid bad domain db t. Note that representations can often be much worse than using a constant r.v. Specifically, if an encoder p Z | X maps an x outside of the source support then there exists an infinite number of target domains where that representation is the worst possible representation. Proposition 5 (Worst representation). Let Rep, p Y,X | ds, p Zds | X, ℓbe as in Prop. 4, and ϵ > 0. If there exists an example xb X \ supp(p X | ds) that is mapped outside of the source support, i.e., supp(p Zds | xb) supp(p Z | ds) = , then there exist many target domains p X,Y | dt s.t. p Zds | X is ϵ close to the worst possible loss, i.e., sup h H Zds ,ds Rdt h [Y | Zds] 1 ϵ. (97) Proof. By assumption there exists an xb whose support is outside the source support. Then similarly to Prop. 4 we construct a bad target domain dt by giving nearly all mass to that example p X | dt(xb) = 1 δ where δ > 0 and assign with probability 1 to some label that is in the source Bayes image, i.e., p Y | xb,dt(γb) = 1 for some γb Γ ds. The rest of the target domain mass δ is distributed as in Prop. 4 to the source inputs. As in Prop. 4, such a target domain dt satisfies our assumptions. Now let us compute the risk for that dt and show that the desired encoder performs arbitrarily bad. sup h H Zds ,ds Rdt h [Y | Zds] (98) = sup h H Zds ,ds (1 δ) Ep Zds | xb [1 1[γb = h(Zds)]] + δ Rds h [Y | Zds] Eq. (94) (99) sup h H Zds ,ds (1 δ) Ep Zds | xb [1 1[γb = h(Zds)]] (100) = 1 δ (101) Eq. (101) uses the fact that H Zds ,ds is unconstrained outside of the source support and that by assumption supp(p Zds | xb) supp(p Zds | ds) = . To achieve the sup 1 δ it then suffices to predict an γ = γb Γ. We thus see that Eq. (97) holds for dt as long as 0 < δ < ϵ. We conclude the proof by noting that there is an infinite possible choices of δ each of which give rise to a bad target domain. Published as a conference paper at ICLR 2022 B.4 AUGMENTATIONS Proposition 2 shows that the optimal representations for IDG can be learned with augmentations in a self-supervised fashion. Here, we provide formal definitions, assumptions, and proofs. Definition 6 (Augmenter). An augmenter is a conditional distribution p A | X : A X [0, 1] from the input space X to an augmentation space A. For example, in CLIP X is the space of images and A is the space of text. In standard SSL, A is typically the same as X (e.g., both X and A are the space of images). Definition 7 (Augmentation conditional set). Given an augmenter p A | X, define the augmentation conditional set as the set of conditionals of A given X: P (A | X) := p A | x | x X (102) Similarly, we can define the augmentation conditional set for domain d: P d(A | X) := p A | x | x supp(p X | d) (103) These sets are clearly countable. Note that the augmentation conditional set can be seen as a special case of the Bayes image (Def. 3) if we view the augmentation A as the label and consider the log-loss where the conditional distribution is the Bayes optimal predictor due to its strict properness (Gneiting & Raftery, 2007). Assumption 7 (Finite augmentation entropy). We consider the augmenter p A | X such that the entropy of the augmentation A is finite, i.e., H[A] < . Assumption 8 (Cardinalities). We assume that |Z| |P (A | X)| (104) This is a similar assumption as Assumption 3, which ensures the existence of optimal representations. Assumption 9 (Domain-agnostic augmentation). We assume that the augmentation A is domainagnostic, i.e., the augmentation conditional set is invariant across domains, P d(A | X) = P (A | X), d D (105) This assumption is generalized from the constant Bayes image assumption (Assumption 5), which guarantees the existence of optimal representations. Domain-agnostic augmentations essentially ensures that each augmentation conditional p A | x P (A | X) is seen at least once in all domains. If we introduce an equivalence relation as x x iff p A | x = p A | x and the equivalence class [x] := {x X | x x}. Under this relation, it is easy to see that the above assumption is satisfied if and only if, for all possible equivalence classes [x] {[x ] | x X}, we have that [x] has intersections with all domains: [x] supp(p X | d) = , d D (106) Not all augmentations are domain-agnostic. In particular, the standard image augmentations used by typical SSL models like Sim CLR are not domain-agnostic, but the text-image augmentations of CLIP nearly are, as discussed in the main body (Sec. 4). Assumption 10 (Bayes-preserving augmentation). We assume that the augmentation A is Bayespreserving, i.e., x, x X, p A | x = p A | x = f (x) = f (x ). (107) Under the notion of equivalence relation in Assumption 9, this means that for each equivalence class [x], all x [x] have the same Bayes prediction. Note that most augmentations used in practice like standard image augmentations are Bayes-preserving. Next, we show that under the above assumptions, we can learn optimal representations by maximizing the mutual information I[A; Z] (in the case of log-loss ℓ) under the support match constraint. We use log-loss simply because it is typically the loss used for training in practice. Note that the learned representations are optimal for any strict proper losses. Published as a conference paper at ICLR 2022 Proposition 6 (Learning optimal representations without labels, equiv. Proposition 2). Let p A | X be an augmenter. Under Assumptions 1 to 10, any encoder p Z | X such that p Z | X arg max p Z | X I[A; Z] (108) s.t. d D, supp(p Z | d) = supp(p Z) (109) is optimal for idealized domain generalization. Proof. The support match constraint Eq. (109) is equivalent to the support match constraint Eq. (68). Thus, Prop. 3 and Thm. 2 state that we only need to prove that maximizing the mutual information of A and Z under the support constraint implies that R [Y | Z] = R [Y | X] . (110) We will prove this by constructing an optimal predictor h . Since H[A] < (Assumption 7) we have that arg max p Z | X I[A; Z] = arg min p Z | X H[A | Z] . (111) Note the fact that the conditional entropy is the Bayes risk under the log-loss (Gneiting & Raftery, 2007), i.e., H[A | Z] = R [A | Z]. By construction, A satisfies covariate shift w.r.t. X (thus Bayes invariant) since A X D forms a Markov chain. Together with Assumptions 1 and 7 to 9, it means that the optimization problem in Eqs. (108) and (109) satisfies the assumptions of Prop. 3, with A in place of Y . Thus, an optimal encoder satisfies R [A | Z] = R [A | X], which leads to H[A | Z] = H[A | X] . (112) By Assumption 7, we can invoke Lemma 2 with the fact that A X Z forms a Markove chain to show that for all (x, z) supp(p X,Z) p A | z = p A | x, (113) as the conditional distributions are the Bayes optimal predictors due to strict properness of log-loss. Now, define the following equivalence relation on X, x x p A | x = p A | x . (114) Because the number of equivalence classes under is countable, there exists a maximal invariant M : X N from X to the natural numbers (for our definition of a maximal invariant see Definition 2, Dubois et al., 2021). By Assumption 10, f is invariant on the equivalence classes [x] := {x X | x x} for all x X. Thus, there exists a function g : N A such that f = g M (Lemma 5, Dubois et al., 2021). Given z supp(p Z), we construct h in the following way. Let xz supp(p X | z) be any input point that could have led to this representation z and define h (z) = g(M(xz)). (115) By Eq. (113) we are guaranteed that all x supp(p X | z) share the same value for f since they are in the same equivalence class. Thus, by the definition of M we have that M(x Z) = M(X) for (X, Z) p X,Z. (116) Rh [Y | Z] = Ep Y,Z[ℓ(Y, h (Z)] (117) = Ep Y | Xp X,Z[ℓ(Y, h (Z)] (118) = Ep Y | Xp X,Z[ℓ(Y, g(M(x Z)))] Eq. (115) (119) = Ep Y | Xp X,Z[ℓ(Y, g(M(X)))] Eq. (116) (120) = Ep Y | Xp X,Z[ℓ(Y, f (X))] = R [Y | X] . (121) Published as a conference paper at ICLR 2022 C PRACTICAL OBJECTIVES Proposition 6 provides an objective to obtain the desired optimal representations, compared to Thm. 2 it is more practical in that it does not require direct access to the labels and in that it can use augmentations under appropriate assumptions. There are nevertheless multiple remaining issues for deriving objectives that can be trained with in practice. Specifically, (i) the support constraint is hard to satisfy in practice; (ii) mutual information I[A; Z] is hard to estimate from samples (Poole et al., 2019); (iii) the objective is constrained which is harder to optimize. We will now show different objectives and variational bounds of them that do not suffer from these issues, and could still recover the desired encoders in their optima. In contrast to the proofs of main theoretical results (previous section), here the derivations will be less formal. As we have seen in Proposition 6, the optimal representation achieves I[A; Z] = I[A; X]. In the following, we will rewrite the objective as the constrained optimization: p Z | X arg min p Z | X B[Z, X, Y, D] (122) s.t. I[A; Z] = I[A; X] (123) where we introduce the domain bottleneck B[Z, X, Y, D] as the objective for enforcing support match (which we denote as B[Z, D] in the main body for simplicity). The requirement on the domain bottleneck objective is that minimizing Eq. (122) under Eq. (123) implies that the support match constraint holds (and can be achieved by some encoder), which leads to optimal representations for IDG. Different domain bottlenecks will be derived later this section. We can then use Lagrangian relaxation to get the following unconstrained objectives. arg min p Z | X I[A; Z] + λ B[Z, X, Y, D] (124) The first term can be easily optimized using variational bounds on MI. Throughout the paper, we will use a contrastive variational lower bound which is based on Info NCE (Oord et al., 2018). Namely, let X be the input sample and A be the positive augmentation sampled from p A | X. We then obtain n negative augmentations A i n i=1 by first independently sampling X i n i=1 from the marginal p X and then sampling A i from p A | X i . It is easy to see that the negatives A i follow the marginal p A. We construct A := A, A 1 , . . . , A n . Let Z be the representation of X by passing it through the encoder pϕ := p Z | X parameterized by ϕ and sψ the critic function parametrized by ψ used to score which A A is the positive augmentation. Then we have the following variational lower bound (Poole et al., 2019): I[A; Z] log(n + 1) + Ep A,X,Z log exp sψ(A, Z) P A A exp sψ(A , Z) In the case of unconstrained variational families sψ, pϕ and infinite samples (n ), the above variational bound recovers I[A; Z] up to a constant (see Oord et al. (2018); Dubois et al. (2021)). Typically the critic is separable, i.e., sψ(A, Z) := gψ(A)T hψ(Z). As discussed in the main body, it can be tied with the encoder pϕ when A = X. In the following we focus on the second term B[Z, X, Y, D] and discuss several choices. Throughout this section, the function M : X N is the maximal invariant defined in Prop. 6 via the equivalence relation defined in Eq. (114). C.1 MUTUAL INFORMATION BOTTLENECK B[Z, X, Y, D] = I[Z; X] The first bottleneck we consider is so called mutual information (MI) bottleneck B[Z, X, Y, D] = I[Z; X], which was introduced by Tishby et al. (2000) to achieve a tradeoff between the predictive power and the complexity of representations. Intuitively, it tries to remove all information of Z that is not needed for maximizing I[Z; A]. In particular, using the fact that Z X D forms a Markov chain and the chain rule of MI, we have I[Z; X] = I[Z; X, D] = I[Z; D] + I[Z; X | D]. Thus, it not only minimizes I[Z; D], i.e., matches the representations distribution across domains, but also minimizes I[Z; X | D], i.e., matches the representations distribution inside domains. Published as a conference paper at ICLR 2022 Why The key to show is that minimizing Eq. (122), i.e., arg minp Z | X I[Z; X] under I[A; Z] = I[A; X], implies the support match constraint. This can be seen as a specific subcase of Dubois et al. s (2021) Corollary 15 with A in place of Y and M(X) induced by p A | X as in the proof of Prop. 6. From the corollary, we know that minp Z | X I[Z; X] = H[M(X)] which can be achieved by any Z s.t. p Z | x = p Z | x M(x) = M(x ). With the assumption of domain-agnostic augmentations (Assumption 9), we have that the set of maximal invariant {M(x) | x supp(p X | d)} is invariant across domains. Then we directly have supp(p Z | d) = x supp(p X | d)supp(p Z | x) = x supp(p X)supp(p Z | x) = supp(p Z), where we use the fact that x within the same equivalence class has the the same p Z | x. How Essentially, we can use any variational upper bound of mutual information. We consider the one used by Variational Information Bottelenck (Alemi et al., 2016), i.e., I[Z; X] = Ep X,Z log pϕ(Z | X) log pϕ(Z | X) DKL[p Z(Z) qθ(Z)] (127) log pϕ(Z | X) = Ep X[DKL[pϕ(Z | X) qθ(Z)]] (129) where a variational distribution qθ is used to approximate p Z and is jointly optimized with pϕ to minimize the bound. The approximation gap of the bound is DKL[p Z(Z) qθ(Z)]. Ignoring the constant, the final loss becomes LMI(ψ, ϕ, θ) := Ep X,A,Z log exp sψ(A, Z) P A A exp sψ(A , Z) + λ DKL[pϕ(Z | X) qθ(Z)] (130) which recovers the optimal encoder in the case of unconstrained variational families for pϕ, qθ, sψ, infinite samples n , and any λ > 1 (Dubois et al., 2021). C.2 ENTROPY BOTTLENECK B[Z, X, Y, D] = H[Z] The entropy (Ent) bottleneck introduced in the main body is a special case of the MI bottleneck, where the encoder is a deterministic mapping, i.e., pϕ(Z | x) is a dirac delta function for all x X and we denote by eϕ(x) the deterministic encoder s.t. pϕ(eϕ(x) | x) = 1. Why In the deterministic case, the MI bottleneck becomes the entropy bottleneck because I[X; Z] = H[Z] H[Z | X] = H[Z], where we use the fact that H[Z | X] = 0. Importantly, considering only deterministic encoders does not constrain our ability to learning optimal encoders. Indeed, just as with the MI bottleneck optimizing the objective with the entropy bottleneck under I[A; Z] = I[A; X] will recover encoders s.t. eϕ(x) = eϕ(x ) M(x) = M(x ), which also satisfies the support match constraint as discussed before. How Using the same derivation as the MI bottleneck, we can derive the variational upper bound on entropy H[Z] Ep Z[ log qθ(Z)] (131) which is the standard variational bound used in neural compression (Ballé et al., 2016; Theis et al., 2017). Putting all together, we have LEnt(ψ, θ, ϕ) := Ep X,A,Z log exp sψ(A, Z) P A A exp sψ(A , Z) λ log qθ(Z) (132) which also recovers the optimal encoder with unconstrained variational families, infinite samples, and λ > 1 as with the MI bottleneck. The detialed algorithm is provided in Algorithm 2. Note that the discreteness of Z could lead to difficulty of gradient-based optimization, and we follow Ballé et al. (2016) to add uniform noise to Z as a differentiable substitute for rounding during training. In our experiments, we will mostly use the Ent bottleneck instead of the MI bottleneck to avoid introducing stochastic encoders. Published as a conference paper at ICLR 2022 Algorithm 2 Ent objective Require: eϕ, sψ, qθ, X, n 1: Z eϕ(X) 2: A sample(p A | X) 3: {(X i , A i )}n i=1 i.i.d. sample(p X,A) 4: A {A} {A i }n i=1 5: Laug log exp sψ(A,Z) P A A exp sψ(A ,Z) I[A; Z] 6: Lsupp log qθ(Z) H[Z] 7: return LEnt = Laug + λLsupp C.3 CONTRASTIVE ADVERSARIAL DOMAIN BOTTLENECK B[Z, X, Y, D] = I[Z; D] The previous two bottlenecks require removing the information of Z (about X) as much as possible, which seems to be unnecessary since our ultimate goal is to match the support of Z across domains. Now we introduce a bottleneck B[Z, X, Y, D] = I[Z; D] which we only seek to remove the information of Z about the domain D. This is very related to the work on invariant representation learning for domain generalization/adaptation (e.g., Ganin et al., 2016; Li et al., 2018a). We derive a new variational bound called the contrastive adversarial domain (CAD) bottleneck that is more stable to train and leads to better empirical performance. For simplicity we consider the deterministic encoder eϕ(x) as with the main body. Why Similar to the previous analysis, we aim to show that arg minp Z | X I[Z; D] under I[A; Z] = I[A; X] leads to the support match constraint. Using Eq. (116) we have I[Z; D] = I[Z, M(XZ); D] = I[Z, M(X); D] = I[M(X); D] + I[Z; D | M(X)] where the last equality uses the chain rule of mutual information. Due to the non-negativity of (conditional) mutual information, we have that the minimum of I[Z; D] under I[A; Z] = I[A; X] is I[M(X); D]. Then we show the minimum is achievable by constructing the same optimal encoder eϕ(X) as the Ent bottleneck which clearly satisfies I[Z; D | M(X)] = 0. It is then easy to show that the support match constraint has to hold when I[Z; D | M(X)] = 0 by contrapositive. Indeed, suppose that the support constraint does not hold then it must be true that I[Z; D | M(X)] > 0 and so the encoder cannot be optimal. How The typical way of minimizing I[Z; D] is to derive the variational bound as I[Z; D] = H[D] H[D | Z] (133) = (const) Ep D,Z log p D | Z(D | Z) (134) (const) Ep D,Z[ log qφ(D | Z)] (135) where a variational distribution (or domain classifier) qφ is used to approximate p D | Z and jointly trained to maximize the bound. This recovers the domain-adversarial training method as introduced in Ganin et al. (2016). However, this has two potential issues: 1) it gives a lower bound instead of the desired upper bound on I[Z; D]; 2) it requires adversarial training which is not stable in practice (Goodfellow, 2016; Kodali et al., 2017). We propose the contrastive adversarial domain (CAD) bottleneck, which is based on the above explicit version but uses a variational distribution qφ(D | Z) that is tied with other parts of the model, thus no need to learn a domain classifier. Suppose we have access to a set of inputs X, we first introduce a contrastive variational distribution qϕ,X(X | Z) of p X | Z as qϕ,X(X | Z) := exp sϕ(X, Z) P X X exp sϕ(X , Z) (136) where sϕ(X, Z) := eϕ(X)T Z is tied with the encoder eϕ. Note that qϕ,X has support over X, and equals p X | Z when sϕ(X, Z) log p X,Z(X, Z) and X recovers X. In practice, we use a variety of crude approximations. Our first crude approximation is that we use the minibatch of samples, i.e., the independently sampled X i n i=1 as X. Now, since p D | Z can be rewritten as Ep X | Z p D | X using the fact that D X Z forms a Markov chain, we obtain the following variational distribution: qϕ,X(D | Z) = Eqϕ,X p D | X(D | X) (137) Published as a conference paper at ICLR 2022 which recovers p D | Z when qϕ,X = p X | Z. Note that p D | X is still not available. For our second crude approximation, we use a count estimate ˆp D,X. In particular, we obtain a collection D by taking each X X and independently sampling D from p D | X to get D := {D i }n i=1. In other words, {(D i , X i )}n i=1 are all i.i.d. sampled from p D,X. Then we use a count estimate ˆp D,X(d | x) = Pn i=1 I (X i = x, D i = d) Pn i=1 I (X i = x) (138) which is an accurate estimate with infinite samples. This leads to our final variational distribution: qϕ,X,D(D | Z) = X X X qϕ,X(X | Z)ˆp D,X(D | X ) (139) Putting all together we get that the loss: LCAD(ϕ, ψ) := Ep D,X,A.Z log exp sψ(A, Z) P A A exp sψ(A , Z) + λ log X X qϕ,X(X | Z)ˆp D,X(D | X ) (140) In practice, ˆp D,X(D | X) is typically a dirac delta function since it is rare to have the same samples in a minibatch. Thus, in Eq. (139) we only need to sum qϕ,X(X | Z) over those associated with the same domain label D as X, i.e., XD :={X i | D i = D, i [n]} where [n] :={1, . . . , n}. This leads to the simplified loss: LCAD(ϕ, ψ) := Ep D,X,A,Z log exp sψ(A, Z) P A A exp sψ(A , Z) + λ log X XD qϕ,X(X | Z) In practice, we find that the second term that minimizes the log probability leads to numerical instability. Intuitively, this could be seen by the exploding gradient of the function log(p) when p 0. We thus replace it with log(1 p) which has the same optima. I.e. in practice we maximize the log of the probablity summed over X D := X \ XD. This reduces Eq. (141) to Eq. (7) described in the main body with a detailed algorithm in Algorithm 1. Note that it is easy to generalize Algorithm 1 to parallel computation within a batch of samples. Indeed, for each sample in the batch, we can view all other samples in the batch as negatives and compute the loss efficiently in parallel. C.4 CONDITIONAL CAD B[Z, X, Y, D] = I[Z; D | Y ] The analysis of the CAD bottleneck also implies that we can minimize the conditional mutual information I[Z; D | M(X)] if we have access to M(X). However, since M(X) is typically not available in practice, we consider the special case where M(X) = Y . In particular, this is the case where the labels are available and the supervised augmentations are used (see Fig. 2b). This reduces the bottleneck to B[Z, X, Y, D] = I[Z; D | Y ] which is related to the conditional version of the domain-adversarial neural network (Li et al., 2018b). In practice, minimizing I[Z; D | Y ] could be easier for optimization than I[Z; D], as it does not require to remove the information that D has about Y . In the following, we derive the conditional CAD (C2AD) bottleneck using a similar idea as CAD. How In this case, we want to minimize I[Z; D | Y ] = H[D | Y ] H[D | Z, Y ] (142) = (const) H[D | Z, Y ] (143) (const) Ep D,Z,Y [ log q(D | Z, Y )] (144) where q(D | Z, Y ) is a variational distribution of p D | Z,Y . Similar to the unconditional case, we also aim to use a non-parametric approximation that is tied with other parts of the model, and we obtain it using the fact p D | Z,Y = Ep X | Z,Y p D | X . Specifically, let Y be the label of input X sampled from p Y | X and Y :={Y 1 , . . . , Y n } be the collection of labels obtained by independently sampling the label from p Y | X for each X X. We collect samples associated with the label Y , i.e., XY :={X i | Y i = Y, i [n]} and obtain a variational distribution of p X | Z,Y : qϕ,X,Y(X | Z, Y ) := exp sϕ(X, Z) P X XY exp sϕ(X , Z) (145) Published as a conference paper at ICLR 2022 where we use the same critic sϕ(X, Z) := eϕ(X)T Z that is tied with the encoder eϕ as before, but only take softmax over those samples with the same label Y . For the term p D | X, we use the same count estimate ˆp D,X in Eq. (138). Then we obtain the variational distribution of p D | Z,Y : qϕ,X,D,Y(D | Z, Y ) = X X XY qϕ,X,Y(X | Z, Y )ˆp D,X(D | X ) (146) Putting all together we get that the final loss: LC2AD(ϕ, ψ) := Ep D,X,A,Y,Z log exp sψ(A, Z) P A A exp sψ(A , Z) + λ log X XY qϕ,X,Y(X | Z, Y )ˆp D,X(D | X ) (147) Again, since in practice ˆp D,X(D | X) is typically a dirac delta function, the summation in Eq. (146) can be done only over those associated with the same label Y and the same domain label D as X, i.e., XY,D := {X i | Y i = Y, D i = D, i [n]}. Similarly, instead of minimizing the log of the probability summed over XY,D, we maximize the log of the probability summed over XY, D := XY \ XY,D ={X i | Y i = Y, D i = D, i [n]}. Finally we obtaine the simplified loss: LC2AD(ϕ, ψ) := Ep D,X,A,Y,Z log exp sψ(A, Z) P A A exp sψ(A , Z) λ log X XY, D qϕ,X,Y(X | Z, Y ) (148) A detailed algorithm is in Algorithm 3. Algorithm 3 conditional CAD (C2AD) objective Require: eϕ, sψ, D, X, Y, n 1: Z eϕ(X) 2: A sample(p A | X) (D i , X i , A i , Y i ) n i=1 i.i.d. sample(p D,X,A,Y ) 4: X, A X i n i=1,{A} A i n i=1 5: XY X i | Y i = Y, i [n] 6: XY, D X i | Y i = Y, D i = D, i [n] 7: Laug log exp sψ(A,Z) P A A exp sψ(A ,Z) I[A; Z] 8: Lsupp log P X XY, D exp eϕ(X )T Z P X XY exp eϕ(X )T Z I[Z; D | Y ] 9: return LC2AD = Laug + λLsupp Published as a conference paper at ICLR 2022 D EXTENDED RELATED WORK Provably optimal representations. Many previous work have theoretically studied advantages of representations in various two-stage settings (representation learning followed by standard training of predictors) by bounding downstream performance (e.g., Ben-David et al., 2007; Shamir et al., 2010; Saunshi et al., 2019). As learning theoretical bounds can be loose, it is hard to know whether they give the right insights into the problem. Our work instead proves the properties of optimal representations, which ensure best downstream performance. Those properties need to be approximated but give the right insights into what to aim for. This perspective and our proofs were inspired by Dubois et al. (2020) who gives sufficient conditions for optimal representations in supervised learning. E EXPERIMENTAL DETAILS E.1 SCIENTIFIC In both the scientific setting and the following bridge setting, we consider rather unrealistic setups for verifying our theory where we have access to labels from all domains. We can choose to directly minimize the risk R [Y | Z] with the cross-entropy loss (denoted as CE henceafter), or minimize H[A | Z] (i.e., maximize I[A; Z]) with supervised augmentations as in Fig. 2b detailed below. Implementation of supervised augmentations When using supervised augmentations, for each sample we obtain its augmentations from within its label class across all domains. A constrastive loss with such augmentations will essentially reduce to the supervised contrast loss (Sup Con, Khosla et al., 2020). In particular, for a single sample in a batch, all samples in the batch with the same labels can be used as the positives (could come from the same domain or different domains) and others as the negatives. In Khosla et al. (2020), two variants of Sup Con loss were introduced for solving the issue of multi-positives depending on whether the summation over multi-positives was located inside (Sup Con-In, Eq. (3) in Khosla et al. (2020)) or outside (Sup Con-Out, Eq. (2) in Khosla et al. (2020)) the log. Though Khosla et al. (2020) chose Sup Con-Out because it worked better than Sup Con-In, we hypothesized that this is because Sup Con-Out has an implicit bottleneck effect. Intuitively, Sup Con-Out upper bounds Sup Con-In and achieves its optima only if the logits with positive samples are all the same by Jensen s inequality, which may encourage positive samples from different domains to get clustered. Since this might confound with the effect of our bottlenecks, we chose to use Sup Con-In though it performed slightly worse in out initial experiments. For the implementation of Sup Con, we followed Khosla et al. (2020) except that no projection was used. Specifically, the temperature was set to 0.1, and normalization was applied when computing the logits. In the scientific setting, we tried to simulate our theory to the greatest extent. In particular, we had two special considerations as detailed below: Eliminating empirical generalization As our theory focuses on the idealized domain generalization that assumes access to population distribution, we considered the setup where the empirical generalization was eliminated. Specifically, we treated the dataset as the population distribution and used the same dataset for training the encoder and training/evaluating the predictor. The Res Net-18 encoder was trained to 300 epochs without any regularization, using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 5e-5, a batch size of 192 (48 for each domain), and a cosine learning rate decay schedule. Worst-case approximation To approximate the worst-case source predictor, we included the target data with randomly assigned wrong labels to the training set for training the source predictor. The target data samples were down-weighted with a sample weight that maximizes the target risk while keeping the source risk close to optima (which is 0). We selected the sample weight by sweeping over [10 10, 1] with a logarithmic scale using CE-Base and Sup Con-Base, as shown in Fig. 4. As the sample weight increases, the target log likelihood (neg. risk) first decreases and then increases. We hypothesized that the increasing trend was due to that the source performance was already not optimal (though not visible from the figure), thus we selected the weight close to the turning point and 10 5 seemed to be reasonable for both CE-Base and Sup Con-Base. Although we did not adaptively select the sample weight for each setup due to the computational cost, the pre-specified sample turned Published as a conference paper at ICLR 2022 10 9 10 7 10 5 10 3 10 1 sample weight Log likelihood (a) CE-Base 10 9 10 7 10 5 10 3 10 1 sample weight Log likelihood source target (b) Sup Con-Base Figure 4: Sweeping the sample weight using CE-Base and Sup Con-Base. We selected 10 5 which seemed to be reasonble for both cases. out to be reasonable for all other losses and different λ combinations. Furthermore, we also removed regularization when training the linear classifier and initialized the linear weight i.i.d. from N(0, 1). Next, we provide other experimental details for reproducibility: Implementation of standard augmentations We followed Sim CLR (Chen et al., 2020) for implementing standard image augmentations. For a fair comparison between the cases when using standard augmentations (Sim CLR) and supervised augmentations (Sup Con), we kept the total batch size the same and also used the same configurations for computing the Sup Con loss, i.e., temperature set to 0.1, no projection, and normalization applied. Details of Fig. 3c In Fig. 3c, we considered different choices of augmentations. The Standard augmentation implementation is described above (Appx. E.1). The Supervised augmentation was essentially implemented using the Sup Con loss as described in Appx. E.1. For other augmentations considered, we implemented them by dropout inter-domain supervised augmentations in Sup Con. Specifically, for each sample in the batch, we randomly masked the samples from different domains (i.e., both inter-domain positives and negatives) i.i.d. with the specified dropout probability, while samples within the same domain were always kept. Intra Dom and Approx DA correspond to dropout probability 1 and 0.9, respectively. Single Dom were implemented by dropout all interdomain samples with probability 1 except for a fixed domain (the A domain of PACS in our case). In the bridge setting (see Appx. F.2), we aimed to bridge the gap between our theoretical setup to the practical setup. The main differences from the scientific setups are that the empirical generalization gap is considered and the average-case source predictor is used, as detailed below: Incorporating empirical generalization In practice, empirical-generalization gap should also be considered besides the source-target generalization gap. Thus, we randomly split the PACS dataset to 80% training and 20% validation splits for each domain. The training splits were used to train both the encoder and the source predictor, and the validation splits were used for encoder and source predictor selection as well as evaluation on target domains. We used the Res Net-50 model as the encoder and initialized it from Image Net pretrained model. The encoder was trained to a maximum of 50 epochs with a 1e-5 weight decay, using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 5e-5, a batch size of 112 (28 for each domain), and a cosine learning rate decay schedule. Using average-case source predictor Instead of approximating the worst-case source predictor in the scientific setting, we considered the average-case3 source predictor which is closer to the common practice. Specifically, we freezed the encoder and trained a SVM classifier with L2 regularization on 3Here we have a slight abuse use of the phrase average-case to distinguish from the worst-case that we use in the scientific setting. In fact, the source predictor could be close to the best-case since the max-margin classifier (SVM) was used. Published as a conference paper at ICLR 2022 the source training split. The regularization parameter was tuned over {1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3} with the source validation accuracy. Next, we provide other experimental details for reproducibility: Selection of λ For all different setups considered in bridge settings, the CAD bottleneck was used and the λ was tuned over {1e-3, 1e-2, 1e-1, 1, 1e1} independently for each. E.3 DOMAINBED Datasets We used non-MNIST datasets on Domain Bed that were non-synthetic, including VLCS (Fang et al., 2013), PACS (Li et al., 2017), Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), and Domain Net (Peng et al., 2019). For each dataset, we split it to 80%/20% training/validation set according to Domain Bed. SSL-based models & Training For all models based on pretrained SSL models (either CLIP-based or DINO-based) with finetuning in this experiment, we freezed the pretrained SSL model and added on top a 1-layer MLP with hidden size 1024, and residual connection. We used CLIP Res Net-50 (CLIP S) to obtain the best possible fair comparison with baselines from Domain Bed, and CLIP Vi T-B/32 (CLIP L) to achieve the best results. Note that the Res Net-50 model of CLIP S was modified as described in Radford et al. (2021) and contained 38M parameters (more than 23M of the original CLIP). The model was trained to 300 epochs for Domain Net and 50 epochs on other datasets (an epoch is defined as a single pass over the smallest domain according to Domain Bed). No data augmentation was used and the temperature for scaling the logits in CAD was fixed to 0.05. We used the Adam optimizer with a 1e-5 weight dacay, and a cosine learning rate decay schedule. The hyperparameter search space is: Learning rate: discrete set {1e-4, 3e-4, 1e-3, 3e-3} Batch size: discrete set {128, 256, 512} for Domain Net and Office Home, and {64, 128, 256} for other datasets MLP dropout: discrete set {0., 0.1, 0.5} Learning rate warmup: discrete set {True, False} End-to-end models & Training In Table 1, we also included an end-to-end trained model without any pretrained SSL models. We used exactly the same model architecture (the original Res Net50, initialized from Image Net pretrained model), training procedure and evaluation protocal as baselines on Domain Bed. Importantly, the linear classifier was jointly trained with the encoder, and no refitting was applied. The model was trained to a maximum of 5000 steps on each dataset, and data augmentations were applied. The Adam optimizer was used without any particular learning rate schedule. The hyperparameter search space is (same as Domain Bed except we added the temperature): Learning rate: log-uniform over [1e-5, 1e-3.5] Batch size: log-uniform over [8, 64] for Domain Net, and [8, 25.5] for other datasets MLP dropout: discrete set {0., 0.1, 0.5} Weight decay: log-uniform over [1e-6, 1e-2] Temperature: discrete set {0.05, 0.1} Linear Probe Evaluation In all the experiments except for the end-to-end training setup, we always followed the procedure of two-stage training, where we first trained the encoder with specified objectives, and then refit the classifier. For datasets except Domain Net, we fitted the SVM classifier and tuned the regularization parameter over {1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3} with source validation selection. Since Domain Net was too large and SVM cannot fit it efficiently, we used the logistic regression classifier which was trained with a batch size 512, the Adam optimizer with a learning rate 5e-4 and early stopping. Note that an alternative was to just use the linear head fitted when training the representor (as we used CE loss with source labels), and we found this could work better than refitting since the classifier was less overfitted to the source domain. However, we didn t do that since we wanted to stick to the representation learning protocol with two-stage training. We Published as a conference paper at ICLR 2022 did that in our end-to-end training setup since we wanted it to be compeletely comparable to baselines on Domain Bed (which did not do refitting). Selection of λ In our experiments, we treated λ as a special hyperparamter. For each model, we used the same λ selected on PACS on all datasets except Domain Net, because our bottleneck is fairly robust to the choice of λ. For the large-scale Domain Net dataset, we selected its λ individually. The λ values chosen for each model were: CLIP S: 1 on Domain Net and 1e-2 on other datasets CLIP L: 1e-1 on Domain Net and 1e-2 on other datasets DINO: 1e-1 on all datasets End-to-end Res Net-50: 1e-5 on all datasets Model We used the CLIP L model (i.e., CLIP Vi T-B/32) with an additional network on top for finetuning. The additional network were two blocks of 2-layer MLP, each with hidden size 2048, pre-activation batch normalization, residual connection, and dropout probability 0.1. Note that the original CLIP L model was frozen and only the additional network was trained. Dataset We used the LAION-400M dataset which contained 400 million image-text pairs for training. Though the dataset might not be as clean as the original CLIP training data (as evidenced by our experimental results), it was the largest publicly available image-text-pair dataset that we could get access to. As we froze the CLIP L model and only did finetuning, we used the 1TB preprocessed embeddings provided by LAION-400M4. No further preprocessing was applied. Training We used the image-text contrastive loss as introduced in Radford et al. (2021) for training model. The temperature was learnable which was initialized as 0.07 and clipped with a minimum 0.01. The model was trained for 1 epoch using the Adam optimizer with a batch size of 16384 and a cosine learning rate decay schedule. The learning rate was tuned over the set {3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2} and the λ value for the Ent bottleneck was tuned over {1e-3, 1e-2, 1e-1, 1, 1e1}. Evaluation For the evaluation on the Image Net-related datasets, we followed a similar procedure in Radford et al. (2021), where a linear classifier was fitted on Image Net using the model representations and evaluated on 7 natural distribution shift datasets. In particular, we fitted a logistic regression classifier with 1e-5 L2 regularization on Image Net training set which was trained with a batch size 512, the Adam optimizer with a learning rate 3e-4 and early stopping. Note that this was different from Radford et al. (2021), where a logistic regression classifier was fitted using full-batch data with decent hyperparameter tuning, due to our computational budget. For evaluation on natural distribution shift datasets, we followed Taori et al. (2020) and used their released testbed5. The evaluation datasets and their abbreviations used in Table 2 were: Image Net V2 (IN-V2, Recht et al., 2019), Image Net-Sketch (IN-S, Wang et al., 2019), Youtube-BB (YT-BB, Shankar et al., 2019), Image Net-Vid (IN-Vid, Shankar et al., 2019), Object Net (Barbu et al., 2019), Image Net Adversarial (IN-A, Hendrycks et al., 2021), and Image Net Rendition (IN-R Hendrycks et al., 2020). 4See https://laion.ai/laion-400-open-dataset/ for details. 5https://github.com/modestyachts/imagenet-testbed Published as a conference paper at ICLR 2022 F ADDITIONAL EXPERIMENTAL RESULTS F.1 SCIENTIFIC 10 2 100 102 104 6 Log likelihood Base CAD Ent (a) CE (R [Y | Z]) objectives 10 2 100 102 104 Log likelihood Base CAD Ent (b) Sup Con (H[A | Z]) objectives Figure 5: The worst-case DG performance of Ent bottleneck is more sensitive to λ than CAD What s the effect of λ for different objectives on the worst-case DG performance? In Fig. 5, the worst-case target log likelihood versus λ values for different objectives is shown. We found that Ent is much more sensitive to the choice of λ than CAD, which was part of the reason why we used the latter in most of our experiments. Note that for Sup Con-Ent with small λ values, it was worse than Sup Con-Base because of the discretization introduced by the Ent bottleneck, which we verified by observing that setting λ = 0 lead to similar results. The scientific setup is closer to our theory than what we do in practice in that worst-case predictor was considered and empirical generalization gap was ignored. Here we bridged these gaps with a more practical setup. In particular, we split the PACS dataset to training and validation splits for each domain and considered the setting: the representor trains the encoder on all-domain training splits with a validation loss selection; the learner trains the SVM predictor (average-case) on the source training split which is selected over the source validation split, and evaluates on the validation splits of other target domains. The target validation accuracy averaged over all (source, target) setups was reported. For simplicity, we will use CE to denote the objective with the cross-entropy loss that uses labels to minimize R [Y | Z], and Sup Con for the contrastive loss that uses supervised augmentations to minimize H[A | Z]. We will use CAD in following experiments unless otherwise specified (chosen with initial experiments). Details in Appx. E.2. Table 3: We repeated most empirical analysis (in the scientific setting) in the more practical bridge setting and observed similar results. Setup Avg. target acc. CE-Base 95.9 0.5 CE-CAD 96.7 0.2 CE-CAD (partial domains) 82.6 0.5 Sup Con-CAD 96.7 0.4 Sup Con-CAD (Single Dom) 96.7 0.3 Sup Con-CAD (Approx DA) 96.6 0.3 Sup Con-CAD (Intra Dom) 96.2 0.7 Sim CLR-CAD 61.7 0.8 Published as a conference paper at ICLR 2022 Does domain bottleneck improve the average-case DG performance? Though our theory focuses on the worst-case DG, we empirically showed that adding bottlenecks to enforce support match can also improve the average-case DG performance by comparing CE-Base and CE-CAD in Table 3. What if the representor only has access to source domains? Similar to what we did in the scientific setting, we considered the setup where one single domain is specified as the target domain and excluded from the training set of the representor and used for evaluation with source predictors trained on other domains. This is denoted as CE+CAD (partial domains) in Table 3, which is much worse then CE-CAD. This shows the necessity of getting access to target domain information for DG. What if the representor only has access to domain-agnostic augmentations? In Table 3, we also compared Sup Con-CAD which used supervised augmentations through the labels with CECAD and they achieved the same performance. This shows that the representor can still learn good representations without labels but only domain-agnostic augmentations in practice. Can we use standard augmentations? In Fig. 2, we point out that standard augmentations are not domain-agnostic and thus not suitable for SSL with our objectives. We empirically showed this by using augmentations of Sim CLR (see Appx. E.1 for details) with our objectives (Sim CLR-CAD). In Table 3, we indeed observed that using standard augmentations performed much worse than using desired augmentations (Sup Con-CAD). How do augmentations matter? Besides investigating the Supervised augmentations (Sup Con CAD) and Standard augmentations (Sim CLR-CAD) above, we also compared other three augmentations as in the scientific section. Specifically, we considered the Single Dom , Intra Dom , and Approx DA augmentations. As shown in Table 3, Sup Con-CAD (Single Dom) and (Approx DA) maintained the DG performance but Sup Con-CAD (Intra Dom) was slightly worse (0.5 accuracy drop). We assumed the small gap was due to the specific dataset that we used (PACS). We did the same analysis on VLCS, and Sup Con-CAD with Supervised , Single Dom , and Intra Dom augmentations gave 84.7 0.4, 83.2 0.3, and 77.5 2.3, respectively. This shows the importance of using domain-agnostic augmentations in practice. Do standard augmentations affect source performance? Previously, we showed that using standard augmentations hurt the DG performance measured by the average target accuracy. It is natural to ask whether using standard augmentations also hurt the source performance since we should also be interested in the effective robustness (Taori et al., 2020). Thus we also reported the average source accuracy of Sup Con-CAD and Sim CLR-CAD which were 96.9 0.2 and 90.1 0.2, respectively. The source performance using standard augmentations was indeed worse, but if we consider the source-target gap which was 0.2 for Sup Con-CAD and 28.4 for Sim CLR-CAD, which still verified that the non-domain-agnostic standard augmentations were harder to force support match. To be even more convincing, we did the same analysis on VLCS, and the average source accuracy of Sup Con-CAD and Sim CLR-CAD were 86.6 0.1 and 84.6 0.5 which were fairly close, but the average target accuracy were 84.7 0.4 and 57.5 1.7, respectively. F.3 DOMAINBED Full result of Table 1 We included the full result of Table 1 with all baselines on Domain Bed as in Table 4. We considered most representative baselines from Domain Bed, most of which considered learning invariant representations or optimal classifiers across domains. Specifically, we included IRM (Arjovsky et al., 2019), Group DRO (Sagawa et al., 2019), Mixup (Yan et al., 2020), CORAL (Sun & Saenko, 2016), MMD (Li et al., 2018a), DANN (Ganin et al., 2016), CDANN (Li et al., 2018b), and VREx (Krueger et al., 2021). We also included the result pretrained CLIP S model with a zero-shot classifier using text representations (CLIP S Zero Shot), which demonstrated better DG performance than CLIP S with linear probe. But we observed that it was outperformed by our CLIP S + CAD. What is the impact of CLIP pretraining? To ensure that our gains are not only due to a novel CAD bottleneck, but the synergy between enforcing support constraint and using desired SSL models, we investigated CAD using the standard Domain Bed protocol denoted as CAD in the table. It shows that CAD on its own performs similarly with Domain Bed baselines (see Table 4 for a full comparison). Published as a conference paper at ICLR 2022 Table 4: Full results on Domain Bed with oracle selection method. Algorithm VLCS PACS Office Home Terra Incognita Domain Net ERM 77.6 0.3 86.7 0.3 66.4 0.5 53.0 0.3 41.3 0.1 IRM 76.9 0.6 84.5 1.1 63.0 2.7 50.5 0.7 28.0 5.1 Group DRO 77.4 0.5 87.1 0.1 66.2 0.6 52.4 0.1 33.4 0.3 Mixup 78.1 0.3 86.8 0.3 68.0 0.2 54.4 0.3 39.6 0.1 CORAL 77.7 0.2 87.1 0.5 68.4 0.2 52.8 0.2 41.8 0.1 MMD 77.9 0.1 87.2 0.1 66.2 0.3 52.0 0.4 23.5 9.4 DANN 79.7 0.5 85.2 0.2 65.3 0.8 50.6 0.4 38.3 0.1 CDANN 79.9 0.2 85.8 0.8 65.3 0.5 50.8 0.6 38.5 0.2 VREx 78.1 0.2 87.2 0.6 65.7 0.3 51.4 0.5 30.1 3.7 CAD 78.0 0.1 87.3 0.2 67.0 0.5 53.5 0.9 41.5 0.1 DINO + CAD 69.6 0.6 76.1 0.1 56.9 0.5 25.9 1.2 33.6 0.1 CLIP S 81.1 0.5 90.3 0.2 70.6 0.1 29.6 0.8 47.7 0.0 CLIP S (Zero-Shot) 80.9 0.1 91.8 0.1 70.4 0.2 19.1 0.1 46.9 0.0 CLIP S + Base 81.3 0.5 91.2 0.3 70.6 0.1 36.4 0.7 46.8 0.2 CLIP S + CAD 82.3 0.3 92.0 0.2 71.9 0.2 36.2 0.8 48.8 0.1 CLIP L 80.7 0.4 93.7 0.8 79.6 0.1 36.9 0.6 52.8 0.1 CLIP L + CAD 81.6 0.1 94.9 0.3 80.0 0.2 40.6 1.1 53.7 0.1 Why oracle selection? In the main body, we provided the results with oracle selection which was the closest to our theory among the model selection methods in Domain Bed (in the sense that we needed target domain information to achieve IDG). Here, we also provided results with source validation selection in Table 5. Source validation selection relies on the assumption that source and target data follow similar distributions (Gulrajani & Lopez-Paz, 2021) thus source and target accuracy are highly correlated, which is not really true in practice. We found some issues with source validation selection results: The selected model with the highest source validation accuracy tends to overfit the source domain, thus possibly leads to worse performance on the target domain. This can be probed by the fact that the finetuned CLIP models (CLIP + Base or CLIP + CAD) were generally worse than the original CLIP model; Selecting model with source validation accuracy tends to diminish the effect of bottlenecks. This can be seen by the fact that the gap between CLIP + Base and CLIP + CAD of source validation selection is much smaller than that of oracle selection; The source accuracy is not a good indicator of target accuracy thus its result has a larger variance. Evaluation results on Domain Bed We included the evaluation results of trained models on Domain Bed in Table 6, where we followed exactly the same linear evaluation protocal discussed in Appx. E.3. We observed similar results as Table 2: the CLIP L model trained with the Ent bottleneck on LAION (Tuned w/ Ent) outperformed the one without (Tuned w/o Ent) on all Domain Bed datasets, but slightly underperformed the original CLIP L model (which might be due to quality of the LAION-400M dataset). Published as a conference paper at ICLR 2022 Table 5: Results on Domain Bed with source validation selection. Source validation selected model tends to overfit more to the source domain and diminish the effect of bottlenecks. Algorithm VLCS PACS Office Home Terra Incognita Domain Net ERM 77.5 0.4 85.5 0.2 66.5 0.3 46.1 1.8 40.9 0.1 IRM 78.5 0.5 83.5 0.8 64.3 2.2 47.6 0.8 33.9 2.8 Group DRO 76.7 0.6 84.4 0.8 66.0 0.7 43.2 1.1 33.3 0.2 Mixup 77.4 0.6 84.6 0.6 68.1 0.3 47.9 0.8 39.2 0.1 CORAL 78.8 0.6 86.2 0.3 68.7 0.3 47.6 1.0 41.5 0.1 MMD 77.5 0.9 84.6 0.5 66.3 0.1 42.2 1.6 23.4 9.5 DANN 78.6 0.4 83.6 0.4 65.9 0.6 46.7 0.5 38.3 0.1 CDANN 77.5 0.1 82.6 0.9 65.8 1.3 45.8 1.6 38.3 0.3 VREx 78.3 0.2 84.9 0.6 66.4 0.6 46.4 0.6 33.6 2.9 CAD 78.0 0.5 85.2 0.9 67.4 0.2 47.3 2.2 41.0 0.1 DINO + CAD 68.9 0.9 75.4 0.5 56.4 0.7 23.6 1.2 31.0 2.3 CLIP S 81.1 0.5 90.3 0.2 70.6 0.1 29.6 0.8 47.7 0.0 CLIP S (Zero-Shot) 80.9 0.1 91.8 0.1 70.4 0.2 19.1 0.1 46.9 0.0 CLIP S + Base 81.4 0.4 89.6 0.7 70.4 0.2 30.9 2.2 44.6 1.6 CLIP S + CAD 81.2 0.6 90.0 0.6 70.5 0.3 30.3 0.9 45.5 2.1 CLIP L 80.6 0.7 93.5 0.8 79.4 0.2 37.5 0.7 50.1 1.1 CLIP L + CAD 80.8 0.7 93.5 0.7 79.7 0.2 37.4 1.2 51.7 1.4 Table 6: Finetuning CLIP L on LAION with an entropy bottleneck performs better on Domain Bed than finetuning without. Algorithm VLCS PACS Office Home Terra Incognita Domain Net CLIP L 80.7 0.4 93.7 0.8 79.9 0.1 36.9 0.6 52.8 0.1 Tuned w/o Ent 79.2 0.7 93.4 0.3 77.2 0.5 36.1 0.4 51.2 0.1 Tuned w/ Ent 80.7 0.4 94.3 0.8 78.2 0.2 36.8 0.4 52.2 0.1